SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Validation of Variables before Loading the Data

Reply
Occasional Contributor
Posts: 7

Validation of Variables before Loading the Data

Hi All,

 

Great to have this community! Thanks for having me here!

 

Could you please help me how I can code the following steps below:

Note: Source data are csv files that have at least 300+ columns and being dumped daily in a file location. 

 

 

1. As a first step, I will be creating a data definition table which has the information of the csv's (source file) variables. I will use this file as a reference for validation. This file will have the field name, field format, field length, description, etc.

2. Read the first line of the CSV file to check and validate the column names vs. the data definition I created on step 1. If the variables are validated and correct, procede to the input statement, else if not, have an error note stating the variable(s) are not wrongly defined.

3. After the data is validated, I want to have a report e.g this &variable is not in the source file, or the source file has the wrong variable definition like var1 should be a numeric but in the source file it was a character, etc.

 

Is there a way we can have these steps into one code? I hope you could help me.

Thanks in advance!

Occasional Contributor
Posts: 7

Re: Validation of Variables before Loading the Data

I have found an answer for the validation in this thread:

 

https://communities.sas.com/t5/SAS-Data-Management/Check-if-Multiple-Variables-exist-in-a-SAS-Datase...

 

Once I have figure out how I can put the steps that I want into a single code, I will post it here. But if you have a better way how to code, kindly let me know.

 

Thanks!

Super User
Posts: 9,691

Re: Validation of Variables before Loading the Data

Check Integrate Constraint : proc datasets ....... modify have; ic create age=...........; initiate; audit; quit;
Occasional Contributor
Posts: 7

Re: Validation of Variables before Loading the Data

Hi Xia Keshan,

Thanks for your reply! I would definitely have a look this functions and hopefully to start my coding.
Super User
Super User
Posts: 7,430

Re: Validation of Variables before Loading the Data

Not sure I am seeing a need for this.  What you can do is to run a proc import on the CSV - this will automatically import all the data, and do a best guess at type and length of the data.  Once you have that as a dataset, the you can query the metadata to see if everything is as you expect.  An example:

/* Simulate dataset from import */
data temp;
  a="Abcd"; b=2; c=4; d="ERTYRF";
run;

/* Create check dataset */
data want (keep=variable result);
  set sashelp.vcolumn (where=(libname="WORK" and memname="TEMP"));
  select(upcase(name));
    when("A") do;
        variable="A";
        if type ne "char" then do;
          result="Type should be char";
          output;
        end;
        if length ne 2 then do;
          result="Length is not 2";
          output;
        end;
      end;
      /* Other variables */
    otherwise;
  end;
run;
Occasional Contributor
Posts: 7

Re: Validation of Variables before Loading the Data

Thanks for your reply, mate! But, is there a way I can put everything into one SAS code -- importing the data definition, and then importing the first obs of the csv file and then validation. Once validated, the code will continue to load the complete csv into SAS data set.

I am not sure how I can start this, but if not, I probably do it one code per steps.

Thanks!
Super User
Super User
Posts: 7,430

Re: Validation of Variables before Loading the Data

Do you have an import agreementwith the vendor, which details file structure and such like.  I guess you must have something like that for the data definition.  If so then I would simplify the process, you write a datastep import program to read the CSV data in as specified in the agreement - a validated well designed data import process is far better then checking it at the end.

 

As for your question, not sure reading one row in will validate the file.  Import processes normally sample the first x amount of records (set by a reg key).  Thats why I suggest using proc import to import the whole data in one step - this automatically has the sampleing of the data to get best fit, and then the metadata of the imported dataset can be compared to your data definition.  If things are of different type or longer than expected that will pick it up.  Its not complicated really:

1) import defintion

2) import data

3) compare definition to sashelp.vcolumn where imported data

4) if fine execute proper datastep import program

 

Trying to do it all in one step I think would be messy at best.

 

 

Occasional Contributor
Posts: 7

Re: Validation of Variables before Loading the Data

Hi RW9! Thanks for the tip! That was also the approach I been thinking this past few days. 

 

I've already created a three (3) data definition of the source files and did a coding how to import these excel files. Below is the code.

1. Filename statement to read the files in a directory

2. SQL statement to input the excel file names in &name and &name2

3. Import procedure

 

filename indata pipe 'dir <directory>\*.xlsx /b';
data LD_tbldef;
length filename $100;
infile indata truncover; /* infile statement for file names*/
input filename $100.; /* read the file names from the directory*/
filename2=scan(filename,1); /*creating filename2 variable to exclude .xlsx extension*/
x=_n_; /*iteration of the variables*/
call symput ('num_files',_n_); /* store the record number in a macro variable*/
*if find(filename,'.xlsx','i') /*get only .xlsx files*/;
run;
%put Macro '&num_files' resolves to >>> &num_files.;

proc sql noprint;
select filename into :name separated by '|' from LD_tbldef;
select filename2 into :name2 separated by '|' from LD_tbldef;
quit;
%put Macro '&name' resolves to >>> &name;
%put Macro '&name2' resolves to >>> &name2;

%macro read_LD_TblDef;
%do i = 1 %to &num_files.;
%let j = %cmpres(%scan(&name.,&i.,'|'));
%let k = %cmpres(%scan(&name2.,&i.,'|'));
proc import
out = libref.&k.
datafile="M:\LeeA\DIA\&j."
dbms=xlsx replace;
sheet="Sheet1";
getnames = yes;
run;
%put i = &i.;
%put Note: Macro '&j' resolves to >>> &j.;
%put Note: Macro '&k' resolves to >>> &k.;
%end;
%mend;
%read_LD_TblDef;

 

After this step, I will do now the validation part. Smiley Happy

 

Thanks again!

Super User
Super User
Posts: 7,430

Re: Validation of Variables before Loading the Data

Hi,

 

You could of course remove all the messy macro code and simply do (note, not checked, modify to your specifics):

filename indata pipe 'dir "<dir>\*.xlsx" /b';

data _null_;
  infile indata dlm="¬";
  input;
  call execute(cats('proc import out=',compress(scan(_infile_,1)),'datafile="',strip(_infile_),'" sheet="Sheet1" getnames=yes;run;'));
run; 

 

Occasional Contributor
Posts: 7

Re: Validation of Variables before Loading the Data

Hi RW9,

 

I have tried this code but I am encountering an error which I can't figure out. See below:

 

NOTE: CALL EXECUTE generated line.
1 + proc import out=LD_Births_TableDefinitiondatafile="LD_Births_TableDefinition.xlsx" sheet="Sheet1" getnames=yes;run;
2 The SAS System 16:44 Wednesday, June 15, 2016

ERROR: "LD_BIRTHS_TABLEDEFINITIONDATAFILE" is not a valid name.

 

I already tried to put a "''" between 

compress(scan(_infile_,1)),'datafile="',strip(_infile_)

 however, it's still fail. So for the mean time, I am still using the previous code I have. If you can help me to figure out what could be missing here, that would be great! Smiley Happy


call_execute.PNG
Occasional Contributor
Posts: 7

Re: Validation of Variables before Loading the Data

Hi Guys!

 

Thanks for your help! I believe I was able to manage the coding and I used Reeza's coding from this link:

 

https://communities.sas.com/t5/SAS-Data-Management/Check-if-Multiple-Variables-exist-in-a-SAS-Datase...

 

The codes that I made are as follows:

1. I created an import macro code to import the table definition and the csv files;

2. I created a macro input that has the infile statement inside which calls the naming convention of my source files

3. I used Reeza's code (with a little tweak for my need) for my validation. This validation macro checks the variable count for my data definition table and for csv file. It creates a report that if a variable is missing, the code will not continue to process the infile statement (macro input). Inside this macro I call the macro input statement.

 

Again, thank you so much for the help specially for those who gave inputs and ideas for this inquiry! Smiley Happy  

 

Ask a Question
Discussion stats
  • 10 replies
  • 458 views
  • 1 like
  • 3 in conversation