06-08-2016 06:06 PM
Great to have this community! Thanks for having me here!
Could you please help me how I can code the following steps below:
Note: Source data are csv files that have at least 300+ columns and being dumped daily in a file location.
1. As a first step, I will be creating a data definition table which has the information of the csv's (source file) variables. I will use this file as a reference for validation. This file will have the field name, field format, field length, description, etc.
2. Read the first line of the CSV file to check and validate the column names vs. the data definition I created on step 1. If the variables are validated and correct, procede to the input statement, else if not, have an error note stating the variable(s) are not wrongly defined.
3. After the data is validated, I want to have a report e.g this &variable is not in the source file, or the source file has the wrong variable definition like var1 should be a numeric but in the source file it was a character, etc.
Is there a way we can have these steps into one code? I hope you could help me.
Thanks in advance!
06-08-2016 07:30 PM
I have found an answer for the validation in this thread:
Once I have figure out how I can put the steps that I want into a single code, I will post it here. But if you have a better way how to code, kindly let me know.
06-09-2016 04:33 AM
Not sure I am seeing a need for this. What you can do is to run a proc import on the CSV - this will automatically import all the data, and do a best guess at type and length of the data. Once you have that as a dataset, the you can query the metadata to see if everything is as you expect. An example:
/* Simulate dataset from import */ data temp; a="Abcd"; b=2; c=4; d="ERTYRF"; run; /* Create check dataset */ data want (keep=variable result); set sashelp.vcolumn (where=(libname="WORK" and memname="TEMP")); select(upcase(name)); when("A") do; variable="A"; if type ne "char" then do; result="Type should be char"; output; end; if length ne 2 then do; result="Length is not 2"; output; end; end; /* Other variables */ otherwise; end; run;
06-09-2016 06:56 PM
06-10-2016 04:57 AM
Do you have an import agreementwith the vendor, which details file structure and such like. I guess you must have something like that for the data definition. If so then I would simplify the process, you write a datastep import program to read the CSV data in as specified in the agreement - a validated well designed data import process is far better then checking it at the end.
As for your question, not sure reading one row in will validate the file. Import processes normally sample the first x amount of records (set by a reg key). Thats why I suggest using proc import to import the whole data in one step - this automatically has the sampleing of the data to get best fit, and then the metadata of the imported dataset can be compared to your data definition. If things are of different type or longer than expected that will pick it up. Its not complicated really:
1) import defintion
2) import data
3) compare definition to sashelp.vcolumn where imported data
4) if fine execute proper datastep import program
Trying to do it all in one step I think would be messy at best.
06-12-2016 05:36 PM
Hi RW9! Thanks for the tip! That was also the approach I been thinking this past few days.
I've already created a three (3) data definition of the source files and did a coding how to import these excel files. Below is the code.
1. Filename statement to read the files in a directory
2. SQL statement to input the excel file names in &name and &name2
3. Import procedure
filename indata pipe 'dir <directory>\*.xlsx /b';
length filename $100;
infile indata truncover; /* infile statement for file names*/
input filename $100.; /* read the file names from the directory*/
filename2=scan(filename,1); /*creating filename2 variable to exclude .xlsx extension*/
x=_n_; /*iteration of the variables*/
call symput ('num_files',_n_); /* store the record number in a macro variable*/
*if find(filename,'.xlsx','i') /*get only .xlsx files*/;
%put Macro '&num_files' resolves to >>> &num_files.;
proc sql noprint;
select filename into :name separated by '|' from LD_tbldef;
select filename2 into :name2 separated by '|' from LD_tbldef;
%put Macro '&name' resolves to >>> &name;
%put Macro '&name2' resolves to >>> &name2;
%do i = 1 %to &num_files.;
%let j = %cmpres(%scan(&name.,&i.,'|'));
%let k = %cmpres(%scan(&name2.,&i.,'|'));
out = libref.&k.
getnames = yes;
%put i = &i.;
%put Note: Macro '&j' resolves to >>> &j.;
%put Note: Macro '&k' resolves to >>> &k.;
After this step, I will do now the validation part.
06-13-2016 04:08 AM
You could of course remove all the messy macro code and simply do (note, not checked, modify to your specifics):
filename indata pipe 'dir "<dir>\*.xlsx" /b'; data _null_; infile indata dlm="¬"; input; call execute(cats('proc import out=',compress(scan(_infile_,1)),'datafile="',strip(_infile_),'" sheet="Sheet1" getnames=yes;run;')); run;
06-15-2016 10:44 PM
I have tried this code but I am encountering an error which I can't figure out. See below:
NOTE: CALL EXECUTE generated line.
1 + proc import out=LD_Births_TableDefinitiondatafile="LD_Births_TableDefinition.xlsx" sheet="Sheet1" getnames=yes;run;
2 The SAS System 16:44 Wednesday, June 15, 2016
ERROR: "LD_BIRTHS_TABLEDEFINITIONDATAFILE" is not a valid name.
I already tried to put a "''" between
however, it's still fail. So for the mean time, I am still using the previous code I have. If you can help me to figure out what could be missing here, that would be great!
06-15-2016 10:40 PM
Thanks for your help! I believe I was able to manage the coding and I used Reeza's coding from this link:
The codes that I made are as follows:
1. I created an import macro code to import the table definition and the csv files;
2. I created a macro input that has the infile statement inside which calls the naming convention of my source files
3. I used Reeza's code (with a little tweak for my need) for my validation. This validation macro checks the variable count for my data definition table and for csv file. It creates a report that if a variable is missing, the code will not continue to process the infile statement (macro input). Inside this macro I call the macro input statement.
Again, thank you so much for the help specially for those who gave inputs and ideas for this inquiry!