Hi,
I have a number of Excel files, named Company1_2016Q1.xlsx, Company1_2016Q2.xlsx, Company2_2016Q1.xlsx, Company2_2016Q2.xlsx etc. They are stored in one folder with multiple subfolders.
I have a following macro to import all Excel files from a catalog to SAS (I've found it somwhere on the Web, I didn't write it on my own). Could you help me modify the macro in two ways:
1. I would like to loop not only through the folder but also through subsoflders and sub-subfolders (if they exist)
2. I have some problem with manipulating the imported datasest - please take a look at the comments in the code below.
%macro read_files (katalog);
%let rc=%sysfunc(filename(rawdata,&katalog));
%let did=%sysfunc(dopen(&rawdata));
%let dnum=%sysfunc(dnum(&did));
%do i=1 %to &dnum;
%let name_file=%sysfunc(dread(&did,&i));
%let rozszerz=%sysfunc(scan(&name_file,2,.));
%if &rozszerz=xlsx %then %do;
%let name_file_short=%sysfunc(scan(&name_file,1,.));
PROC IMPORT OUT=WORK.&name_file_short
DATAFILE="&katalog\&name_file" DBMS=xlsx REPLACE;
RUN;
/* HERE IS THE PROBLEM:
- I want to create two variables in the dataset on the basis of the dataset name but the substr doesn't work*/
data WORK.&name_file_short;
set WORK.&name_file_short;
comp=%substr(&name_file_short,1,8);
date=%substr(&name_file_short,10,6);
run;
%end;
%end;
%mend;
%read_files(C:\Data);
Doh! Yes, of course you need to remove file extensions. Still, I would highly advise that using dates and such like in the dataset names, imagine how much simpler the code would be if you appended all the data into one dataset and used that, or just had all the dataset names similar rather than trying to work out what each one is called. Again, its the Excel thinking, dates in tab or file names. So this would be my preferred way (assuming I couldn't append all the data into one file either in SAS or Excel - note with VBA you could load all the files and dump to CSV, then read the one file in).
filename tmp pipe 'dir "s:\temp\rob" /s /b'; data test; infile tmp dlm="¬"; length buff $2000; input buff $; fname=scan(buff,countw(buff,"\"),"\"); if index(upcase(fname),".XLSX")>0 then call execute(cats('proc import datafile="',buff,'" out=DS',put(_n_,best.),'; run;')); /* Extra bit to add date into dataset */ fname=strip(tranwrd(substr(fname,1,11),"_","")); call execute(cats('data ds',put(_n_,best.),'; date="',fname,'"; run;')); run;
But yes, you could just scan(fname,1,".").
Assuming you have x window access this is pretty simple (and assuming there is no duplicate names in the subfolders!):
filename tmp pipe 'dir "s:\temp\rob" /s /b'; data test; infile tmp dlm="¬"; length buff $2000; input buff $; fname=scan(buff,countw(buff,"\"),"\"); if index(upcase(fname),".XLSX")>0 then call execute(cats('proc import datafile="',buff,'" out=',fname,'; run;')); run;
Note however that Excel is a really poor data medium, and proc import is the worst method of importing the data, so what you will likely end up with is rubbish that you need to post process endlessly. Far better to get the data in a better format (XML, CSV etc.).
Thanks. I know that Excel files are poor data medium, but I cannot change it unfornately.
Another problem is that names of the Excel files start with number and the code doesn't work (as it tries to create a sas datasest with a name starting from a digit). Could you help me modifying this code? I've tried to manipulate the "out=',fname,'" part but with little success.
Once I have all the data in SAS files, I need to modify each by adding some variable, depending on the dataset name (see my comments in the code in the previous post). Could you suggest something there?
Best
Minor update then:
filename tmp pipe 'dir "s:\temp\rob" /s /b'; data test; infile tmp dlm="¬"; length buff $2000; input buff $; fname=scan(buff,countw(buff,"\"),"\"); /* Update fname to be valid SAS dataset name here per what you know the files to be called I will just add a character per your information */ fname=cats("A",fname); if index(upcase(fname),".XLSX")>0 then call execute(cats('proc import datafile="',buff,'" out=',fname,'; run;')); run;
Thanks. It should work now, but there is another problem with my data. I get the following error "ERROR: Libref ''_2017_03_123_ALL' exceeds 8 characters." Unfortunately I cannot simply take first 8 characters of the name - I need everything, as the name of the file contains information that I want to use later on. So maybe the original "proc import" macro could be somehow modified to loop through subfolders as well? It works OK for all files in one folder (at least the part that imports data).
Please post your code and what the filenames look like - exactly. Nowhere in my code do I set libnames. It aalso wouldn't matter how you arrive at your end code, the dataset names have to conform to SAS dataset name conventions - a very good example of why not to put "data" in headings or filenames. If you need the date in the dataset, then do that in another step, maybe something like:
filename tmp pipe 'dir "s:\temp\rob" /s /b'; data test; infile tmp dlm="¬"; length buff $2000; input buff $; fname=scan(buff,countw(buff,"\"),"\"); if index(upcase(fname),".XLSX")>0 then call execute(cats('proc import datafile="',buff,'" out=DS',put(_n_,best.),'; run;')); /* Extra bit to add date into dataset */ fname=strip(tranwrd(substr(fname,1,11),"_","")); call execute(cats('data ds',put(_n_,best.),'; date="',fname,'"; run;')); run;
Note in the above I am just using _n_ to create DSx where x refers to the _n_ number. Am just guessing at what you have/need though. This will add a column in with the date part of the filename.
The code is as follows:
filename tmp pipe 'dir "C:\........" /s /b';
data test;
infile tmp dlm="¬";
length buff $2000;
input buff $;
fname=scan(buff,countw(buff,"\"),"\");
/* Update fname to be valid SAS dataset name here per what you know the
files to be called I will just add a character per your information */
fname=cats("_",fname);
if index(upcase(fname),".XLSX")>0 then
call execute(cats('proc import datafile="',buff,'" out=',fname,'; run;'));
run;
And the error message is as follows (it gives the same error for each file, I only paste one):
NOTE: The infile TMP is:
Unnamed Pipe Access Device,
PROCESS=dir "C:\..........." /s /b,
RECFM=V,LRECL=32767
NOTE: 16 records were read from the infile TMP.
The minimum record length was 61.
The maximum record length was 93.
NOTE: The data set WORK.TEST has 16 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 2.21 seconds
cpu time 0.15 seconds
NOTE: CALL EXECUTE generated line.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
ERROR: Libref '_2017_03_123_GR1' exceeds 8 characters.
NOTE: The SAS System stopped processing this step because of errors.
The Excel names are as follows: 2017_03_123_GR1.xlsx; 2017_03_124_GR1.xlsx; 2017_03_123_GR2.xlsx; 2017_03_124_GR3.xlsx etc.
I believe what you are doing when you attempt to read 2017_03_123_GR1.xlsx is that is the value of FNAME.
So when this is encountered:
call execute(cats('proc import datafile="',buff,'" out=',fname,'; run;'));
the output SAS DATASET name is 2017_03_123_GR1.xlsx which SAS treats the 2017_03_123_GR1 portion as the library name and the XLSX as the dataset name in LIB.SETNAME notation.
So truncate the Fname at the . before using it in the Call Execute.
Perhaps Fname= Scan(fname,1,'.'); just before.
Doh! Yes, of course you need to remove file extensions. Still, I would highly advise that using dates and such like in the dataset names, imagine how much simpler the code would be if you appended all the data into one dataset and used that, or just had all the dataset names similar rather than trying to work out what each one is called. Again, its the Excel thinking, dates in tab or file names. So this would be my preferred way (assuming I couldn't append all the data into one file either in SAS or Excel - note with VBA you could load all the files and dump to CSV, then read the one file in).
filename tmp pipe 'dir "s:\temp\rob" /s /b'; data test; infile tmp dlm="¬"; length buff $2000; input buff $; fname=scan(buff,countw(buff,"\"),"\"); if index(upcase(fname),".XLSX")>0 then call execute(cats('proc import datafile="',buff,'" out=DS',put(_n_,best.),'; run;')); /* Extra bit to add date into dataset */ fname=strip(tranwrd(substr(fname,1,11),"_","")); call execute(cats('data ds',put(_n_,best.),'; date="',fname,'"; run;')); run;
But yes, you could just scan(fname,1,".").
Ah, thats an easy one. Proc import is what is known as a guessing procedure. Its scans the datafile with some parameters and guessse what the best format for the data is, so you don't necessarily always get the same result, or know what you will get. I always recommend that the user specifies what they want from the data, to do this by creating a datastep import and fixing how SAS reads the data, assign correct lengths, informats etc. It takes a bit longer, and you can use the output from a proc import as the basis, but the key is to make sure you - the person who knows the data best or can interpret it best - is the one who makes the call on the import. Excel for instance has lots of lovely functionality (mostly hidden) which can really mess up data, each cell can be different for instance.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.