Hi SAS Users,
As we know Proc import causing quite a bit of noise. However, it is my way of importing excel files into SAS at the initial stage, afterwards, I can use libname to refer to the file.
And one of the issues is that the length and data types are inconsistent.
I am wondering if you can have a look to make the data consistent regarding Type and length in my code as below:
options compress=yes reuse=yes;
options mergenoby=error;
OPTIONS MPRINT;
/*options validvarname=any; in case transform to SAS Base*/
%macro ImportAndTranspose(
File=
, cur=
, outf=
, StartSheet=
, EndSheet=
);
%local i;
%do i = &StartSheet. %to &EndSheet.;
%if &i=1 %then %do;
proc import datafile= "&File."
out= &outf.&i.
dbms= xlsx
replace;
range= "Sheet1$A:X";
getnames= yes;
run;
proc sort data= &outf.&i.;
by Type;
run;
%end;
%else %if &i=34 %then %do;
proc import datafile= "&File."
out= &outf.&i.
dbms= xlsx
replace;
range= "Sheet&i.$A:AG";
getnames= yes;
run;
proc sort data= &outf.&i.;
by Type;
run;
proc transpose data= &outf.&i.
out= &outf.&i._out(rename=(COL1=s&i. _NAME_=Year) drop=_label_);
by Type;
VAR '1988'N - '2019'N;
run;
proc print data=&outf.&i._out;
run;
data &outf.&i._outx;
set &outf.&i._out;
if s&i. in: ('NA', '$$', '') then s&i. = " ";/********/
run;
%end;
%else %if (&i ne 1) and (&i ne 34) %then %do;
proc import datafile= "&File."
out= &outf.&i.
dbms= xlsx
replace;
range= "Sheet&i.$A:AG";
getnames= yes;
run;
proc sort data= &outf.&i.;
by Type;
run;
proc transpose data= &outf.&i.
out= &outf.&i._out(rename=(COL1=s&i. _NAME_=Year) drop=_label_);
by Type;
VAR '1988'N - '2019'N;
run;
proc print data=&outf.&i._out;
run;
data &outf.&i._outx;
set &outf.&i._out;
if input(s&i., ?? 32.) = . then s&i.2=.;
else s&i.2=input(s&i., 32.);
drop s&i.;
rename s&i.2=s&i.;
run;
%end;
%end;
/* Merging*/
%let outfm = %substr(&outf,1,%length(&outf)-5);
data &outfm.merge1;
merge
%do i =&StartSheet.+1 %to &EndSheet.;
&outf.&i._outx(keep= type year s&i.)
%end;
;
by type year;
run;
data &outfm.merge2;
merge
&outf.&StartSheet.
&outfm.merge1
;
by type;
run;
%mend;
/*Replicate all files in one folder*/
filename mydir 'C:\Users\pnguyen\Desktop\New folder';
data _null_;
did = dopen('mydir');
do i = 1 to dnum(did);
fname = scan(dread(did,i),1,'.');
/*filename:ARGENTINAARS*/
length short_fn $29 currency $3 ;
short_fn= cats(substr(fname, 1,length(fname)-3),'_');
currency=substr(fname,length(fname)-2);
cmd=cats('%ImportAndTranspose(File=C:\Users\pnguyen\Desktop\New folder\',
strip(fname),
',cur=',currency,
',outf=',short_fn,'sheet,startsheet=1,endsheet=45);');
call execute(cmd);
end;
keep fname;
run;
Many thanks and warmest regards.
Can you clarify the plan for combining these and the reason for the variability?
Depending on the answers I would recommend different strategies of help in automating the combination.
If you truly have no idea what is in each xlsx file then using a LIBNAME with XLSX engine will let you get access to information on all of the sheets in the files.
libname src xlsx "&file";
proc contents data=src._all_ noprint out=contents;
run;
Note this works very well when the sheets are actually structured like a dataset. Header row with varaible names. Data starts in second row. No "footers" or "titles". One type of value per column. But if you are dealing with messy spreadsheets this will not work that well.
If you have a well defined structure for that data (even if there is some flexibility) then you can easily use the PROC CONTENTS output to check for things like wrong data type, wrong length of character variables. duplicate variable names. So that you could use it to generate the code to "fix" the issues so that the data can be combined.
For example your step that is trying to use PROC TRANSPOSE to get the YEAR value out of the column header (variable name) and into an actual variable. That should work fine if the only variability in the sheets is how many "year" columns there are. You could easily handle some sheets with years from 1988 to 2019 and others that have different range of years.
If I understand correctly you are importing sheets from one excel file.
Is a sheet an expand of the other sheets (different variables except identifier(s))
or all sheets are of same format (same variables) ?
Is the target to concatenate/merge all imported sheets into one sas data set?
Supposing all sheets are of the same format, same variables, then copy the generated code from the log and adapt it to serve for all sheets by applying:
1) max length of character variables
2) appropriate informat and format of numeric variables depending on the specific variable.
3) wrap the code with a macro and execute it for all sheets in a loop, similar to what you have done.
Thank you for your helpful comments. I am wondering if I just import them by using proc import like that. And after all, I will concatenate them, before concatenating them, I will set max length (by using length statement) and play some tricks to make sure all the variables' types are consistent through all dataset (because after all, most of the variables are consistent in Type but inconsistent in length (just one or two useless variables inconsistent in Types).
Is it a reasonable solution?
Many thanks.
It seems to me that you are doing hard work again and again.
By adapting the generated code in advance you may save a lot of time in the future.
You just need to know your data and assign appropriate length, informat and format.
You even can add more code, may be, to eliminate some later steps of adapting the data set for you needs.
This is why I NEVER import Excel files directly. When I receive data in this format, I always save it to a csv file and import that with a simple data step. Since that step never changes, the datasets will have a consistent structure.
Can you clarify the plan for combining these and the reason for the variability?
Depending on the answers I would recommend different strategies of help in automating the combination.
If you truly have no idea what is in each xlsx file then using a LIBNAME with XLSX engine will let you get access to information on all of the sheets in the files.
libname src xlsx "&file";
proc contents data=src._all_ noprint out=contents;
run;
Note this works very well when the sheets are actually structured like a dataset. Header row with varaible names. Data starts in second row. No "footers" or "titles". One type of value per column. But if you are dealing with messy spreadsheets this will not work that well.
If you have a well defined structure for that data (even if there is some flexibility) then you can easily use the PROC CONTENTS output to check for things like wrong data type, wrong length of character variables. duplicate variable names. So that you could use it to generate the code to "fix" the issues so that the data can be combined.
For example your step that is trying to use PROC TRANSPOSE to get the YEAR value out of the column header (variable name) and into an actual variable. That should work fine if the only variability in the sheets is how many "year" columns there are. You could easily handle some sheets with years from 1988 to 2019 and others that have different range of years.
Hi @Tom
I intend to let the Type and Year consistent because other experts recommend that I should change from xlsx to csv type of data and then import afterwards or I should make the data being consistent regarding type of data and length of data that I can avoid a lot of issues afterwards.
When I tried to learn how to transform from csv file to xlsx file, I found some other posts but the code is quite complex to me so far...
And as you said, my data is like that, yes
Note this works very well when the sheets are actually structured like a dataset. Header row with varaible names. Data starts in second row. No "footers" or "titles". One type of value per column. But if you are dealing with messy spreadsheets this will not work that well.
Other experts also help me regarding spotting out the errors and else, I just try to optimize the code to make sure all length and charatcer type be consistent since importing because I did not understand the code to change from xlsx to csv
Best regards,
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.