DATA Step, Macro, Functions and more

Variable format inconsistency in stacking many excel files

Accepted Solution Solved
Reply
Contributor
Posts: 73
Accepted Solution

Variable format inconsistency in stacking many excel files

I have a large number of excel workbooks, all of them sharing the same column header/variable names. When I read them in using proc import, the same variables are sometimes numeric and other times character. For example a date variable can be read as numeric, but also gets turned into character in files with all missing date values. If I want to stack them together, I run into this problem. Is there a quick automated way to go through all datasets, and get them to agree on the right variable formats as well as lengths (so nothing gets truncated in stacking)?

 

I found a close solution here in the answer given by username Oligolas and I tried it. I does work, but seems like it turns everything into character. But I would like to keep what should be numeric numeric. If dates are all missing, I want that to be numeric so they can be merged with non-missing dates from other files. I guess the logic is that for each variable, if it's seen a numeric in any of the imported files, then make it alwasy numeric. If it's always character, then make it character. Not quite sure how to implement this. Thanks.

 

https://communities.sas.com/t5/Base-SAS-Programming/Variable-has-been-defined-as-both-character-and-...

 


Accepted Solutions
Solution
‎05-28-2017 10:27 PM
Super User
Posts: 19,815

Re: Variable format inconsistency in stacking many excel files

Posted in reply to apolitical

apolitical wrote:
By using a data step to import, I assume you meaning spelling out (in)format specifications for every variable in the dataset, is that correct? I was looking to see if it's possible to avoid doing that on over 100 variables, but if that's the most reliable way to go, then I will do it. Thanks.

 

It's definitely the most reliable. There's a VBS script on here to automate the conversion. 

 

Additionally, you can first use PROC IMPORT on your data and then check the log. It will have the code as a data step so you copy that and modify it. Hold down ALT while selecting the code to omit the line numbers from the log. AND most importantly, you can read all 100 files in one step. 

 

Here's a little write up I did on this process:

https://communities.sas.com/t5/SAS-Communities-Library/How-do-I-write-a-macro-to-import-multiple-tex...

View solution in original post


All Replies
Super User
Posts: 11,343

Re: Variable format inconsistency in stacking many excel files

Posted in reply to apolitical

Proc Import is a GUESSING procedure. It has to guess from the contents of the data what the data type is. So if the first few rows of a variable contain nothing SAS assigns a character type (an possibly only one character long). If your "numeric" field has something else that may not be a simple number sucha a NA, N/A, NULL, BLANK etc then SAS may decide the column is text.

 

If you want to have more control you need to use another approach, possible the LIbname or convert the file to CSV where you can use a data step to import the fields with the exact properties you want set. If you are reading multiple Excel files that should be the same format then this may pay off the most in the long run.

Contributor
Posts: 73

Re: Variable format inconsistency in stacking many excel files

By using a data step to import, I assume you meaning spelling out (in)format specifications for every variable in the dataset, is that correct? I was looking to see if it's possible to avoid doing that on over 100 variables, but if that's the most reliable way to go, then I will do it. Thanks.
Solution
‎05-28-2017 10:27 PM
Super User
Posts: 19,815

Re: Variable format inconsistency in stacking many excel files

Posted in reply to apolitical

apolitical wrote:
By using a data step to import, I assume you meaning spelling out (in)format specifications for every variable in the dataset, is that correct? I was looking to see if it's possible to avoid doing that on over 100 variables, but if that's the most reliable way to go, then I will do it. Thanks.

 

It's definitely the most reliable. There's a VBS script on here to automate the conversion. 

 

Additionally, you can first use PROC IMPORT on your data and then check the log. It will have the code as a data step so you copy that and modify it. Hold down ALT while selecting the code to omit the line numbers from the log. AND most importantly, you can read all 100 files in one step. 

 

Here's a little write up I did on this process:

https://communities.sas.com/t5/SAS-Communities-Library/How-do-I-write-a-macro-to-import-multiple-tex...

Contributor
Posts: 73

Re: Variable format inconsistency in stacking many excel files

For some reason, when I execute proc import, the log does not show the detailed background code as if running a data step. How do I get those lines to show so I can copy and modify and put them into a data step code of my own, and use that on all files?

The log I got was:
NOTE: The import data set has 354 observations and 115 variables.
NOTE: Compressing data set WORK.TRY decreased size by 33.33 percent.
Compressed is 4 pages; un-compressed would require 6 pages.
NOTE: WORK.TRY data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.27 seconds
cpu time 0.23 seconds
Super User
Posts: 19,815

Re: Variable format inconsistency in stacking many excel files

Posted in reply to apolitical

apolitical wrote:
For some reason, when I execute proc import, the log does not show the detailed background code as if running a data step. How do I get those lines to show so I can copy and modify and put them into a data step code of my own, and use that on all files?

The log I got was:
NOTE: The import data set has 354 observations and 115 variables.
NOTE: Compressing data set WORK.TRY decreased size by 33.33 percent.
Compressed is 4 pages; un-compressed would require 6 pages.
NOTE: WORK.TRY data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.27 seconds
cpu time 0.23 seconds

This means you were trying to import an Excel file still. You will get data step code when trying to read text files.

Contributor
Posts: 73

Re: Variable format inconsistency in stacking many excel files

that's right. i have converted an xlsx file into csv and imported it. i can copy the informat/format/input chunk of the log file and make changes as necessary, which is convenient. but i still have 30 files to import, each with over 100 fields. is there an easier way to determine the correct format for each variable, besides looking at what values a variable contains across all 30 files, and humanly decide what format is optimal to use, so they are consistent for later stacking?
Super User
Posts: 19,815

Re: Variable format inconsistency in stacking many excel files

Posted in reply to apolitical

You stated that your files are the same structure. Did you look at the link I posted on how once you've figured out one file you can automate it for all your other files?

 

You can set the GUESSINGROWS option in the PROC IMPORT to a large number, as large as your file. This way it scans the entire file before guessing so it should be closer to your actual data, BUT there is no way to guarantee it's what you want unless you explicitly specify it. Computers are dumb, they only do what you tell them to do.  

 


apolitical wrote:
that's right. i have converted an xlsx file into csv and imported it. i can copy the informat/format/input chunk of the log file and make changes as necessary, which is convenient. but i still have 30 files to import, each with over 100 fields. is there an easier way to determine the correct format for each variable, besides looking at what values a variable contains across all 30 files, and humanly decide what format is optimal to use, so they are consistent for later stacking?

 

Contributor
Posts: 73

Re: Variable format inconsistency in stacking many excel files

I ended up using proc import to generate the SAS recognized formats and informats command, and compared across all raw files for each variable. For character variables, I chose $max. For variables that can be both numeric and character, I went back to look at the data to determine what is the correct format to take on. I wrote this into a macro and executed it on all files and stacked them. Certainly not the most efficient way of doing things, but seems to have worked out OK.

Thank you so much.
PROC Star
Posts: 102

Re: Variable format inconsistency in stacking many excel files

Posted in reply to apolitical

If I understand you correctly, you have some sheets that are standardised and contain the same columns. The problem is, of course, that sometimes all the cells in a numeric column are missing,and it gets defined as character in that dataset, and that character variables get different lengths.

 

You may want to use a LIBNAME instead of PROC IMPORT, and then use the DBSASTYPE option to define your variable types, e.g.

%let Exceldef=dbsastype=(arrivalDt=date name=char(50) n=numeric);

libname excel1 excel 'c:\ws1.xls';
libname excel2 excel 'c:\ws2.xls';

data want;
  set 
    excel1.sheet1(&excelDef)
    excel2.sheet2(&excelDef)
    ;
run;
Contributor
Posts: 73

Re: Variable format inconsistency in stacking many excel files

Thank you. I got this error message:

1 dbsastype=(arrivalDt=date name=char(50) n=numeric)
--------- -
108 108
ERROR 108-63: Invalid syntax for option DBSASTYPE.
PROC Star
Posts: 102

Re: Variable format inconsistency in stacking many excel files

Posted in reply to apolitical
Can you show all of the log from your program?
Contributor
Posts: 73

Re: Variable format inconsistency in stacking many excel files

I only changed the file path of the logs below. The "sheet1" in "excel1.sheet1(&excelDef)" does not have to the real sheet/tab name in the excel file, does it?


5447 %let Exceldef=dbsastype=(arrivalDt=date name=char(50) n=numeric);
5448
5449 libname excel1 excel 'c:\001.xlsx';
NOTE: Libref EXCEL1 was successfully assigned as follows:
Engine: EXCEL
Physical Name: c:\001.xlsx
5450 libname excel2 excel 'c:\002..xlsx';
NOTE: Libref EXCEL2 was successfully assigned as follows:
Engine: EXCEL
Physical Name: c:\002.xlsx
5451

5452 data want;
5453 set
5454 excel1.sheet1(&excelDef)
SYMBOLGEN: Macro variable EXCELDEF resolves to dbsastype=(arrivalDt=date name=char(50) n=numeric)
NOTE: Line generated by the macro variable "EXCELDEF".
1 dbsastype=(arrivalDt=date name=char(50) n=numeric)
--------- -
108 108
ERROR 108-63: Invalid syntax for option DBSASTYPE.

5455 excel2.sheet2(&excelDef)
SYMBOLGEN: Macro variable EXCELDEF resolves to dbsastype=(arrivalDt=date name=char(50) n=numeric)
NOTE: Line generated by the macro variable "EXCELDEF".
1 dbsastype=(arrivalDt=date name=char(50) n=numeric)
--------- -
108 108
ERROR 108-63: Invalid syntax for option DBSASTYPE.

5456 ;
5457 run;

NOTE: Compression was disabled for data set WORK.WANT because compression overhead would increase the size of the data
set.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0 observations and 0
variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds

PROC Star
Posts: 102

Re: Variable format inconsistency in stacking many excel files

Posted in reply to apolitical

apolitical wrote:

I only changed the file path of the logs below. The "sheet1" in "excel1.sheet1(&excelDef)" does not have to the real sheet/tab name in the excel file, does it?


5447 %let Exceldef=dbsastype=(arrivalDt=date name=char(50) n=numeric);
5448
5449 libname excel1 excel 'c:\001.xlsx';
NOTE: Libref EXCEL1 was successfully assigned as follows:
Engine: EXCEL
Physical Name: c:\001.xlsx
5450 libname excel2 excel 'c:\002..xlsx';
NOTE: Libref EXCEL2 was successfully assigned as follows:
Engine: EXCEL
Physical Name: c:\002.xlsx
5451

5452 data want;
5453 set
5454 excel1.sheet1(&excelDef)
SYMBOLGEN: Macro variable EXCELDEF resolves to dbsastype=(arrivalDt=date name=char(50) n=numeric)
NOTE: Line generated by the macro variable "EXCELDEF".
1 dbsastype=(arrivalDt=date name=char(50) n=numeric)
--------- -
108 108
ERROR 108-63: Invalid syntax for option DBSASTYPE.

5455 excel2.sheet2(&excelDef)
SYMBOLGEN: Macro variable EXCELDEF resolves to dbsastype=(arrivalDt=date name=char(50) n=numeric)
NOTE: Line generated by the macro variable "EXCELDEF".
1 dbsastype=(arrivalDt=date name=char(50) n=numeric)
--------- -
108 108
ERROR 108-63: Invalid syntax for option DBSASTYPE.

5456 ;
5457 run;

NOTE: Compression was disabled for data set WORK.WANT because compression overhead would increase the size of the data
set.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0 observations and 0
variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds


Yes, you will have to change the names of the sheets you read from to the correct names. You will also have to change the variable definitions to the actual variables in the Excel tables you are reading.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 171 views
  • 0 likes
  • 4 in conversation