Requesting ideas/help to read multiple excel files xlsx from a folder in server efficiently?
Folder is in server
Number of cols: 130
Row 1 has some junk and has to be ignored
Variable names begin from Row 2
Values begin from Row 3
Also some columns do not have variable names and is blank
Any simple convenient yet efficient way to handle this?An example will help. Thank you in advance
PS My mind isn't thinking well. Sorry for the bother
It is a pain in ***.
If you just use PROC IMPORT it will use the first row as the names and include the names in the data.
If you tell it the data starts on line 3 it will not read the names as data, but it still tries to use the title row as the names.
Instead use the RANGE option to tell it start reading on A2. But then you NEED to know the sheetname.
Example file:
Example Code:
%let filename="c:\downloads\testa.xlsx";
proc import datafile=&filename dbms=xlsx
out=test1 replace
;
run;
proc print; title1 'TEST1';
run;
proc import datafile=&filename dbms=xlsx
out=test2 replace
;
datarow=3;
run;
proc print; title1 'TEST2';
run;
proc import datafile=&filename dbms=xlsx
out=test3 replace
;
range="Sheet1$A2:0";
run;
proc print; title1 'TEST3';
run;
title1;
Example Results:
TEST1 Title_ Obs line B C 1 ID Name Age 2 1 Sam 23 3 2 Fred 33 TEST2 Title_ Obs line B C 1 1 Sam 23 2 2 Fred 33 TEST3 Obs ID Name Age 1 1 Sam 23 2 2 Fred 33
1. Do all the files have similar structure?
2. What type of server, Windows, Unix? If you're on Unix, you likely need to use PCFILES.
3. What happens if you write a libname reference to the file and use PROC COPY?
4. If they are similar structure have you tried using a basic PROC IMPORT with DBMS = XLSX and specifying the RANGE from A2:END? There's a way to specify a dynamic range so if this works, I can look up how to do that.
Documentation reference for Excel files:
libname myFile XLSX "path to xlsx file";
*you need to ensure pcfiles is set up and running;
libname myTest PCFILES Path = "path to xlsx file" server= port=;
The XLSX LIBNAME engine and the XLSX PROC IMPORT engine work on Unix, so the SAS PC Files Server might not be needed.
Vince DelGobbo
SAS R&D
It's a windows server.
This seems to work
proc import
DATAFILE = "\\corp\sites\RIB1001\HLSCreditRisk\RMBS Collateral Data\RAW DATA FOR NAVEEN\12646wah7.xlsx"
DBMS=XLSX OUT =test REPLACE;
getnames=no;
datarow=3;
run;
But I do not understand the datarow as the doc says Starts reading data from the specified row number in the delimited text file. So what's the delimiter here?
+1 to your thoughts.
I'm very unimpressed too as I was embarrassed 😞
It is a pain in ***.
If you just use PROC IMPORT it will use the first row as the names and include the names in the data.
If you tell it the data starts on line 3 it will not read the names as data, but it still tries to use the title row as the names.
Instead use the RANGE option to tell it start reading on A2. But then you NEED to know the sheetname.
Example file:
Example Code:
%let filename="c:\downloads\testa.xlsx";
proc import datafile=&filename dbms=xlsx
out=test1 replace
;
run;
proc print; title1 'TEST1';
run;
proc import datafile=&filename dbms=xlsx
out=test2 replace
;
datarow=3;
run;
proc print; title1 'TEST2';
run;
proc import datafile=&filename dbms=xlsx
out=test3 replace
;
range="Sheet1$A2:0";
run;
proc print; title1 'TEST3';
run;
title1;
Example Results:
TEST1 Title_ Obs line B C 1 ID Name Age 2 1 Sam 23 3 2 Fred 33 TEST2 Title_ Obs line B C 1 1 Sam 23 2 2 Fred 33 TEST3 Obs ID Name Age 1 1 Sam 23 2 2 Fred 33
@Tom where in the documentation does it say that data row applies to XLSX DBMS? In the documentation, it only shows that RANGE and SHEET apply to XLSX DBMS?
That's primarily the issue here, I'm guessing that it's under a different section than the link above?
Since DATAROW doesn't really work right unless you want GETNAMES=NO not sure it matters that much.
On this page:
It mentions DATAROW under the section on RANGE. It also does not seem to properly describe the limitation on the RANGE option that it MUST include a sheetname if you are not using a named range.
In the simple case of reading the first worksheet, I don't think that you need to specify the worksheet name in the range.
proc import datafile=&filename dbms=xlsx
out=test3 replace;
range='$A2:0';
run; quit;
Vince DelGobbo
SAS R&D
I thought I tried that. But looking at the log I had tried just '$A2'.
Interestingly you don't need the trailing zero. So just '$A2:' works.
Marvelous Sir @Tom . First off, thank you for your time and detailing the example. let me try that solution, and then come back to you should i seek any clarification and of course to mark the answer.
Sir @Tom , @Reeza , @Vince_SAS
Sorry for the bother yet again. May I request a favor in helping me with some kind of a tool that would convert all the xlsx files in a folder to CSV. I know this request is perhaps not appropriate in a SAS forum but I am just taking a chance. The reason that led to me to this is I am not quite happy with Proc import making guesses incorrectly for some files and perhaps correctly for others. Could i get lucky plz?
https://gist.github.com/statgeek/c51f58a009f8d315a200f34912e494b1
This goes from xml to xlsx but it's easy to modify, change the 51 to 6 and the extension in the code from xlsx to csv
https://docs.microsoft.com/en-us/office/vba/api/excel.xlfileformat
I don't think this deals with multiple sheets in each file, is that something you need to deal with?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.