Re: Import Excel spreadsheet which contains two separate datasets

K_McInvale · Posted 12-08-2015 10:18 AM

Note: I am using PC SAS 9.4 M2(x64).

I have some Excel spreadsheets to import and each spreadsheet contains two datasets that ate stacked horizontally.

Upstream
Record_No, Col2, Col3, etc
1 , data, data, data
2 , data, data, data
3 , data, data, data
n. , data, data, data
Blank-row
Downstream
Record_No, col2, col3, col4
1 , data, data, data
......
......

I'm thinking something like import the whole sheet into a dataset. Then a proc statement data = work.upstream. work.downstream. A do until loop until the blank row or merged cell 'downstream'. This data output to upstream and the remaining data to 'downstream' dataset.

Any suggestions would be greatly appreciated

DartRodrigo · Posted 12-08-2015 10:36 AM

Hi mate,

This code works as you need:

Use range option to specify the range cells and get the first table and the second, the following example i've made:

TABLE1
COL1	COL2	COL3	COL4	COL5
A	B	C	D	E
A	B	C	D	E
A	B	C	D	E
A	B	C	D	E
A	B	C	D	E
A	B	C	D	E
A	B	C	D	E

TABLE2
COL1	COL2	COL3	COL4	COL5
A	B	C	D	E
A	B	C	D	E
A	B	C	D	E
A	B	C	D	E
A	B	C	D	E
A	B	C	D	E
A	B	C	D	E

PROC IMPORT DATAFILE="/PATH/test.xls"
		OUT = UPSTREAM                       
    DBMS=XLS REPLACE;                                                                                                                 
    SHEET="Sheet1";                                                                                                                    
    GETNAMES=YES;                                                                                                                      
DATAROW=2;
RANGE="Sheet1$A1:E9";
RUN;

PROC IMPORT DATAFILE="/PATH/test.xls"
		OUT = DOWNSTREAM                       
    DBMS=XLS REPLACE;                                                                                                                 
    SHEET="Sheet1";                                                                                                                    
    GETNAMES=YES;                                                                                                                      
DATAROW=2;
RANGE="Sheet1$A11:E19";
RUN;

Hope this helps

K_McInvale · Posted 12-08-2015 11:32 AM

Sorry, I should have been more clear. The first section, Table1, contains 'n' observations. I am importing from roughly 30 workbooks and n is rarely the same. So, using range= does not help me in this situation.

DartRodrigo · Posted 12-08-2015 11:53 AM

Ok,

I changed to use macro to get the values and separate the datasets:

PROC IMPORT DATAFILE="/path/test.xls"
		OUT = FIRSTSTREAM                       
    DBMS=XLS REPLACE;                                                                                                                 
    SHEET="Sheet1";                                                                                                                    
    GETNAMES=YES;                                                                                                                      
DATAROW=3;
RUN;

DATA HAVE;
   SET FIRSTSTREAM;
   X = _N_;
RUN;

PROC SQL;
  SELECT X+2    INTO: FIRSTLINE FROM HAVE WHERE TABLE1 = "";
  SELECT MAX(X) INTO: LASTLINE  FROM HAVE;
QUIT;


PROC IMPORT DATAFILE="/path/test.xls"
		OUT = UPSTREAM                       
    DBMS=XLS REPLACE;                                                                                                                 
    SHEET="Sheet1";                                                                                                                    
    GETNAMES=YES;                                                                                                                      
DATAROW=2;
RANGE="Sheet1$A1:E&FIRSTLINE.";
RUN;

PROC IMPORT DATAFILE="/path/test.xls"
		OUT = DOWNSTREAM                       
    DBMS=XLS REPLACE;                                                                                                                 
    SHEET="Sheet1";                                                                                                                    
    GETNAMES=YES;                                                                                                                      
DATAROW=2;
RANGE="Sheet1$A&FIRSTLINE.:E&LASTLINE.";
RUN;

PGStats · Posted 12-08-2015 12:48 PM

Could you post one of your Excel files? The proper approach depends a lot on data structure and layout.

PG

K_McInvale · Posted 12-08-2015 02:10 PM

Excellent point Mr Respected Advisor! I have a dummy workbook and I will place an image here as well.

RW9 · Posted 12-09-2015 04:17 AM

Which shows my example nicely. What you have posted is a best case scenario, all the columns are text, and the same structure. Is your data really like that, I wouldn't think so. So what will you do if:

- col1 is missing for any reason within the data?

- numbers where previous table is characters

- different numbers of columns

These are a few things that spring to mind, not to mention hidded data, special characters, no fixed setup, so could change at further attempts etc. Why are you importing this data in the first place, why not access the Raw data this was produced from? I mentioned this in another topic, but one of the big reasons to avoid Excel as an output format is the general misconception that it is a data transfer format and can be used as such (unlike Word/PDF), hence we end up trying to read in output review files, which is what you are trying to do here, rather than go through a strcutured validated robust approach of:

- create data import specifcation

- get agreement on file format (using proper data transfer file formats = CSV, XML etc.)

- create import program based on specifcations

- run program on test data and validate

- run at each import occasion and validate

DartRodrigo · Posted 12-09-2015 07:00 AM

PROC IMPORT DATAFILE="/path/test.xls"
		OUT = FIRSTSTREAM                       
    DBMS=XLS REPLACE;                                                                                                                 
    SHEET="Sheet1";                                                                                                                    
    GETNAMES=YES;                                                                                                                      
DATAROW=3;
RUN;

DATA HAVE;
   SET FIRSTSTREAM;
   X = _N_;
RUN;

PROC SQL;
  SELECT X+2    INTO: FIRSTLINE FROM HAVE WHERE TABLE1 = "";
  SELECT MAX(X) INTO: LASTLINE  FROM HAVE;
QUIT;


PROC IMPORT DATAFILE="/path/test.xls"
		OUT = UPSTREAM                       
    DBMS=XLS REPLACE;                                                                                                                 
    SHEET="Sheet1";                                                                                                                    
    GETNAMES=YES;                                                                                                                      
DATAROW=2;
RANGE="Sheet1$A1:E&FIRSTLINE.";
RUN;

PROC IMPORT DATAFILE="/path/test.xls"
		OUT = DOWNSTREAM                       
    DBMS=XLS REPLACE;                                                                                                                 
    SHEET="Sheet1";                                                                                                                    
    GETNAMES=YES;                                                                                                                      
DATAROW=2;
RANGE="Sheet1$A&FIRSTLINE.:E&LASTLINE.";
RUN;

This code do exactly what you need, and you don't need to specify which one is the line to breakup the tables.
You only have to change the where clause in proc sql.

Hope this helps

RW9 · Posted 12-08-2015 10:43 AM

I wouldn't import it as is. The reason is that you will have text (downstrem) in a numeric field. Its messy at best, and assuming a missing value is the break could be misleading. Its another example of Excel being a problem when dealing with real data. I would suggest you break the table yourself into two sheets, and preferentially save each sheet as CSV and write an import program for each. This avoids the numerous issues when using Excel .

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away