Solved: Re: Import excel file and read specific column header

AZIQ1 · Posted 03-23-2018 02:50 PM

Hi,

I am a beginner to intermediate SAS user and need help.

Here is a sample excel file that I want to import in SAS:

This is a sample file

This is a sample file

This is a sample file
This is a sample file
School A
Class	Grade	Pass
1	A	Yes
2	B	Yes
3	E	No
This is a sample file

This is a sample file

This is a sample file
This is a sample file
School B
Class	Grade	Pass
4	A	Yes
5	E	No
6	E	No

I want SAS to read the data every time it hits the column header "Class" so the final data looks like a clean SAS data set with single column header and all the data and delete the rows in between:

Class	Grade	Pass
1	A	Yes
2	B	Yes
3	E	No
4	A	Yes
5	E	No
6	E	No

Thank you

Patrick · Posted 03-24-2018 02:26 AM

@AZIQ1

I'm giving basically the same answer as @PGStats but also posting some code creating data as it would look like after a proc import.

/*proc import datafile="c:\temp\sample.xlsx"*/
/*  dbms=xlsx*/
/*  out=work.source replace;*/
/*  sheet='sheet1';*/
/*  getnames=no;*/
/*run;*/

DATA WORK.source;
    INFILE DATALINES4
        DLM='7F'x
        truncover
        DSD ;
    INPUT
        a               : $30.
        b               : $30.
        c               : $30. ;
DATALINES4;
This is a sample file   
   
This is a sample file   
   
This is a sample file   
This is a sample file   
School A   
ClassGradePass 
1AYes 
2BYes 
3ENo 
This is a sample file   
   
This is a sample file   
   
This is a sample file   
This is a sample file   
School B   
ClassGradePass
4AYes
5ENo
6ENo
;;;;
run;

data want;
  set source;
  retain headerFlg 0;
  if find(a,'This is a sample file','ti') then headerFlg=0;
  if headerFlg=1 then
    do;
      ClassID=input(a,best32.);
      Grade=input(b,$1.);
      Pass=input(c,$3.);
      output;
      keep ClassID Grade Pass;
    end;

  if find(a,'class','ti') then headerFlg=1;

run;

N.B: For your real data you will need to find text tokens where you're 100% sure that they are sufficient as indicators where your wanted data starts and where it ends.

You probably need also to implement some logic which verifies on which column your data starts (unless you can be certain it's always on column A).

View solution in original post

Reeza · Posted 03-23-2018 03:23 PM

Is your file Excel or CSV?

Are you doing this multiple times or once for a single file?

What have you tried so far that doesn't work, so we don't suggest things you've already tried.

AZIQ1 · Posted 03-23-2018 03:29 PM

Thank you for your response.
It is in excel
It is for multiple files that are similar
I haven't tried anything, I know range specification but that wont work here since the row can be any number.

Reeza · Posted 03-23-2018 03:36 PM

So there's no consistency to the data at all? They're not using the same ranges across the tables?

If that's the case, I would consider converting it to CSV and then using a manual data step to parse the code into what you need. Not particularly fun but it gets the job done.

AZIQ1 · Posted 03-23-2018 03:46 PM

Thank you, but there are hundreds of files.

Reeza · Posted 03-23-2018 04:14 PM

@AZIQ1 wrote:
Thank you, but there are hundreds of files.

Ok, isn't that why your using a programming language so you can automate/generalize your process?

There's a VBS script I posted here earlier this week that will convert all XLSX files in a folder into CSV

Then write the program once for a single file, test it. Once it's working, its trivial to run it for all by either modifying the program.

We can't provide code for a problem where no data has been provided except a general idea for a single file.

Patrick · Posted 03-24-2018 02:26 AM

@AZIQ1

I'm giving basically the same answer as @PGStats but also posting some code creating data as it would look like after a proc import.

/*proc import datafile="c:\temp\sample.xlsx"*/
/*  dbms=xlsx*/
/*  out=work.source replace;*/
/*  sheet='sheet1';*/
/*  getnames=no;*/
/*run;*/

DATA WORK.source;
    INFILE DATALINES4
        DLM='7F'x
        truncover
        DSD ;
    INPUT
        a               : $30.
        b               : $30.
        c               : $30. ;
DATALINES4;
This is a sample file   
   
This is a sample file   
   
This is a sample file   
This is a sample file   
School A   
ClassGradePass 
1AYes 
2BYes 
3ENo 
This is a sample file   
   
This is a sample file   
   
This is a sample file   
This is a sample file   
School B   
ClassGradePass
4AYes
5ENo
6ENo
;;;;
run;

data want;
  set source;
  retain headerFlg 0;
  if find(a,'This is a sample file','ti') then headerFlg=0;
  if headerFlg=1 then
    do;
      ClassID=input(a,best32.);
      Grade=input(b,$1.);
      Pass=input(c,$3.);
      output;
      keep ClassID Grade Pass;
    end;

  if find(a,'class','ti') then headerFlg=1;

run;

N.B: For your real data you will need to find text tokens where you're 100% sure that they are sufficient as indicators where your wanted data starts and where it ends.

You probably need also to implement some logic which verifies on which column your data starts (unless you can be certain it's always on column A).

SuryaKiran · Posted 03-23-2018 04:11 PM

Hi,

If this is your situation the best way is convert the file to CSV as others mentioned and then using DATA STEP you can read as you need.

DATA WORK.SAS_Import;
INFILE '/SAS_Import.csv' dlm=',' dsd ;
    INPUT @;
	IF prxmatch("m/this|sample|School|class| /oi",_INFILE_) or missing(compress(_infile_,',')) then delete;
	else INPUT CLASS $ GRADE $ PASS $;
RUN;

There is another way, for this you need to understand how your data is imported and then convert the dataset (Cleansing data).

proc import datafile='/SAS_Import.xlsx'
dbms=xlsx out=Have replace;
Getnames=no;
run;

data want;
set have(rename=(A=CLASS B=GRADE C=PASS));
IF prxmatch("m/this|sample|School|class/oi",CLASS) or missing(CLASS) then delete;
run;

Thanks,
Suryakiran

AZIQ1 · Posted 04-06-2018 11:48 PM

Thank you so much

PGStats · Posted 03-24-2018 01:37 AM

The fact that the data resides in an Excel spreadsheet is not important. You just need to read the three columns, something like

infile xl Excel "path\myFile.xlsx" header=no;

data test;

set xl.'Sheet1$A1:C1000';

columns will be named F1, F2 and F3.

then set a flag for when the subtables start

if F1="Class" and F2 = "Grade" and F3 = "Pass" then inTable = 1;

else do;

reset the flag when you read beyond the subtable

if inTable then if notdigit(trim(F1)) > 0 then inTable = 0;

Transfer the values if Inside the subtable

if inTable then do;

...

output;

end;

...

end;

PG

AZIQ1 · Posted 04-06-2018 11:47 PM

Thank you so much this worked too

Registration is open

SAS Training: Just a Click Away