SAS Programming

Mr_Breeze · Posted 09-16-2011 01:12 PM

Hi all,

I have some semi-structured text files that I'm trying to read into SAS v9.1.3. An example is shown below in blue.

The column headings are listed above the dashed line and the data blocks are below the line. I'm not sure how to parse the data into a table structure (see desired outcome at the bottom). Any suggestions would be greatly appreciated.

Daily Report for 08SEP11

Printed on 09SEP11 at EST 07:19:00

Page 1

Tran No R Company Name DB/CR Amount DB Curr Date/Time Appv'd

Tran ID N Payment Method Transfer Amount CR Curr Date/Time Prnt'd

Value Date FX Rate

-------- - ------------------------ ------------------ -- ---- ----------------

2510001 READING RAINBOW 375.50 DB USD 08SEP11 07:11:22

SUPERXFER R book 08SEP11 15:47:00

08SEP11 * PENDING *

Debit account number: 3333333333

Credit bank aaa: 12345678

Credit acct number: 1111111111

Credit acct name: IRA FLEXIBLE SPENDING ACCOUNT

2510002 BABY SUPERSTORE 175,000.00 DB USD 08SEP11 07:20:31

BSUPER001 N Fedwire Out Interface 08SEP11 08:20:40

08SEP11 * CONFIRMED *

Debit account number: 2222222222

Receiving bank aba: 234567898

Credit account number: 2222-1111111-7

Credit account name: IND OPSE

Tran_No	Company_Name	DB_CR_Amt	DateTime_Appr	DateTime_Prn	Tran_ID	Pmt_Mthd	Value_date	DB_Acct_Nbr	CR_Bank_AAA	CR_Acct_Nbr	CR_Acct_Name
2510001	READING RAINBOW	375.50	08SEP11 07:11:22	08SEP11 15:47:00	SUPERXFER	book	08SEP11	3333333333	12345678	1111111111	IRA FLEXIBLE SPENDING ACCOUNT
2510002	BABY SUPERSTORE	175,000.00	08SEP11 07:20:31	08SEP11 08:20:40	BSUPER001	Fedwire Out Interface	08SEP11	2222222222	234567898	2222-1111111-7	IND OPSE

Tom · Posted 09-18-2011 01:45 PM

Here is a cleaned up version. I had skipped the line with the value date before.

data

company(keep=tran_no company_name db_cr_amount db curr appdt apptm

tran_id rn payment_method trdt trtm valuedt fx_rate)

company_account(keep=tran_no acctype acctid)

;

infile src end=eof truncover;

input @ ;

if (_n_=1 or index(_infile_,'0c'x)) then do until (eof or _infile_=:'----');

input / @;

end;

do while (_infile_=:'----' or _infile_=' ');

input / @ ;

end;

*-----------------------------------------------------------------------------;

* Read transaction report ;

*-----------------------------------------------------------------------------;

input tran_no 1-8company_name $12-35

db_cr_amount : comma15. db $ curr $ appdt date7. apptm time8.

/ tran_id $ 1-9 rn $ 11 payment_method $ 13-64 trdt date7. trtm time8.

/ @12 valuedt date7. fx_rate & $20.

@

;

format appdt trdt valuedt date9. apptm trtm time8. db_cr_amount comma15.2 ;

output company;

*-----------------------------------------------------------------------------;

* Read the account numbers ;

*-----------------------------------------------------------------------------;

do until (_infile_=' ' or eof) ;

input / @;

if _infile_ ne ' ' then do;

acctype = left(scan(_infile_,1,':'));

acctid = left(scan(_infile_,2,':'));

output company_account;

end;

run;

proc print data=company width=min;

run;

proc print data=company_account width=min;

run;

View solution in original post

Reeza · Posted 09-16-2011 01:32 PM

If you can delete the heading and all the records have the identical structure its a case of reading multiple line per subject data type.

Can you attach an example of your text file to your post?

What have you tried?

Here's a starting point if you haven't tried anything at all.

http://www.ats.ucla.edu/stat/sas/code/multiple_lines_per_s.htm

Mr_Breeze · Posted 09-16-2011 01:45 PM

Hi Reeza,

Thanks for the reply. I'm using input and # to point to multiple observations right now, but I haven't finished coding it yet. The text file is exactly like what I copied in the question body, just a lot more rows.

I would delete the heading and other records, but this is a file that I will receive daily, and I want to append each one to a master file, so it would be cumbersome to delete the heading of each file.

I'm thinking of using some sort of If to tell the program when to start and stop reading lines per data type.

Reeza · Posted 09-16-2011 02:20 PM

Is the page no, ie page no repeated across diff pages or does it just occur at the top?

When I copy and paste into notepad there's a lot of extra lines that get added in, which is why I'm not sure of your actual file format.

One method is to brute force it.

A record seems to start with an account number which seems to be a x digit so if the first value is xdigits its your start of record, would the tran id always be character or could it start with a number as well? Anyways, I'd read in the full string for each line and then parse separately.

You have more structure to your data than I've had with some programs like this but it's a method that works but isn't very elegant.

filename editset "H:\sample.txt";

*READ IN FILE;

data sample;

infile editset lrecl=80 END=eof truncover;

input line $256.;

run;

Mr_Breeze · Posted 09-16-2011 02:42 PM

The page numbers are repeated across different 'pages' in the file.

I didn't think of bringing in the whole line and then parsing... I'll try that and see how it works.

Thanks for the suggestion!

DLing · Posted 09-16-2011 04:09 PM

Here's an idea to try out as well. If the data occurs in very specific blocks in a predictable pattern, then this structure may work:

data want;

infile in (with all suitable options like line length truncover...);

input @; *--- read the line into _infile_ buffer;

if left( _infile_ ) =: 'Daily Report' then do; *--- skip the report headers;

do i=1 to 9; input; end; *--- swallow 9 lines;

return;

end;

acct = substr((left,_infile_),1,7);

if verify(acct,'0123456789') = 0 then do;

input #1 acct : 7. company_name db_cr_amount dt_appr

#2 .....

#3 .....

#4 .....

#5 .....

#6 .....

#7 .....

;

output;

end;

run;

The idea is to read a line & hold it, figure out if it is the start of a known block, if it is not the start of something useful, do nothing with it, start next loop.

If it is the start of a known block, then handle that known block as appropriate.

The "return" statement can be used to "start the next iteration of the data step".

Too bad you have to work with this data structure, it's far easier if you can arrange with them to give you the CSV file or XML file or Excel file even. Anything structured is better than having to "undo" the report formatting.

Tom · Posted 09-16-2011 08:05 PM

# pointer will probably not work unless both your header lines and your report blocks have the same number of lines.

If your blocks are consistently seven lines long then use / instead of # to move to the next line in your input. Then you can still scan the file line by line to look for the start of the next block. If they are not a consistent number of lines then it looks like you can read until you find the next empty line.

You should look for start of pages. Usually you will see the formfeed character at the start of all pages (except the first). ( index(_infile_,'0C'x) )

You header line seems to be all dashes. To look for these I usually use this test :

index(_infile_,'--') and compress(_infile_,'-')=' '

This will find lines that have at least two adjacent hyphens but no other printable characters.

Another trick is let in the input statement parse the line into words for you and then based on what you see you can assign the right word your variables.

infile ... truncover ;

length x1-x20 $200 ;

input x1-x20 ;

if length(x1)=7 and . ^= input(x1,??6.) then *** found start of new block *** ;

Ksharp · Posted 09-16-2011 11:51 PM

Can you post a attachment which contains some sample data .

It will be more helpful to get a code.

Ksharp

Tom · Posted 09-17-2011 11:59 AM

You could try this. It seems to read these two records. I tried to put in something to account for mulitple pages, but no way to test if it works.

The account descriptions and ID numbers listed below each transaction where not consistently formatted between your two records, so I decided to output those into a separated dataset. I assumed that the TRAN_NO could serve as a key to link the records back together.

data

company(keep=tran_no company_name db_cr_amount db curr appdt apptm

tran_id rn payment_method trdt trtm )

company_account(keep=tran_no acctype acctid)

;

infile src end=eof truncover ;

input @ ;

if (_n_=1 or index(_infile_,'0c'x)) then do until(eof or _infile_=:'----');

input / @;

end;

if _infile_=:'----' or _infile_=' ' then input / @ ;

*-----------------------------------------------------------------------------;

* Read one company report ;

*-----------------------------------------------------------------------------;

input tran_no 1-8 company_name $12-35 db_cr_amount : comma15. db $ curr $ appdt date7. apptm time8.

/ tran_id $ 1-9 rn $ 11 payment_method $ 13-64 trdt date7. trtm time8.

;

format appdt trdt date9. apptm trtm time8. db_cr_amount comma15.2 ;

output company;

*-----------------------------------------------------------------------------;

* Read the account numbers ;

*-----------------------------------------------------------------------------;

do until (_infile_=' ' or eof) ;

input / @;

if _infile_ ne ' ' then do;

acctype = left(scan(_infile_,1,':'));

acctid = left(scan(_infile_,2,':'));

output company_account;

end;

run;

proc print data=company width=min; run;

proc print data=company_account width=min; run;

Tom · Posted 09-18-2011 01:45 PM

Here is a cleaned up version. I had skipped the line with the value date before.

data

company(keep=tran_no company_name db_cr_amount db curr appdt apptm

tran_id rn payment_method trdt trtm valuedt fx_rate)

company_account(keep=tran_no acctype acctid)

;

infile src end=eof truncover;

input @ ;

if (_n_=1 or index(_infile_,'0c'x)) then do until (eof or _infile_=:'----');

input / @;

end;

do while (_infile_=:'----' or _infile_=' ');

input / @ ;

end;

*-----------------------------------------------------------------------------;

* Read transaction report ;

*-----------------------------------------------------------------------------;

input tran_no 1-8company_name $12-35

db_cr_amount : comma15. db $ curr $ appdt date7. apptm time8.

/ tran_id $ 1-9 rn $ 11 payment_method $ 13-64 trdt date7. trtm time8.

/ @12 valuedt date7. fx_rate & $20.

@

;

format appdt trdt valuedt date9. apptm trtm time8. db_cr_amount comma15.2 ;

output company;

*-----------------------------------------------------------------------------;

* Read the account numbers ;

*-----------------------------------------------------------------------------;

do until (_infile_=' ' or eof) ;

input / @;

if _infile_ ne ' ' then do;

acctype = left(scan(_infile_,1,':'));

acctid = left(scan(_infile_,2,':'));

output company_account;

end;

run;

proc print data=company width=min;

run;

proc print data=company_account width=min;

run;

Mr_Breeze · Posted 09-19-2011 09:41 AM

Thanks, Tom!!

It does work across multiple pages, and parses everything nicely. Once I transpose the account file and join it to the company file, it will be lined up perfectly for my purposes.

I greatly appreciate all of the help!

SAS Programming

How to parse 'messy data with variable data across multiple observations?

Re: How to parse 'messy data with variable data across multiple observations?

How to parse 'messy data with variable data across multiple observations?

How to parse 'messy data with variable data across multiple observations?

How to parse 'messy data with variable data across multiple observations?

Re: How to parse 'messy data with variable data across multiple observations?

Re: How to parse 'messy data with variable data across multiple observations?

Re: How to parse 'messy data with variable data across multiple observations?

Re: How to parse 'messy data with variable data across multiple observations?

Re: How to parse 'messy data with variable data across multiple observations?

Re: How to parse 'messy data with variable data across multiple observations?

Re: How to parse 'messy data with variable data across multiple observations?

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...