DATA Step, Macro, Functions and more

How to parse 'messy data with variable data across multiple observations?

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 9
Accepted Solution

How to parse 'messy data with variable data across multiple observations?

Hi all,

I have some semi-structured text files that I'm trying to read into SAS v9.1.3. An example is shown below in blue.

The column headings are listed above the dashed line and the data blocks are below the line. I'm not sure how to parse the data into a table structure (see desired outcome at the bottom). Any suggestions would be greatly appreciated.

                               Daily Report for 08SEP11

                      Printed on 09SEP11 at EST 07:19:00

                                                                      Page  1

Tran No  R Company Name             DB/CR Amount       DB Curr Date/Time Appv'd

Tran ID  N Payment Method           Transfer Amount    CR Curr Date/Time Prnt'd

           Value Date               FX Rate

-------- - ------------------------ ------------------ -- ---- ----------------

2510001    READING RAINBOW              375.50 DB  USD 08SEP11 07:11:22

SUPERXFER R book                                                08SEP11 15:47:00

           08SEP11                                              * PENDING *

           Debit account number:  3333333333

           Credit bank aaa:       12345678

           Credit acct number: 1111111111

           Credit acct name:   IRA FLEXIBLE SPENDING ACCOUNT

2510002    BABY SUPERSTORE         175,000.00 DB  USD 08SEP11 07:20:31

BSUPER001 N Fedwire Out Interface                               08SEP11 08:20:40

           08SEP11                                              * CONFIRMED *

           Debit account number:  2222222222

           Receiving bank aba:    234567898

           Credit account number: 2222-1111111-7

           Credit account name:   IND OPSE

Tran_NoCompany_NameDB_CR_AmtDateTime_ApprDateTime_PrnTran_IDPmt_MthdValue_dateDB_Acct_NbrCR_Bank_AAACR_Acct_NbrCR_Acct_Name
2510001READING RAINBOW375.5008SEP11 07:11:2208SEP11 15:47:00SUPERXFERbook08SEP113333333333123456781111111111 IRA FLEXIBLE SPENDING ACCOUNT
2510002BABY SUPERSTORE175,000.0008SEP11 07:20:3108SEP11 08:20:40BSUPER001Fedwire Out Interface08SEP1122222222222345678982222-1111111-7 IND OPSE



Accepted Solutions
Solution
‎09-18-2011 01:45 PM
Super User
Super User
Posts: 7,039

Re: How to parse 'messy data with variable data across multiple observations?

Here is a cleaned up version.  I had skipped the line with the value date before.

data

  company(keep=tran_no company_name db_cr_amount db curr appdt apptm

               tran_id rn payment_method trdt trtm valuedt fx_rate)

  company_account(keep=tran_no acctype acctid)

;

  infile src end=eof truncover;

  input @ ;

  if (_n_=1 or index(_infile_,'0c'x)) then do until (eof or _infile_=:'----');

    input / @;

  end;

  do while (_infile_=:'----' or _infile_=' ');

    input / @ ;

  end;

*-----------------------------------------------------------------------------;

* Read transaction report ;

*-----------------------------------------------------------------------------;

  input tran_no 1-8company_name $12-35

        db_cr_amount : comma15. db $ curr $ appdt date7. apptm time8.

      / tran_id $ 1-9 rn $ 11 payment_method $ 13-64 trdt date7. trtm time8.

      /  @12 valuedt date7.  fx_rate & $20.

      @

  ;

  format appdt trdt valuedt date9. apptm trtm time8. db_cr_amount comma15.2 ;

  output company;

*-----------------------------------------------------------------------------;

* Read the account numbers ;

*-----------------------------------------------------------------------------;

  do until (_infile_=' ' or eof) ;

    input / @;

    if _infile_ ne ' ' then do;

      acctype = left(scan(_infile_,1,':'));

      acctid = left(scan(_infile_,2,':'));

      output company_account;

    end;

  end;

run;

proc print data=company width=min;

run;


proc print data=company_account width=min;

run;

View solution in original post


All Replies
Super User
Posts: 19,770

How to parse 'messy data with variable data across multiple observations?

Posted in reply to Mr_Breeze

If you can delete the heading and all the records have the identical structure its a case of reading multiple line per subject data type.

Can you attach an example of your text file to your post?

What have you tried?

Here's a starting point if you haven't tried anything at all.

http://www.ats.ucla.edu/stat/sas/code/multiple_lines_per_s.htm

Occasional Contributor
Posts: 9

How to parse 'messy data with variable data across multiple observations?

Hi Reeza,

Thanks for the reply. I'm using input and # to point to multiple observations right now, but I haven't finished coding it yet. The text file is exactly like what I copied in the question body, just a lot more rows.

I would delete the heading and other records, but this is a file that I will receive daily, and I want to append each one to a master file, so it would be cumbersome to delete the heading of each file.

I'm thinking of using some sort of If to tell the program when to start and stop reading lines per data type.

Super User
Posts: 19,770

How to parse 'messy data with variable data across multiple observations?

Posted in reply to Mr_Breeze

Is the page no, ie page no repeated across diff pages or does it just occur at the top?

When I copy and paste into notepad there's a lot of extra lines that get added in, which is why I'm not sure of your actual file format.

One method is to brute force it. 

A record seems to start with an account number which seems to be a x digit so if the first value is xdigits its your start of record, would the tran id always be character or could it start with a number as well? Anyways, I'd read in the full string for each line and then parse separately.

You have more structure to your data than I've had with some programs like this but it's a method that works but isn't very elegant.

filename editset  "H:\sample.txt";

*READ IN FILE;

data sample;

    infile editset lrecl=80 END=eof truncover;

    input line $256.;

   

run;

Occasional Contributor
Posts: 9

Re: How to parse 'messy data with variable data across multiple observations?

The page numbers are repeated across different 'pages' in the file.

I didn't think of bringing in the whole line and then parsing... I'll try that and see how it works.

Thanks for the suggestion!

Frequent Contributor
Posts: 104

Re: How to parse 'messy data with variable data across multiple observations?

Posted in reply to Mr_Breeze

Here's an idea to try out as well.  If the data occurs in very specific blocks in a predictable pattern, then this structure may work:

data want;

     infile in (with all suitable options like line length truncover...);

     input @;                                         *--- read the line into _infile_ buffer;

     if left( _infile_ ) =: 'Daily Report' then do;                    *--- skip the report headers;

          do i=1 to 9;   input;   end;     *--- swallow 9 lines;

          return;

     end;

     acct = substr((left,_infile_),1,7);

     if verify(acct,'0123456789') = 0 then do;

          input #1 acct : 7. company_name db_cr_amount dt_appr

                 #2 .....

                 #3 .....

                 #4 .....

                 #5 .....

                 #6 .....

                 #7 .....

          ;

          output;

   end;

run;

The idea is to read a line & hold it, figure out if it is the start of a known block, if it is not the start of something useful, do nothing with it, start next loop.

If it is the start of a known block, then handle that known block as appropriate.

The "return" statement can be used to "start the next iteration of the data step".

Too bad you have to work with this data structure, it's far easier if you can arrange with them to give you the CSV file or XML file or Excel file even.  Anything structured is better than having to "undo" the report formatting.

Super User
Super User
Posts: 7,039

Re: How to parse 'messy data with variable data across multiple observations?

Posted in reply to Mr_Breeze

# pointer will probably not work unless both your header lines and your report blocks have the same number of lines.

If your blocks are consistently seven lines long then use / instead of # to move to the next line in your input.  Then you can still scan the file line by line to look for the start of the next block.  If they are not a consistent number of lines then it looks like you can read until you find the next empty line.

You should look for start of pages.  Usually you will see the formfeed character at the start of all pages (except the first).  ( index(_infile_,'0C'x) )

You header line seems to be all dashes.  To look for these I usually use this test :

  index(_infile_,'--') and compress(_infile_,'-')=' '

This will find lines that have at least two adjacent hyphens but no other printable characters.

Another trick is let in the input statement parse the line into words for you and then based on what you see you can assign the right word your variables.

infile ... truncover ;

length x1-x20 $200 ;

input x1-x20 ;

if length(x1)=7 and . ^= input(x1,??6.) then *** found start of new block *** ;

Super User
Posts: 10,020

Re: How to parse 'messy data with variable data across multiple observations?

Posted in reply to Mr_Breeze

Can you post a attachment which contains some sample data .

It will be more helpful to get a code.

Ksharp

Super User
Super User
Posts: 7,039

Re: How to parse 'messy data with variable data across multiple observations?

Posted in reply to Mr_Breeze

You could try this. It seems to read these two records. I tried to put in something to account for mulitple pages, but no way to test if it works.

The account descriptions and ID numbers listed below each transaction where not consistently formatted between your two records, so I decided to output those into a separated dataset.  I assumed that the TRAN_NO could serve as a key to link the records back together.

data                                                                                                                                   

company(keep=tran_no company_name db_cr_amount db curr appdt apptm                                                                    

              tran_id rn payment_method trdt trtm )                                                                                    

company_account(keep=tran_no acctype acctid)                                                                                          

;                                                                                                                                      

  infile src end=eof truncover ;                                                                                                       

  input @ ;                                                                                                                            

  if (_n_=1 or index(_infile_,'0c'x)) then do until(eof or _infile_=:'----');                                                          

    input / @;                                                                                                                         

  end;                                                                                                                                 

  if _infile_=:'----' or _infile_=' ' then input / @ ;                                                                                 

                                                                                                                                       

*-----------------------------------------------------------------------------;                                                        

* Read one company report ;                                                                                                            

*-----------------------------------------------------------------------------;                                                        

  input tran_no 1-8 company_name $12-35 db_cr_amount : comma15. db $ curr $ appdt date7. apptm time8.                                  

     /  tran_id $ 1-9 rn $ 11 payment_method $ 13-64 trdt date7. trtm time8.                                                           

  ;                                                                                                                                    

  format appdt trdt date9. apptm trtm time8. db_cr_amount comma15.2 ;                                                                  

  output company;                                                                                                                      

*-----------------------------------------------------------------------------;                                                        

* Read the account numbers ;                                                                                                           

*-----------------------------------------------------------------------------;                                                        

  do until (_infile_=' ' or eof) ;                                                                                                     

    input / @;                                                                                                                         

    if _infile_ ne ' ' then do;                                                                                                        

      acctype = left(scan(_infile_,1,':'));                                                                                            

      acctid  = left(scan(_infile_,2,':'));                                                                                            

      output company_account;                                                                                                          

    end;                                                                                                                               

  end;                                                                                                                                 

                                                                                                                                       

run;                                                                                                                                   

                                                                                                                                       

proc print data=company width=min; run;                                                                                                

proc print data=company_account width=min; run;                                                                                        

Solution
‎09-18-2011 01:45 PM
Super User
Super User
Posts: 7,039

Re: How to parse 'messy data with variable data across multiple observations?

Here is a cleaned up version.  I had skipped the line with the value date before.

data

  company(keep=tran_no company_name db_cr_amount db curr appdt apptm

               tran_id rn payment_method trdt trtm valuedt fx_rate)

  company_account(keep=tran_no acctype acctid)

;

  infile src end=eof truncover;

  input @ ;

  if (_n_=1 or index(_infile_,'0c'x)) then do until (eof or _infile_=:'----');

    input / @;

  end;

  do while (_infile_=:'----' or _infile_=' ');

    input / @ ;

  end;

*-----------------------------------------------------------------------------;

* Read transaction report ;

*-----------------------------------------------------------------------------;

  input tran_no 1-8company_name $12-35

        db_cr_amount : comma15. db $ curr $ appdt date7. apptm time8.

      / tran_id $ 1-9 rn $ 11 payment_method $ 13-64 trdt date7. trtm time8.

      /  @12 valuedt date7.  fx_rate & $20.

      @

  ;

  format appdt trdt valuedt date9. apptm trtm time8. db_cr_amount comma15.2 ;

  output company;

*-----------------------------------------------------------------------------;

* Read the account numbers ;

*-----------------------------------------------------------------------------;

  do until (_infile_=' ' or eof) ;

    input / @;

    if _infile_ ne ' ' then do;

      acctype = left(scan(_infile_,1,':'));

      acctid = left(scan(_infile_,2,':'));

      output company_account;

    end;

  end;

run;

proc print data=company width=min;

run;


proc print data=company_account width=min;

run;

Occasional Contributor
Posts: 9

Re: How to parse 'messy data with variable data across multiple observations?

Thanks, Tom!!

It does work across multiple pages, and parses everything nicely. Once I transpose the account file and join it to the company file, it will be lined up perfectly for my purposes.

I greatly appreciate all of the help!

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 1009 views
  • 6 likes
  • 5 in conversation