DATA Step, Macro, Functions and more

Parse Data Based on a Particular Value.

Reply
Occasional Contributor
Posts: 12

Parse Data Based on a Particular Value.

I have a text file that I want to import and parse in SAS. The problem with this text file, though, is that it has no delimiters; however, each line starts the same:

 

1234567892120QAZWSXEDCRFVTGBYHNUJMWSXZCVBNMLKIOPWSDX                 /*start of new record*/
1234567892121PLMOKNIJBUHVYGCTFXRDZIKJUYHGVFCDXSZMKI
1234567892121QWEDSAZXCVBNHGFRTYHJKZASWQEDCFGTRNHALOIYBVNU
1234567892122NMLIOPLOIKMUJHNYTGBCFMNJKTDCFVEWSEUZWEYQ45
1234519994780ZXCDEWSAQRFVCDEWSFHYUSAJHTGKMCJURQB879312              /*start of new record*/
1234519994781PQALZMXNSKEIRUFHVNBGTZXCVBNMLKJHGFD

 

What I want to do is separate the data into columns based on a particular value. In the example above, if digit 12 is 0, then that line is separated a certain way. If digit 12 is 1, then that line is separated a different way. The 12th digit can range from 0-9, and depending on the value of that digit, that line has to be separated a particular way. 

 

If the 12th digit = 0, then it is a new record, so the first four lines of data above would be one record but would need to separated three different ways; the last two lines would another record and would need to be separated two different ways.

 

Any idea as to how I can do this? Any help is greatly appreciated.

 

/*here is what I have so far...not much since I got stuck; r_type is the 12th digit*/
/*I was trying to parse everything based on one value of r_type but realized I need
to parse each line differently based on the value of r_type on that line.*/ data new; infile 'location of my text file'; length c_num $4 w_num $7 r_type $1 spa $12 active_num $16; input c_num 1-4 w_num 5-11 r_type 12 spa 13-24 active_num 25-40; run;

 

 

Super User
Posts: 19,791

Re: Parse Data Based on a Particular Value.

You can use multiple input statements and trailing @@ to hold the line. 

 

Input @12 special_char @@;

 

if special_char = 1 then 

Input code goes here;

Else input code goes here ;

Occasional Contributor
Posts: 12

Re: Parse Data Based on a Particular Value.

Okay. Thanks so much for your response. Based on what you said, I wrote the code below...is this what you were talking about?

 

data new;
infile 'my location';

length c_num $4 w_num $7 r_type $1 spa $12 active_num $16
receipt date $8 name $64 action_type $8 var_name $20;

input @12 record_type@@;
if record_type = 0 then input c_num 1-4 w_num 5-11 spa 13-24 receipt_date 71-78;
else if record_type = 1 then input c_num 1-4 w_num 5-11 @14 name;
else if record_type = 2 then input c_num 1-4 w_num 5-11 @13 action_type;
else if recorD_type = 3 then input c_num 1-4 w_num 5-11 @14 var_name;

proc print data = new;
run;
Super User
Posts: 11,343

Re: Parse Data Based on a Particular Value.

I would expand on @Reeza with:

 

data new;
   infile ...;
   informat special_char $1.;
   Input  special_char 12 @@;

   select (special_char);
      when ('0') input <rest of input statement for rule 0>;
      when ('1') input <rest of input statement for rule 1>;
      when ('2') input <rest of input statement for rule 2>;
      when ('3') input <rest of input statement for rule 3>;
      ...
      when ('9') input <rest of input statement for rule 9>;
      otherwise Put "WARNING: Unexpected value" +1 special_char= $1. +1 "on line" +1 _n_;
   end;
<other stuff in data step>
run;

Read as a single character in case you find something that you don't expect.

 

Trusted Advisor
Posts: 1,564

Re: Parse Data Based on a Particular Value.

As one record is composed of  up to 10 rows, there need some more work:

- you need RETAIN all variables in memory, until you have all different inputs of one record 

- when you meet a new record you need OUTPUT the previous agregated record data

   before dealing with the new record

- if it is the first INPUT statement then there is no previous record to OUYTPUT.

- you need to check End Of Input file in order to OUTPUT the last record

- you should consider: 

     + is r_type 0 exists in every record ? can it be a flag to recognize start of new record ?

     + which r_types are must and which are optional ? 

     + how to deal with an absent must r_tyoe record ?

     + is the input sorted by r_type ? if yes and r_type 9 is a must it can be used as flag to OUTPUT the record

 

Try to add to the code you got from @Reeza and @ballardw and check your results.

If you still have issues we are here to help.

 

Occasional Contributor
Posts: 12

Re: Parse Data Based on a Particular Value.

Okay. Thanks for the help. I am learning a lot here. My SAS skills are limited...

R_type exists in every record, as far as I know, but the input is not sorted by r_type. I added to the code @Reeza and @ballardw suggested, but I am not sure how to incorporate all of the other things you mentioned into the code, specifically checking the end of input and outputting the record. Would I need to add some type of loop before the select statement below?

 

data new;
infile 'my location';
length client_num $4 workcase_num $7 record_type $1 spa $12
receipt_date $8 name $104 action_type $8 var_name $20;
informat r_type $1.;
retain c_num w_num spa receipt_date name action_type var_name;
input r_type 12 @@;
select (r_type);
when ('0') input c_num 1-4 w_num 5-11 spa 13-24 receipt_date 71-78;
when ('1') input c_num 1-4 w_num 5-11 @14 name;
when ('2') input c_num 1-4 w_num 5-11 @13 action_type;
when ('3') input c_num 1-4 w_num 5-11 @14 var_name;
/*the rest of the r_type values here*/
otherwise Put "Other value" +1 r_type = $1. +1 "on line" +1 _n_;
end;
run;
Trusted Advisor
Posts: 1,564

Re: Parse Data Based on a Particular Value.

First, I have some more remarks:

- r_type is probably in position 13 and not 12. So either c_num is $5 or w_num is $8.

- As r_type value is between 0 and 9 I prefer define it as numeric.

   If c_num and w_num should be numeric too, and not alphanumeric, then better define them as numeric.

   I assumed they should be numeric and printed into log if they are not.

- In order to be able to sort data by c_num w_num r_type I broke the program into two steps.

  Step 1 - input those 3 variables and the rest into a temporary row_rest variable

               checking validity - are they numeric.

  Step 2 - parse the row_rest into wanted variables according to r_type and dealing with output when ready.

- The sort enables SAS to check are c_num w_num the same as previous row or a new record start.

- more remarks inside the attached code

- you should check and adapt postions of all other variables in row-rest variable.

  remember tha I have cut off the first 12 bytes.

 

When you run the program successfully, I'll show you how to use macro in order to reuse similar code many times

in a short way.

 

Attachment
Occasional Contributor
Posts: 12

Re: Parse Data Based on a Particular Value.

Well, I made modifications to Reeza/ballardw/Shmuel's code below and got it to run; however, I am not getting the right results. The code parses the data correctly and splits it up, but the problem seems to be when putting the data back together to get one record per w_num (some fields are accurate for each record, while other fields belong to some other record).  I'm not sure how or why that is happening.  

Also, how can I make the code below more efficient? I am a fairly new SAS user, so I am sure the modifications I made weren't the best way to go...

 

I pasted the code below and attached it as well.

 

data new;
 infile 'my location';
 length c_num $4 w_num $7 r_type $1 spa $12
		account_num $16 csr $3 TEAM_ID $3 REASSIGN_REASON $2 
		W_TYPE $8 other $14 receipt_date $8 entry_date $8
		other2 $29 COMPLIANCE_DATE $8 EXCEPTION_DATE $8 
		REVIEW_DATE $8 RESOLVE_DATE $8 other3 $56 addr_flag $1
		name $104 addr $104 city_state_zip $30 other4 $282 
		phone $38 other5 $19 action_type $8 queue_ID $8 
		other6 $73 other7 $73 other8 $83 other9 $73 
		other10 $101 other11 $7 other12 $44;
 retain c_num w_num r_type spa account_num
		csr TEAM_ID REASSIGN_REASON W_TYPE other 
		receipt_date entry_date other2 COMPLIANCE_DATE 
		EXCEPTION_DATE REVIEW_DATE RESOLVE_DATE other3 
		addr_flag name addr city_state_zip other4 phone 
		other5 action_type queue_ID other6 other7 other8
		other9 other10 other11 other12;
 informat r_type $1.;
 input r_type 12 @@;
 select (r_type);
	when ('0') input c_num 1-4 w_num 5-11 
               spa 13-24 account_num 25-40 csr 41-43
               TEAM_ID 44-46 REASSIGN_REASON 47-48 
               W_TYPE 49-56 other 57-70 
               receipt_date 71-78 entry_date 79-86
               other2 87-115 COMPLIANCE_DATE 116-123
               EXCEPTION_DATE 124-131 
               REVIEW_DATE 132-139 
               RESOLVE_DATE 140-147 other3 148-203;
	when ('1') input c_num 1-4 w_num 5-11
               addr_flag 13 name 14-117 addr 118-221 
               city_state_zip 222-251 other4 252-533 
			   phone 534-571 other5 572-590;
	when ('2') input c_num 1-4 w_num 5-11
               action_type 13-20 queue_ID 21-28 
               other6 29-101;
	when ('3') input c_num 1-4 w_num 5-11
			   other7 13-85;
	when ('4') input c_num 1-4 w_num 5-11
               other8 13-95;
	when ('5') input c_num 1-4 w_num 5-11
               other9 13-85;
	when ('6') input c_num 1-4 w_num 5-11
               other10 13-113;
	when ('9') input c_num 1-4 other11 5-11
               other12 13-56;
    otherwise Put "Other value" +1 r_type = $1.
              +1 "on line" +1 _n_;
 end;
run;


proc sort data=new; 
  by w_num r_type; 
run;

data want;
 set new end=eof;  /* eof is assigned variable indicator for End-Of-File */
 by w_num r_type; 
     
    length c_num $4 w_num $7 r_type $1 spa $12
		account_num $16 csr $3 TEAM_ID $3 REASSIGN_REASON $2 
		W_TYPE $8 other $14 receipt_date $8 entry_date $8
		other2 $29 COMPLIANCE_DATE $8 EXCEPTION_DATE $8 
		REVIEW_DATE $8 RESOLVE_DATE $8 other3 $56 addr_flag $1
		name $104 addr $104 city_state_zip $30 other4 $282 
		phone $38 other5 $19 action_type $8 queue_ID $8 
		other6 $73 other7 $73 other8 $83 other9 $73 
		other10 $101 other11 $7 other12 $44;
 	retain c_num w_num r_type spa account_num
		csr TEAM_ID REASSIGN_REASON W_TYPE other 
		receipt_date entry_date other2 COMPLIANCE_DATE 
		EXCEPTION_DATE REVIEW_DATE RESOLVE_DATE other3 
		addr_flag name addr city_state_zip other4 phone 
		other5 action_type queue_ID other6 other7 other8
		other9 other10 other11 other12;
    array rest $ spa account_num
		csr TEAM_ID REASSIGN_REASON W_TYPE other 
		receipt_date entry_date other2 COMPLIANCE_DATE 
		EXCEPTION_DATE REVIEW_DATE RESOLVE_DATE other3 
		addr_flag name addr city_state_zip other4 phone 
		other5 action_type queue_ID other6 other7 other8
		other9 other10 other11 other12;
    array typs {10} t0-t9; /* checking are all r_types exist */

    if first.w_num then do; 
       if _N_ > 1 then output; /* new record and not the first in input */
       do i=1 to dim(rest); call missing(rest(i)); end;  /* clean for new record */
       do i=1 to 9; typs(i)=0; end;  /* 0=absent, 1=exist */
    end; else
         
    select (r_type);
        when (0) link in_typ0;
        when (1) link in_typ1;
        when (2) link in_typ2;
        when (3) link in_typ3;
        when (4) link in_typ4;
        when (5) link in_typ5;
        when (6) link in_typ6;
        when (9) link in_typ9;                
        otherwise Put "Other value " r_type= ;
    end;
 
    if eof then output;  /* output last record in input */  
    drop i row_rest t0-t9 r_type;  
return;
/*------------------------*/

in_typ0:
    /* parse row_rest when r_type = 0 */
    if t0 = 1 then do;
       put '**Duplicate r_type=0 for ' c_num= w_num= ;
       return;
    end; 
    t0=1;
    spa = substr(row_rest,1,12);
    receipt_date = substr(row_rest,59,8);
return;
/*------------------------*/

in_typ1:
    /* parse row_rest when r_type = 1 */
    if t1 = 1 then do;    
       put '**Duplicate r_type=1 for ' c_num= w_num= ;
       return;
    end;
    t1 = 1;
    name = row_rest; 
return;
/*------------------------*/

in_typ2:
    /* parse row_rest when r_type = 2 */
    if t2 = 1 then do;   
       put '**Duplicate r_type=2 for ' c_num= w_num= ;
       return;
    end;
    t2 = 1;
    action_type = substr(row_rest,1,8); 
return;
/*------------------------*/

in_typ3:
    /* parse row_rest when r_type = 3 */
    if t3 = 1 then do;    
       put '**Duplicate r_type=3 for ' c_num= w_num= ;
       return;
    end;
    t3 = 1;
    var_name = substr(row_rest,1,20);   
return;
/*------------------------*/

in_typ4:
    /* parse row_rest when r_type = 4 */
    if t4 = 1 then do;    
       put '**Duplicate r_type=4 for ' c_num= w_num= ;
       return;
    end;
    t4 =1;
      
return;
/*------------------------*/

in_typ5:
    /* parse row_rest when r_type = 5 */
    if t5 = 1 then do;   
       put '**Duplicate r_type=5 for ' c_num= w_num= ;
       return;
    end;
    t5 = 1;
      
return;
/*------------------------*/

in_typ6:
    /* parse row_rest when r_type = 6 */
    if t6 = 1 then do;    
       put '**Duplicate r_type=6 for ' c_num= w_num= ;
       return;
    end;
    t6 = 1;
     
return;
/*------------------------*/

in_typ9:
    /* parse row_rest when r_type = 9 */
    if t9 = 1 then do;
       put '**Duplicate r_type=9 for ' c_num= w_num= ;
       return;
    end;
    t9 = 1;
      
return;
/*------------------------*/
run;
Attachment
Trusted Advisor
Posts: 1,564

Re: Parse Data Based on a Particular Value.

@yeppy12, I'll be not available next 3 weeks.

 

1) You have in your code 3 steps: I shall relate to them by names: DATA1, SORT, DATA2.

2) Both DATA1 and DATA2 are parsing the input. 

3) If you check my test70.sas code you will see that first step parses input into 4 variables:

    client_num   workcase_num   r_type  row_rest
    the last varaible - row_rest - is parsed in DATA2, after the SORT.

-  In your code, DATA1 does not create variable row_rest, so probably you get ERRORs in DATA2.

4) You added variables to create from later r_rtypes, that you have not mentioned in yout first post.

    You need add those variables into DATA2, each under its appropriate label (in_typ0 - in_typ9 fitting r_type).

5) in oredr to be able to help you more, post some test data containing all kinds of r_type, at least

    two or three client_numbers.

6) Check your log after running. If you still have problems post your full log, too.

 

 

Occasional Contributor
Posts: 12

Re: Parse Data Based on a Particular Value.

I really appreciate all of the help and responses.  I believe I am close to getting this code working, but I still have a few issues.  My code is attached.

 

1.  I am not getting any results for in_typ0 but in_typ1, in_typ2, etc. seem to be populating.  My final dataset, Want, is blank for all of the fields listed under in_typ0.  Any idea why that might be happening?

 

2. It seems like when the data is split up and then combined again that some of the data is for a different record.  For example, the account number and workcase type will be for one person, but the name and address will be for a different record.  Any idea as to why that might be happening?

 

Any help is greatly appreciated.  Thanks.

Trusted Advisor
Posts: 1,564

Re: Parse Data Based on a Particular Value.

I  cannot doany test now. I see few issues

1. You have not added parsing row_rest  variable for types 7-9 in Data2 step.

2. If name and address are of the next person then maybe you missed retain of those variables in Data2 step.

Super User
Super User
Posts: 7,043

Re: Parse Data Based on a Particular Value.

Do you have unique identifiers?  If so you should be able to read the 10 different record types into 10 different tables and then merge them to get the single record with all of the fields.  If you data file is small (which is probably anything up to 1 gigabyte these adays) then just read it 10 times.

data new0;
  infile 'location of my text file';
  input @12 check 1. @1 @ ;
  if check=0;
  input ...... ;
run;
data new1;
  infile 'location of my text file';
  input @12 check 1. @1 @ ;
  if check=1;
  input ...... ;
run;
etc.

There are ways to do that in one data step, but unless you file is large why bother.

Then sort (if needed) and merge.

data new ;
  merge new0 - new9 ;
  by c_num w_num;
run;

 

Occasional Contributor
Posts: 12

Re: Parse Data Based on a Particular Value.

shmuel, that is a good point. I will check my code in data step 2.

 

Tom, your idea is a good one. There is a unique identifier and the file is not that big, so breaking it up into 10 pieces and merging them back together should work. I will give it a try.

 

Thank you both so much for your help! Smiley Happy

 

Ask a Question
Discussion stats
  • 12 replies
  • 379 views
  • 0 likes
  • 5 in conversation