DATA Step, Macro, Functions and more

complicated txt file

Reply
New Contributor
Posts: 2

complicated txt file

[ Edited ]

Hi folks, thanks for all  the great suggestions. I realized that I simplied the problem too much in my original post and I still can’t figure out with your suggestions. So here is problem:

First of all, there is no delimiters; everything based on positions. One row is for one customer, and it can contains several types of items. Every row starts with Customer ID which take the first 5 characters. There are 4 types of items with similar patterns but different length.  Each item start  with 4 characters (a combination of letter and number)

One type starts with "AN01" and has length of 10; one type start with "BC01" and has length of 8.  

One type starts with "CR4R" and has length of 8; one type start with "DY03" and has length of 6.  

 

Each row can have any number of any types. For example 2 “AN01”, followed by 3 “BC01”,  then 10 “CR4R”, then 2 “DY03”. Another row can have 3 “AN01”, no “BC01”,  then 8 “CR4R”, then 1 “DY03”.

 

The tricky part is that every type can have space in  the value. E.g. ‘BC01  89’

 

For example:

 

CUST12345AN01    98BC01  98BC01  89CR4R 4 5DY03 9
CUST12346AN01    98BC01  98BC01  89 BC017 89CR4R 4 5DY03 9

 

The first line means: 'CUST12345' is the customer ID, with 5 items

 

AN01    98
BC01  98
BC01  89
CR4R 4 5
DY03 9

 

I would like the output to have one item per row along with its Customer ID:

 

CUST_ID      ITEM
CUST12345 AN01    98
CUST12345 BC01  98
CUST12345 BC01  89
CUST12345 CR4R 4 5
CUST12345 DY03 9
CUST12346 AN01    98
CUST12346 BC01  98
CUST12346 BC01  89
CUST12346 BC017 89
CUST12346 CR4R 4 5
CUST12346 DY03 9

 

Original post:

 

Hi folks, I had another complicated txt files that I need to read in based on their position. There is no delimiter. One row is one customer, and there are two types of items. Customer ID always take the first 5 characters. One type of item starts with "A" and has length of 10; the other type start with "B" and has length of 15.  Each customer can have any number of A or B items. The tricky part is that A or B can have space in  the value. 

 

For example:

CUST12345A10     98B98    89  1245A20     98B88    89  1245 

 

This line means: 'CUST12345' is the customer ID

'A10     98' is the value for the first "A" item

'B98    89  1245' is  the value for  the first "B" item

'A20     98' is the value for the 2nd "A" item

'B88    89  1245' is  the value for the 2nd "B" item

 

 

It will look like: 

 

CUST12345 A10 98

CUST12345 A20 98

 

The space in “A” and “B” values drive me crazy. Any idea how to create the output? 

 

Respected Advisor
Posts: 4,641

Re: complicated txt file

Regular expressions to the rescue:

 

data test;
if not prxid then prxid + prxparse("/[AB][^AB]+/i");
infile datalines truncover;
input @5 custId $5. line $200.;
length value $20;
start = 1;
stop = length(line);
call prxnext(prxid, start, stop, line, pos, len);
do while (pos > 0);
    value = substr(line, pos, len);
    output;
    call prxnext(prxid, start, stop, line, pos, len);
    end;
drop prxid start stop pos len line;
datalines;
CUST12345A10     98B98    89  1245A20     98B88    89  1245
;

proc print; run;
PG
PROC Star
Posts: 1,558

Re: complicated txt file

Or slightly different:

 

data WANT;
  if not PRXID then PRXID + prxparse("/A[^AB]+/i");
  infile datalines truncover;
  input ;
  length VALUE $20;
  START = index(_infile_,'A');          
  CUSTID=substr(_infile_,1,START-1);
  call prxnext(PRXID, START, -1, _infile_, POS, LEN);
  do while (POS > 0);       
      VALUE = substr(_infile_, POS, LEN);
      output;
      call prxnext(PRXID, START, -1, _infile_, POS, LEN);
  end;
  keep CUSTID VALUE;
datalines;
CUST12345A10     98B98    89  1245A20     98B88    89  1245
run;

proc print noobs; 
run;

 

 

VALUE CUSTID
A10 98 CUST12345
A20 98 CUST12345
Super User
Posts: 9,671

Re: complicated txt file

data have;
infile cards truncover;
input cust $9. a $10. b $15. @;
do while(not missing(a));
 a1=substr(a,1,3);
 a2=substr(a,length(a)-2);
 output;
 input a $10. b $15. @;
end;

cards;
CUST12345A10     98B98    89  1245A20     98B88    89  1245 
CUST22345A10     98B98    89  1245 
;
run;
Respected Advisor
Posts: 4,641

Re: complicated txt file

Adapted to your new requirement, assuming the values are digits and spaces:

 

data test;
if not prxid then 
    prxid + prxparse("/(AN01|BC01|CR4R|DY03)[ 0-9]+/i");
infile datalines truncover;
input @1 custId $9. line $200.;
length item $20;
start = 1;
stop = length(line);
call prxnext(prxid, start, stop, line, pos, len);
do while (pos > 0);
    item = substr(line, pos, len);
    output;
    call prxnext(prxid, start, stop, line, pos, len);
    end;
drop prxid start stop pos len line;
datalines;
CUST12345AN01    98BC01  98BC01  89CR4R 4 5DY03 9
CUST12346AN01    98BC01  98BC01  89 BC017 89CR4R 4 5DY03 9
;

proc print; run;
PG
Super User
Super User
Posts: 6,498

Re: complicated txt file

So read ahead then you can calculate how many characters to read based on the beginning of the next item. You can use negative cursor movement values to skip back. You can use $VARYING. informat to read varying length fields.

data want ;
  infile cards truncover column = col;
  length cust_id $9 n 8 item $10 ;
  input cust_id $9. item $10. +(-10)  @ ;
  do n=1 by 1 until(item=' ');
    select ;
      when (item=:'AN') len=10;
      when (item=:'BC') len=8;
      when (item=:'CR') len=8;
      when (item=:'DY') len=6;
      otherwise len=0;
    end;
    input item $varying10. len  @;
    if n=1 or item ne ' ' then output;
    input item $10. +(-10) @ ;
  end;
*---+----0----+----0----+----0----+----0----+----0----+----0;
cards;
CUST12345AN01    98BC01  98BC01  89CR4R 4 5DY03 9
CUST12346AN01    98BC01  98BC01  89BC017 89CR4R 4 5DY03 9
;
proc print; run;
Super User
Posts: 9,671

Re: complicated txt file

data have;
infile cards truncover;
input x $200.;
cards;
CUST12345AN01    98BC01  98BC01  89CR4R 4 5DY03 9
CUST12346AN01    98BC01  98BC01  89 BC017 89CR4R 4 5DY03 9
;
run;
data temp;
 set have;
 pid=prxparse('/[a-z]+[\d\s]+/io');
 s=1;e=length(x);
 call prxnext(pid,s,e,x,p,l);
 do while(p gt 0);
  temp=substr(x,p,l);output;
  call prxnext(pid,s,e,x,p,l);
 end;
 keep temp;
run;

data want;
 set temp;
 length cust $ 40;
 retain cust;
 if upcase(temp) =: 'CUST' then do;
  cust=temp;delete;
 end;
run;
Ask a Question
Discussion stats
  • 6 replies
  • 147 views
  • 0 likes
  • 5 in conversation