complicated txt file

yymissing · Posted 04-19-2017 10:50 PM

Hi folks, thanks for all the great suggestions. I realized that I simplied the problem too much in my original post and I still can’t figure out with your suggestions. So here is problem:

First of all, there is no delimiters; everything based on positions. One row is for one customer, and it can contains several types of items. Every row starts with Customer ID which take the first 5 characters. There are 4 types of items with similar patterns but different length. Each item start with 4 characters (a combination of letter and number)

One type starts with "AN01" and has length of 10; one type start with "BC01" and has length of 8.

One type starts with "CR4R" and has length of 8; one type start with "DY03" and has length of 6.

Each row can have any number of any types. For example 2 “AN01”, followed by 3 “BC01”, then 10 “CR4R”, then 2 “DY03”. Another row can have 3 “AN01”, no “BC01”, then 8 “CR4R”, then 1 “DY03”.

The tricky part is that every type can have space in the value. E.g. ‘BC01 89’

For example:

CUST12345AN01    98BC01  98BC01  89CR4R 4 5DY03 9
CUST12346AN01    98BC01  98BC01  89 BC017 89CR4R 4 5DY03 9

The first line means: 'CUST12345' is the customer ID, with 5 items

AN01    98
BC01  98
BC01  89
CR4R 4 5
DY03 9

I would like the output to have one item per row along with its Customer ID:

CUST_ID      ITEM
CUST12345 AN01    98
CUST12345 BC01  98
CUST12345 BC01  89
CUST12345 CR4R 4 5
CUST12345 DY03 9
CUST12346 AN01    98
CUST12346 BC01  98
CUST12346 BC01  89
CUST12346 BC017 89
CUST12346 CR4R 4 5
CUST12346 DY03 9

Original post:

Hi folks, I had another complicated txt files that I need to read in based on their position. There is no delimiter. One row is one customer, and there are two types of items. Customer ID always take the first 5 characters. One type of item starts with "A" and has length of 10; the other type start with "B" and has length of 15. Each customer can have any number of A or B items. The tricky part is that A or B can have space in the value.

For example:

CUST12345A10 98B98 89 1245A20 98B88 89 1245

This line means: 'CUST12345' is the customer ID

'A10 98' is the value for the first "A" item

'B98 89 1245' is the value for the first "B" item

'A20 98' is the value for the 2nd "A" item

'B88 89 1245' is the value for the 2nd "B" item

It will look like:

CUST12345 A10 98

CUST12345 A20 98

The space in “A” and “B” values drive me crazy. Any idea how to create the output?

PGStats · Posted 04-19-2017 11:37 PM

Regular expressions to the rescue:

data test;
if not prxid then prxid + prxparse("/[AB][^AB]+/i");
infile datalines truncover;
input @5 custId $5. line $200.;
length value $20;
start = 1;
stop = length(line);
call prxnext(prxid, start, stop, line, pos, len);
do while (pos > 0);
    value = substr(line, pos, len);
    output;
    call prxnext(prxid, start, stop, line, pos, len);
    end;
drop prxid start stop pos len line;
datalines;
CUST12345A10     98B98    89  1245A20     98B88    89  1245
;

proc print; run;

PG

ChrisNZ · Posted 04-20-2017 01:15 AM

Or slightly different:

data WANT;
  if not PRXID then PRXID + prxparse("/A[^AB]+/i");
  infile datalines truncover;
  input ;
  length VALUE $20;
  START = index(_infile_,'A');          
  CUSTID=substr(_infile_,1,START-1);
  call prxnext(PRXID, START, -1, _infile_, POS, LEN);
  do while (POS > 0);       
      VALUE = substr(_infile_, POS, LEN);
      output;
      call prxnext(PRXID, START, -1, _infile_, POS, LEN);
  end;
  keep CUSTID VALUE;
datalines;
CUST12345A10     98B98    89  1245A20     98B88    89  1245
run;

proc print noobs; 
run;

VALUE	CUSTID
A10 98	CUST12345
A20 98	CUST12345

High-Performance SAS Coding - Third Edition

Ksharp · Posted 04-20-2017 06:40 AM

data have;
infile cards truncover;
input cust $9. a $10. b $15. @;
do while(not missing(a));
 a1=substr(a,1,3);
 a2=substr(a,length(a)-2);
 output;
 input a $10. b $15. @;
end;

cards;
CUST12345A10     98B98    89  1245A20     98B88    89  1245 
CUST22345A10     98B98    89  1245 
;
run;

PGStats · Posted 04-20-2017 09:24 AM

Adapted to your new requirement, assuming the values are digits and spaces:

data test;
if not prxid then 
    prxid + prxparse("/(AN01|BC01|CR4R|DY03)[ 0-9]+/i");
infile datalines truncover;
input @1 custId $9. line $200.;
length item $20;
start = 1;
stop = length(line);
call prxnext(prxid, start, stop, line, pos, len);
do while (pos > 0);
    item = substr(line, pos, len);
    output;
    call prxnext(prxid, start, stop, line, pos, len);
    end;
drop prxid start stop pos len line;
datalines;
CUST12345AN01    98BC01  98BC01  89CR4R 4 5DY03 9
CUST12346AN01    98BC01  98BC01  89 BC017 89CR4R 4 5DY03 9
;

proc print; run;

PG

Tom · Posted 04-20-2017 01:00 PM

So read ahead then you can calculate how many characters to read based on the beginning of the next item. You can use negative cursor movement values to skip back. You can use $VARYING. informat to read varying length fields.

data want ;
  infile cards truncover column = col;
  length cust_id $9 n 8 item $10 ;
  input cust_id $9. item $10. +(-10)  @ ;
  do n=1 by 1 until(item=' ');
    select ;
      when (item=:'AN') len=10;
      when (item=:'BC') len=8;
      when (item=:'CR') len=8;
      when (item=:'DY') len=6;
      otherwise len=0;
    end;
    input item $varying10. len  @;
    if n=1 or item ne ' ' then output;
    input item $10. +(-10) @ ;
  end;
*---+----0----+----0----+----0----+----0----+----0----+----0;
cards;
CUST12345AN01    98BC01  98BC01  89CR4R 4 5DY03 9
CUST12346AN01    98BC01  98BC01  89BC017 89CR4R 4 5DY03 9
;
proc print; run;

Ksharp · Posted 04-21-2017 12:52 AM

data have;
infile cards truncover;
input x $200.;
cards;
CUST12345AN01    98BC01  98BC01  89CR4R 4 5DY03 9
CUST12346AN01    98BC01  98BC01  89 BC017 89CR4R 4 5DY03 9
;
run;
data temp;
 set have;
 pid=prxparse('/[a-z]+[\d\s]+/io');
 s=1;e=length(x);
 call prxnext(pid,s,e,x,p,l);
 do while(p gt 0);
  temp=substr(x,p,l);output;
  call prxnext(pid,s,e,x,p,l);
 end;
 keep temp;
run;

data want;
 set temp;
 length cust $ 40;
 retain cust;
 if upcase(temp) =: 'CUST' then do;
  cust=temp;delete;
 end;
run;

complicated txt file

Re: complicated txt file

Re: complicated txt file

Re: complicated txt file

Re: complicated txt file

Re: complicated txt file

Re: complicated txt file

complicated txt file

Re: complicated txt file

Re: complicated txt file

Re: complicated txt file

Re: complicated txt file

Re: complicated txt file

Re: complicated txt file

Register Today!

SAS Training: Just a Click Away