Hi folks, thanks for all the great suggestions. I realized that I simplied the problem too much in my original post and I still can’t figure out with your suggestions. So here is problem:
First of all, there is no delimiters; everything based on positions. One row is for one customer, and it can contains several types of items. Every row starts with Customer ID which take the first 5 characters. There are 4 types of items with similar patterns but different length. Each item start with 4 characters (a combination of letter and number)
One type starts with "AN01" and has length of 10; one type start with "BC01" and has length of 8.
One type starts with "CR4R" and has length of 8; one type start with "DY03" and has length of 6.
Each row can have any number of any types. For example 2 “AN01”, followed by 3 “BC01”, then 10 “CR4R”, then 2 “DY03”. Another row can have 3 “AN01”, no “BC01”, then 8 “CR4R”, then 1 “DY03”.
The tricky part is that every type can have space in the value. E.g. ‘BC01 89’
For example:
CUST12345AN01 98BC01 98BC01 89CR4R 4 5DY03 9
CUST12346AN01 98BC01 98BC01 89 BC017 89CR4R 4 5DY03 9
The first line means: 'CUST12345' is the customer ID, with 5 items
AN01 98
BC01 98
BC01 89
CR4R 4 5
DY03 9
I would like the output to have one item per row along with its Customer ID:
CUST_ID ITEM
CUST12345 AN01 98
CUST12345 BC01 98
CUST12345 BC01 89
CUST12345 CR4R 4 5
CUST12345 DY03 9
CUST12346 AN01 98
CUST12346 BC01 98
CUST12346 BC01 89
CUST12346 BC017 89
CUST12346 CR4R 4 5
CUST12346 DY03 9
Original post:
Hi folks, I had another complicated txt files that I need to read in based on their position. There is no delimiter. One row is one customer, and there are two types of items. Customer ID always take the first 5 characters. One type of item starts with "A" and has length of 10; the other type start with "B" and has length of 15. Each customer can have any number of A or B items. The tricky part is that A or B can have space in the value.
For example:
CUST12345A10 98B98 89 1245A20 98B88 89 1245
This line means: 'CUST12345' is the customer ID
'A10 98' is the value for the first "A" item
'B98 89 1245' is the value for the first "B" item
'A20 98' is the value for the 2nd "A" item
'B88 89 1245' is the value for the 2nd "B" item
It will look like:
CUST12345 A10 98
CUST12345 A20 98
The space in “A” and “B” values drive me crazy. Any idea how to create the output?
Regular expressions to the rescue:
data test;
if not prxid then prxid + prxparse("/[AB][^AB]+/i");
infile datalines truncover;
input @5 custId $5. line $200.;
length value $20;
start = 1;
stop = length(line);
call prxnext(prxid, start, stop, line, pos, len);
do while (pos > 0);
value = substr(line, pos, len);
output;
call prxnext(prxid, start, stop, line, pos, len);
end;
drop prxid start stop pos len line;
datalines;
CUST12345A10 98B98 89 1245A20 98B88 89 1245
;
proc print; run;
Or slightly different:
data WANT;
if not PRXID then PRXID + prxparse("/A[^AB]+/i");
infile datalines truncover;
input ;
length VALUE $20;
START = index(_infile_,'A');
CUSTID=substr(_infile_,1,START-1);
call prxnext(PRXID, START, -1, _infile_, POS, LEN);
do while (POS > 0);
VALUE = substr(_infile_, POS, LEN);
output;
call prxnext(PRXID, START, -1, _infile_, POS, LEN);
end;
keep CUSTID VALUE;
datalines;
CUST12345A10 98B98 89 1245A20 98B88 89 1245
run;
proc print noobs;
run;
VALUE | CUSTID |
---|---|
A10 98 | CUST12345 |
A20 98 | CUST12345 |
data have;
infile cards truncover;
input cust $9. a $10. b $15. @;
do while(not missing(a));
a1=substr(a,1,3);
a2=substr(a,length(a)-2);
output;
input a $10. b $15. @;
end;
cards;
CUST12345A10 98B98 89 1245A20 98B88 89 1245
CUST22345A10 98B98 89 1245
;
run;
Adapted to your new requirement, assuming the values are digits and spaces:
data test;
if not prxid then
prxid + prxparse("/(AN01|BC01|CR4R|DY03)[ 0-9]+/i");
infile datalines truncover;
input @1 custId $9. line $200.;
length item $20;
start = 1;
stop = length(line);
call prxnext(prxid, start, stop, line, pos, len);
do while (pos > 0);
item = substr(line, pos, len);
output;
call prxnext(prxid, start, stop, line, pos, len);
end;
drop prxid start stop pos len line;
datalines;
CUST12345AN01 98BC01 98BC01 89CR4R 4 5DY03 9
CUST12346AN01 98BC01 98BC01 89 BC017 89CR4R 4 5DY03 9
;
proc print; run;
So read ahead then you can calculate how many characters to read based on the beginning of the next item. You can use negative cursor movement values to skip back. You can use $VARYING. informat to read varying length fields.
data want ;
infile cards truncover column = col;
length cust_id $9 n 8 item $10 ;
input cust_id $9. item $10. +(-10) @ ;
do n=1 by 1 until(item=' ');
select ;
when (item=:'AN') len=10;
when (item=:'BC') len=8;
when (item=:'CR') len=8;
when (item=:'DY') len=6;
otherwise len=0;
end;
input item $varying10. len @;
if n=1 or item ne ' ' then output;
input item $10. +(-10) @ ;
end;
*---+----0----+----0----+----0----+----0----+----0----+----0;
cards;
CUST12345AN01 98BC01 98BC01 89CR4R 4 5DY03 9
CUST12346AN01 98BC01 98BC01 89BC017 89CR4R 4 5DY03 9
;
proc print; run;
data have;
infile cards truncover;
input x $200.;
cards;
CUST12345AN01 98BC01 98BC01 89CR4R 4 5DY03 9
CUST12346AN01 98BC01 98BC01 89 BC017 89CR4R 4 5DY03 9
;
run;
data temp;
set have;
pid=prxparse('/[a-z]+[\d\s]+/io');
s=1;e=length(x);
call prxnext(pid,s,e,x,p,l);
do while(p gt 0);
temp=substr(x,p,l);output;
call prxnext(pid,s,e,x,p,l);
end;
keep temp;
run;
data want;
set temp;
length cust $ 40;
retain cust;
if upcase(temp) =: 'CUST' then do;
cust=temp;delete;
end;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.