Hello Community,
I am working on data cleanup project and i have come across a item i am not sure how to tackle.
I have a data set with 5 columns, I need to retain 4 of them as is, but one of them I need to extract the largest string from.
In most cases this string is a 16 digit record number.
The field will look like this, (the value i want is 'X')
Column Cell [AAAAAA XXXXXXXXXXXXXXXX BBBBBB]
Or [XXXXXXXXXXXXXXXX AAAAAA BBBBBB]
Or [AAAAAA BBBBBB XXXXXXXXXXXXXXXX]
The 16 digit number can be in any of the three positions... (dont ask, legacy bookeeping)
And some times the same field will be dirty data with some random value entered as a place holder.
What is the best way to extract that 16 digit X field from my data column and place it in a new column?
data want;
string = 'AAAAAA XXXXXXXXXXXXXXXX BBBBBB';
do i = 1 to countw(string);
if length(scan(string, i)) = 16 then wanted_word = scan(string, i);
end;
put _all_;
run;
data want;
string = 'AAAAAA XXXXXXXXXXXXXXXX BBBBBB';
do i = 1 to countw(string);
if length(scan(string, i)) = 16 then wanted_word = scan(string, i);
end;
put _all_;
run;
Are A and B numbers too or is X the only digit? Are the brackets part of the string?
A and B are alpha numeric, X is a numeric value.
Like this?
data WANT;
input STRING :&$32.;
DIGITS=prxchange('s/(?:\w{6} (\d{16}) \w{6}|\w{6} \w{6} (\d{16})|(\d{16}) \w{6} \w{6}|.*)/\1\2\3/',1,STRING);
cards;
ssssss 6676543456545678 llllll
llllll ssssss 6676543456545678
6676543456545678 ssssss llllll
ssssss 643456545678 llllll
run;
STRING | DIGITS |
---|---|
ssssss 6676543456545678 llllll | 6676543456545678 |
llllll ssssss 6676543456545678 | 6676543456545678 |
6676543456545678 ssssss llllll | 6676543456545678 |
ssssss 643456545678 llllll |
data WANT; input STRING :&$32.; p=prxmatch('/\d{16}/',string); if p then DIGITS=substr(string,p,16); drop p ; cards; ssssss 6676543456545678 llllll llllll ssssss 6676543456545678 6676543456545678 ssssss llllll ssssss 643456545678 llllll run;
@smilbuta : you wrote: "In most cases this string is a 16 digit record number." - can you explain what to extract in the other cases?
Yes,
In the historical data, sometimes a record number was manually entered that does not conform to the norm. It will be a singular alphanumeric digit of variable length. The size of all of those is under 16 digits. They are few and far between and I would be able to reconcile those after the fact.
By extracting the 16 digit value i am seeking and populating the new column, any blanks remaining in the new column would be non conforming data from the original source.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.