Hi All,
I'm working on a text mining project which has millions of reports. My problem statement is little weird. I have column with alph number column with all possible values. from the given data i would like to select only number which has exactly 5 digits. below are the some criteria
1. values are separated by comma,colon, semi colon, space
2. the data i want to fetch might has pre text and post text value
3. i want to select values which are highlighted in bold color
4. if the leading numbers as zero(0) and followed by digits which is 5 digit will be selected, example is given in 1st row
I appreciate your help
5 digit column | output |
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344 | 45646;43432;34344 |
4235354 sfd4t5345345 erteyye 45634# ye54545 | 54545 |
00055 0545554 54454545 000000052345 | 52345 |
wer43423, 445wsfsf4535, 4344%455 | 43423 |
Hi @sss ,
If I understood your question correctly you want to parse these 5 digit numbers then collect them into a comma separated list?
If that's the case then PRXPARSE and PRXNEXT should do the trick.
This code:
/* imported from .csv file */
data have;
set import;
keep text;
run;
/* print data */
title 'Original Data';
proc print data=have;run;
/* parse data using prxparse and prxnext */
data want;
set have;
length out found $20;
re = prxparse("/(\d\d\d\d\d)/");
start = 1;
stop = length(text);
call prxnext(re,start,stop,text,position,length);
do while (position > 0);
out = substr(text,position,length);
found = catx(', ',found,out);
put out;
call prxnext(re,start,stop,text,position,length);
end;
keep text found;
run;
/* print output */
title 'Parsed Data';
proc print data=want;run;
Produces this output:
I hope that helps
Harry
@sss wrote:
HI Harry,
I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.
from the first 1st row, i highlighted the values i'm looking for.
For ex:
if the input data is
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344
the expected output is
45646 ; 43432; 34344
in this case the delimiter is space
... stuff deleted ...
for 5th delimited value 435-- erte0034344 , i would like to extract only last 5 digit because it had number less then 5 length and specialchar and char should be trimmed
Actually, given that your text has a blank (one of the 4 delimiters you specificed) prior to the "ert", the 5th delimited value is just "435--", and you have a sixth delimited value: "erte0034344", which qualifies "34344" for the result variable since there are 5 contiguous digits after removal of leading zeroes.
HI Harry, I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.
from the first 1st row, i highlighted the values i'm looking for.
FOr ex: if the input data is ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344 the expected output is 45646 ; 43432; 34344 in this case the delimiter is space
from the first delimited value ffreg45646 i'want to have value 45646. as it has length of 5 digits continuously.
from 2nd delimited value %345^$45e it has special character in between and the length 3 char and 2 char which not a valid value for me. it should continuous 5 digits in this case it has special character in between.
from 3rd delimited value 0005353532342 we need to ingore all preceeding zeros(0) and the length should be equal to 5 however it have length 10. this also doesn't meet my requirement so it should be excluded from output
from 4th delimited value ertehf%43432 it has char and special char both should be trimmed and leftover numbers are equal to exactly 5 , which meets the requirement hence should be part of output
for 5th delimited value 435-- erte0034344 , i would like to extract only last 5 digit because it had number less then 5 length and special char and char should be trimmed
Any suggestion mates 🙂
Appreciate your help
You second row has txt=" 4235354 sfd4t5345345 erteyye 45634# ye54545"
Why is your expected result not 45634;54545?
Assuming the second row result should be 45634;54545, then just loop of line segments:
data have;
input txt $70. @71 expected_result $70. ;
datalines4;
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344 45646;43432;34344
4235354 sfd4t5345345 erteyye 45634# ye54545 54545
00055 0545554 54454545 000000052345 52345
wer43423, 445wsfsf4535, 4344%455 43423
;;;;run;
data want (drop=_:);
set have ;
length result $70;
*text segments are separated by comma,colon, semi-colon, space ;
do _s=1 to countw(txt,',:; ');
_segment=scan(txt,_s,',:; ');
_d=indexc(_segment,'123456789'); /*Leftmost digit, except 0 */
if _d=0 then continue;
_segment=substr(_segment,_d); /*Left justify to leading non-zero digit*/
/*If leftmost non-digit is position 6, then update RESULT*/
if notdigit(_segment)=6 then result= catx(';',result,substr(_segment,1,5));
end;
run;
This code assumes that each segment (i.e. delimited by blank, colon, semi-colon, or comma) has no more than one candidate for inclusion in the result, and that the first non-zero digit is the start of the candidate.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.