Hi All,
I'm working on a text mining project which has millions of reports. My problem statement is little weird. I have column with alph number column with all possible values. from the given data i would like to select only number which has exactly 5 digits. below are the some criteria
1. values are separated by comma,colon, semi colon, space
2. the data i want to fetch might has pre text and post text value
3. i want to select values which are highlighted in bold color
4. if the leading numbers as zero(0) and followed by digits which is 5 digit will be selected, example is given in 1st row
I appreciate your help
5 digit column | output |
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344 | 45646;43432;34344 |
4235354 sfd4t5345345 erteyye 45634# ye54545 | 54545 |
00055 0545554 54454545 000000052345 | 52345 |
wer43423, 445wsfsf4535, 4344%455 | 43423 |
Hi @sss ,
If I understood your question correctly you want to parse these 5 digit numbers then collect them into a comma separated list?
If that's the case then PRXPARSE and PRXNEXT should do the trick.
This code:
/* imported from .csv file */
data have;
set import;
keep text;
run;
/* print data */
title 'Original Data';
proc print data=have;run;
/* parse data using prxparse and prxnext */
data want;
set have;
length out found $20;
re = prxparse("/(\d\d\d\d\d)/");
start = 1;
stop = length(text);
call prxnext(re,start,stop,text,position,length);
do while (position > 0);
out = substr(text,position,length);
found = catx(', ',found,out);
put out;
call prxnext(re,start,stop,text,position,length);
end;
keep text found;
run;
/* print output */
title 'Parsed Data';
proc print data=want;run;
Produces this output:
I hope that helps
Harry
@sss wrote:
HI Harry,
I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.
from the first 1st row, i highlighted the values i'm looking for.
For ex:
if the input data is
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344
the expected output is
45646 ; 43432; 34344
in this case the delimiter is space
... stuff deleted ...
for 5th delimited value 435-- erte0034344 , i would like to extract only last 5 digit because it had number less then 5 length and specialchar and char should be trimmed
Actually, given that your text has a blank (one of the 4 delimiters you specificed) prior to the "ert", the 5th delimited value is just "435--", and you have a sixth delimited value: "erte0034344", which qualifies "34344" for the result variable since there are 5 contiguous digits after removal of leading zeroes.
HI Harry, I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.
from the first 1st row, i highlighted the values i'm looking for.
FOr ex: if the input data is ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344 the expected output is 45646 ; 43432; 34344 in this case the delimiter is space
from the first delimited value ffreg45646 i'want to have value 45646. as it has length of 5 digits continuously.
from 2nd delimited value %345^$45e it has special character in between and the length 3 char and 2 char which not a valid value for me. it should continuous 5 digits in this case it has special character in between.
from 3rd delimited value 0005353532342 we need to ingore all preceeding zeros(0) and the length should be equal to 5 however it have length 10. this also doesn't meet my requirement so it should be excluded from output
from 4th delimited value ertehf%43432 it has char and special char both should be trimmed and leftover numbers are equal to exactly 5 , which meets the requirement hence should be part of output
for 5th delimited value 435-- erte0034344 , i would like to extract only last 5 digit because it had number less then 5 length and special char and char should be trimmed
Any suggestion mates 🙂
Appreciate your help
You second row has txt=" 4235354 sfd4t5345345 erteyye 45634# ye54545"
Why is your expected result not 45634;54545?
Assuming the second row result should be 45634;54545, then just loop of line segments:
data have;
input txt $70. @71 expected_result $70. ;
datalines4;
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344 45646;43432;34344
4235354 sfd4t5345345 erteyye 45634# ye54545 54545
00055 0545554 54454545 000000052345 52345
wer43423, 445wsfsf4535, 4344%455 43423
;;;;run;
data want (drop=_:);
set have ;
length result $70;
*text segments are separated by comma,colon, semi-colon, space ;
do _s=1 to countw(txt,',:; ');
_segment=scan(txt,_s,',:; ');
_d=indexc(_segment,'123456789'); /*Leftmost digit, except 0 */
if _d=0 then continue;
_segment=substr(_segment,_d); /*Left justify to leading non-zero digit*/
/*If leftmost non-digit is position 6, then update RESULT*/
if notdigit(_segment)=6 then result= catx(';',result,substr(_segment,1,5));
end;
run;
This code assumes that each segment (i.e. delimited by blank, colon, semi-colon, or comma) has no more than one candidate for inclusion in the result, and that the first non-zero digit is the start of the candidate.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.