BookmarkSubscribeRSS Feed
sss
Fluorite | Level 6 sss
Fluorite | Level 6

 

Hi All,

I'm working on a text mining project which has millions of reports.  My problem statement is little weird.  I have column with alph number column with all possible values.  from the given data i would like to select only number which has exactly 5 digits. below are the some criteria 

1. values are separated by comma,colon, semi colon, space

2. the data i want to fetch might has pre text and post text value

3. i want to select values which are highlighted in bold color

4. if the leading numbers as zero(0) and followed by digits which is 5 digit will be selected, example is given in 1st row

 

I appreciate your help 

5 digit column                                                      output            
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344   45646;43432;34344 
 4235354 sfd4t5345345 erteyye 45634# ye54545                        54545
00055 0545554 54454545 000000052345                                52345
wer43423, 445wsfsf4535, 4344%455                                   43423

 

 

7 REPLIES 7
HarrySnart
SAS Employee

Hi @sss ,

 

If I understood your question correctly you want to parse these 5 digit numbers then collect them into a comma separated list?

 

If that's the case then PRXPARSE and PRXNEXT should do the trick.

 

This code:

/* imported from .csv file */
data have;
set import;
keep text;
run;

/* print data */
title 'Original Data';
proc print data=have;run;

/* parse data using prxparse and prxnext */
data want;
set have;
length out found $20;
re = prxparse("/(\d\d\d\d\d)/");
start = 1;
stop = length(text);
call prxnext(re,start,stop,text,position,length);
do while (position > 0);
out = substr(text,position,length);
found = catx(', ',found,out);
put out;
call prxnext(re,start,stop,text,position,length);

end;
keep text found;
run;

/* print output */
title 'Parsed Data';
proc print data=want;run;

 

Produces this output:

HarrySnart_0-1699958689540.png

 

I hope that helps

Harry

sss
Fluorite | Level 6 sss
Fluorite | Level 6
HI Harry,

I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.

from the first 1st row, i highlighted the values i'm looking for.
For ex:
if the input data is
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344 

the expected output is
45646 ; 43432; 34344

in this case the delimiter is space
from the first delimited value
ffreg45646 i 'want to have value 45646. as it has length of 5 digits continuously.

from 2nd delimited value %345^$45e it has special character in between and the length 3 char and 2 char which not a valid value for me. it should continuous 5 digits in this case it has special character in between.

from 3rd delimited value 0005353532342 we need to ignore all preceding zeros(0) and the length should be equal to 5 however it have length 10. this also doesn't meet my requirement so it should be excluded from output

from 4th delimited value ertehf%43432 it has char and special char both should be trimmed and leftover numbers are equal to exactly 5 , which meets the requirement hence should be part of output

for 5th delimited value 435-- erte0034344 , i would like to extract only last 5 digit because it had number less then 5 length and specialchar and char should be trimmed
mkeintz
PROC Star

@sss wrote:
HI Harry,

I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.

from the first 1st row, i highlighted the values i'm looking for.
For ex:
if the input data is
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344 

the expected output is
45646 ; 43432; 34344

in this case the delimiter is space

... stuff deleted ...

for 5th delimited value 435-- erte0034344 , i would like to extract only last 5 digit because it had number less then 5 length and specialchar and char should be trimmed

Actually, given that your text has a blank (one of the 4 delimiters you specificed) prior to the "ert",  the 5th delimited value is just "435--", and you have a sixth delimited value: "erte0034344", which qualifies "34344" for the result variable since there are 5 contiguous digits after removal of leading zeroes.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
sss
Fluorite | Level 6 sss
Fluorite | Level 6

HI Harry, I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.

from the first 1st row, i highlighted the values i'm looking for.

FOr ex: if the input data is ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344  the expected output is 45646 ; 43432; 34344 in this case the delimiter is space

 

from the first delimited value ffreg45646 i'want to have value 45646. as it has length of 5 digits continuously.

 

from 2nd delimited value %345^$45e it has special character in between and the length 3 char and 2 char which not a valid value for me. it should continuous 5 digits in this case it has special character in between.

 

from 3rd delimited value 0005353532342 we need to ingore all preceeding zeros(0) and the length should be equal to 5 however it have length 10. this also doesn't meet my requirement so it should be excluded from output

 

from 4th delimited value ertehf%43432 it has char and special char both should be trimmed and leftover numbers are equal to exactly 5 , which meets the requirement hence should be part of output

 

for 5th delimited value 435-- erte0034344 , i would like to extract only last 5 digit because it had number less then 5 length and special char and char should be trimmed

sss
Fluorite | Level 6 sss
Fluorite | Level 6

Any suggestion mates 🙂

sss
Fluorite | Level 6 sss
Fluorite | Level 6

Appreciate your help

mkeintz
PROC Star

You second row has txt=" 4235354 sfd4t5345345 erteyye 45634# ye54545

 

Why is your expected result not  45634;54545?

 

Assuming the second row result should be 45634;54545, then just loop of line segments:

 

data have;
  input txt $70.  @71 expected_result $70. ;
datalines4;
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344    45646;43432;34344 
 4235354 sfd4t5345345 erteyye 45634# ye54545                          54545
00055 0545554 54454545 000000052345                                   52345
wer43423, 445wsfsf4535, 4344%455                                      43423
;;;;run;

data want (drop=_:);
  set have ;
  length result $70;

  *text segments are separated by comma,colon, semi-colon, space ;
  do _s=1 to countw(txt,',:; ');
    _segment=scan(txt,_s,',:; ');            
    _d=indexc(_segment,'123456789');          /*Leftmost digit, except 0 */
    if _d=0 then continue;
    _segment=substr(_segment,_d);             /*Left justify to leading non-zero digit*/

    /*If leftmost non-digit is position 6, then update RESULT*/
    if notdigit(_segment)=6 then result= catx(';',result,substr(_segment,1,5));
  end;
run;

This code assumes that each segment (i.e. delimited by blank, colon, semi-colon, or comma) has no more than one candidate for inclusion in the result, and that the first non-zero digit is the start of the candidate.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1767 views
  • 0 likes
  • 3 in conversation