BookmarkSubscribeRSS Feed
sss
Fluorite | Level 6 sss
Fluorite | Level 6

 

Hi All,

I'm working on a text mining project which has millions of reports.  My problem statement is little weird.  I have column with alph number column with all possible values.  from the given data i would like to select only number which has exactly 5 digits. below are the some criteria 

1. values are separated by comma,colon, semi colon, space

2. the data i want to fetch might has pre text and post text value

3. i want to select values which are highlighted in bold color

4. if the leading numbers as zero(0) and followed by digits which is 5 digit will be selected, example is given in 1st row

 

I appreciate your help 

5 digit column                                                      output            
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344   45646;43432;34344 
 4235354 sfd4t5345345 erteyye 45634# ye54545                        54545
00055 0545554 54454545 000000052345                                52345
wer43423, 445wsfsf4535, 4344%455                                   43423

 

 

7 REPLIES 7
HarrySnart
SAS Employee

Hi @sss ,

 

If I understood your question correctly you want to parse these 5 digit numbers then collect them into a comma separated list?

 

If that's the case then PRXPARSE and PRXNEXT should do the trick.

 

This code:

/* imported from .csv file */
data have;
set import;
keep text;
run;

/* print data */
title 'Original Data';
proc print data=have;run;

/* parse data using prxparse and prxnext */
data want;
set have;
length out found $20;
re = prxparse("/(\d\d\d\d\d)/");
start = 1;
stop = length(text);
call prxnext(re,start,stop,text,position,length);
do while (position > 0);
out = substr(text,position,length);
found = catx(', ',found,out);
put out;
call prxnext(re,start,stop,text,position,length);

end;
keep text found;
run;

/* print output */
title 'Parsed Data';
proc print data=want;run;

 

Produces this output:

HarrySnart_0-1699958689540.png

 

I hope that helps

Harry

sss
Fluorite | Level 6 sss
Fluorite | Level 6
HI Harry,

I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.

from the first 1st row, i highlighted the values i'm looking for.
For ex:
if the input data is
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344 

the expected output is
45646 ; 43432; 34344

in this case the delimiter is space
from the first delimited value
ffreg45646 i 'want to have value 45646. as it has length of 5 digits continuously.

from 2nd delimited value %345^$45e it has special character in between and the length 3 char and 2 char which not a valid value for me. it should continuous 5 digits in this case it has special character in between.

from 3rd delimited value 0005353532342 we need to ignore all preceding zeros(0) and the length should be equal to 5 however it have length 10. this also doesn't meet my requirement so it should be excluded from output

from 4th delimited value ertehf%43432 it has char and special char both should be trimmed and leftover numbers are equal to exactly 5 , which meets the requirement hence should be part of output

for 5th delimited value 435-- erte0034344 , i would like to extract only last 5 digit because it had number less then 5 length and specialchar and char should be trimmed
mkeintz
PROC Star

@sss wrote:
HI Harry,

I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.

from the first 1st row, i highlighted the values i'm looking for.
For ex:
if the input data is
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344 

the expected output is
45646 ; 43432; 34344

in this case the delimiter is space

... stuff deleted ...

for 5th delimited value 435-- erte0034344 , i would like to extract only last 5 digit because it had number less then 5 length and specialchar and char should be trimmed

Actually, given that your text has a blank (one of the 4 delimiters you specificed) prior to the "ert",  the 5th delimited value is just "435--", and you have a sixth delimited value: "erte0034344", which qualifies "34344" for the result variable since there are 5 contiguous digits after removal of leading zeroes.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
sss
Fluorite | Level 6 sss
Fluorite | Level 6

HI Harry, I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.

from the first 1st row, i highlighted the values i'm looking for.

FOr ex: if the input data is ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344  the expected output is 45646 ; 43432; 34344 in this case the delimiter is space

 

from the first delimited value ffreg45646 i'want to have value 45646. as it has length of 5 digits continuously.

 

from 2nd delimited value %345^$45e it has special character in between and the length 3 char and 2 char which not a valid value for me. it should continuous 5 digits in this case it has special character in between.

 

from 3rd delimited value 0005353532342 we need to ingore all preceeding zeros(0) and the length should be equal to 5 however it have length 10. this also doesn't meet my requirement so it should be excluded from output

 

from 4th delimited value ertehf%43432 it has char and special char both should be trimmed and leftover numbers are equal to exactly 5 , which meets the requirement hence should be part of output

 

for 5th delimited value 435-- erte0034344 , i would like to extract only last 5 digit because it had number less then 5 length and special char and char should be trimmed

sss
Fluorite | Level 6 sss
Fluorite | Level 6

Any suggestion mates 🙂

sss
Fluorite | Level 6 sss
Fluorite | Level 6

Appreciate your help

mkeintz
PROC Star

You second row has txt=" 4235354 sfd4t5345345 erteyye 45634# ye54545

 

Why is your expected result not  45634;54545?

 

Assuming the second row result should be 45634;54545, then just loop of line segments:

 

data have;
  input txt $70.  @71 expected_result $70. ;
datalines4;
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344    45646;43432;34344 
 4235354 sfd4t5345345 erteyye 45634# ye54545                          54545
00055 0545554 54454545 000000052345                                   52345
wer43423, 445wsfsf4535, 4344%455                                      43423
;;;;run;

data want (drop=_:);
  set have ;
  length result $70;

  *text segments are separated by comma,colon, semi-colon, space ;
  do _s=1 to countw(txt,',:; ');
    _segment=scan(txt,_s,',:; ');            
    _d=indexc(_segment,'123456789');          /*Leftmost digit, except 0 */
    if _d=0 then continue;
    _segment=substr(_segment,_d);             /*Left justify to leading non-zero digit*/

    /*If leftmost non-digit is position 6, then update RESULT*/
    if notdigit(_segment)=6 then result= catx(';',result,substr(_segment,1,5));
  end;
run;

This code assumes that each segment (i.e. delimited by blank, colon, semi-colon, or comma) has no more than one candidate for inclusion in the result, and that the first non-zero digit is the start of the candidate.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 852 views
  • 0 likes
  • 3 in conversation