BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
gzr2mz39
Quartz | Level 8

How would I go about extracting all words from a text field that start with "SM" where the 2 letters "SM" are uppercase? All words that begin with "SM" would be copied into a new column where each word that begins with "SM" for that specific record would be separated by a space or comma.

For example, for record 100 the text is "SM690 and SM780 occurred at 1:00pm". The newly created column would show "SM690 SM780" or "SM690,SM780".

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

Below should do the job. The code is based on SAS sample code found here.

data have;
  infile datalines truncover;
  input text $100.;
  datalines;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
;

data want(drop=_:);
  set have;
  length words $100;
  _prxid = prxparse('/\bSM.*?\b/');
  _start = 1;
  _stop = length(text);

  /* Use PRXNEXT to find the first instance of the pattern, */
  /* then use DO WHILE to find all further instances.       */
  /* PRXNEXT changes the _start parameter so that searching  */
  /* begins again after the last match.                     */
  call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);

  do while (_pos > 0);
    words = catx(' ',words,substr(text, _pos, _len));
    call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
  end;
run;

proc print data=want;
run;

Patrick_0-1595643828428.png

 

 

View solution in original post

5 REPLIES 5
Nicole_Fox
Obsidian | Level 7

Using a where clause is one way to accomplish this. where x like 'SM%';

 

 

 

data x;   
     do i=1 to 10; 
       x='small';
       output;
     end; 
     do i=1 to 10; 
       x='SMURF';
       output;
     end; 
     drop i; 
run;
data y; 
    set x;
    where x like 'SM%';
run; 

 

 

 

Patrick
Opal | Level 21

Below should do the job. The code is based on SAS sample code found here.

data have;
  infile datalines truncover;
  input text $100.;
  datalines;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
;

data want(drop=_:);
  set have;
  length words $100;
  _prxid = prxparse('/\bSM.*?\b/');
  _start = 1;
  _stop = length(text);

  /* Use PRXNEXT to find the first instance of the pattern, */
  /* then use DO WHILE to find all further instances.       */
  /* PRXNEXT changes the _start parameter so that searching  */
  /* begins again after the last match.                     */
  call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);

  do while (_pos > 0);
    words = catx(' ',words,substr(text, _pos, _len));
    call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
  end;
run;

proc print data=want;
run;

Patrick_0-1595643828428.png

 

 

ChrisNZ
Tourmaline | Level 20

Like this?

data HAVE;
  infile cards truncover;
  input TEXT $100.;
cards;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
run;

data WANT;
  set HAVE;
  WORDS=prxchange('s/(\b(?!SM).*? )/ /',-1,TEXT);
run;

proc print data=WANT;
run;
Obs TEXT WORDS
1 SM690 and SM780 occurred at 1:00pm SM690 SM780
2 SM690 and Sm780 occurred at 1:00pm SM690
3 abSM690 and Sm780 occurred at 1:00pm  

 

Patrick
Opal | Level 21

@ChrisNZ  How would you amend your RegEx to also capture SM888 from below data?

SM780,SM888 occurred at 1:00pm and SM999
ChrisNZ
Tourmaline | Level 20

@Patrick Not too sure why the \b is not usable here at the end of the word.

regex101.com suggests it should be a valid expression for the examples given and for your example. 

I suppose we could use this:

data HAVE;
  infile cards truncover;
  input TEXT $100.;
  WORDS=prxchange('s/(\b(?!SM).*?\W)/ /',-1,TEXT);
cards;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
SM780,SM888 occurred at 1:00pm and SM999
run;

proc print; run;
Obs TEXT WORDS
1 SM690 and SM780 occurred at 1:00pm SM690 SM780
2 SM690 and Sm780 occurred at 1:00pm SM690
3 abSM690 and Sm780 occurred at 1:00pm  
4 SM780,SM888 occurred at 1:00pm and SM999 SM780 SM888 SM999

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 758 views
  • 3 likes
  • 4 in conversation