BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
gzr2mz39
Quartz | Level 8

How would I go about extracting all words from a text field that start with "SM" where the 2 letters "SM" are uppercase? All words that begin with "SM" would be copied into a new column where each word that begins with "SM" for that specific record would be separated by a space or comma.

For example, for record 100 the text is "SM690 and SM780 occurred at 1:00pm". The newly created column would show "SM690 SM780" or "SM690,SM780".

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

Below should do the job. The code is based on SAS sample code found here.

data have;
  infile datalines truncover;
  input text $100.;
  datalines;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
;

data want(drop=_:);
  set have;
  length words $100;
  _prxid = prxparse('/\bSM.*?\b/');
  _start = 1;
  _stop = length(text);

  /* Use PRXNEXT to find the first instance of the pattern, */
  /* then use DO WHILE to find all further instances.       */
  /* PRXNEXT changes the _start parameter so that searching  */
  /* begins again after the last match.                     */
  call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);

  do while (_pos > 0);
    words = catx(' ',words,substr(text, _pos, _len));
    call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
  end;
run;

proc print data=want;
run;

Patrick_0-1595643828428.png

 

 

View solution in original post

5 REPLIES 5
Nicole_Fox
Obsidian | Level 7

Using a where clause is one way to accomplish this. where x like 'SM%';

 

 

 

data x;   
     do i=1 to 10; 
       x='small';
       output;
     end; 
     do i=1 to 10; 
       x='SMURF';
       output;
     end; 
     drop i; 
run;
data y; 
    set x;
    where x like 'SM%';
run; 

 

 

 

Patrick
Opal | Level 21

Below should do the job. The code is based on SAS sample code found here.

data have;
  infile datalines truncover;
  input text $100.;
  datalines;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
;

data want(drop=_:);
  set have;
  length words $100;
  _prxid = prxparse('/\bSM.*?\b/');
  _start = 1;
  _stop = length(text);

  /* Use PRXNEXT to find the first instance of the pattern, */
  /* then use DO WHILE to find all further instances.       */
  /* PRXNEXT changes the _start parameter so that searching  */
  /* begins again after the last match.                     */
  call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);

  do while (_pos > 0);
    words = catx(' ',words,substr(text, _pos, _len));
    call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
  end;
run;

proc print data=want;
run;

Patrick_0-1595643828428.png

 

 

ChrisNZ
Tourmaline | Level 20

Like this?

data HAVE;
  infile cards truncover;
  input TEXT $100.;
cards;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
run;

data WANT;
  set HAVE;
  WORDS=prxchange('s/(\b(?!SM).*? )/ /',-1,TEXT);
run;

proc print data=WANT;
run;
Obs TEXT WORDS
1 SM690 and SM780 occurred at 1:00pm SM690 SM780
2 SM690 and Sm780 occurred at 1:00pm SM690
3 abSM690 and Sm780 occurred at 1:00pm  

 

Patrick
Opal | Level 21

@ChrisNZ  How would you amend your RegEx to also capture SM888 from below data?

SM780,SM888 occurred at 1:00pm and SM999
ChrisNZ
Tourmaline | Level 20

@Patrick Not too sure why the \b is not usable here at the end of the word.

regex101.com suggests it should be a valid expression for the examples given and for your example. 

I suppose we could use this:

data HAVE;
  infile cards truncover;
  input TEXT $100.;
  WORDS=prxchange('s/(\b(?!SM).*?\W)/ /',-1,TEXT);
cards;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
SM780,SM888 occurred at 1:00pm and SM999
run;

proc print; run;
Obs TEXT WORDS
1 SM690 and SM780 occurred at 1:00pm SM690 SM780
2 SM690 and Sm780 occurred at 1:00pm SM690
3 abSM690 and Sm780 occurred at 1:00pm  
4 SM780,SM888 occurred at 1:00pm and SM999 SM780 SM888 SM999

 

 

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1940 views
  • 3 likes
  • 4 in conversation