Solved: extracting words from a text field based on their first 2 letters

gzr2mz39 · Posted 07-24-2020 08:21 PM

How would I go about extracting all words from a text field that start with "SM" where the 2 letters "SM" are uppercase? All words that begin with "SM" would be copied into a new column where each word that begins with "SM" for that specific record would be separated by a space or comma.

For example, for record 100 the text is "SM690 and SM780 occurred at 1:00pm". The newly created column would show "SM690 SM780" or "SM690,SM780".

Thank you.

Patrick · Posted 07-24-2020 10:23 PM

Below should do the job. The code is based on SAS sample code found here.

data have;
  infile datalines truncover;
  input text $100.;
  datalines;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
;

data want(drop=_:);
  set have;
  length words $100;
  _prxid = prxparse('/\bSM.*?\b/');
  _start = 1;
  _stop = length(text);

  /* Use PRXNEXT to find the first instance of the pattern, */
  /* then use DO WHILE to find all further instances.       */
  /* PRXNEXT changes the _start parameter so that searching  */
  /* begins again after the last match.                     */
  call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);

  do while (_pos > 0);
    words = catx(' ',words,substr(text, _pos, _len));
    call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
  end;
run;

proc print data=want;
run;

View solution in original post

Nicole_Fox · Posted 07-24-2020 08:50 PM

Using a where clause is one way to accomplish this. where x like 'SM%';

data x;   
     do i=1 to 10; 
       x='small';
       output;
     end; 
     do i=1 to 10; 
       x='SMURF';
       output;
     end; 
     drop i; 
run;
data y; 
    set x;
    where x like 'SM%';
run;

Patrick · Posted 07-24-2020 10:23 PM

Below should do the job. The code is based on SAS sample code found here.

data have;
  infile datalines truncover;
  input text $100.;
  datalines;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
;

data want(drop=_:);
  set have;
  length words $100;
  _prxid = prxparse('/\bSM.*?\b/');
  _start = 1;
  _stop = length(text);

  /* Use PRXNEXT to find the first instance of the pattern, */
  /* then use DO WHILE to find all further instances.       */
  /* PRXNEXT changes the _start parameter so that searching  */
  /* begins again after the last match.                     */
  call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);

  do while (_pos > 0);
    words = catx(' ',words,substr(text, _pos, _len));
    call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
  end;
run;

proc print data=want;
run;

ChrisNZ · Posted 07-25-2020 03:32 AM

Like this?

data HAVE;
  infile cards truncover;
  input TEXT $100.;
cards;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
run;

data WANT;
  set HAVE;
  WORDS=prxchange('s/(\b(?!SM).*? )/ /',-1,TEXT);
run;

proc print data=WANT;
run;

Obs	TEXT	WORDS
1	SM690 and SM780 occurred at 1:00pm	SM690 SM780
2	SM690 and Sm780 occurred at 1:00pm	SM690
3	abSM690 and Sm780 occurred at 1:00pm

High-Performance SAS Coding - Third Edition

Patrick · Posted 07-25-2020 11:24 PM

@ChrisNZ How would you amend your RegEx to also capture SM888 from below data?

SM780,SM888 occurred at 1:00pm and SM999

ChrisNZ · Posted 07-26-2020 02:58 AM

@Patrick Not too sure why the \b is not usable here at the end of the word.

regex101.com suggests it should be a valid expression for the examples given and for your example.

I suppose we could use this:

data HAVE;
  infile cards truncover;
  input TEXT $100.;
  WORDS=prxchange('s/(\b(?!SM).*?\W)/ /',-1,TEXT);
cards;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
SM780,SM888 occurred at 1:00pm and SM999
run;

proc print; run;

Obs	TEXT	WORDS
1	SM690 and SM780 occurred at 1:00pm	SM690 SM780
2	SM690 and Sm780 occurred at 1:00pm	SM690
3	abSM690 and Sm780 occurred at 1:00pm
4	SM780,SM888 occurred at 1:00pm and SM999	SM780 SM888 SM999

High-Performance SAS Coding - Third Edition

extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Registration is open

extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Re: extracting words from a text field based on their first 2 letters

Registration is open

SAS Training: Just a Click Away