How would I go about extracting all words from a text field that start with "SM" where the 2 letters "SM" are uppercase? All words that begin with "SM" would be copied into a new column where each word that begins with "SM" for that specific record would be separated by a space or comma.
For example, for record 100 the text is "SM690 and SM780 occurred at 1:00pm". The newly created column would show "SM690 SM780" or "SM690,SM780".
Thank you.
Below should do the job. The code is based on SAS sample code found here.
data have;
infile datalines truncover;
input text $100.;
datalines;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
;
data want(drop=_:);
set have;
length words $100;
_prxid = prxparse('/\bSM.*?\b/');
_start = 1;
_stop = length(text);
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the _start parameter so that searching */
/* begins again after the last match. */
call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
do while (_pos > 0);
words = catx(' ',words,substr(text, _pos, _len));
call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
end;
run;
proc print data=want;
run;
Using a where clause is one way to accomplish this. where x like 'SM%';
data x;
do i=1 to 10;
x='small';
output;
end;
do i=1 to 10;
x='SMURF';
output;
end;
drop i;
run;
data y;
set x;
where x like 'SM%';
run;
Below should do the job. The code is based on SAS sample code found here.
data have;
infile datalines truncover;
input text $100.;
datalines;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
;
data want(drop=_:);
set have;
length words $100;
_prxid = prxparse('/\bSM.*?\b/');
_start = 1;
_stop = length(text);
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the _start parameter so that searching */
/* begins again after the last match. */
call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
do while (_pos > 0);
words = catx(' ',words,substr(text, _pos, _len));
call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
end;
run;
proc print data=want;
run;
Like this?
data HAVE;
infile cards truncover;
input TEXT $100.;
cards;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
run;
data WANT;
set HAVE;
WORDS=prxchange('s/(\b(?!SM).*? )/ /',-1,TEXT);
run;
proc print data=WANT;
run;
| Obs | TEXT | WORDS |
|---|---|---|
| 1 | SM690 and SM780 occurred at 1:00pm | SM690 SM780 |
| 2 | SM690 and Sm780 occurred at 1:00pm | SM690 |
| 3 | abSM690 and Sm780 occurred at 1:00pm |
@ChrisNZ How would you amend your RegEx to also capture SM888 from below data?
SM780,SM888 occurred at 1:00pm and SM999
@Patrick Not too sure why the \b is not usable here at the end of the word.
regex101.com suggests it should be a valid expression for the examples given and for your example.
I suppose we could use this:
data HAVE;
infile cards truncover;
input TEXT $100.;
WORDS=prxchange('s/(\b(?!SM).*?\W)/ /',-1,TEXT);
cards;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
SM780,SM888 occurred at 1:00pm and SM999
run;
proc print; run;
| Obs | TEXT | WORDS |
|---|---|---|
| 1 | SM690 and SM780 occurred at 1:00pm | SM690 SM780 |
| 2 | SM690 and Sm780 occurred at 1:00pm | SM690 |
| 3 | abSM690 and Sm780 occurred at 1:00pm | |
| 4 | SM780,SM888 occurred at 1:00pm and SM999 | SM780 SM888 SM999 |
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.