How would I go about extracting all words from a text field that start with "SM" where the 2 letters "SM" are uppercase? All words that begin with "SM" would be copied into a new column where each word that begins with "SM" for that specific record would be separated by a space or comma.
For example, for record 100 the text is "SM690 and SM780 occurred at 1:00pm". The newly created column would show "SM690 SM780" or "SM690,SM780".
Thank you.
Below should do the job. The code is based on SAS sample code found here.
data have;
infile datalines truncover;
input text $100.;
datalines;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
;
data want(drop=_:);
set have;
length words $100;
_prxid = prxparse('/\bSM.*?\b/');
_start = 1;
_stop = length(text);
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the _start parameter so that searching */
/* begins again after the last match. */
call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
do while (_pos > 0);
words = catx(' ',words,substr(text, _pos, _len));
call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
end;
run;
proc print data=want;
run;
Using a where clause is one way to accomplish this. where x like 'SM%';
data x;
do i=1 to 10;
x='small';
output;
end;
do i=1 to 10;
x='SMURF';
output;
end;
drop i;
run;
data y;
set x;
where x like 'SM%';
run;
Below should do the job. The code is based on SAS sample code found here.
data have;
infile datalines truncover;
input text $100.;
datalines;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
;
data want(drop=_:);
set have;
length words $100;
_prxid = prxparse('/\bSM.*?\b/');
_start = 1;
_stop = length(text);
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the _start parameter so that searching */
/* begins again after the last match. */
call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
do while (_pos > 0);
words = catx(' ',words,substr(text, _pos, _len));
call prxnext(_prxid, _start, _stop, trim(text), _pos, _len);
end;
run;
proc print data=want;
run;
Like this?
data HAVE;
infile cards truncover;
input TEXT $100.;
cards;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
run;
data WANT;
set HAVE;
WORDS=prxchange('s/(\b(?!SM).*? )/ /',-1,TEXT);
run;
proc print data=WANT;
run;
Obs | TEXT | WORDS |
---|---|---|
1 | SM690 and SM780 occurred at 1:00pm | SM690 SM780 |
2 | SM690 and Sm780 occurred at 1:00pm | SM690 |
3 | abSM690 and Sm780 occurred at 1:00pm |
@ChrisNZ How would you amend your RegEx to also capture SM888 from below data?
SM780,SM888 occurred at 1:00pm and SM999
@Patrick Not too sure why the \b is not usable here at the end of the word.
regex101.com suggests it should be a valid expression for the examples given and for your example.
I suppose we could use this:
data HAVE;
infile cards truncover;
input TEXT $100.;
WORDS=prxchange('s/(\b(?!SM).*?\W)/ /',-1,TEXT);
cards;
SM690 and SM780 occurred at 1:00pm
SM690 and Sm780 occurred at 1:00pm
abSM690 and Sm780 occurred at 1:00pm
SM780,SM888 occurred at 1:00pm and SM999
run;
proc print; run;
Obs | TEXT | WORDS |
---|---|---|
1 | SM690 and SM780 occurred at 1:00pm | SM690 SM780 |
2 | SM690 and Sm780 occurred at 1:00pm | SM690 |
3 | abSM690 and Sm780 occurred at 1:00pm | |
4 | SM780,SM888 occurred at 1:00pm and SM999 | SM780 SM888 SM999 |
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.