Is below what you're after?
data check;
var="Patient reported 100 c headache 3f and nausea. MD noticed rash.";
output;
var="a2 Pt. Rptd. Backache. usd25";
output;
var="b25h of patient reported seeing spots.";
output;
var="3g Elevated pulse and 6d labored breathing.";
output;
var="Headache.";
output;
run;
data want;
if _n_=1 then _rx=prxparse('/(^|[^ab\d])(\d+)([^cd\d]|$)/oi');
retain _rx;
set check;
length string $20;
_check=var;
_check=compress(_check);
if prxmatch(_rx,_check) then
do;
string=prxposn(_rx,2,_check);
end;
run;
Thank you, Patrick!
Sorry, just one more scenario, for cases like this one below:
var="abc h3 4k Headache.";
There are 2 valid numbers, I'm ok with if 3 is returned, but not 34, any way to fix?
Thanks!
Not really pretty but best I could come up with:
data want;
if _n_=1 then _rx=prxparse('/(^|[^ab\d])(\d+)([^cd\d]|$)/oi');
retain _rx;
set check;
length string $20;
_check=var;
_check=prxchange('s/(\d) +(\d)/\1\|\2)/o',-1,_check);
_check=compress(_check);
if prxmatch(_rx,_check) then
do;
string=prxposn(_rx,2,_check);
end;
run;
Hi Patrick,
Thank you so much! It’s working mostly, but still something is missing. I have to change my sample to suit better of my real data:
data check;
var="Patient reported .1 c headache 1.3f and nausea. MD noticed rash."; output;
var="abc2.2 Pt. Rptd. Backache. usd2.5"; output;
var="efg2.5h of patient reported seeing spots."; output;
var="1.3g Elevated pulse and 0.6d labored breathing."; output;
var="Headache."; output;
var="abc .3 5 4.5k Headache.";output;
run;
Expected Output:
var string
Patient reported .1 c headache 1.3f and nausea. MD noticed rash. | 1.3 | Patientreported.1cheadache1.3fandnausea.MDnoticedrash. | ||
abc2.2 Pt. Rptd. Backache. usd2.5 | 2.5 | abc2.2Pt.Rptd.Backache.usd2.5 | ||
efg2.5h of patient reported seeing spots. | efg2.5hofpatientreportedseeingspots. | |||
1.3g Elevated pulse and 0.6d labored breathing. | 1.3 | 1.3gElevatedpulseand0.6dlaboredbreathing. | ||
Headache. | Headache. | |||
abc .3 5 4.5k Headache. | 4.5 | abc.3\|54.5kHeadache. |
data check; var="Patient reported .1 c headache 1.3f and nausea. MD noticed rash."; output; var="abc2.2 Pt. Rptd. Backache. usd2.5"; output; var="efg2.5h of patient reported seeing spots."; output; var="1.3g Elevated pulse and 0.6d labored breathing."; output; var="Headache."; output; var="abc .3 5 4.5k Headache.";output; run; run; data want; set check; length v $ 20; retain pid; if _n_ eq 1 then pid=prxparse('/\b[dh-z]+\d+\.?\d+\b|\b\d+\.?\d+[abe-z]+\b/i'); call prxsubstr(pid, var, position, length); if position ne 0 then v = compress(substr(var, position, length),'.','kd'); drop pid position length; run;
Xia Keshan
Thank you guys very much for your help!
I don’t want to start a new discussion, since it is another question regarding regular expression, wonder if it’s even possible though. Is there a way to identify the word if it’s within let’s say 3-words of distance? I’m going to use the same sample (modified), so the requirement become this:
data check;
var="Patient ab ht reported .1 headache 1.3f and nausea. MD ods noticed rash."; output;
var="ab ht 2.2 Pt. Rptd. Backache. ht usd2.5 od"; output;
var="2.5h of ods patient reported seeing spots."; output;
var="1.3g Elevated pulse ab ht and 0.6d labored breathing."; output;
var="Headache."; output;
var="ab ht .3 5 4 .5k Headache ab";output;
run;
Expected Output:
var | string |
Patient ab ht reported .1 headache 1.3f and nausea. MD ods noticed rash. | 1.3 |
ab ht 2.2 Pt. Rptd. Backache. ht usd2.5 od | 2.5 |
2.5h of ods patient reported seeing spots. | |
1.3g Elevated pulse ab ht and 0.6d labored breathing. | |
Headache. | |
ab ht .3 5 4 .5k Headache ab | .5 |
Thanks!
1) If your looking for stuff like '.2', '3.2' but not '3' then your RegEx looks fine to me.
2) What is the delimiter for a word? A specific pattern within 3 word distance could be something like (untested): (\bods\b)(\b\w+){0,2}(\d*\.\d+)
Thank you Patrick. I'll try that too.
You really make me headache . Hope you don't post such question again .
data check; var="Patient ab ht reported .1 headache 1.3f and nausea. MD ods noticed rash."; output; var="ab ht 2.2 Pt. Rptd. Backache. ht usd2.5 od"; output; var="2.5h of ods patient reported seeing spots."; output; var="1.3g Elevated pulse ab ht and 0.6d labored breathing."; output; var="Headache."; output; var="ab ht .3 5 4 .5k Headache ab";output; run; data want; set check; length v $ 20; retain pid; var=prxchange('s/\bab\s+ht\b|\bods\b/|/i',-1,var); if _n_ eq 1 then pid=prxparse('/[^|]\s+(\S+)?\s+(\S+)?\s+[dh-z]+(\d+)?\.?\d+\s+|\s+(\d+)?\.?\d+[abe-z]+\s+(\S+)?\s+(\S+)?\s+[^|]/i'); call prxsubstr(pid, var, position, length); if position ne 0 then v = prxchange('s/^\.+(?=\d+\.\d+)|\.+$//' ,-1, compress(substr(var, position, length),'.','kd') ); drop pid position length; run;
Xia Keshan
Xia Keshan,
Thank you so much! That’s awesome!
One question, I believe you carried over [dh-z]+ and [abe-z]+ from my first question, right? So I tried to remove them as the code below, I don’t understand why it’s not working, what did I do wrong?
data want;
set check;
length v $ 20;
retain pid;
var=prxchange('s/\bab\s+ht\b|\bods\b/|/i',-1,var);
if _n_ eq 1 then pid=prxparse('/[^|]\s+(\S+)?\s+(\S+)?\s+(\d+)?\.?\d+\s+|\s+(\d+)?\.?\d+\s+(\S+)?\s+(\S+)?\s+[^|]/i');
call prxsubstr(pid, var, position, length);
if position ne 0 then v = prxchange('s/^\.+(?=\d+\.\d+)|\.+$//' ,-1, compress(substr(var, position, length),'.','kd') );
drop pid position length;
run;
Why do you want to remove that line ? What is your purpose ?
If you don't want that condition any more , try this one :
data check; var="Patient ab ht reported .1 headache 1.3f and nausea. MD ods noticed rash."; output; var="ab ht 2.2 Pt. Rptd. Backache. ht usd2.5 od"; output; var="2.5h of ods patient reported seeing spots."; output; var="1.3g Elevated pulse ab ht and 0.6d labored breathing."; output; var="Headache."; output; var="ab ht .3 5 4 .5k Headache ab";output; run; data want; set check; length v $ 20; retain pid; if _n_ eq 1 then pid=prxparse('/\s+[dh-z]+(\d+)?\.?\d+\s+|\s+(\d+)?\.?\d+[abe-z]+\s+/i'); call prxsubstr(pid, var, position, length); if position ne 0 then v = prxchange('s/^\.+(?=\d+\.\d+)|\.+$//' ,-1, compress(substr(var, position, length),'.','kd') ); drop pid position length; run;
Xia Keshan
Message was edited by: xia keshan
The requirement was this:
There is no [dh-z] or [abe-z], I think your previous code is right as long as these are removed, but I can't get it work.
I don't know why it couldn't work.
Try to Remove '?' behind '.'
if _n_ eq 1 then pid=prxparse('/[^|]\s+(\S+)?\s+(\S+)?\s+[dh-z]+(\d+)?\.?\d+\s+|\s+(\d+)?\.?\d+[abe-z]+\s+(\S+)?\s+(\S+)?\s+[^|]/i');
data want;
set check;
length v $ 20;
retain pid;
var=prxchange('s/\bab\s+ht\b|\bods\b/|/i',-1,var);
if _n_ eq 1 then pid=prxparse('/[^|]\s+(\S+)?\s+(\S+)?\s+[dh-z]+(\d+)?\.\d+\s+|\s+(\d+)?\.\d+[abe-z]+\s+(\S+)?\s+(\S+)?\s+[^|]/i');
call prxsubstr(pid, var, position, length);
if position ne 0 then v = prxchange('s/^\.+(?=\d+\.\d+)|\.+$//' ,-1, compress(substr(var, position, length),'.','kd') );
drop pid position length;
run;
Hi xia keshan,
Sorry, am I misunderstanding [dh-z], is it for (not started with ‘abc’, ‘efg’) from the previous example? But it’s not the request for this task, [abe-z] as well, so I want to remove them from the code you provided, but can’t make it work.
The only 2 requests for this task are:
Thanks.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.