BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Patrick
Opal | Level 21

Is below what you're after?

data check;

  var="Patient reported 100 c headache 3f and nausea. MD noticed rash.";

  output;

  var="a2 Pt. Rptd. Backache. usd25";

  output;

  var="b25h of patient reported seeing spots.";

  output;

  var="3g Elevated pulse and 6d labored breathing.";

  output;

  var="Headache.";

  output;

run;

data want;

  if _n_=1 then _rx=prxparse('/(^|[^ab\d])(\d+)([^cd\d]|$)/oi');

  retain _rx;

  set check;

  length string $20;

  _check=var;

  _check=compress(_check);

  if prxmatch(_rx,_check) then

    do;

      string=prxposn(_rx,2,_check);

    end;

run;

allaboutsas
Calcite | Level 5

Thank you, Patrick!

Sorry, just one more scenario, for cases like this one below:

var="abc h3 4k Headache.";

There are 2 valid numbers, I'm ok with if 3 is returned, but not 34, any way to fix?

Thanks!

Patrick
Opal | Level 21

Not really pretty but best I could come up with:

data want;

  if _n_=1 then _rx=prxparse('/(^|[^ab\d])(\d+)([^cd\d]|$)/oi');

  retain _rx;

  set check;

  length string $20;

  _check=var;

  _check=prxchange('s/(\d) +(\d)/\1\|\2)/o',-1,_check);

  _check=compress(_check);

  if prxmatch(_rx,_check) then

    do;

      string=prxposn(_rx,2,_check);

    end;

run;

allaboutsas
Calcite | Level 5

Hi Patrick,

Thank you so much! It’s working mostly, but still something is missing. I have to change my sample to suit better of my real data:

  1. The number wanted has to be a decimal: (\d*\.\d+)
  2. Words don’t want to be in front of the number are ‘abc’, ‘efg’
  3. Letters don’t want to be behind the number are ‘c’, ‘d’

data check;

var="Patient reported .1 c headache 1.3f and nausea. MD noticed rash."; output;

var="abc2.2 Pt. Rptd. Backache. usd2.5"; output;

var="efg2.5h of patient reported seeing spots."; output;

var="1.3g Elevated pulse and 0.6d labored breathing."; output;

var="Headache."; output;

var="abc .3 5 4.5k Headache.";output;

run;

Expected Output:

   var                                                                                                    string

Patient reported .1 c headache 1.3f and nausea. MD noticed rash.1.3 Patientreported.1cheadache1.3fandnausea.MDnoticedrash.
abc2.2 Pt. Rptd. Backache. usd2.52.5abc2.2Pt.Rptd.Backache.usd2.5
efg2.5h of patient reported seeing spots. efg2.5hofpatientreportedseeingspots.
1.3g Elevated pulse and 0.6d labored breathing.1.31.3gElevatedpulseand0.6dlaboredbreathing.
Headache. Headache.
abc .3 5 4.5k Headache.4.5abc.3\|54.5kHeadache.
Patrick
Opal | Level 21

With regular expressions it's really important that the patterns one wants to search for are very clearly defined. I therefore suggest you do first an in-depth analysis of your real data and then provide sample data and the extraction rules which cover "everything".

Ksharp
Super User

data check;
var="Patient reported .1 c headache 1.3f and nausea. MD noticed rash."; output;
var="abc2.2 Pt. Rptd. Backache. usd2.5"; output;
var="efg2.5h of patient reported seeing spots."; output;
var="1.3g Elevated pulse and 0.6d labored breathing."; output;
var="Headache."; output;
var="abc .3 5 4.5k Headache.";output;
run;
run;
 
 
data want;
set check;
length v $ 20;
retain pid;
if _n_ eq 1 then pid=prxparse('/\b[dh-z]+\d+\.?\d+\b|\b\d+\.?\d+[abe-z]+\b/i');
call prxsubstr(pid, var, position, length);
if position ne 0 then v = compress(substr(var, position, length),'.','kd');
drop pid position length;
run;

Xia Keshan

allaboutsas
Calcite | Level 5

Thank you guys very much for your help!

I don’t want to start a new discussion, since it is another question regarding regular expression, wonder if it’s even possible though. Is there a way to identify the word if it’s within let’s say 3-words of distance? I’m going to use the same sample (modified), so the requirement become this:

  1. The number wanted has to be a decimal: (\d*\.\d+)
  2. Words don’t want to be within 3-words in front or 3-words behind are: ‘ab ht’, ‘ods’.

data check;

var="Patient ab ht reported .1 headache 1.3f and nausea. MD ods noticed rash."; output;

var="ab ht 2.2 Pt. Rptd. Backache. ht usd2.5 od"; output;

var="2.5h of ods patient reported seeing spots."; output;

var="1.3g Elevated pulse ab ht and 0.6d labored breathing."; output;

var="Headache."; output;

var="ab ht .3 5 4 .5k Headache ab";output;

run;

Expected Output:

varstring
Patient ab ht reported .1 headache 1.3f and nausea. MD ods noticed rash.1.3
ab ht 2.2 Pt. Rptd. Backache. ht usd2.5 od2.5
2.5h of ods patient reported seeing spots.
1.3g Elevated pulse ab ht and 0.6d labored breathing.
Headache.
ab ht .3 5 4 .5k Headache ab.5

Thanks!

Patrick
Opal | Level 21

1) If your looking for stuff like '.2', '3.2' but not '3' then your RegEx looks fine to me.

2) What is the delimiter for a word? A specific pattern within 3 word distance could be something like (untested):   (\bods\b)(\b\w+){0,2}(\d*\.\d+)

allaboutsas
Calcite | Level 5

Thank you Patrick. I'll try that too.

Ksharp
Super User

You really make me headache . Hope you don't post such question again .

data check;
var="Patient ab ht reported .1 headache 1.3f and nausea. MD ods noticed rash."; output;
var="ab ht 2.2 Pt. Rptd. Backache. ht usd2.5 od"; output;
var="2.5h of ods patient reported seeing spots."; output;
var="1.3g Elevated pulse ab ht and 0.6d labored breathing."; output;
var="Headache."; output;
var="ab ht .3 5 4 .5k Headache ab";output;
run;
 
 
data want;
set check;
length v $ 20;
retain pid;
var=prxchange('s/\bab\s+ht\b|\bods\b/|/i',-1,var);
if _n_ eq 1 then pid=prxparse('/[^|]\s+(\S+)?\s+(\S+)?\s+[dh-z]+(\d+)?\.?\d+\s+|\s+(\d+)?\.?\d+[abe-z]+\s+(\S+)?\s+(\S+)?\s+[^|]/i');
call prxsubstr(pid, var, position, length);
if position ne 0 then v = prxchange('s/^\.+(?=\d+\.\d+)|\.+$//' ,-1, compress(substr(var, position, length),'.','kd') );
drop pid position length;
run;

Xia Keshan

allaboutsas
Calcite | Level 5

Xia Keshan,

Thank you so much! That’s awesome!


One question, I believe you carried over [dh-z]+ and [abe-z]+ from my first question, right? So I tried to remove them as the code below, I don’t understand why it’s not working, what did I do wrong?


data want;

set check;

length v $ 20;

retain pid;

var=prxchange('s/\bab\s+ht\b|\bods\b/|/i',-1,var);

if _n_ eq 1 then pid=prxparse('/[^|]\s+(\S+)?\s+(\S+)?\s+(\d+)?\.?\d+\s+|\s+(\d+)?\.?\d+\s+(\S+)?\s+(\S+)?\s+[^|]/i');

call prxsubstr(pid, var, position, length);

if position ne 0 then v = prxchange('s/^\.+(?=\d+\.\d+)|\.+$//' ,-1, compress(substr(var, position, length),'.','kd') );

drop pid position length;

run;

Ksharp
Super User

Why do you want to remove that line ? What is your purpose ?

If you don't want that condition any more , try this one :

 
data check;
var="Patient ab ht reported .1 headache 1.3f and nausea. MD ods noticed rash."; output;
var="ab ht 2.2 Pt. Rptd. Backache. ht usd2.5 od"; output;
var="2.5h of ods patient reported seeing spots."; output;
var="1.3g Elevated pulse ab ht and 0.6d labored breathing."; output;
var="Headache."; output;
var="ab ht .3 5 4 .5k Headache ab";output;
run;
 
 
data want;
set check;
length v $ 20;
retain pid;

if _n_ eq 1 then pid=prxparse('/\s+[dh-z]+(\d+)?\.?\d+\s+|\s+(\d+)?\.?\d+[abe-z]+\s+/i');
call prxsubstr(pid, var, position, length);
if position ne 0 then v = prxchange('s/^\.+(?=\d+\.\d+)|\.+$//' ,-1, compress(substr(var, position, length),'.','kd') );
drop pid position length;
run;

Xia Keshan

Message was edited by: xia keshan

allaboutsas
Calcite | Level 5

The requirement was this:

  1. The number wanted has to be a decimal: (\d*\.\d+)
  2. Words don’t want to be within 3-words in front, or 3-words behind are: ‘ab ht’, ‘ods’.

There is no [dh-z] or [abe-z], I think your previous code is right as long as these are removed, but I can't get it work.

Ksharp
Super User

I don't know why it couldn't work.

Try to Remove '?' behind '.' 

if _n_ eq 1 then pid=prxparse('/[^|]\s+(\S+)?\s+(\S+)?\s+[dh-z]+(\d+)?\.?\d+\s+|\s+(\d+)?\.?\d+[abe-z]+\s+(\S+)?\s+(\S+)?\s+[^|]/i'); 

data want;

set check;

length v $ 20;

retain pid;

var=prxchange('s/\bab\s+ht\b|\bods\b/|/i',-1,var);

if _n_ eq 1 then pid=prxparse('/[^|]\s+(\S+)?\s+(\S+)?\s+[dh-z]+(\d+)?\.\d+\s+|\s+(\d+)?\.\d+[abe-z]+\s+(\S+)?\s+(\S+)?\s+[^|]/i');

call prxsubstr(pid, var, position, length);

if position ne 0 then v = prxchange('s/^\.+(?=\d+\.\d+)|\.+$//' ,-1, compress(substr(var, position, length),'.','kd') );

drop pid position length;

run;

allaboutsas
Calcite | Level 5

Hi xia keshan,

Sorry, am I misunderstanding [dh-z], is it for (not started with ‘abc’, ‘efg’) from the previous example? But it’s not the request for this task, [abe-z] as well, so I want to remove them from the code you provided, but can’t make it work.


The only 2 requests for this task are:

  1. The number wanted has to be a decimal: (\d*\.\d+)
  2. Words don’t want to be within 3-words in front or 3-words behind are: ‘ab ht’, ‘ods’.

Thanks.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 42 replies
  • 3498 views
  • 7 likes
  • 8 in conversation