BookmarkSubscribeRSS Feed
aasdfafafsdfsaf
Calcite | Level 5

Hi,

 

I have a table of data with a column of text for medical information/notes. There is a 8 digit identifier that is used for each patient. An example of one patient information could be "John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678"
I want to be able to pull out all ID information and store it as a new column in the table. however the ID format is not always the same, sometimes there could be a space like "1234 5678" or no space "12345678" or possible hyphens in different spots while always being 8 digits long.

 

I have been using prxchange but it doesnt seem to be working.

 

Any help would be greatly appreciated.

17 REPLIES 17
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @aasdfafafsdfsaf 

 

Will the identifier (regardless of format) always be the last item in the string, or could it be be found anywhere?

aasdfafafsdfsaf
Calcite | Level 5

it can be found anywhere unfortunately. there isnt really a unified format for this as everyone that fills in the information may fill it in differently. The only thing i noticed was that there is always an "ID:" beforehand or "ID"

PaigeMiller
Diamond | Level 26

As you look at the data, can you provide a set of rules (in words, not SAS code) that will allow someone to create a SAS program that will extract the desired information?

--
Paige Miller
aasdfafafsdfsaf
Calcite | Level 5

I suppose it would be something like

1. Look for the ID in any case or ID:
2. look for a sequence of 8 digits after the prxmatch position of id
3. put the 8 digit value into a new column

Patrick
Opal | Level 21

Below an approach using RegEx.

data have;
  infile datalines truncover;
  input str $200.;
  datalines;
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678 
ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update.
ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update.
John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. (111)-222-3333 he was notified of his update.
;

data want(drop=_:);
  set have;
  length _pat_id $9 pat_id $8;
  _prxid=prxparse('/\bid[: ]+(\d([\d- ]){6,10}\d)\b/oi');
  if prxmatch(_prxid,str) then
    do;
      call prxposn(_prxid,1,_pos,_len);
      _pat_id=compress(substr(str,_pos,_len),,'kd');
      if length(_pat_id) = 8 then pat_id=_pat_id;
    end;
run;

proc print data=want;
run;

 For the RegEx bit {6,10}: the lower boundary is the number of digits that must exist, the higher boundary includes the additional non-digits that are allowed. If there can be only max two non-digits (dash or blank) then change the upper boundary to 8.

Should there be a possibility that your text contains multiple patient id's then you would need to amend the code by adding logic for call prxnext()

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @aasdfafafsdfsaf 

 

This code handled the different positions and formats in my test data. Note the little cheat:  The code creates a string of all digits found after ID, but it is placed in a variable 8 chars long, so it is only the 8 digits from the identifier that ahr kept, and following digits from telephone number etc. are discarded, 

* Test data;
data have;
  str = "John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678"; output;
  str = "John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678 "; output;
  str = "ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update."; output;
  str = "ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update."; output;
  str = "John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update."; output;
run;  

* Data step with intermediate results - easier to maintain;
data want1 (drop=s1 s2);
  set have;
  Length Identifier $8;
  s1 = translate(str,'    ','.-:,');
  s2 = substr(s1,index(lowcase(s1),'id '));
  Identifier = compress(s2,,'kd');
run;

* Proc sql with function calls nested for use as an expression;
* Beware! nested function calls in proc sql have a habit of returning max 200 bytes regardless of input length;
proc sql;
  create table want2 as
    select 
      str, 
      compress(substr(translate(str,'    ','.-:,'),index(lowcase(translate(str,'    ','.-:,')),'id ')),,'kd') as Identifier length=8
    from have;
quit;

 Skærmbillede 2024-01-18 180300.png

aasdfafafsdfsaf
Calcite | Level 5
Thanks for the reply. I noticed another issue in some of the fields. Is there a quick work around if there are two instances of ID? if it was mentioned earlier in the text and then later on with ID: 1234 5678
mkeintz
PROC Star

Using the data provided by @Patrick , you can:

  1. Find the word "ID"
  2. Concatenate the next two words into the ID variable.

A "word" below is a string of text delimited by a blank, a colon, a dash, or a period  - see the _delim variable.

 

data have;
  infile datalines truncover;
  input str $200.;
  datalines;
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678 
ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update.
ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update.
John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. (111)-222-3333 he was notified of his update.
;
data want (drop=_:);
  set have;

  _delim=' :-.';
  _nw=countw(str,_delim);
  do _w=1 to _nw until (scan(str,_w,_delim)='ID');
  end;
  length id $10;
  if _w<_nw-1 then id=catx(' ',scan(str,_w+1,_delim),scan(str,_w+2,_delim));
  put _n_= _w= id=;
run;

which produces this on the log:

_N_=1 _w=17 id=1234 5678
_N_=2 _w=17 id=12 345678
_N_=3 _w=1 id=1234 5678
_N_=4 _w=1 id=12 345678
_N_=5 _w=3 id=1234 5678
_N_=6 _w=3 id=1234 56789
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
aasdfafafsdfsaf
Calcite | Level 5

Thanks for the reply. is there a way we can find a sequence of numbers after the mention of ID: in the text? I'm noticing in some of the text ID is mentioned twice where the 1234-5678 is only mentioned after the second ID

Patrick
Opal | Level 21

@aasdfafafsdfsaf wrote:

Thanks for the reply. is there a way we can find a sequence of numbers after the mention of ID: in the text? I'm noticing in some of the text ID is mentioned twice where the 1234-5678 is only mentioned after the second ID


If the sample data used in below code doesn't cover all your cases then please provide amended sample data that represents the different patterns and tell us what you'd like as result (like populating multiple variables or creating multiple rows?).

Below code populates additional variables and is based on SAS Docu sample found here.

data have;
  infile datalines truncover;
  input str $200.;
  datalines;
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678 
ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update.
ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update.
John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. There is also ID: 1234-9999 available
John Smith ID:1234 5678 was called at telephone number. There is also ID: 1234-9999 available
;

data want(drop=_:);
  set have;
  array pat_id {3} $8;
  _prxid=prxparse('/\bid[: ]+(\d([\d- ]){6,10}\d)\b/oi');

  length _pat_id $9 _i _start _stop _pos _len 8;
  _start=1;
  _stop=length(str);
  call prxnext(_prxid, _start, _stop, str, _pos, _len);
  do while (_pos > 0);
    _pat_id = compress(substr(str, _pos, _len),,'kd');
    /* only populate pat_id if identified string contains exactly 8 digits */
    if length(_pat_id) = 8 then 
      do;
        _i=sum(_i,1);
        pat_id[_i]=_pat_id;
      end;
    call prxnext(_prxid, _start, _stop, str, _pos, _len);
  end;
run;

proc print data=want;
run;

Patrick_0-1706830841443.png

aasdfafafsdfsaf
Calcite | Level 5
Hi Patrick,

Thanks for the reply. It the example text would be more like something "John Smith was given an id and a telephone call. ID 1234-5678" where the term ID is mentioned earlier but later on it would be recorded like showed. I want to extract this number and put it into a singular column. As far as I know do not believe there are multiple ID numbers mentioned. just the term "ID" may be mentioned multiple times before the number is given
Patrick
Opal | Level 21

Hi @aasdfafafsdfsaf 

The regular expression used in both of my answers will only match if an ID is followed by digits. For the pattern to match there must be a blank or colon between ID and the first digit, after the first digit the only characters allowed are a digit, a blank or a dash. 

\bid[: ]+(\d([\d- ]){6,10}\d)

The existence of the word ID followed by some other characters that don't match the pattern won't influence the result.

 

From what you describe now I suggest you use the code from my first answer, add to the data have step any additional lines of text (cases) where things are not working with your current code and test if that also causes issues with what I proposed (which I believe it won't).

aasdfafafsdfsaf
Calcite | Level 5

Would this work in a table setting? The string i need to parse is a column within a table

Patrick
Opal | Level 21

@aasdfafafsdfsaf wrote:

Would this work in a table setting? The string i need to parse is a column within a table


Not sure that I understand. The sample code I've shared parses what's stored in variable str - which would be "a column within a table";

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 2754 views
  • 2 likes
  • 6 in conversation