Re: How do i parse specific number combinations from a field in a tabl...

aasdfafafsdfsaf · Posted 01-18-2024 10:56 AM

Hi,

I have a table of data with a column of text for medical information/notes. There is a 8 digit identifier that is used for each patient. An example of one patient information could be "John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678"
I want to be able to pull out all ID information and store it as a new column in the table. however the ID format is not always the same, sometimes there could be a space like "1234 5678" or no space "12345678" or possible hyphens in different spots while always being 8 digits long.

I have been using prxchange but it doesnt seem to be working.

Any help would be greatly appreciated.

ErikLund_Jensen · Posted 01-18-2024 11:17 AM

Hi @aasdfafafsdfsaf

Will the identifier (regardless of format) always be the last item in the string, or could it be be found anywhere?

aasdfafafsdfsaf · Posted 01-18-2024 11:19 AM

it can be found anywhere unfortunately. there isnt really a unified format for this as everyone that fills in the information may fill it in differently. The only thing i noticed was that there is always an "ID:" beforehand or "ID"

PaigeMiller · Posted 01-18-2024 11:27 AM

As you look at the data, can you provide a set of rules (in words, not SAS code) that will allow someone to create a SAS program that will extract the desired information?

--
Paige Miller

aasdfafafsdfsaf · Posted 01-18-2024 12:00 PM

I suppose it would be something like

1. Look for the ID in any case or ID:
2. look for a sequence of 8 digits after the prxmatch position of id
3. put the 8 digit value into a new column

Patrick · Posted 01-18-2024 04:23 PM

Below an approach using RegEx.

data have;
  infile datalines truncover;
  input str $200.;
  datalines;
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678 
ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update.
ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update.
John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. (111)-222-3333 he was notified of his update.
;

data want(drop=_:);
  set have;
  length _pat_id $9 pat_id $8;
  _prxid=prxparse('/\bid[: ]+(\d([\d- ]){6,10}\d)\b/oi');
  if prxmatch(_prxid,str) then
    do;
      call prxposn(_prxid,1,_pos,_len);
      _pat_id=compress(substr(str,_pos,_len),,'kd');
      if length(_pat_id) = 8 then pat_id=_pat_id;
    end;
run;

proc print data=want;
run;

For the RegEx bit {6,10}: the lower boundary is the number of digits that must exist, the higher boundary includes the additional non-digits that are allowed. If there can be only max two non-digits (dash or blank) then change the upper boundary to 8.

Should there be a possibility that your text contains multiple patient id's then you would need to amend the code by adding logic for call prxnext()

ErikLund_Jensen · Posted 01-18-2024 12:04 PM

Hi @aasdfafafsdfsaf

This code handled the different positions and formats in my test data. Note the little cheat: The code creates a string of all digits found after ID, but it is placed in a variable 8 chars long, so it is only the 8 digits from the identifier that ahr kept, and following digits from telephone number etc. are discarded,

* Test data;
data have;
  str = "John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678"; output;
  str = "John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678 "; output;
  str = "ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update."; output;
  str = "ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update."; output;
  str = "John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update."; output;
run;  

* Data step with intermediate results - easier to maintain;
data want1 (drop=s1 s2);
  set have;
  Length Identifier $8;
  s1 = translate(str,'    ','.-:,');
  s2 = substr(s1,index(lowcase(s1),'id '));
  Identifier = compress(s2,,'kd');
run;

* Proc sql with function calls nested for use as an expression;
* Beware! nested function calls in proc sql have a habit of returning max 200 bytes regardless of input length;
proc sql;
  create table want2 as
    select 
      str, 
      compress(substr(translate(str,'    ','.-:,'),index(lowcase(translate(str,'    ','.-:,')),'id ')),,'kd') as Identifier length=8
    from have;
quit;

aasdfafafsdfsaf · Posted 02-01-2024 11:30 AM

Thanks for the reply. I noticed another issue in some of the fields. Is there a quick work around if there are two instances of ID? if it was mentioned earlier in the text and then later on with ID: 1234 5678

mkeintz · Posted 01-18-2024 05:08 PM

Using the data provided by @Patrick , you can:

Find the word "ID"
Concatenate the next two words into the ID variable.

A "word" below is a string of text delimited by a blank, a colon, a dash, or a period - see the _delim variable.

data have;
  infile datalines truncover;
  input str $200.;
  datalines;
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678 
ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update.
ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update.
John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. (111)-222-3333 he was notified of his update.
;
data want (drop=_:);
  set have;

  _delim=' :-.';
  _nw=countw(str,_delim);
  do _w=1 to _nw until (scan(str,_w,_delim)='ID');
  end;
  length id $10;
  if _w<_nw-1 then id=catx(' ',scan(str,_w+1,_delim),scan(str,_w+2,_delim));
  put _n_= _w= id=;
run;

which produces this on the log:

_N_=1 _w=17 id=1234 5678
_N_=2 _w=17 id=12 345678
_N_=3 _w=1 id=1234 5678
_N_=4 _w=1 id=12 345678
_N_=5 _w=3 id=1234 5678
_N_=6 _w=3 id=1234 56789

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

aasdfafafsdfsaf · Posted 02-01-2024 11:58 AM

Thanks for the reply. is there a way we can find a sequence of numbers after the mention of ID: in the text? I'm noticing in some of the text ID is mentioned twice where the 1234-5678 is only mentioned after the second ID

Patrick · Posted 02-01-2024 06:06 PM

@aasdfafafsdfsaf wrote:

Thanks for the reply. is there a way we can find a sequence of numbers after the mention of ID: in the text? I'm noticing in some of the text ID is mentioned twice where the 1234-5678 is only mentioned after the second ID

If the sample data used in below code doesn't cover all your cases then please provide amended sample data that represents the different patterns and tell us what you'd like as result (like populating multiple variables or creating multiple rows?).

Below code populates additional variables and is based on SAS Docu sample found here.

data have;
  infile datalines truncover;
  input str $200.;
  datalines;
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678 
ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update.
ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update.
John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. There is also ID: 1234-9999 available
John Smith ID:1234 5678 was called at telephone number. There is also ID: 1234-9999 available
;

data want(drop=_:);
  set have;
  array pat_id {3} $8;
  _prxid=prxparse('/\bid[: ]+(\d([\d- ]){6,10}\d)\b/oi');

  length _pat_id $9 _i _start _stop _pos _len 8;
  _start=1;
  _stop=length(str);
  call prxnext(_prxid, _start, _stop, str, _pos, _len);
  do while (_pos > 0);
    _pat_id = compress(substr(str, _pos, _len),,'kd');
    /* only populate pat_id if identified string contains exactly 8 digits */
    if length(_pat_id) = 8 then 
      do;
        _i=sum(_i,1);
        pat_id[_i]=_pat_id;
      end;
    call prxnext(_prxid, _start, _stop, str, _pos, _len);
  end;
run;

proc print data=want;
run;

aasdfafafsdfsaf · Posted 02-02-2024 12:15 PM

Hi Patrick,

Thanks for the reply. It the example text would be more like something "John Smith was given an id and a telephone call. ID 1234-5678" where the term ID is mentioned earlier but later on it would be recorded like showed. I want to extract this number and put it into a singular column. As far as I know do not believe there are multiple ID numbers mentioned. just the term "ID" may be mentioned multiple times before the number is given

Patrick · Posted 02-02-2024 09:20 PM

Hi @aasdfafafsdfsaf

The regular expression used in both of my answers will only match if an ID is followed by digits. For the pattern to match there must be a blank or colon between ID and the first digit, after the first digit the only characters allowed are a digit, a blank or a dash.

\bid[: ]+(\d([\d- ]){6,10}\d)

The existence of the word ID followed by some other characters that don't match the pattern won't influence the result.

From what you describe now I suggest you use the code from my first answer, add to the data have step any additional lines of text (cases) where things are not working with your current code and test if that also causes issues with what I proposed (which I believe it won't).

aasdfafafsdfsaf · Posted 02-13-2024 10:12 AM

Would this work in a table setting? The string i need to parse is a column within a table

Patrick · Posted 02-14-2024 02:19 AM

@aasdfafafsdfsaf wrote:

Would this work in a table setting? The string i need to parse is a column within a table

Not sure that I understand. The sample code I've shared parses what's stored in variable str - which would be "a column within a table";

How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Re: How do i parse specific number combinations from a field in a table?

Classroom Training Available!