Hi,
I have a table of data with a column of text for medical information/notes. There is a 8 digit identifier that is used for each patient. An example of one patient information could be "John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678"
I want to be able to pull out all ID information and store it as a new column in the table. however the ID format is not always the same, sometimes there could be a space like "1234 5678" or no space "12345678" or possible hyphens in different spots while always being 8 digits long.
I have been using prxchange but it doesnt seem to be working.
Any help would be greatly appreciated.
Will the identifier (regardless of format) always be the last item in the string, or could it be be found anywhere?
it can be found anywhere unfortunately. there isnt really a unified format for this as everyone that fills in the information may fill it in differently. The only thing i noticed was that there is always an "ID:" beforehand or "ID"
As you look at the data, can you provide a set of rules (in words, not SAS code) that will allow someone to create a SAS program that will extract the desired information?
I suppose it would be something like
1. Look for the ID in any case or ID:
2. look for a sequence of 8 digits after the prxmatch position of id
3. put the 8 digit value into a new column
Below an approach using RegEx.
data have;
infile datalines truncover;
input str $200.;
datalines;
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678
ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update.
ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update.
John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. (111)-222-3333 he was notified of his update.
;
data want(drop=_:);
set have;
length _pat_id $9 pat_id $8;
_prxid=prxparse('/\bid[: ]+(\d([\d- ]){6,10}\d)\b/oi');
if prxmatch(_prxid,str) then
do;
call prxposn(_prxid,1,_pos,_len);
_pat_id=compress(substr(str,_pos,_len),,'kd');
if length(_pat_id) = 8 then pat_id=_pat_id;
end;
run;
proc print data=want;
run;
For the RegEx bit {6,10}
: the lower boundary is the number of digits that must exist, the higher boundary includes the additional non-digits that are allowed. If there can be only max two non-digits (dash or blank) then change the upper boundary to 8.
Should there be a possibility that your text contains multiple patient id's then you would need to amend the code by adding logic for call prxnext()
This code handled the different positions and formats in my test data. Note the little cheat: The code creates a string of all digits found after ID, but it is placed in a variable 8 chars long, so it is only the 8 digits from the identifier that ahr kept, and following digits from telephone number etc. are discarded,
* Test data;
data have;
str = "John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678"; output;
str = "John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678 "; output;
str = "ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update."; output;
str = "ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update."; output;
str = "John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update."; output;
run;
* Data step with intermediate results - easier to maintain;
data want1 (drop=s1 s2);
set have;
Length Identifier $8;
s1 = translate(str,' ','.-:,');
s2 = substr(s1,index(lowcase(s1),'id '));
Identifier = compress(s2,,'kd');
run;
* Proc sql with function calls nested for use as an expression;
* Beware! nested function calls in proc sql have a habit of returning max 200 bytes regardless of input length;
proc sql;
create table want2 as
select
str,
compress(substr(translate(str,' ','.-:,'),index(lowcase(translate(str,' ','.-:,')),'id ')),,'kd') as Identifier length=8
from have;
quit;
Using the data provided by @Patrick , you can:
A "word" below is a string of text delimited by a blank, a colon, a dash, or a period - see the _delim variable.
data have;
infile datalines truncover;
input str $200.;
datalines;
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678
ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update.
ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update.
John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. (111)-222-3333 he was notified of his update.
;
data want (drop=_:);
set have;
_delim=' :-.';
_nw=countw(str,_delim);
do _w=1 to _nw until (scan(str,_w,_delim)='ID');
end;
length id $10;
if _w<_nw-1 then id=catx(' ',scan(str,_w+1,_delim),scan(str,_w+2,_delim));
put _n_= _w= id=;
run;
which produces this on the log:
_N_=1 _w=17 id=1234 5678
_N_=2 _w=17 id=12 345678
_N_=3 _w=1 id=1234 5678
_N_=4 _w=1 id=12 345678
_N_=5 _w=3 id=1234 5678
_N_=6 _w=3 id=1234 56789
Thanks for the reply. is there a way we can find a sequence of numbers after the mention of ID: in the text? I'm noticing in some of the text ID is mentioned twice where the 1234-5678 is only mentioned after the second ID
@aasdfafafsdfsaf wrote:
Thanks for the reply. is there a way we can find a sequence of numbers after the mention of ID: in the text? I'm noticing in some of the text ID is mentioned twice where the 1234-5678 is only mentioned after the second ID
If the sample data used in below code doesn't cover all your cases then please provide amended sample data that represents the different patterns and tell us what you'd like as result (like populating multiple variables or creating multiple rows?).
Below code populates additional variables and is based on SAS Docu sample found here.
data have;
infile datalines truncover;
input str $200.;
datalines;
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID: 1234-5678
John Smith was called at telephone number (111)-222-3333 he was notified of his update. ID 12 345678
ID: 1234-5678 John Smith was called at telephone number (111)-222-3333 he was notified of his update.
ID 12 345678. John Smith was called at telephone number (111)-222-3333 he was notified of his update.
John Smith ID:1234 5678 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. (111)-222-3333 he was notified of his update.
John Smith ID:1234 56789 was called at telephone number. There is also ID: 1234-9999 available
John Smith ID:1234 5678 was called at telephone number. There is also ID: 1234-9999 available
;
data want(drop=_:);
set have;
array pat_id {3} $8;
_prxid=prxparse('/\bid[: ]+(\d([\d- ]){6,10}\d)\b/oi');
length _pat_id $9 _i _start _stop _pos _len 8;
_start=1;
_stop=length(str);
call prxnext(_prxid, _start, _stop, str, _pos, _len);
do while (_pos > 0);
_pat_id = compress(substr(str, _pos, _len),,'kd');
/* only populate pat_id if identified string contains exactly 8 digits */
if length(_pat_id) = 8 then
do;
_i=sum(_i,1);
pat_id[_i]=_pat_id;
end;
call prxnext(_prxid, _start, _stop, str, _pos, _len);
end;
run;
proc print data=want;
run;
The regular expression used in both of my answers will only match if an ID is followed by digits. For the pattern to match there must be a blank or colon between ID and the first digit, after the first digit the only characters allowed are a digit, a blank or a dash.
\bid[: ]+(\d([\d- ]){6,10}\d)
The existence of the word ID followed by some other characters that don't match the pattern won't influence the result.
From what you describe now I suggest you use the code from my first answer, add to the data have step any additional lines of text (cases) where things are not working with your current code and test if that also causes issues with what I proposed (which I believe it won't).
Would this work in a table setting? The string i need to parse is a column within a table
@aasdfafafsdfsaf wrote:
Would this work in a table setting? The string i need to parse is a column within a table
Not sure that I understand. The sample code I've shared parses what's stored in variable str - which would be "a column within a table";
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.