I need to find the most efficient way possible to extract 6 test results that are populated in one text field. The variable is called RESULT_NOTES and the general format of the variable looks something like this (patterns vary a little for each record- just an example of 1, also all of the numbers are made up in this example):
RESULT COMMENT(S): ---------------------------------------------------------------------~ TEST NAME RESULT ALERT UOM PERFORMED AT~---------------------------------------------------------------------~Perfluorobutanesulfonic Acid 10 ng/mL 01~Synonym(s): PFBS~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 20 ng/mL~(90% CI, <0.0.050 - 0.23 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2013-2014)~(n=3000) (isomers not described) is typically~below 0.1 ng/mL (95th percentile)~ -----~Perfluoroheptanoic Acid 8 ng/mL 01~Synonym(s): PFHpA~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 97 ng/mL~(90% CI, 0.25 - 0.73 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2013-2014)~(n=2168) (isomers not described) is typically~below 0.20 ng/mL (95% CI, 0.10 - 0.20 ng/mL)~(95th percentile)~ -----~Perfluorohexanesulfonic Acid 16 ng/mL 01~Synonym(s): PFHxS~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 30 ng/mL~(90% CI, 4.1 - 17 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) (isomers not described) is typically~below 4.9 ng/mL (95% CI, 4.1 - 5.8 ng/mL)~(95th percentile)~ -----~Perfluorooctanoic Acid 20 ng/mL 01~Synonym(s): PFOA; FC-143 Component~Population reference interval derived from~BBC Laboratories data (n=300) is usually less than 4.1 ng/mL~(90% CI, 3.3 - 8.0 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) for the linear isomer is typically below~8 ng/mL (95% CI,7.6-10.3 ng/mL) (95th percentile)~below 1.9 ng/mL (95% CI, 1.5 - 2.2 ng/mL)~(95th percentile)~ -----~Perfluorononanoic Acid 5 ng/mL 01~Synonym(s): PFNA~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 1.4 ng/mL~(90% CI, 1.2 - 2.3 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) (isomers not described) is typically~below 1.9 ng/mL (95% CI, 1.5 - 2.2 ng/mL)~(95th percentile)~ -----~Perfluorooctanesulfonic Acid 12 ng/mL 01~Please be advised that this report contains fictitious~results to be used for interface program validation.~This report does not contain real patient results.~Synonym(s): PFOS~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 12 ng/mL~(90% CI, 7.7 - 15 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) for the linear isomer is typically~below 13 ng/mL (95% CI, 10 - 18 ng/mL)~(95th percentile)~This test was developed and its performance characteristics~determined by BBC Laboratories. It has not been cleared or approved~by the US Food and Drug Administration.~ -----~---------------------------------------------------------------------~ ALERT/ABNORMAL FLAG LEGEND:~ L= Below Low Normal, H= Above High Normal, LL= Alert Low~ HH= Alert High, <= Panic Low, >= Panic High, A= Abnormal~---------------------------------------------------------------------
Ideally, I would need to create a variable for each compound, with the numeric value and then a separate variable for the units an example below:
Perfluorobutanesulfonic_Value Perfluorobutanesulfonic_Units etc...…(for each compound bolded above)
10 ng/ml
Thank you for any help or suggestions!
A typical strategy:
data wantList;
set have;
stringId = _n_;
do compound =
"Perfluorobutanesulfonic Acid",
"Perfluoroheptanoic Acid",
"Perfluorohexanesulfonic Acid";
pos = find(text, compound, "it");
if pos > 0 then do;
pos = pos + length(compound);
length var $26; /* 32 - length("_units") */
var = scan(compound, 1);
value = input(scan(substr(text, pos), 1, " "), best.);
units = scan(substr(text, pos), 2, " ");
output;
end;
end;
drop text pos;
run;
proc transpose data=wantList out=wantvalues(drop=_name_) suffix=_value;
by stringId;
var value;
id var;
run;
proc transpose data=wantList out=wantunits(drop=_name_) suffix=_units;
by stringId;
var units;
id var;
run;
data want;
merge wantValues wantUnits;
by stringId;
drop stringId;
run;
You need two transpose steps because values are numeric and units are character.
May I assume that the prefix Perfluoro appears only at start of the text to be extracted and not in between extracted texts ?
Are All acids and is the extracted text always in a format of "Perfluoro .... Acid <amount> ng/ml" ?
If positive then next code (not tested) may help:
data want;
set have;
length text1 $32000
searc_for $9
out_var1 $30
out_value 8
out_unit $5
;
flag=1;
search_for = 'Perfluoro';
text1 = substr(result_notes,index(result_notes,search_for));
do until falg=0;
out_var1 = scan(text1,1)|| 'Acid';
out_value = input(scan(text1,3),best4.);
out_unit = scan(text1,4,' .');
output;
keep out_var1 out_value out_unit;
ix = index(text1,out_unit) + length(out_unit) +1;
text1 = substr(text1,index(result_notes,search_for));
if text1 = '' then flag=0;
end;
run;
Here's a start, anyway. This code assumes your data is in a text file named "remarks.txt" located in the SAS default directory.
data intermediate;
drop _:;
retain __rxID 0;
length _Found $ 32767 Name $50 Units $ 15 Value 8;
if _n_=1 then do;
_rxID=prxparse('/-~(\w+\s?\w*\/?\w*\s)+01~/');
put 'NOTE: ' _rxID=;
end;
infile "remarks.txt";
input;
_start = 1;
_stop = length(_infile_);
call prxnext(_rxID, _start, _stop, _infile_, _position, _length);
do while (_position > 0);
_found = substr(_infile_, _position+2, _length-3);
Units=scan(_found,-2,' ');
Value=input(scan(_found,-3,' '),10.);
Name=substr(_found,1,anydigit(_found)-2);
output;
call prxnext(_rxID, _start, _stop, _infile_, _position, _length);
end;
run;
%macro Fixnames;
proc sql noprint;
select distinct
Name
,cats(scan(Name,1),'_Value')
,cats(scan(Name,1),'_Units')
into :Name1- , :val1- , :unit1-
from intermediate
order by 1
;
quit;
data want;
retain
%do i=1 %to &sqlobs;
&&val&i 0
&&unit&i ' '
%end;
;
set intermediate end=last;
%do i=1 %to &sqlobs;
if Name="&&name&i" then do;
&&val&i=value;
&&unit&i=units;
end;
%end;
if last then output;
run;
%mend;
%fixnames
The program produces this result:
Obs | Perfluorobutanesulfonic_Value | Perfluorobutanesulfonic_Units | Perfluoroheptanoic_Value | Perfluoroheptanoic_Units | Perfluorohexanesulfonic_Value | Perfluorohexanesulfonic_Units | Perfluorononanoic_Value | Perfluorononanoic_Units | Perfluorooctanesulfonic_Value | Perfluorooctanesulfonic_Units | Perfluorooctanoic_Value | Perfluorooctanoic_Units |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10 | ng/mL | 8 | ng/mL | 16 | ng/mL | 5 | ng/mL | 12 | ng/mL | 20 | ng/mL |
A typical strategy:
data wantList;
set have;
stringId = _n_;
do compound =
"Perfluorobutanesulfonic Acid",
"Perfluoroheptanoic Acid",
"Perfluorohexanesulfonic Acid";
pos = find(text, compound, "it");
if pos > 0 then do;
pos = pos + length(compound);
length var $26; /* 32 - length("_units") */
var = scan(compound, 1);
value = input(scan(substr(text, pos), 1, " "), best.);
units = scan(substr(text, pos), 2, " ");
output;
end;
end;
drop text pos;
run;
proc transpose data=wantList out=wantvalues(drop=_name_) suffix=_value;
by stringId;
var value;
id var;
run;
proc transpose data=wantList out=wantunits(drop=_name_) suffix=_units;
by stringId;
var units;
id var;
run;
data want;
merge wantValues wantUnits;
by stringId;
drop stringId;
run;
You need two transpose steps because values are numeric and units are character.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.