I need to find the most efficient way possible to extract 6 test results that are populated in one text field. The variable is called RESULT_NOTES and the general format of the variable looks something like this (patterns vary a little for each record- just an example of 1, also all of the numbers are made up in this example):
RESULT COMMENT(S): ---------------------------------------------------------------------~ TEST NAME RESULT ALERT UOM PERFORMED AT~---------------------------------------------------------------------~Perfluorobutanesulfonic Acid 10 ng/mL 01~Synonym(s): PFBS~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 20 ng/mL~(90% CI, <0.0.050 - 0.23 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2013-2014)~(n=3000) (isomers not described) is typically~below 0.1 ng/mL (95th percentile)~ -----~Perfluoroheptanoic Acid 8 ng/mL 01~Synonym(s): PFHpA~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 97 ng/mL~(90% CI, 0.25 - 0.73 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2013-2014)~(n=2168) (isomers not described) is typically~below 0.20 ng/mL (95% CI, 0.10 - 0.20 ng/mL)~(95th percentile)~ -----~Perfluorohexanesulfonic Acid 16 ng/mL 01~Synonym(s): PFHxS~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 30 ng/mL~(90% CI, 4.1 - 17 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) (isomers not described) is typically~below 4.9 ng/mL (95% CI, 4.1 - 5.8 ng/mL)~(95th percentile)~ -----~Perfluorooctanoic Acid 20 ng/mL 01~Synonym(s): PFOA; FC-143 Component~Population reference interval derived from~BBC Laboratories data (n=300) is usually less than 4.1 ng/mL~(90% CI, 3.3 - 8.0 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) for the linear isomer is typically below~8 ng/mL (95% CI,7.6-10.3 ng/mL) (95th percentile)~below 1.9 ng/mL (95% CI, 1.5 - 2.2 ng/mL)~(95th percentile)~ -----~Perfluorononanoic Acid 5 ng/mL 01~Synonym(s): PFNA~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 1.4 ng/mL~(90% CI, 1.2 - 2.3 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) (isomers not described) is typically~below 1.9 ng/mL (95% CI, 1.5 - 2.2 ng/mL)~(95th percentile)~ -----~Perfluorooctanesulfonic Acid 12 ng/mL 01~Please be advised that this report contains fictitious~results to be used for interface program validation.~This report does not contain real patient results.~Synonym(s): PFOS~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 12 ng/mL~(90% CI, 7.7 - 15 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) for the linear isomer is typically~below 13 ng/mL (95% CI, 10 - 18 ng/mL)~(95th percentile)~This test was developed and its performance characteristics~determined by BBC Laboratories. It has not been cleared or approved~by the US Food and Drug Administration.~ -----~---------------------------------------------------------------------~ ALERT/ABNORMAL FLAG LEGEND:~ L= Below Low Normal, H= Above High Normal, LL= Alert Low~ HH= Alert High, <= Panic Low, >= Panic High, A= Abnormal~---------------------------------------------------------------------
Ideally, I would need to create a variable for each compound, with the numeric value and then a separate variable for the units an example below:
Perfluorobutanesulfonic_Value Perfluorobutanesulfonic_Units etc...…(for each compound bolded above)
10 ng/ml
Thank you for any help or suggestions!
A typical strategy:
data wantList;
set have;
stringId = _n_;
do compound =
"Perfluorobutanesulfonic Acid",
"Perfluoroheptanoic Acid",
"Perfluorohexanesulfonic Acid";
pos = find(text, compound, "it");
if pos > 0 then do;
pos = pos + length(compound);
length var $26; /* 32 - length("_units") */
var = scan(compound, 1);
value = input(scan(substr(text, pos), 1, " "), best.);
units = scan(substr(text, pos), 2, " ");
output;
end;
end;
drop text pos;
run;
proc transpose data=wantList out=wantvalues(drop=_name_) suffix=_value;
by stringId;
var value;
id var;
run;
proc transpose data=wantList out=wantunits(drop=_name_) suffix=_units;
by stringId;
var units;
id var;
run;
data want;
merge wantValues wantUnits;
by stringId;
drop stringId;
run;
You need two transpose steps because values are numeric and units are character.
May I assume that the prefix Perfluoro appears only at start of the text to be extracted and not in between extracted texts ?
Are All acids and is the extracted text always in a format of "Perfluoro .... Acid <amount> ng/ml" ?
If positive then next code (not tested) may help:
data want;
set have;
length text1 $32000
searc_for $9
out_var1 $30
out_value 8
out_unit $5
;
flag=1;
search_for = 'Perfluoro';
text1 = substr(result_notes,index(result_notes,search_for));
do until falg=0;
out_var1 = scan(text1,1)|| 'Acid';
out_value = input(scan(text1,3),best4.);
out_unit = scan(text1,4,' .');
output;
keep out_var1 out_value out_unit;
ix = index(text1,out_unit) + length(out_unit) +1;
text1 = substr(text1,index(result_notes,search_for));
if text1 = '' then flag=0;
end;
run;
Here's a start, anyway. This code assumes your data is in a text file named "remarks.txt" located in the SAS default directory.
data intermediate;
drop _:;
retain __rxID 0;
length _Found $ 32767 Name $50 Units $ 15 Value 8;
if _n_=1 then do;
_rxID=prxparse('/-~(\w+\s?\w*\/?\w*\s)+01~/');
put 'NOTE: ' _rxID=;
end;
infile "remarks.txt";
input;
_start = 1;
_stop = length(_infile_);
call prxnext(_rxID, _start, _stop, _infile_, _position, _length);
do while (_position > 0);
_found = substr(_infile_, _position+2, _length-3);
Units=scan(_found,-2,' ');
Value=input(scan(_found,-3,' '),10.);
Name=substr(_found,1,anydigit(_found)-2);
output;
call prxnext(_rxID, _start, _stop, _infile_, _position, _length);
end;
run;
%macro Fixnames;
proc sql noprint;
select distinct
Name
,cats(scan(Name,1),'_Value')
,cats(scan(Name,1),'_Units')
into :Name1- , :val1- , :unit1-
from intermediate
order by 1
;
quit;
data want;
retain
%do i=1 %to &sqlobs;
&&val&i 0
&&unit&i ' '
%end;
;
set intermediate end=last;
%do i=1 %to &sqlobs;
if Name="&&name&i" then do;
&&val&i=value;
&&unit&i=units;
end;
%end;
if last then output;
run;
%mend;
%fixnames
The program produces this result:
Obs | Perfluorobutanesulfonic_Value | Perfluorobutanesulfonic_Units | Perfluoroheptanoic_Value | Perfluoroheptanoic_Units | Perfluorohexanesulfonic_Value | Perfluorohexanesulfonic_Units | Perfluorononanoic_Value | Perfluorononanoic_Units | Perfluorooctanesulfonic_Value | Perfluorooctanesulfonic_Units | Perfluorooctanoic_Value | Perfluorooctanoic_Units |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10 | ng/mL | 8 | ng/mL | 16 | ng/mL | 5 | ng/mL | 12 | ng/mL | 20 | ng/mL |
A typical strategy:
data wantList;
set have;
stringId = _n_;
do compound =
"Perfluorobutanesulfonic Acid",
"Perfluoroheptanoic Acid",
"Perfluorohexanesulfonic Acid";
pos = find(text, compound, "it");
if pos > 0 then do;
pos = pos + length(compound);
length var $26; /* 32 - length("_units") */
var = scan(compound, 1);
value = input(scan(substr(text, pos), 1, " "), best.);
units = scan(substr(text, pos), 2, " ");
output;
end;
end;
drop text pos;
run;
proc transpose data=wantList out=wantvalues(drop=_name_) suffix=_value;
by stringId;
var value;
id var;
run;
proc transpose data=wantList out=wantunits(drop=_name_) suffix=_units;
by stringId;
var units;
id var;
run;
data want;
merge wantValues wantUnits;
by stringId;
drop stringId;
run;
You need two transpose steps because values are numeric and units are character.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.