Solved: How to extract 6 different test results from a text field

sas21 · Posted 12-02-2020 03:44 PM

I need to find the most efficient way possible to extract 6 test results that are populated in one text field. The variable is called RESULT_NOTES and the general format of the variable looks something like this (patterns vary a little for each record- just an example of 1, also all of the numbers are made up in this example):

RESULT COMMENT(S): ---------------------------------------------------------------------~ TEST NAME RESULT ALERT UOM PERFORMED AT~---------------------------------------------------------------------~Perfluorobutanesulfonic Acid 10 ng/mL 01~Synonym(s): PFBS~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 20 ng/mL~(90% CI, <0.0.050 - 0.23 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2013-2014)~(n=3000) (isomers not described) is typically~below 0.1 ng/mL (95th percentile)~ -----~Perfluoroheptanoic Acid 8 ng/mL 01~Synonym(s): PFHpA~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 97 ng/mL~(90% CI, 0.25 - 0.73 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2013-2014)~(n=2168) (isomers not described) is typically~below 0.20 ng/mL (95% CI, 0.10 - 0.20 ng/mL)~(95th percentile)~ -----~Perfluorohexanesulfonic Acid 16 ng/mL 01~Synonym(s): PFHxS~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 30 ng/mL~(90% CI, 4.1 - 17 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) (isomers not described) is typically~below 4.9 ng/mL (95% CI, 4.1 - 5.8 ng/mL)~(95th percentile)~ -----~Perfluorooctanoic Acid 20 ng/mL 01~Synonym(s): PFOA; FC-143 Component~Population reference interval derived from~BBC Laboratories data (n=300) is usually less than 4.1 ng/mL~(90% CI, 3.3 - 8.0 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) for the linear isomer is typically below~8 ng/mL (95% CI,7.6-10.3 ng/mL) (95th percentile)~below 1.9 ng/mL (95% CI, 1.5 - 2.2 ng/mL)~(95th percentile)~ -----~Perfluorononanoic Acid 5 ng/mL 01~Synonym(s): PFNA~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 1.4 ng/mL~(90% CI, 1.2 - 2.3 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) (isomers not described) is typically~below 1.9 ng/mL (95% CI, 1.5 - 2.2 ng/mL)~(95th percentile)~ -----~Perfluorooctanesulfonic Acid 12 ng/mL 01~Please be advised that this report contains fictitious~results to be used for interface program validation.~This report does not contain real patient results.~Synonym(s): PFOS~Population reference interval derived from BBC Laboratories~data (n=300) is usually less than 12 ng/mL~(90% CI, 7.7 - 15 ng/mL) (97.5th percentile)~General U.S. population from CDC-NHANES (2015-2016)~(n=1993) for the linear isomer is typically~below 13 ng/mL (95% CI, 10 - 18 ng/mL)~(95th percentile)~This test was developed and its performance characteristics~determined by BBC Laboratories. It has not been cleared or approved~by the US Food and Drug Administration.~ -----~---------------------------------------------------------------------~ ALERT/ABNORMAL FLAG LEGEND:~ L= Below Low Normal, H= Above High Normal, LL= Alert Low~ HH= Alert High, <= Panic Low, >= Panic High, A= Abnormal~---------------------------------------------------------------------

Ideally, I would need to create a variable for each compound, with the numeric value and then a separate variable for the units an example below:

Perfluorobutanesulfonic_Value Perfluorobutanesulfonic_Units etc...…(for each compound bolded above)

10 ng/ml

Thank you for any help or suggestions!

%MCEPASTEBIN%

%MCEPASTEBIN

PGStats · Posted 12-02-2020 05:16 PM

A typical strategy:

data wantList;
set have;
stringId = _n_;
do compound = 
    "Perfluorobutanesulfonic Acid", 
    "Perfluoroheptanoic Acid",
    "Perfluorohexanesulfonic Acid";
    pos = find(text, compound, "it");
    if pos > 0 then do;
        pos = pos + length(compound);
        length var $26; /* 32 - length("_units") */
        var = scan(compound, 1);
        value = input(scan(substr(text, pos), 1, " "), best.);
        units = scan(substr(text, pos), 2, " ");
        output;
        end;
    end;
drop text pos;
run;

proc transpose data=wantList out=wantvalues(drop=_name_) suffix=_value;
by stringId;
var value;
id var;
run;

proc transpose data=wantList out=wantunits(drop=_name_) suffix=_units;
by stringId;
var units;
id var;
run;

data want;
merge wantValues wantUnits;
by stringId;
drop stringId;
run;

You need two transpose steps because values are numeric and units are character.

PG

View solution in original post

Shmuel · Posted 12-02-2020 04:48 PM

May I assume that the prefix Perfluoro appears only at start of the text to be extracted and not in between extracted texts ?

Are All acids and is the extracted text always in a format of "Perfluoro .... Acid <amount> ng/ml" ?

If positive then next code (not tested) may help:


data want;
 set have;
     length text1 $32000 
	        searc_for $9
			out_var1 $30
			out_value  8
			out_unit  $5
		;
	flag=1;
	search_for = 'Perfluoro';
	text1 = substr(result_notes,index(result_notes,search_for));
	do until falg=0;
	   out_var1 = scan(text1,1)|| 'Acid';
	   out_value = input(scan(text1,3),best4.);
	   out_unit = scan(text1,4,' .');
	   output; 
	   keep out_var1 out_value out_unit;
	   
	   ix = index(text1,out_unit) + length(out_unit) +1;
	   text1 = substr(text1,index(result_notes,search_for));
	   if text1 = '' then flag=0;
	end;
run;

SASJedi · Posted 12-02-2020 04:54 PM

Here's a start, anyway. This code assumes your data is in a text file named "remarks.txt" located in the SAS default directory.

data intermediate;
   drop _:;
   retain __rxID 0;
   length _Found $ 32767 Name $50 Units $ 15 Value 8;
   if _n_=1 then do;
      _rxID=prxparse('/-~(\w+\s?\w*\/?\w*\s)+01~/');
      put 'NOTE: ' _rxID=;
   end;
   infile "remarks.txt";
   input;
   _start = 1;
   _stop = length(_infile_);
   call prxnext(_rxID, _start, _stop, _infile_, _position, _length);
      do while (_position > 0);
         _found = substr(_infile_, _position+2, _length-3);
         Units=scan(_found,-2,' ');
         Value=input(scan(_found,-3,' '),10.);
         Name=substr(_found,1,anydigit(_found)-2);
         output;
         call prxnext(_rxID, _start, _stop, _infile_, _position, _length);
      end;
run;

%macro Fixnames;
proc sql noprint;
select distinct
       Name
      ,cats(scan(Name,1),'_Value')
      ,cats(scan(Name,1),'_Units')
  into :Name1- , :val1- , :unit1-
  from intermediate
  order by 1
;
quit;

data want;
  retain 
     %do i=1 %to &sqlobs;
     &&val&i 0 
     &&unit&i '          '
     %end;
  ;
  set intermediate end=last;
  %do i=1 %to &sqlobs;
  if Name="&&name&i" then do;
     &&val&i=value;
     &&unit&i=units;
  end;
  %end;
  if last then output;
run;
%mend;

%fixnames

The program produces this result:

Obs	Perfluorobutanesulfonic_Value	Perfluorobutanesulfonic_Units	Perfluoroheptanoic_Value	Perfluoroheptanoic_Units	Perfluorohexanesulfonic_Value	Perfluorohexanesulfonic_Units	Perfluorononanoic_Value	Perfluorononanoic_Units	Perfluorooctanesulfonic_Value	Perfluorooctanesulfonic_Units	Perfluorooctanoic_Value	Perfluorooctanoic_Units
1	10	ng/mL	8	ng/mL	16	ng/mL	5	ng/mL	12	ng/mL	20	ng/mL

Check out my Jedi SAS Tricks for SAS Users

PGStats · Posted 12-02-2020 05:16 PM

A typical strategy:

data wantList;
set have;
stringId = _n_;
do compound = 
    "Perfluorobutanesulfonic Acid", 
    "Perfluoroheptanoic Acid",
    "Perfluorohexanesulfonic Acid";
    pos = find(text, compound, "it");
    if pos > 0 then do;
        pos = pos + length(compound);
        length var $26; /* 32 - length("_units") */
        var = scan(compound, 1);
        value = input(scan(substr(text, pos), 1, " "), best.);
        units = scan(substr(text, pos), 2, " ");
        output;
        end;
    end;
drop text pos;
run;

proc transpose data=wantList out=wantvalues(drop=_name_) suffix=_value;
by stringId;
var value;
id var;
run;

proc transpose data=wantList out=wantunits(drop=_name_) suffix=_units;
by stringId;
var units;
id var;
run;

data want;
merge wantValues wantUnits;
by stringId;
drop stringId;
run;

You need two transpose steps because values are numeric and units are character.

PG

How to extract 6 different test results from a text field

Re: How to extract 6 different test results from a text field

Re: How to extract 6 different test results from a text field

Re: How to extract 6 different test results from a text field

Re: How to extract 6 different test results from a text field

How to extract 6 different test results from a text field

Re: How to extract 6 different test results from a text field

Re: How to extract 6 different test results from a text field

Re: How to extract 6 different test results from a text field

Re: How to extract 6 different test results from a text field

Registration is open

SAS Training: Just a Click Away