Extracting keywords and corresponding number

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 135
Accepted Solution

Extracting keywords and corresponding number

Hi everyone

Imagine this string in a dataset as follws:

 

"Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).

Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.

Other: Reactive lymphadenopathy is seen . No complications of disease were noted."

 

What I want is to extract data as follows:

 

 

Hear Muscle

15mm

Aorta

1.6mm

Lung

1.9mm

 

So basically, the code will need to breake the paragraph into setences, and then finds the senteces that ocntain (heart muscle, aorta, lung) and extract the corresponding number from the same sentence.

Any help appreciated

 


Accepted Solutions
Solution
‎06-24-2017 07:30 AM
Grand Advisor
Posts: 9,584

Re: Extracting keywords and corresponding number

OK. if you like Perl Regular Expression.

 

data have;
x="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";


pid=prxparse('/(heart muscle|aorta|lung)[\w\s]+\([^\(\)]+\)/i');
pid1=prxparse('/(heart muscle|aorta|lung).+\s([\d\.]+mm)/i');
s=1;
e=length(x);
call prxnext(pid,s,e,x,p,l);
do while(p>0);
 want=substr(x,p,l);
 if prxmatch(pid1,want) then do;
  call prxposn(pid1,1,p1,l1);
  x1=substr(want,p1,l1);
  call prxposn(pid1,2,p2,l2);
  x2=substr(want,p2,l2);
 end;
 output;
 call prxnext(pid,s,e,x,p,l);
end;


drop pid s e p l pid1 p1 l1 p2 l2;
run;

View solution in original post


All Replies
Esteemed Advisor
Esteemed Advisor
Posts: 7,229

Re: Extracting keywords and corresponding number

Whilst it may be technically possible to search the string and find things, and then extract further information, I really wouldn't recommend it.  Its one of the reasons free text in databases is frowned upon, you could have anything.  In this instance, as it is medical data, I would at very minimum have a medic review the free text and provide their expert opinion on what should be extracted from it.  I mean someone could write anything in that free text what to do if:

"Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm) moving to abnormal aorta (maximal wall thickness = 1.5mm)"

Frequent Contributor
Posts: 135

Re: Extracting keywords and corresponding number

Thank you RW9

I fully understand the limitation, however, this is a standard way how the data was entered and therefore I felt comfortable using SAS as a first step. There will be further manual reviews of the results to make sure we are getting what we need. 

Grand Advisor
Posts: 9,584

Re: Extracting keywords and corresponding number

Here could give a start .

 

data have;
x="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";

do i=1 to countw(x,'()') by 2;
 x1=scan(scan(x,i,'()'),-1,'.:');
 x2=scan(scan(x,i+1,'()'),-1,'=');
 if not missing(x1) and not missing(x2) then output;
end;
keep x1 x2;
run;
Frequent Contributor
Posts: 135

Re: Extracting keywords and corresponding number

[ Edited ]

Thank you Ksharp
Almost there with your solution
assuming that I want to rely on the occurrence of the words "heart muscle" or "aorta" or "lung" and extract the number in their sentences, could you please advice on how to do that?

ie I dont want to rely on using ":" or "="....

Solution
‎06-24-2017 07:30 AM
Grand Advisor
Posts: 9,584

Re: Extracting keywords and corresponding number

OK. if you like Perl Regular Expression.

 

data have;
x="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";


pid=prxparse('/(heart muscle|aorta|lung)[\w\s]+\([^\(\)]+\)/i');
pid1=prxparse('/(heart muscle|aorta|lung).+\s([\d\.]+mm)/i');
s=1;
e=length(x);
call prxnext(pid,s,e,x,p,l);
do while(p>0);
 want=substr(x,p,l);
 if prxmatch(pid1,want) then do;
  call prxposn(pid1,1,p1,l1);
  x1=substr(want,p1,l1);
  call prxposn(pid1,2,p2,l2);
  x2=substr(want,p2,l2);
 end;
 output;
 call prxnext(pid,s,e,x,p,l);
end;


drop pid s e p l pid1 p1 l1 p2 l2;
run;
Frequent Contributor
Posts: 135

Re: Extracting keywords and corresponding number

Excellent work Ksharp, as usual
Thanks
Super Contributor
Posts: 275

Re: Extracting keywords and corresponding number

 

data have;
string="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";
id1=prxparse('/\d+.?\d+\w+/');
id2=prxparse('/(?<=Normal )((\S+ ){1,3})(?=\()/');
start1=1;
start2=1;
end=length(string);
call prxnext(id2,start2,end,string,position2,length2);
call prxnext(id1,start1,end,string,position1,length1);
do while(position1>0);
Name=substr(string,position2,length2);
Number=substr(string,position1,length1);
output;
call prxnext(id2,start2,end,string,position2,length2);
call prxnext(id1,start1,end,string,position1,length1);
end;
keep name number;
run;
proc print;run;

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 182 views
  • 1 like
  • 4 in conversation