Hi everyone
Imagine this string in a dataset as follws:
"Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted."
What I want is to extract data as follows:
Hear Muscle |
15mm |
Aorta |
1.6mm |
Lung |
1.9mm |
So basically, the code will need to breake the paragraph into setences, and then finds the senteces that ocntain (heart muscle, aorta, lung) and extract the corresponding number from the same sentence.
Any help appreciated
OK. if you like Perl Regular Expression.
data have;
x="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";
pid=prxparse('/(heart muscle|aorta|lung)[\w\s]+\([^\(\)]+\)/i');
pid1=prxparse('/(heart muscle|aorta|lung).+\s([\d\.]+mm)/i');
s=1;
e=length(x);
call prxnext(pid,s,e,x,p,l);
do while(p>0);
want=substr(x,p,l);
if prxmatch(pid1,want) then do;
call prxposn(pid1,1,p1,l1);
x1=substr(want,p1,l1);
call prxposn(pid1,2,p2,l2);
x2=substr(want,p2,l2);
end;
output;
call prxnext(pid,s,e,x,p,l);
end;
drop pid s e p l pid1 p1 l1 p2 l2;
run;
Whilst it may be technically possible to search the string and find things, and then extract further information, I really wouldn't recommend it. Its one of the reasons free text in databases is frowned upon, you could have anything. In this instance, as it is medical data, I would at very minimum have a medic review the free text and provide their expert opinion on what should be extracted from it. I mean someone could write anything in that free text what to do if:
"Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm) moving to abnormal aorta (maximal wall thickness = 1.5mm)"
Thank you RW9
I fully understand the limitation, however, this is a standard way how the data was entered and therefore I felt comfortable using SAS as a first step. There will be further manual reviews of the results to make sure we are getting what we need.
Here could give a start .
data have;
x="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";
do i=1 to countw(x,'()') by 2;
x1=scan(scan(x,i,'()'),-1,'.:');
x2=scan(scan(x,i+1,'()'),-1,'=');
if not missing(x1) and not missing(x2) then output;
end;
keep x1 x2;
run;
Thank you Ksharp
Almost there with your solution
assuming that I want to rely on the occurrence of the words "heart muscle" or "aorta" or "lung" and extract the number in their sentences, could you please advice on how to do that?
ie I dont want to rely on using ":" or "="....
OK. if you like Perl Regular Expression.
data have;
x="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";
pid=prxparse('/(heart muscle|aorta|lung)[\w\s]+\([^\(\)]+\)/i');
pid1=prxparse('/(heart muscle|aorta|lung).+\s([\d\.]+mm)/i');
s=1;
e=length(x);
call prxnext(pid,s,e,x,p,l);
do while(p>0);
want=substr(x,p,l);
if prxmatch(pid1,want) then do;
call prxposn(pid1,1,p1,l1);
x1=substr(want,p1,l1);
call prxposn(pid1,2,p2,l2);
x2=substr(want,p2,l2);
end;
output;
call prxnext(pid,s,e,x,p,l);
end;
drop pid s e p l pid1 p1 l1 p2 l2;
run;
data have;
string="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";
id1=prxparse('/\d+.?\d+\w+/');
id2=prxparse('/(?<=Normal )((\S+ ){1,3})(?=\()/');
start1=1;
start2=1;
end=length(string);
call prxnext(id2,start2,end,string,position2,length2);
call prxnext(id1,start1,end,string,position1,length1);
do while(position1>0);
Name=substr(string,position2,length2);
Number=substr(string,position1,length1);
output;
call prxnext(id2,start2,end,string,position2,length2);
call prxnext(id1,start1,end,string,position1,length1);
end;
keep name number;
run;
proc print;run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.