BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ammarhm
Lapis Lazuli | Level 10

Hi everyone

Imagine this string in a dataset as follws:

 

"Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).

Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.

Other: Reactive lymphadenopathy is seen . No complications of disease were noted."

 

What I want is to extract data as follows:

 

 

Hear Muscle

15mm

Aorta

1.6mm

Lung

1.9mm

 

So basically, the code will need to breake the paragraph into setences, and then finds the senteces that ocntain (heart muscle, aorta, lung) and extract the corresponding number from the same sentence.

Any help appreciated

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

OK. if you like Perl Regular Expression.

 

data have;
x="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";


pid=prxparse('/(heart muscle|aorta|lung)[\w\s]+\([^\(\)]+\)/i');
pid1=prxparse('/(heart muscle|aorta|lung).+\s([\d\.]+mm)/i');
s=1;
e=length(x);
call prxnext(pid,s,e,x,p,l);
do while(p>0);
 want=substr(x,p,l);
 if prxmatch(pid1,want) then do;
  call prxposn(pid1,1,p1,l1);
  x1=substr(want,p1,l1);
  call prxposn(pid1,2,p2,l2);
  x2=substr(want,p2,l2);
 end;
 output;
 call prxnext(pid,s,e,x,p,l);
end;


drop pid s e p l pid1 p1 l1 p2 l2;
run;

View solution in original post

7 REPLIES 7
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Whilst it may be technically possible to search the string and find things, and then extract further information, I really wouldn't recommend it.  Its one of the reasons free text in databases is frowned upon, you could have anything.  In this instance, as it is medical data, I would at very minimum have a medic review the free text and provide their expert opinion on what should be extracted from it.  I mean someone could write anything in that free text what to do if:

"Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm) moving to abnormal aorta (maximal wall thickness = 1.5mm)"

ammarhm
Lapis Lazuli | Level 10

Thank you RW9

I fully understand the limitation, however, this is a standard way how the data was entered and therefore I felt comfortable using SAS as a first step. There will be further manual reviews of the results to make sure we are getting what we need. 

Ksharp
Super User

Here could give a start .

 

data have;
x="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";

do i=1 to countw(x,'()') by 2;
 x1=scan(scan(x,i,'()'),-1,'.:');
 x2=scan(scan(x,i+1,'()'),-1,'=');
 if not missing(x1) and not missing(x2) then output;
end;
keep x1 x2;
run;
ammarhm
Lapis Lazuli | Level 10

Thank you Ksharp
Almost there with your solution
assuming that I want to rely on the occurrence of the words "heart muscle" or "aorta" or "lung" and extract the number in their sentences, could you please advice on how to do that?

ie I dont want to rely on using ":" or "="....

Ksharp
Super User

OK. if you like Perl Regular Expression.

 

data have;
x="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";


pid=prxparse('/(heart muscle|aorta|lung)[\w\s]+\([^\(\)]+\)/i');
pid1=prxparse('/(heart muscle|aorta|lung).+\s([\d\.]+mm)/i');
s=1;
e=length(x);
call prxnext(pid,s,e,x,p,l);
do while(p>0);
 want=substr(x,p,l);
 if prxmatch(pid1,want) then do;
  call prxposn(pid1,1,p1,l1);
  x1=substr(want,p1,l1);
  call prxposn(pid1,2,p2,l2);
  x2=substr(want,p2,l2);
 end;
 output;
 call prxnext(pid,s,e,x,p,l);
end;


drop pid s e p l pid1 p1 l1 p2 l2;
run;
ammarhm
Lapis Lazuli | Level 10
Excellent work Ksharp, as usual
Thanks
slchen
Lapis Lazuli | Level 10

 

data have;
string="Heart: Normal heart muscle colon (maximal wall thickness = 15mm). Normal aorta (maximal wall thickness = 1.6mm).
Lung: Normal lung (maximal wall thickness = 1.9mm), however movement is absent from the distal part.
Other: Reactive lymphadenopathy is seen . No complications of disease were noted.";
id1=prxparse('/\d+.?\d+\w+/');
id2=prxparse('/(?<=Normal )((\S+ ){1,3})(?=\()/');
start1=1;
start2=1;
end=length(string);
call prxnext(id2,start2,end,string,position2,length2);
call prxnext(id1,start1,end,string,position1,length1);
do while(position1>0);
Name=substr(string,position2,length2);
Number=substr(string,position1,length1);
output;
call prxnext(id2,start2,end,string,position2,length2);
call prxnext(id1,start1,end,string,position1,length1);
end;
keep name number;
run;
proc print;run;

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1369 views
  • 1 like
  • 4 in conversation