Hi everyone
i need help solving the following problem, i think Perl extensions would be the best way to do it, but not sure how...
I have a text containing information on different body parts.
The text is written in the following way
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
Some of the strings will only have one or two of the three keywords (length, height, width):
text text text text textdimentionstext text text where height, width text text text of 24mm and 55.3mm
or
text text text text textdimentionstext text text where length, text text text 55.3mm
So basically:
the code will need to seach for the occurance of any of the three words (or two or one of them, depending on what the sentence contains) and then assigns the numbers consquetivly as they appear in the sentence.
Any help is greatly appreciated
The common features in all strings are:
1. the keywords are always from a list of the three vaiables (length, height, width)
2. The measurement always ends in the two letters mm, they might contain a decimal or not (.)
data have;
infile cards truncover;
input x $200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of 24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;
run;
data want;
set have;
group+1;
s=1;e=length(x);
pid=prxparse('/\b(length|height|width)\b/oi');
call prxnext(pid,s,e,x,p,l);
do while(p>0);
want=substr(x,p,l);
output;
call prxnext(pid,s,e,x,p,l);
end;
pid=prxparse('/\d+(\.\d+)?mm/oi');
call prxnext(pid,s,e,x,p,l);
do while(p>0);
want=substr(x,p,l);
output;
call prxnext(pid,s,e,x,p,l);
end;
drop pid p l s e;
run;
data k v(rename=(want=_want));
set want;
if anydigit(want) then output v;
else output k;
run;
data final_want;
merge k v;
by group;
drop x;
run;
Well, first let me advise you that it is highly recommended NOT to use free text for any kind of analysis. I can't stress that enough.
That being said, it is technically possible to extract the information, but those rules have to be really robust for anything to work:
data want (drop=i tmp ind:); infile datalines dlm="~"; length line tmp $2000; input line $; array loc{3} $200; array dim{3} $200; ind1=1; ind2=1; do i=1 to countw(line," ,"); tmp=scan(line,i," ,"); if tmp in ("length","height","width") then do; loc{ind1}=tmp; ind1=ind1+1; end; if index(tmp,"mm")>0 then do; dim{ind2}=tmp; ind2=ind2+1; end; end; datalines; aabg thkd length, height, width as abe ddd 24mm and 55.3mm,34mm trryurturur aswwrqr length, height possioe asad 12mm wert 34mm ; run;
Again, really not advised.
data have;
infile cards truncover;
input x $200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of 24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;
run;
data want;
set have;
group+1;
s=1;e=length(x);
pid=prxparse('/\b(length|height|width)\b/oi');
call prxnext(pid,s,e,x,p,l);
do while(p>0);
want=substr(x,p,l);
output;
call prxnext(pid,s,e,x,p,l);
end;
pid=prxparse('/\d+(\.\d+)?mm/oi');
call prxnext(pid,s,e,x,p,l);
do while(p>0);
want=substr(x,p,l);
output;
call prxnext(pid,s,e,x,p,l);
end;
drop pid p l s e;
run;
data k v(rename=(want=_want));
set want;
if anydigit(want) then output v;
else output k;
run;
data final_want;
merge k v;
by group;
drop x;
run;
Thank you everyone
@Ksharp while the code ran perfectly well i ran into a problems:
1. The code extracted that last number only, i correct this by assing s=1 after the first do-end loop
another question, if the text file lióoks like this:
text text text text textdimentionstext text text where length 22.3mm, height 12mm, width 53mm text text text
is there a way to specifically link the number to the keyword just befre it? The reason I am asking is that there might be another keyword and a number after it in the middle, and I am not interested in these other keywords :
text text text text textdimentionstext text text where length 22.3mm, height 12mm, thickness 2.1mm, width 53mm text text text
In the above example i am not interested in thickness
Kind regards
If you know all the keyword , you can specify it in
pid=prxparse('/\b(length|height|width|thickness)\b/oi');
Otherwise, you need write some code to find out all these keywords.
Thanks Ksharp
To explain further: I know all the keywords....
The way the code is made, is that the second part of the code simply extracts the series of numbers in textas they occure.
So in the following example, the first part of the code identifies the three keywords which I am interested in, which is very adequate:
pid=prxparse('/\b(length|height|width|thickness)\b/oi');
on
text text text text textdimentionstext text text where length 22.3mm, height 12mm, thickness 2.1mm, width 53mm text text text
will return
length, height, width ( I am not interested in thickness and I dont want it to be extracted)
BUT
The second part of the code will pull out 4 series of numbers:
pid=prxparse('/\d+(\.\d+)?mm/oi');
will return
22.3mm, 12mm, 2.1mm, 53mm
Now this will cause a missmatch when the data is merged. That is why I am trying to link the number directly to the keyword before it, if that can be done then 2.1mm will not be extracted
Kind regards
Yeah. Specify all the keywords in
pid=prxparse('/\b(length|height|width|thickness)\b/oi');
that would avoid such kind of error.
Or use PG's code. If any one of V or Value is missing, that means there is a problem in that record.
Building on @Ksharp's excellent solution, I would simplify, refine and complete as:
data have;
infile cards truncover;
input x $200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of 24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;
data k(keep=group i var) v(keep=group i value) ;
set have;
group+1;
s=1;e=length(x);
pid=prxparse('/\b(length|height|width)\b/oi');
call prxnext(pid,s,e,x,p,l);
do i = 1 by 1 while(p>0);
var=substr(x,p,l);
output k;
call prxnext(pid,s,e,x,p,l);
end;
pid=prxparse('/\d+(\.\d*)?\s?mm/oi');
call prxnext(pid,s,e,x,p,l);
do i = 1 by 1 while(p>0);
value=input(compress(substr(x,p,l),"mM"),best.);
output v;
call prxnext(pid,s,e,x,p,l);
end;
run;
data list;
merge k v;
by group i;
drop i;
run;
proc transpose data=list out=table(drop=_name_);
by group;
id var;
var value;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.