## Extracting keywords and corresponding number

Solved
Regular Contributor
Posts: 182

# Extracting keywords and corresponding number

Hi everyone

i need help solving the following problem, i think Perl extensions would be the best way to do it, but not sure how...

I have a text containing information on different body parts.

The text is written in the following way

text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm

Some of the strings will only have one or two of the three keywords (length, height, width):

text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm

or

text text text text textdimentionstext text text where length, text text text 55.3mm

So basically:

the code will need to seach for the occurance of any of the three words (or two or one of them, depending on what the sentence contains) and then assigns the numbers consquetivly as they appear in the sentence.

Any help is greatly appreciated

The common features in all strings are:

1. the keywords are always from a list of the three vaiables (length, height, width)

2. The measurement always ends in the two letters mm, they might contain a decimal or not (.)

Accepted Solutions
Solution
‎08-11-2017 09:57 PM
Super User
Posts: 10,770

## Re: Extracting keywords and corresponding number

``````data have;
infile cards truncover;
input x \$200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;
run;
data want;
set have;
group+1;
s=1;e=length(x);
pid=prxparse('/\b(length|height|width)\b/oi');
call prxnext(pid,s,e,x,p,l);
do while(p>0);
want=substr(x,p,l);
output;
call prxnext(pid,s,e,x,p,l);
end;

pid=prxparse('/\d+(\.\d+)?mm/oi');
call prxnext(pid,s,e,x,p,l);
do while(p>0);
want=substr(x,p,l);
output;
call prxnext(pid,s,e,x,p,l);
end;

drop pid p l s e;
run;

data k v(rename=(want=_want));
set want;
if anydigit(want) then output v;
else output k;
run;
data final_want;
merge k v;
by group;
drop x;
run;``````

All Replies
Super User
Posts: 9,599

## Re: Extracting keywords and corresponding number

Well, first let me advise you that it is highly recommended NOT to use free text for any kind of analysis.  I can't stress that enough.

That being said, it is technically possible to extract the information, but those rules have to be really robust for anything to work:

```data want (drop=i tmp ind:);
infile datalines dlm="~";
length line tmp \$2000;
input line \$;
array loc{3} \$200;
array dim{3} \$200;
ind1=1;
ind2=1;
do i=1 to countw(line," ,");
tmp=scan(line,i," ,");
if tmp in ("length","height","width") then do;
loc{ind1}=tmp;
ind1=ind1+1;
end;
if index(tmp,"mm")>0 then do;
dim{ind2}=tmp;
ind2=ind2+1;
end;
end;
datalines;
aabg thkd length, height, width as abe ddd 24mm and 55.3mm,34mm
trryurturur aswwrqr length, height possioe asad 12mm wert 34mm
;
run;```

Again, really not advised.

Solution
‎08-11-2017 09:57 PM
Super User
Posts: 10,770

## Re: Extracting keywords and corresponding number

``````data have;
infile cards truncover;
input x \$200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;
run;
data want;
set have;
group+1;
s=1;e=length(x);
pid=prxparse('/\b(length|height|width)\b/oi');
call prxnext(pid,s,e,x,p,l);
do while(p>0);
want=substr(x,p,l);
output;
call prxnext(pid,s,e,x,p,l);
end;

pid=prxparse('/\d+(\.\d+)?mm/oi');
call prxnext(pid,s,e,x,p,l);
do while(p>0);
want=substr(x,p,l);
output;
call prxnext(pid,s,e,x,p,l);
end;

drop pid p l s e;
run;

data k v(rename=(want=_want));
set want;
if anydigit(want) then output v;
else output k;
run;
data final_want;
merge k v;
by group;
drop x;
run;``````
Regular Contributor
Posts: 182

## Re: Extracting keywords and corresponding number

Thank you everyone

@Ksharp while the code ran perfectly well i ran into a problems:

1. The code extracted that last number only, i correct this by assing s=1 after the first do-end loop

another question, if the text file lióoks like this:

text text text text textdimentionstext text text where length 22.3mm, height 12mm, width 53mm text text text

is there a way to specifically link the number to the keyword just befre it? The reason I am asking is that there might be another keyword and a number after it in the middle, and I am not interested in these other keywords :

text text text text textdimentionstext text text where length 22.3mm, height 12mm, thickness 2.1mmwidth 53mm text text text

In the above example i am not interested in thickness

Kind regards

Super User
Posts: 10,770

## Re: Extracting keywords and corresponding number

If you know all the keyword , you can specify it in

``pid=prxparse('/\b(length|height|width|thickness)\b/oi');``

Otherwise, you need write some code to find out all these keywords.

Regular Contributor
Posts: 182

## Re: Extracting keywords and corresponding number

Thanks Ksharp

To explain further: I know all the keywords....

The way the code is made, is that the second part of the code simply extracts the series of numbers in textas they occure.

So in the following example, the first part of the code identifies the three keywords which I am interested in, which is very adequate:

``pid=prxparse('/\b(length|height|width|thickness)\b/oi');``

on

text text text text textdimentionstext text text where length 22.3mm, height 12mm, thickness 2.1mmwidth 53mm text text text

will return

length, height, width ( I am not interested in thickness and I dont want it to be extracted)

BUT

The second part of the code will pull out 4 series of numbers:

`` pid=prxparse('/\d+(\.\d+)?mm/oi');``

will return

22.3mm, 12mm, 2.1mm, 53mm

Now this will cause a missmatch when the data is merged. That is why I am trying to link the number directly to the keyword before it, if that can be done then 2.1mm will not be extracted

Kind regards

Super User
Posts: 10,770

## Re: Extracting keywords and corresponding number

Yeah. Specify all the keywords in

``pid=prxparse('/\b(length|height|width|thickness)\b/oi');``

that would avoid such kind of error.

Or use PG's code. If any one of V or Value is missing, that means there is a problem in that record.

Posts: 5,523

## Re: Extracting keywords and corresponding number

Building on @Ksharp's excellent solution, I would simplify, refine and complete as:

``````data have;
infile cards truncover;
input x \$200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;

data k(keep=group i var) v(keep=group i value) ;
set have;
group+1;
s=1;e=length(x);
pid=prxparse('/\b(length|height|width)\b/oi');
call prxnext(pid,s,e,x,p,l);
do i = 1 by 1 while(p>0);
var=substr(x,p,l);
output k;
call prxnext(pid,s,e,x,p,l);
end;

pid=prxparse('/\d+(\.\d*)?\s?mm/oi');
call prxnext(pid,s,e,x,p,l);
do i = 1 by 1 while(p>0);
value=input(compress(substr(x,p,l),"mM"),best.);
output v;
call prxnext(pid,s,e,x,p,l);
end;

run;

data list;
merge k v;
by group i;
drop i;
run;

proc transpose data=list out=table(drop=_name_);
by group;
id var;
var value;
run;``````
PG
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
• 7 replies
• 644 views
• 10 likes
• 4 in conversation