Extracting keywords and corresponding number

Accepted Solution Solved
Reply
Regular Contributor
Posts: 156
Accepted Solution

Extracting keywords and corresponding number

Hi everyone

i need help solving the following problem, i think Perl extensions would be the best way to do it, but not sure how...

I have a text containing information on different body parts.

The text is written in the following way

text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm

 

 

 

Some of the strings will only have one or two of the three keywords (length, height, width):

 

text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm

or

text text text text textdimentionstext text text where length, text text text 55.3mm

 

So basically:

 

the code will need to seach for the occurance of any of the three words (or two or one of them, depending on what the sentence contains) and then assigns the numbers consquetivly as they appear in the sentence.

Any help is greatly appreciated

 

 

The common features in all strings are: 

1. the keywords are always from a list of the three vaiables (length, height, width)

2. The measurement always ends in the two letters mm, they might contain a decimal or not (.)


Accepted Solutions
Solution
2 weeks ago
Super User
Posts: 9,687

Re: Extracting keywords and corresponding number

data have;
infile cards truncover;
input x $200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;
run;
data want;
 set have;
 group+1;
 s=1;e=length(x);
 pid=prxparse('/\b(length|height|width)\b/oi');
 call prxnext(pid,s,e,x,p,l);
 do while(p>0);
  want=substr(x,p,l);
  output;
  call prxnext(pid,s,e,x,p,l);
 end;

 pid=prxparse('/\d+(\.\d+)?mm/oi');
 call prxnext(pid,s,e,x,p,l);
 do while(p>0);
  want=substr(x,p,l);
  output;
  call prxnext(pid,s,e,x,p,l);
 end;

 drop pid p l s e;
 run;

 data k v(rename=(want=_want));
  set want;
  if anydigit(want) then output v;
   else output k;
run;
data final_want;
 merge k v;
 by group;
 drop x;
run;

View solution in original post


All Replies
Super User
Super User
Posts: 7,413

Re: Extracting keywords and corresponding number

Well, first let me advise you that it is highly recommended NOT to use free text for any kind of analysis.  I can't stress that enough.  

That being said, it is technically possible to extract the information, but those rules have to be really robust for anything to work:

data want (drop=i tmp ind:);
  infile datalines dlm="~";
  length line tmp $2000;
  input line $;
  array loc{3} $200;
  array dim{3} $200;
  ind1=1;
  ind2=1;
  do i=1 to countw(line," ,");
    tmp=scan(line,i," ,");
    if tmp in ("length","height","width") then do;
      loc{ind1}=tmp;
      ind1=ind1+1;
    end;
    if index(tmp,"mm")>0 then do;
      dim{ind2}=tmp;
      ind2=ind2+1;
    end;
  end;
datalines;
aabg thkd length, height, width as abe ddd 24mm and 55.3mm,34mm
trryurturur aswwrqr length, height possioe asad 12mm wert 34mm
;
run;

Again, really not advised.

Solution
2 weeks ago
Super User
Posts: 9,687

Re: Extracting keywords and corresponding number

data have;
infile cards truncover;
input x $200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;
run;
data want;
 set have;
 group+1;
 s=1;e=length(x);
 pid=prxparse('/\b(length|height|width)\b/oi');
 call prxnext(pid,s,e,x,p,l);
 do while(p>0);
  want=substr(x,p,l);
  output;
  call prxnext(pid,s,e,x,p,l);
 end;

 pid=prxparse('/\d+(\.\d+)?mm/oi');
 call prxnext(pid,s,e,x,p,l);
 do while(p>0);
  want=substr(x,p,l);
  output;
  call prxnext(pid,s,e,x,p,l);
 end;

 drop pid p l s e;
 run;

 data k v(rename=(want=_want));
  set want;
  if anydigit(want) then output v;
   else output k;
run;
data final_want;
 merge k v;
 by group;
 drop x;
run;
Regular Contributor
Posts: 156

Re: Extracting keywords and corresponding number

Thank you everyone

@Ksharp while the code ran perfectly well i ran into a problems:

1. The code extracted that last number only, i correct this by assing s=1 after the first do-end loop

 

 

another question, if the text file lióoks like this:

 

text text text text textdimentionstext text text where length 22.3mm, height 12mm, width 53mm text text text

 

is there a way to specifically link the number to the keyword just befre it? The reason I am asking is that there might be another keyword and a number after it in the middle, and I am not interested in these other keywords :

 

text text text text textdimentionstext text text where length 22.3mm, height 12mm, thickness 2.1mmwidth 53mm text text text

In the above example i am not interested in thickness

Kind regards

 

 

 

Super User
Posts: 9,687

Re: Extracting keywords and corresponding number

If you know all the keyword , you can specify it in 

 

pid=prxparse('/\b(length|height|width|thickness)\b/oi');

Otherwise, you need write some code to find out all these keywords.

Regular Contributor
Posts: 156

Re: Extracting keywords and corresponding number

Thanks Ksharp

To explain further: I know all the keywords....

The way the code is made, is that the second part of the code simply extracts the series of numbers in textas they occure.

So in the following example, the first part of the code identifies the three keywords which I am interested in, which is very adequate:

pid=prxparse('/\b(length|height|width|thickness)\b/oi');

on

text text text text textdimentionstext text text where length 22.3mm, height 12mm, thickness 2.1mmwidth 53mm text text text

will return

length, height, width ( I am not interested in thickness and I dont want it to be extracted)

BUT

The second part of the code will pull out 4 series of numbers:

 

 pid=prxparse('/\d+(\.\d+)?mm/oi');

will return

 

 22.3mm, 12mm, 2.1mm, 53mm

Now this will cause a missmatch when the data is merged. That is why I am trying to link the number directly to the keyword before it, if that can be done then 2.1mm will not be extracted

Kind regards

 

Super User
Posts: 9,687

Re: Extracting keywords and corresponding number

Yeah. Specify all the keywords in 

pid=prxparse('/\b(length|height|width|thickness)\b/oi');

that would avoid such kind of error.

 

Or use PG's code. If any one of V or Value is missing, that means there is a problem in that record. 

Respected Advisor
Posts: 4,654

Re: Extracting keywords and corresponding number

Building on @Ksharp's excellent solution, I would simplify, refine and complete as:

 

data have;
infile cards truncover;
input x $200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;

data k(keep=group i var) v(keep=group i value) ;
 set have;
 group+1;
 s=1;e=length(x);
 pid=prxparse('/\b(length|height|width)\b/oi');
 call prxnext(pid,s,e,x,p,l);
 do i = 1 by 1 while(p>0);
  var=substr(x,p,l);
  output k;
  call prxnext(pid,s,e,x,p,l);
 end;

 pid=prxparse('/\d+(\.\d*)?\s?mm/oi');
 call prxnext(pid,s,e,x,p,l);
 do i = 1 by 1 while(p>0);
  value=input(compress(substr(x,p,l),"mM"),best.);
  output v;
  call prxnext(pid,s,e,x,p,l);
 end;

 run;

data list;
 merge k v;
 by group i;
drop i;
run;

proc transpose data=list out=table(drop=_name_);
by group;
id var;
var value;
run;
PG
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 318 views
  • 10 likes
  • 4 in conversation