BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ammarhm
Lapis Lazuli | Level 10

Hi everyone

i need help solving the following problem, i think Perl extensions would be the best way to do it, but not sure how...

I have a text containing information on different body parts.

The text is written in the following way

text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm

 

 

 

Some of the strings will only have one or two of the three keywords (length, height, width):

 

text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm

or

text text text text textdimentionstext text text where length, text text text 55.3mm

 

So basically:

 

the code will need to seach for the occurance of any of the three words (or two or one of them, depending on what the sentence contains) and then assigns the numbers consquetivly as they appear in the sentence.

Any help is greatly appreciated

 

 

The common features in all strings are: 

1. the keywords are always from a list of the three vaiables (length, height, width)

2. The measurement always ends in the two letters mm, they might contain a decimal or not (.)

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data have;
infile cards truncover;
input x $200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;
run;
data want;
 set have;
 group+1;
 s=1;e=length(x);
 pid=prxparse('/\b(length|height|width)\b/oi');
 call prxnext(pid,s,e,x,p,l);
 do while(p>0);
  want=substr(x,p,l);
  output;
  call prxnext(pid,s,e,x,p,l);
 end;

 pid=prxparse('/\d+(\.\d+)?mm/oi');
 call prxnext(pid,s,e,x,p,l);
 do while(p>0);
  want=substr(x,p,l);
  output;
  call prxnext(pid,s,e,x,p,l);
 end;

 drop pid p l s e;
 run;

 data k v(rename=(want=_want));
  set want;
  if anydigit(want) then output v;
   else output k;
run;
data final_want;
 merge k v;
 by group;
 drop x;
run;

View solution in original post

7 REPLIES 7
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, first let me advise you that it is highly recommended NOT to use free text for any kind of analysis.  I can't stress that enough.  

That being said, it is technically possible to extract the information, but those rules have to be really robust for anything to work:

data want (drop=i tmp ind:);
  infile datalines dlm="~";
  length line tmp $2000;
  input line $;
  array loc{3} $200;
  array dim{3} $200;
  ind1=1;
  ind2=1;
  do i=1 to countw(line," ,");
    tmp=scan(line,i," ,");
    if tmp in ("length","height","width") then do;
      loc{ind1}=tmp;
      ind1=ind1+1;
    end;
    if index(tmp,"mm")>0 then do;
      dim{ind2}=tmp;
      ind2=ind2+1;
    end;
  end;
datalines;
aabg thkd length, height, width as abe ddd 24mm and 55.3mm,34mm
trryurturur aswwrqr length, height possioe asad 12mm wert 34mm
;
run;

Again, really not advised.

Ksharp
Super User
data have;
infile cards truncover;
input x $200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;
run;
data want;
 set have;
 group+1;
 s=1;e=length(x);
 pid=prxparse('/\b(length|height|width)\b/oi');
 call prxnext(pid,s,e,x,p,l);
 do while(p>0);
  want=substr(x,p,l);
  output;
  call prxnext(pid,s,e,x,p,l);
 end;

 pid=prxparse('/\d+(\.\d+)?mm/oi');
 call prxnext(pid,s,e,x,p,l);
 do while(p>0);
  want=substr(x,p,l);
  output;
  call prxnext(pid,s,e,x,p,l);
 end;

 drop pid p l s e;
 run;

 data k v(rename=(want=_want));
  set want;
  if anydigit(want) then output v;
   else output k;
run;
data final_want;
 merge k v;
 by group;
 drop x;
run;
ammarhm
Lapis Lazuli | Level 10

Thank you everyone

@Ksharp while the code ran perfectly well i ran into a problems:

1. The code extracted that last number only, i correct this by assing s=1 after the first do-end loop

 

 

another question, if the text file lióoks like this:

 

text text text text textdimentionstext text text where length 22.3mm, height 12mm, width 53mm text text text

 

is there a way to specifically link the number to the keyword just befre it? The reason I am asking is that there might be another keyword and a number after it in the middle, and I am not interested in these other keywords :

 

text text text text textdimentionstext text text where length 22.3mm, height 12mm, thickness 2.1mmwidth 53mm text text text

In the above example i am not interested in thickness

Kind regards

 

 

 

Ksharp
Super User

If you know all the keyword , you can specify it in 

 

pid=prxparse('/\b(length|height|width|thickness)\b/oi');

Otherwise, you need write some code to find out all these keywords.

ammarhm
Lapis Lazuli | Level 10

Thanks Ksharp

To explain further: I know all the keywords....

The way the code is made, is that the second part of the code simply extracts the series of numbers in textas they occure.

So in the following example, the first part of the code identifies the three keywords which I am interested in, which is very adequate:

pid=prxparse('/\b(length|height|width|thickness)\b/oi');

on

text text text text textdimentionstext text text where length 22.3mm, height 12mm, thickness 2.1mmwidth 53mm text text text

will return

length, height, width ( I am not interested in thickness and I dont want it to be extracted)

BUT

The second part of the code will pull out 4 series of numbers:

 

 pid=prxparse('/\d+(\.\d+)?mm/oi');

will return

 

 22.3mm, 12mm, 2.1mm, 53mm

Now this will cause a missmatch when the data is merged. That is why I am trying to link the number directly to the keyword before it, if that can be done then 2.1mm will not be extracted

Kind regards

 

Ksharp
Super User

Yeah. Specify all the keywords in 

pid=prxparse('/\b(length|height|width|thickness)\b/oi');

that would avoid such kind of error.

 

Or use PG's code. If any one of V or Value is missing, that means there is a problem in that record. 

PGStats
Opal | Level 21

Building on @Ksharp's excellent solution, I would simplify, refine and complete as:

 

data have;
infile cards truncover;
input x $200.;
cards;
text text text text textdimentionstext text text where length, height, width text text text of 30.2mm, 24mm and 55.3mm
text text text text textdimentionstext text text where height, width text text text of  24mm and 55.3mm
text text text text textdimentionstext text text where length, text text text 55.3mm
;

data k(keep=group i var) v(keep=group i value) ;
 set have;
 group+1;
 s=1;e=length(x);
 pid=prxparse('/\b(length|height|width)\b/oi');
 call prxnext(pid,s,e,x,p,l);
 do i = 1 by 1 while(p>0);
  var=substr(x,p,l);
  output k;
  call prxnext(pid,s,e,x,p,l);
 end;

 pid=prxparse('/\d+(\.\d*)?\s?mm/oi');
 call prxnext(pid,s,e,x,p,l);
 do i = 1 by 1 while(p>0);
  value=input(compress(substr(x,p,l),"mM"),best.);
  output v;
  call prxnext(pid,s,e,x,p,l);
 end;

 run;

data list;
 merge k v;
 by group i;
drop i;
run;

proc transpose data=list out=table(drop=_name_);
by group;
id var;
var value;
run;
PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1380 views
  • 10 likes
  • 4 in conversation