DATA Step, Macro, Functions and more

Word distance algorithm to identify record of interest

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 96
Accepted Solution

Word distance algorithm to identify record of interest

Hi there,

Based on help from sas community, I have tried to identify records having specific words of interest in a particular order (e.g medicine will be first and diet will be second) as well as the distance between these words should be less than 7 words. 

data test;

xyz='She was prescribed exercise and diet. You may visit next week to take further advice about medicine as well as as well diet. You must take diet according to your dietician. Later we will think to revise your medicine'; *search for the word she;
First = 'medicine' ;
Second = 'diet' ;
array firsts (3) f1-f3;
Array seconds (3) s1-s3;
Findex=1;
Sindex=1;
do i = 1 to (countw(xyz));
if upcase(First) = upcase(Scan(xyz,i)) then do;
Firsts[Findex] = i;
Findex = Findex+1;
end;
if upcase(Second) = upcase(Scan(xyz,i)) then do;
Seconds[Sindex]=i;
Sindex = Sindex +1;
end;
end;

do i = 1 to (n(of Firsts(*)));
put First "occurs in position" +1 Firsts[i] ;
end;
do j = 1 to (n(of seconds(*)));
put second "occurs in position" +1 seconds[j] ;
end;


if (Firsts(1) lt seconds(1) and seconds(1) - firsts(1) le 6 and Firsts(1) ne . and seconds(1) ne . )
or (Firsts(1) lt seconds(2) and seconds(2) - firsts(1) le 6 and Firsts(1) ne . and seconds(2) ne . )
or (Firsts(1) lt seconds(3) and seconds(3) - firsts(1) le 6 and Firsts(1) ne . and seconds(3) ne . )
or (Firsts(2) lt seconds(1) and seconds(1) - firsts(2) le 6 and Firsts(2) ne . and seconds(1) ne . )
or (Firsts(2) lt seconds(2) and seconds(2) - firsts(2) le 6 and Firsts(2) ne . and seconds(2) ne . )
or (Firsts(2) lt seconds(3) and seconds(3) - firsts(2) le 6 and Firsts(2) ne . and seconds(3) ne . )
or (Firsts(3) lt seconds(1) and seconds(1) - firsts(3) le 6 and Firsts(3) ne . and seconds(1) ne . )
or (Firsts(3) lt seconds(2) and seconds(2) - firsts(3) le 6 and Firsts(3) ne . and seconds(2) ne . )
or (Firsts(3) lt seconds(3) and seconds(3) - firsts(3) le 6 and Firsts(3) ne . and seconds(3) ne . ) ;

run;

 

 

Can somebody suggest me some simplied code for the colored section. 

 

Thank you in advance for your kind reply. 

Regards,

Deepak

Swain

Accepted Solutions
Solution
‎04-29-2016 03:54 PM
Super User
Posts: 10,500

Re: Word distance algorithm to identify record of interest

Look at this example:

data _null_; 

xyz='She was prescribed exercise and diet. You may visit next week to take further advice about medicine as well as as well diet. You must take diet according to your dietician. Later we will think to revise your medicine'; 
   First = 'medicine' ;
   Second ='diet' ;
   array firsts (4)  f1-f4; /*assumes 1) that the first word won't occur more than 4 times*/
   Array seconds (4) s1-s4; 
   Findex=1;/* these index variables will point to where to store the word count in the arrays*/
   Sindex=1;
   do i = 1 to (countw(xyz));
      if upcase(First) = upcase(Scan(xyz,i)) then do;
         Firsts[Findex] = i;
         Findex = Findex+1;
      end;
      if upcase(Second) = upcase(Scan(xyz,i)) then do;
         Seconds[Sindex]=i;
         Sindex = Sindex +1;
      end;
   end;

   do i = 1 to (n(of Firsts(*)));
      do j = 1 to (n(of seconds(*)));
         if 0< seconds[j]- firsts[i] le 6 then 
            put First "occurs in position" +1 firsts[i] "and"+1 Second "occurs at position" +1 Seconds[j];
      end;
   end;

run;

Also, it may be time to read a bit about arrays and logic constructs. To out put just the ones where "diet" is 6 or fewer words after "medicine" compare each value pair.

 

 

Please post code in the box after selecting the "run" icon above. It will preserve formatting and the indents really make it much easier to read nested do loop code frequently needed for arrays.

 

You could look up the n (of seconds(*)) to find out that it gets the count of populated cells in the array so all of your comments about "and ne ." are not needed.

You DID need to add the Position of second - Position of first should be greater than 0.

View solution in original post


All Replies
Solution
‎04-29-2016 03:54 PM
Super User
Posts: 10,500

Re: Word distance algorithm to identify record of interest

Look at this example:

data _null_; 

xyz='She was prescribed exercise and diet. You may visit next week to take further advice about medicine as well as as well diet. You must take diet according to your dietician. Later we will think to revise your medicine'; 
   First = 'medicine' ;
   Second ='diet' ;
   array firsts (4)  f1-f4; /*assumes 1) that the first word won't occur more than 4 times*/
   Array seconds (4) s1-s4; 
   Findex=1;/* these index variables will point to where to store the word count in the arrays*/
   Sindex=1;
   do i = 1 to (countw(xyz));
      if upcase(First) = upcase(Scan(xyz,i)) then do;
         Firsts[Findex] = i;
         Findex = Findex+1;
      end;
      if upcase(Second) = upcase(Scan(xyz,i)) then do;
         Seconds[Sindex]=i;
         Sindex = Sindex +1;
      end;
   end;

   do i = 1 to (n(of Firsts(*)));
      do j = 1 to (n(of seconds(*)));
         if 0< seconds[j]- firsts[i] le 6 then 
            put First "occurs in position" +1 firsts[i] "and"+1 Second "occurs at position" +1 Seconds[j];
      end;
   end;

run;

Also, it may be time to read a bit about arrays and logic constructs. To out put just the ones where "diet" is 6 or fewer words after "medicine" compare each value pair.

 

 

Please post code in the box after selecting the "run" icon above. It will preserve formatting and the indents really make it much easier to read nested do loop code frequently needed for arrays.

 

You could look up the n (of seconds(*)) to find out that it gets the count of populated cells in the array so all of your comments about "and ne ." are not needed.

You DID need to add the Position of second - Position of first should be greater than 0.

Frequent Contributor
Posts: 96

Re: Word distance algorithm to identify record of interest

Hi there,
I am familiar with First. and Last. but the concept of firsts() and seconds() are new to me. In other word, I am new to sas. Can you kindly provide me some informative materials related to it to enrich my knowledge. Thank you in advance for your kind reply. Have a nice weekend.
Regards,
Deepak
Swain
Respected Advisor
Posts: 4,649

Re: Word distance algorithm to identify record of interest

[ Edited ]

Pattern matching is ideal for this kind of intricate request:

 

data test;
xyz='She was prescribed exercise and diet. You may visit next week to take 
further advice about medicine as well as as well diet. You must take diet 
according to your dietician. Later we will think to revise your medicine';
run;

data _null_;
set test;
First = 'medicine' ;
do Second = "well", "diet", "you", "must", "take" ;
    interest = prxmatch(cats("/", First, "(\W+\w+){0,6}\W+", Second, "\b/i"), xyz);
    put (First Second interest) (=)/;
    end;
run;

/* 
 Pattern reads : Find First word, followed with 0 to 6 words (a sequence 
 of non-word characters (\W) followed by a sequence of word characters (\w)), 
 followed with a sequence of non-word characters, followed with the 
 Second word, ending on a word boundary (\b). 
 The match is case insensitive (i). 
*/

Edit: new version + comments 

PG
Frequent Contributor
Posts: 96

Re: Word distance algorithm to identify record of interest

Hi there,
I am very much interested to explore the use of perl in this context. Unfortunately I am having little understanding of it being novice to sas. I tried to ran the gievn code but could not understand the importance of "think". I am looking for records where there will first word medicine being followed by second word diet and there may be intervening words less than 7.
Thank you in advance for your valuable input.
Regards,
Deepak
Swain
Respected Advisor
Posts: 4,649

Re: Word distance algorithm to identify record of interest

"think" is a word that is present in the string but at more than 7 words away from "medicine", thus the result interest=0.

 

The pattern means : Find the word medicine followed by zero to six (sequences of word letters followed one or many spaces) folowed by spaces and the word diet.

PG
Respected Advisor
Posts: 4,649

Re: Word distance algorithm to identify record of interest

See edited version of my code.

PG
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 292 views
  • 1 like
  • 3 in conversation