DATA Step, Macro, Functions and more

measure distance between two words in a text string

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 104
Accepted Solution

measure distance between two words in a text string

I am intersted to measure distance between 2 specific words in a text string in  term of number of words in between them.

Most of the functions I am aware of are providing me distance in term of number of characters such as:

 

data _null_;

searchhere='residential treatment facility';

fullword=indexw(searchhere,'treatment');

put fullword=;

run;

data _null_;

xyz='She sells seashells? Yes, she does.'; *search for the word she;

whereisShe=findw(xyz,'she');

put whereisShe;

run;

 

 

 

N.B: I am looking for distance i.e. number of words between 'sick' and 'antibiotics' in the string: Very sick people may only take antibiotics.

 

Thank you in advance for your kind reply.
Deepak

Swain

Accepted Solutions
Solution
‎04-28-2016 10:20 AM
Super User
Posts: 11,343

Re: measure distance between two words in a text string

Posted in reply to DeepakSwain

What if one of the words is repeated? Which count would you want?

What if both words appear in the string multiple times?

What if the "first" word actually occurs after the "second" word?

Is the search to be Case sensitive? Is "Sick" to match "sick" (I assume yes, but should clarify)

What happens when only one of the words matches?

 

You may also have to look at delimeters between works does a dash in a compound word qualify? Would sick-bed count as "sick"?

 

A stub of some code that may work that matches the FIRST occurence of a word and matches regardless of case.

data _null_; 

xyz='She sells seashells? Yes, she does.'; *search for the word she;
First = 'She'  ;
Second = 'does' ;
Firstword=.; 
Secondword=.;
do i = 1 to (countw(xyz));
   if missing(Firstword) and upcase(First) = upcase(Scan(xyz,i)) then FirstWord=i;
   if missing(Secondword) and upcase(Second) = upcase(Scan(xyz,i)) then Secondword=i;
end;

put Firstword= SecondWord=;

run;

View solution in original post


All Replies
Solution
‎04-28-2016 10:20 AM
Super User
Posts: 11,343

Re: measure distance between two words in a text string

Posted in reply to DeepakSwain

What if one of the words is repeated? Which count would you want?

What if both words appear in the string multiple times?

What if the "first" word actually occurs after the "second" word?

Is the search to be Case sensitive? Is "Sick" to match "sick" (I assume yes, but should clarify)

What happens when only one of the words matches?

 

You may also have to look at delimeters between works does a dash in a compound word qualify? Would sick-bed count as "sick"?

 

A stub of some code that may work that matches the FIRST occurence of a word and matches regardless of case.

data _null_; 

xyz='She sells seashells? Yes, she does.'; *search for the word she;
First = 'She'  ;
Second = 'does' ;
Firstword=.; 
Secondword=.;
do i = 1 to (countw(xyz));
   if missing(Firstword) and upcase(First) = upcase(Scan(xyz,i)) then FirstWord=i;
   if missing(Secondword) and upcase(Second) = upcase(Scan(xyz,i)) then Secondword=i;
end;

put Firstword= SecondWord=;

run;
Frequent Contributor
Posts: 104

Re: measure distance between two words in a text string

Posted in reply to DeepakSwain
Hi there, First of all, I want to thank you for your kind reply. Using your code I have successfully measured distance between two specific words in number of words. data _null_; xyz='She was prescribed exercise and drug. You may visit next week to take further advice about medicine as well as diet'; *search for the word she; First = 'medicine' ; Second = 'diet' ; Firstword=.; Secondword=.; worddistance=.; do i = 1 to (countw(xyz)); if missing(Firstword) and upcase(First) = upcase(Scan(xyz,i)) then FirstWord=i; if missing(Secondword) and upcase(Second) = upcase(Scan(xyz,i)) then Secondword=i; end; if Firstword lt Secondword; worddistance= SecondWord-Firstword; put worddistance=; put Secondword = Firstword=; run; Once again thank you for raising some questions which are very relevant to my analysis. To initiate discussion of the issue, I tried to keep it as simple as possible. * The words are case insensitive. * If the first word come after the second word, it can be filtered from flagging/analysis by using if Firstword lt Secondword; * if only one of the two words are present, it will be automatically filtered from flagging/analysis and is desired too. * Now the issue remaining to be addressed is the calculation of distance when words are occurring multiple times: for e.g. xyz='She was prescribed exercise and diet. You may visit next week to take further advice about medicine as well as diet'; The above code is not working. The word 'diet ' is occurring twice. The code measures the distance for the first "diet" and not for the second "diet". Again the condition i.e. second word should always be next to first word to measure the distance also expels it. Once again, thank you in advance for your kind guidance. Regards, Deepak
Swain
Super User
Posts: 11,343

Re: measure distance between two words in a text string

Posted in reply to DeepakSwain

You could exend the logic about finding words multiple times but you'll still need to make some assumptions and decisions.

For instance you can find out how many times the specific words occur and then using an array store the positions for first, second, etc occurence for each word.

 

This demonstrates getting those values.

You will need to decide your logic on getting which comparisons of the positions you want.

data _null_; 

   xyz='She sells seashells? Yes, she does.'; *search for the word she;
   First = 'She'  ;
   Second = 'does' ;
   array firsts (4)  f1-f4; /*assumes 1) that the first word won't occur more than 4 times*/
   Array seconds (4) s1-s4; 
   Findex=1;/* these index variables will point to where to store the word count in the arrays*/
   Sindex=1;
   do i = 1 to (countw(xyz));
      if upcase(First) = upcase(Scan(xyz,i)) then do;
         Firsts[Findex] = i;
         Findex = Findex+1;
      end;
      if upcase(Second) = upcase(Scan(xyz,i)) then do;
         Seconds[Sindex]=i;
         Sindex = Sindex +1;
      end;
   end;

   do i = 1 to (n(of Firsts(*)));
      put First "occurs in position" +1 Firsts[i] +(-1) '.' @;
      do j = 1 to (n(of seconds(*)));
         put +1 second "occurs in position" +1 seconds[j];
      end;
      put;
   end;

run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 800 views
  • 2 likes
  • 2 in conversation