Hi there,
Based on help from sas community, I have tried to identify records having specific words of interest in a particular order (e.g medicine will be first and diet will be second) as well as the distance between these words should be less than 7 words.
data test;
xyz='She was prescribed exercise and diet. You may visit next week to take further advice about medicine as well as as well diet. You must take diet according to your dietician. Later we will think to revise your medicine'; *search for the word she;
First = 'medicine' ;
Second = 'diet' ;
array firsts (3) f1-f3;
Array seconds (3) s1-s3;
Findex=1;
Sindex=1;
do i = 1 to (countw(xyz));
if upcase(First) = upcase(Scan(xyz,i)) then do;
Firsts[Findex] = i;
Findex = Findex+1;
end;
if upcase(Second) = upcase(Scan(xyz,i)) then do;
Seconds[Sindex]=i;
Sindex = Sindex +1;
end;
end;
do i = 1 to (n(of Firsts(*)));
put First "occurs in position" +1 Firsts[i] ;
end;
do j = 1 to (n(of seconds(*)));
put second "occurs in position" +1 seconds[j] ;
end;
if (Firsts(1) lt seconds(1) and seconds(1) - firsts(1) le 6 and Firsts(1) ne . and seconds(1) ne . )
or (Firsts(1) lt seconds(2) and seconds(2) - firsts(1) le 6 and Firsts(1) ne . and seconds(2) ne . )
or (Firsts(1) lt seconds(3) and seconds(3) - firsts(1) le 6 and Firsts(1) ne . and seconds(3) ne . )
or (Firsts(2) lt seconds(1) and seconds(1) - firsts(2) le 6 and Firsts(2) ne . and seconds(1) ne . )
or (Firsts(2) lt seconds(2) and seconds(2) - firsts(2) le 6 and Firsts(2) ne . and seconds(2) ne . )
or (Firsts(2) lt seconds(3) and seconds(3) - firsts(2) le 6 and Firsts(2) ne . and seconds(3) ne . )
or (Firsts(3) lt seconds(1) and seconds(1) - firsts(3) le 6 and Firsts(3) ne . and seconds(1) ne . )
or (Firsts(3) lt seconds(2) and seconds(2) - firsts(3) le 6 and Firsts(3) ne . and seconds(2) ne . )
or (Firsts(3) lt seconds(3) and seconds(3) - firsts(3) le 6 and Firsts(3) ne . and seconds(3) ne . ) ;
run;
Can somebody suggest me some simplied code for the colored section.
Thank you in advance for your kind reply.
Regards,
Deepak
Look at this example:
data _null_;
xyz='She was prescribed exercise and diet. You may visit next week to take further advice about medicine as well as as well diet. You must take diet according to your dietician. Later we will think to revise your medicine';
First = 'medicine' ;
Second ='diet' ;
array firsts (4) f1-f4; /*assumes 1) that the first word won't occur more than 4 times*/
Array seconds (4) s1-s4;
Findex=1;/* these index variables will point to where to store the word count in the arrays*/
Sindex=1;
do i = 1 to (countw(xyz));
if upcase(First) = upcase(Scan(xyz,i)) then do;
Firsts[Findex] = i;
Findex = Findex+1;
end;
if upcase(Second) = upcase(Scan(xyz,i)) then do;
Seconds[Sindex]=i;
Sindex = Sindex +1;
end;
end;
do i = 1 to (n(of Firsts(*)));
do j = 1 to (n(of seconds(*)));
if 0< seconds[j]- firsts[i] le 6 then
put First "occurs in position" +1 firsts[i] "and"+1 Second "occurs at position" +1 Seconds[j];
end;
end;
run;
Also, it may be time to read a bit about arrays and logic constructs. To out put just the ones where "diet" is 6 or fewer words after "medicine" compare each value pair.
Please post code in the box after selecting the "run" icon above. It will preserve formatting and the indents really make it much easier to read nested do loop code frequently needed for arrays.
You could look up the n (of seconds(*)) to find out that it gets the count of populated cells in the array so all of your comments about "and ne ." are not needed.
You DID need to add the Position of second - Position of first should be greater than 0.
Look at this example:
data _null_;
xyz='She was prescribed exercise and diet. You may visit next week to take further advice about medicine as well as as well diet. You must take diet according to your dietician. Later we will think to revise your medicine';
First = 'medicine' ;
Second ='diet' ;
array firsts (4) f1-f4; /*assumes 1) that the first word won't occur more than 4 times*/
Array seconds (4) s1-s4;
Findex=1;/* these index variables will point to where to store the word count in the arrays*/
Sindex=1;
do i = 1 to (countw(xyz));
if upcase(First) = upcase(Scan(xyz,i)) then do;
Firsts[Findex] = i;
Findex = Findex+1;
end;
if upcase(Second) = upcase(Scan(xyz,i)) then do;
Seconds[Sindex]=i;
Sindex = Sindex +1;
end;
end;
do i = 1 to (n(of Firsts(*)));
do j = 1 to (n(of seconds(*)));
if 0< seconds[j]- firsts[i] le 6 then
put First "occurs in position" +1 firsts[i] "and"+1 Second "occurs at position" +1 Seconds[j];
end;
end;
run;
Also, it may be time to read a bit about arrays and logic constructs. To out put just the ones where "diet" is 6 or fewer words after "medicine" compare each value pair.
Please post code in the box after selecting the "run" icon above. It will preserve formatting and the indents really make it much easier to read nested do loop code frequently needed for arrays.
You could look up the n (of seconds(*)) to find out that it gets the count of populated cells in the array so all of your comments about "and ne ." are not needed.
You DID need to add the Position of second - Position of first should be greater than 0.
Pattern matching is ideal for this kind of intricate request:
data test;
xyz='She was prescribed exercise and diet. You may visit next week to take
further advice about medicine as well as as well diet. You must take diet
according to your dietician. Later we will think to revise your medicine';
run;
data _null_;
set test;
First = 'medicine' ;
do Second = "well", "diet", "you", "must", "take" ;
interest = prxmatch(cats("/", First, "(\W+\w+){0,6}\W+", Second, "\b/i"), xyz);
put (First Second interest) (=)/;
end;
run;
/*
Pattern reads : Find First word, followed with 0 to 6 words (a sequence
of non-word characters (\W) followed by a sequence of word characters (\w)),
followed with a sequence of non-word characters, followed with the
Second word, ending on a word boundary (\b).
The match is case insensitive (i).
*/
Edit: new version + comments
"think" is a word that is present in the string but at more than 7 words away from "medicine", thus the result interest=0.
The pattern means : Find the word medicine followed by zero to six (sequences of word letters followed one or many spaces) folowed by spaces and the word diet.
See edited version of my code.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.