Hi there,
Based on help from sas community, I have tried to identify records having specific words of interest in a particular order (e.g medicine will be first and diet will be second) as well as the distance between these words should be less than 7 words.
data test;
xyz='She was prescribed exercise and diet. You may visit next week to take further advice about medicine as well as as well diet. You must take diet according to your dietician. Later we will think to revise your medicine'; *search for the word she;
First = 'medicine' ;
Second = 'diet' ;
array firsts (3) f1-f3;
Array seconds (3) s1-s3;
Findex=1;
Sindex=1;
do i = 1 to (countw(xyz));
if upcase(First) = upcase(Scan(xyz,i)) then do;
Firsts[Findex] = i;
Findex = Findex+1;
end;
if upcase(Second) = upcase(Scan(xyz,i)) then do;
Seconds[Sindex]=i;
Sindex = Sindex +1;
end;
end;
do i = 1 to (n(of Firsts(*)));
put First "occurs in position" +1 Firsts[i] ;
end;
do j = 1 to (n(of seconds(*)));
put second "occurs in position" +1 seconds[j] ;
end;
if (Firsts(1) lt seconds(1) and seconds(1) - firsts(1) le 6 and Firsts(1) ne . and seconds(1) ne . )
or (Firsts(1) lt seconds(2) and seconds(2) - firsts(1) le 6 and Firsts(1) ne . and seconds(2) ne . )
or (Firsts(1) lt seconds(3) and seconds(3) - firsts(1) le 6 and Firsts(1) ne . and seconds(3) ne . )
or (Firsts(2) lt seconds(1) and seconds(1) - firsts(2) le 6 and Firsts(2) ne . and seconds(1) ne . )
or (Firsts(2) lt seconds(2) and seconds(2) - firsts(2) le 6 and Firsts(2) ne . and seconds(2) ne . )
or (Firsts(2) lt seconds(3) and seconds(3) - firsts(2) le 6 and Firsts(2) ne . and seconds(3) ne . )
or (Firsts(3) lt seconds(1) and seconds(1) - firsts(3) le 6 and Firsts(3) ne . and seconds(1) ne . )
or (Firsts(3) lt seconds(2) and seconds(2) - firsts(3) le 6 and Firsts(3) ne . and seconds(2) ne . )
or (Firsts(3) lt seconds(3) and seconds(3) - firsts(3) le 6 and Firsts(3) ne . and seconds(3) ne . ) ;
run;
Can somebody suggest me some simplied code for the colored section.
Thank you in advance for your kind reply.
Regards,
Deepak
Look at this example:
data _null_;
xyz='She was prescribed exercise and diet. You may visit next week to take further advice about medicine as well as as well diet. You must take diet according to your dietician. Later we will think to revise your medicine';
First = 'medicine' ;
Second ='diet' ;
array firsts (4) f1-f4; /*assumes 1) that the first word won't occur more than 4 times*/
Array seconds (4) s1-s4;
Findex=1;/* these index variables will point to where to store the word count in the arrays*/
Sindex=1;
do i = 1 to (countw(xyz));
if upcase(First) = upcase(Scan(xyz,i)) then do;
Firsts[Findex] = i;
Findex = Findex+1;
end;
if upcase(Second) = upcase(Scan(xyz,i)) then do;
Seconds[Sindex]=i;
Sindex = Sindex +1;
end;
end;
do i = 1 to (n(of Firsts(*)));
do j = 1 to (n(of seconds(*)));
if 0< seconds[j]- firsts[i] le 6 then
put First "occurs in position" +1 firsts[i] "and"+1 Second "occurs at position" +1 Seconds[j];
end;
end;
run;
Also, it may be time to read a bit about arrays and logic constructs. To out put just the ones where "diet" is 6 or fewer words after "medicine" compare each value pair.
Please post code in the box after selecting the "run" icon above. It will preserve formatting and the indents really make it much easier to read nested do loop code frequently needed for arrays.
You could look up the n (of seconds(*)) to find out that it gets the count of populated cells in the array so all of your comments about "and ne ." are not needed.
You DID need to add the Position of second - Position of first should be greater than 0.
Look at this example:
data _null_;
xyz='She was prescribed exercise and diet. You may visit next week to take further advice about medicine as well as as well diet. You must take diet according to your dietician. Later we will think to revise your medicine';
First = 'medicine' ;
Second ='diet' ;
array firsts (4) f1-f4; /*assumes 1) that the first word won't occur more than 4 times*/
Array seconds (4) s1-s4;
Findex=1;/* these index variables will point to where to store the word count in the arrays*/
Sindex=1;
do i = 1 to (countw(xyz));
if upcase(First) = upcase(Scan(xyz,i)) then do;
Firsts[Findex] = i;
Findex = Findex+1;
end;
if upcase(Second) = upcase(Scan(xyz,i)) then do;
Seconds[Sindex]=i;
Sindex = Sindex +1;
end;
end;
do i = 1 to (n(of Firsts(*)));
do j = 1 to (n(of seconds(*)));
if 0< seconds[j]- firsts[i] le 6 then
put First "occurs in position" +1 firsts[i] "and"+1 Second "occurs at position" +1 Seconds[j];
end;
end;
run;
Also, it may be time to read a bit about arrays and logic constructs. To out put just the ones where "diet" is 6 or fewer words after "medicine" compare each value pair.
Please post code in the box after selecting the "run" icon above. It will preserve formatting and the indents really make it much easier to read nested do loop code frequently needed for arrays.
You could look up the n (of seconds(*)) to find out that it gets the count of populated cells in the array so all of your comments about "and ne ." are not needed.
You DID need to add the Position of second - Position of first should be greater than 0.
Pattern matching is ideal for this kind of intricate request:
data test;
xyz='She was prescribed exercise and diet. You may visit next week to take
further advice about medicine as well as as well diet. You must take diet
according to your dietician. Later we will think to revise your medicine';
run;
data _null_;
set test;
First = 'medicine' ;
do Second = "well", "diet", "you", "must", "take" ;
interest = prxmatch(cats("/", First, "(\W+\w+){0,6}\W+", Second, "\b/i"), xyz);
put (First Second interest) (=)/;
end;
run;
/*
Pattern reads : Find First word, followed with 0 to 6 words (a sequence
of non-word characters (\W) followed by a sequence of word characters (\w)),
followed with a sequence of non-word characters, followed with the
Second word, ending on a word boundary (\b).
The match is case insensitive (i).
*/
Edit: new version + comments
"think" is a word that is present in the string but at more than 7 words away from "medicine", thus the result interest=0.
The pattern means : Find the word medicine followed by zero to six (sequences of word letters followed one or many spaces) folowed by spaces and the word diet.
See edited version of my code.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.