Pyrite | Level 9

## Flagging records based on word distance criteria

Hi there,

Following guidance from the sas community, I have been trying to flag records based on word distance criteria. Probably I am doing some mistake in syntax and not achieving success. I have attached a sample of 4 records to test my code given below:

``````data test2;
set test1;
First = 'negative' ;
Second ='malignancy' ;
array firsts (4)  f1-f4; /*assumes 1) that the first word won't occur more than 4 times*/
Array seconds (4) s1-s4;
Findex=1;/* these index variables will point to where to store the word count in the arrays*/
Sindex=1;
do i = 1 to (countw(xyz));
if upcase(First) = upcase(Scan(xyz,i)) then do;
Firsts[Findex] = i;
Findex = Findex+1;
end;
if upcase(Second) = upcase(Scan(xyz,i)) then do;
Seconds[Sindex]=i;
Sindex = Sindex +1;
end;
end;

do i = 1 to (n(of Firsts(*)));
do j = 1 to (n(of seconds(*)));
if 0< seconds[j]- firsts[i] le 6 then
put First "occurs in position" +1 firsts[i] "and"+1 Second "occurs at position" +1 Seconds[j];
negative_malignancy=1;
end;
end;

run;``````
Swain
1 ACCEPTED SOLUTION

Accepted Solutions
Opal | Level 21

## Re: Flagging records based on word distance criteria

At the risk of sounding insistent, I would favor treating this problem with regular expressions (I renamed your file Diagnosis.xls😞

``````libname xl Excel "&sasforum\datasets\diagnosis.xls" access=readonly;

data test2;
First = 'negative' ;
Second ='malignancy' ;
prx = prxParse(cats("/(", First, ")(?:\W+\w+){0,6}\W+(", Second, ")\b/i"));
set xl.'test1\$'n;
if prxMatch(prx, diagnosis) then do;
call prxPosn(prx, 1, firstPos);
call prxPosn(prx, 2, secondPos);
end;
drop prx;
run;

proc sql;
select * from test2;
quit;``````
PG
10 REPLIES 10
Opal | Level 21

## Re: Flagging records based on word distance criteria

At the risk of sounding insistent, I would favor treating this problem with regular expressions (I renamed your file Diagnosis.xls😞

``````libname xl Excel "&sasforum\datasets\diagnosis.xls" access=readonly;

data test2;
First = 'negative' ;
Second ='malignancy' ;
prx = prxParse(cats("/(", First, ")(?:\W+\w+){0,6}\W+(", Second, ")\b/i"));
set xl.'test1\$'n;
if prxMatch(prx, diagnosis) then do;
call prxPosn(prx, 1, firstPos);
call prxPosn(prx, 2, secondPos);
end;
drop prx;
run;

proc sql;
select * from test2;
quit;``````
PG
Pyrite | Level 9

## Re: Flagging records based on word distance criteria

Hi there,

Perl regular expression seems to be awesome. Can you modify it to put third word in the criteria for example;

first word= 'negative'

second word='for'

third word ='malignancy'

the distance between first and second word = 1

the distance between second and third word le 6.

Swain
Opal | Level 21

## Re: Flagging records based on word distance criteria

Extended version:

``````libname xl Excel "&sasforum\datasets\diagnosis.xls" access=readonly;

data test3;
First =  'negative' ;
Second = 'for';
Third  = 'malignancy' ;
prx = prxParse(cats("/(", First, ")\W+", Second, "(?:\W+\w+){0,5}\W+(", Third, ")\b/i"));
set xl.'test1\$'n;
if prxMatch(prx, diagnosis) then do;
call prxPosn(prx, 1, firstPos);
call prxPosn(prx, 2, lastPos);
end;
drop prx;
run;

proc sql;
select * from test3;
quit;``````
PG
Pyrite | Level 9

## Re: Flagging records based on word distance criteria

Hi PG,

For the same situation, the code provided by you is working fanstastic using cats function. Just to have a better understing of the regular expression syntax, I am trying to write the same code without using cats function.

``````libname xl Excel "&sasforum\datasets\diagnosis.xls" access=readonly;

data test3;
First =  'negative' ;
Second = 'for';
Third  = 'malignancy' ;
prx = prxParse(cats("/(", First, ")\W+", Second, "(?:\W+\w+){0,5}\W+(", Third, ")\b/i"));
set xl.'test1\$'n;
if prxMatch(prx, diagnosis) then do;
call prxPosn(prx, 1, firstPos);
call prxPosn(prx, 2, lastPos);
end;
drop prx;
run;

proc sql;
select * from test3;
quit;``````

I am trying to replace the 3 red lines code with a single line code:

prx =

``prxParse("/("negative")\W+", for, "(?:\W+\w+){0,5}\W+(" malignancy")\b/i"));``

Regards,

Deepak

Swain
Opal | Level 21

## Re: Flagging records based on word distance criteria

prx = prxParse("/(negative)\W+for(?:\W+\w+){0,5}\W+(malignancy)\b/io");

PG
Pyrite | Level 9

## Re: Flagging records based on word distance criteria

Hi there,

The solution provided is an excellent example of PERL. Being new to SAS, I having limited experience with this language. Unfortunately I have no idea about use of prxparse and cats function together. Is it possible to get some article or information materials showing examples of use of prxparse and cats  function together. I am interested to have better understanding about the syntax so that I can use it more frequently in the future.

Deepak

Swain
Opal | Level 21

## Re: Flagging records based on word distance criteria

cats is a simple string concatenation function that is used to build the search pattern. Regular expression matching functions are more difficult to master but quite powerful. There are many tutorials available on the net. The following example is a bit easier to understand than the previous:

``````data test;
xyz='She was prescribed exercise and diet. You may visit next week to take
further advice about medicine as well as as well diet. You must take diet
according to your dietician. Later we will think to revise your medicine';
run;

data _null_;
length pattern \$50;
set test;
put xyz=/;
First = 'medicine' ;
do Second = "well", "diet", "you", "must", "take" ;
pattern = cats("/", First, "(\W+\w+){0,6}\W+", Second, "\b/i");
interest = prxmatch(pattern, xyz);
put (pattern interest) (=/)/;
end;
run;

``````

Good luck!

PG
Pyrite | Level 9

## Re: Flagging records based on word distance criteria

Hi PG,

Thanks for showing your patience with my silly questions as well as for your kind reply. In last 24 hours I am having a nice learning experience in perl. Now your code is crystal clear to me and I can use similar logic in varoius situations.

Regards,

Deepak

Swain

## Re: Flagging records based on word distance criteria

@DeepakSwain: Also, what kind of failure did you encounter? I've run your code and the result seems to be correct and the log is clean. Of course, I had to name the character variable XYZ, not DIAGNOSIS, to make the code work.

Pyrite | Level 9

## Re: Flagging records based on word distance criteria

Thank you for your feedback. Appreciated.

Swain
Discussion stats
• 10 replies
• 1999 views
• 2 likes
• 3 in conversation