DATA Step, Macro, Functions and more

Flagging records based on word distance criteria

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 104
Accepted Solution

Flagging records based on word distance criteria

Hi there,

Following guidance from the sas community, I have been trying to flag records based on word distance criteria. Probably I am doing some mistake in syntax and not achieving success. I have attached a sample of 4 records to test my code given below:

Thank you in advance for your kind reply. 

data test2; 
set test1;
   First = 'negative' ;
   Second ='malignancy' ;
   array firsts (4)  f1-f4; /*assumes 1) that the first word won't occur more than 4 times*/
   Array seconds (4) s1-s4; 
   Findex=1;/* these index variables will point to where to store the word count in the arrays*/
   Sindex=1;
   do i = 1 to (countw(xyz));
      if upcase(First) = upcase(Scan(xyz,i)) then do;
         Firsts[Findex] = i;
         Findex = Findex+1;
      end;
      if upcase(Second) = upcase(Scan(xyz,i)) then do;
         Seconds[Sindex]=i;
         Sindex = Sindex +1;
      end;
   end;

   do i = 1 to (n(of Firsts(*)));
      do j = 1 to (n(of seconds(*)));
         if 0< seconds[j]- firsts[i] le 6 then 
            put First "occurs in position" +1 firsts[i] "and"+1 Second "occurs at position" +1 Seconds[j];
			negative_malignancy=1;
      end;
   end;

run;
Swain

Accepted Solutions
Solution
‎05-03-2016 04:23 PM
Respected Advisor
Posts: 4,934

Re: Flagging records based on word distance criteria

Posted in reply to DeepakSwain

At the risk of sounding insistent, I would favor treating this problem with regular expressions (I renamed your file Diagnosis.xls):

 

libname xl Excel "&sasforum\datasets\diagnosis.xls" access=readonly;

data test2;
First = 'negative' ;
Second ='malignancy' ;
prx = prxParse(cats("/(", First, ")(?:\W+\w+){0,6}\W+(", Second, ")\b/i"));
set xl.'test1$'n;
if prxMatch(prx, diagnosis) then do;
    call prxPosn(prx, 1, firstPos);
    call prxPosn(prx, 2, secondPos);
    end;
drop prx;
run;

proc sql;
select * from test2;
quit;
PG

View solution in original post


All Replies
Solution
‎05-03-2016 04:23 PM
Respected Advisor
Posts: 4,934

Re: Flagging records based on word distance criteria

Posted in reply to DeepakSwain

At the risk of sounding insistent, I would favor treating this problem with regular expressions (I renamed your file Diagnosis.xls):

 

libname xl Excel "&sasforum\datasets\diagnosis.xls" access=readonly;

data test2;
First = 'negative' ;
Second ='malignancy' ;
prx = prxParse(cats("/(", First, ")(?:\W+\w+){0,6}\W+(", Second, ")\b/i"));
set xl.'test1$'n;
if prxMatch(prx, diagnosis) then do;
    call prxPosn(prx, 1, firstPos);
    call prxPosn(prx, 2, secondPos);
    end;
drop prx;
run;

proc sql;
select * from test2;
quit;
PG
Frequent Contributor
Posts: 104

Re: Flagging records based on word distance criteria

Hi there,

Perl regular expression seems to be awesome. Can you modify it to put third word in the criteria for example;

first word= 'negative'

second word='for' 

third word ='malignancy'

the distance between first and second word = 1

the distance between second and third word le 6.

 

Thank you in advance for your kind guidance. Good learning experience.

 

Swain
Respected Advisor
Posts: 4,934

Re: Flagging records based on word distance criteria

Posted in reply to DeepakSwain

Extended version:

 

libname xl Excel "&sasforum\datasets\diagnosis.xls" access=readonly;

data test3;
First =  'negative' ;
Second = 'for';
Third  = 'malignancy' ;
prx = prxParse(cats("/(", First, ")\W+", Second, "(?:\W+\w+){0,5}\W+(", Third, ")\b/i"));
set xl.'test1$'n;
if prxMatch(prx, diagnosis) then do;
    call prxPosn(prx, 1, firstPos);
    call prxPosn(prx, 2, lastPos);
    end;
drop prx;
run;

proc sql;
select * from test3;
quit;
PG
Frequent Contributor
Posts: 104

Re: Flagging records based on word distance criteria

Hi PG,

 

For the same situation, the code provided by you is working fanstastic using cats function. Just to have a better understing of the regular expression syntax, I am trying to write the same code without using cats function.  

libname xl Excel "&sasforum\datasets\diagnosis.xls" access=readonly;

data test3;
First =  'negative' ;
Second = 'for';
Third  = 'malignancy' ;
prx = prxParse(cats("/(", First, ")\W+", Second, "(?:\W+\w+){0,5}\W+(", Third, ")\b/i"));
set xl.'test1$'n;
if prxMatch(prx, diagnosis) then do;
    call prxPosn(prx, 1, firstPos);
    call prxPosn(prx, 2, lastPos);
    end;
drop prx;
run;

proc sql;
select * from test3;
quit;

 I am trying to replace the 3 red lines code with a single line code:

prx =

prxParse("/("negative")\W+", for, "(?:\W+\w+){0,5}\W+(" malignancy")\b/i"));

 Due to limited experience, I am doing some syntax error. Can you please help me. 

 

Regards,

Deepak

Swain
Respected Advisor
Posts: 4,934

Re: Flagging records based on word distance criteria

Posted in reply to DeepakSwain

prx = prxParse("/(negative)\W+for(?:\W+\w+){0,5}\W+(malignancy)\b/io");

PG
Frequent Contributor
Posts: 104

Re: Flagging records based on word distance criteria

Hi there,

The solution provided is an excellent example of PERL. Being new to SAS, I having limited experience with this language. Unfortunately I have no idea about use of prxparse and cats function together. Is it possible to get some article or information materials showing examples of use of prxparse and cats  function together. I am interested to have better understanding about the syntax so that I can use it more frequently in the future.

 

Thank you in advance for your kind reply. 

 

Deepak

Swain
Respected Advisor
Posts: 4,934

Re: Flagging records based on word distance criteria

Posted in reply to DeepakSwain

cats is a simple string concatenation function that is used to build the search pattern. Regular expression matching functions are more difficult to master but quite powerful. There are many tutorials available on the net. The following example is a bit easier to understand than the previous:

 

data test;
xyz='She was prescribed exercise and diet. You may visit next week to take 
 further advice about medicine as well as as well diet. You must take diet 
 according to your dietician. Later we will think to revise your medicine';
run;

data _null_;
length pattern $50;
set test;
put xyz=/;
First = 'medicine' ;
do Second = "well", "diet", "you", "must", "take" ;
    pattern = cats("/", First, "(\W+\w+){0,6}\W+", Second, "\b/i");
    interest = prxmatch(pattern, xyz);
    put (pattern interest) (=/)/;
    end;
run;

Good luck!

PG
Frequent Contributor
Posts: 104

Re: Flagging records based on word distance criteria

Hi PG,

 

Thanks for showing your patience with my silly questions as well as for your kind reply. In last 24 hours I am having a nice learning experience in perl. Now your code is crystal clear to me and I can use similar logic in varoius situations. 

 

Regards,

Deepak

 

Swain
Trusted Advisor
Posts: 1,118

Re: Flagging records based on word distance criteria

Posted in reply to DeepakSwain

@DeepakSwain: Also, what kind of failure did you encounter? I've run your code and the result seems to be correct and the log is clean. Of course, I had to name the character variable XYZ, not DIAGNOSIS, to make the code work.

Frequent Contributor
Posts: 104

Re: Flagging records based on word distance criteria

Posted in reply to FreelanceReinhard

Thank you for your feedback. Appreciated. 

Swain
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 802 views
  • 2 likes
  • 3 in conversation