Text mining and content categorization

How to use Prxmatch to match all substrings

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 10
Accepted Solution

How to use Prxmatch to match all substrings

Hi all,

 

I would like to create an indicator variable for successfully matching all the sub-string. I can do the following, for example:

 

if count(source_var, "substring1") >0 and count(source_var, "substring2")>0 and count(source_var, "substring3")>0 then indicator=1;

else indicator=0;

 

But I believe there is a much quicker way to do it using Prxmatch syntax....please advise if you have a sample Prxmatch code to achieving this task in SAS. Thank you very much in advance!


Accepted Solutions
Solution
‎03-04-2017 04:53 PM
Grand Advisor
Posts: 10,052

Re: How to use Prxmatch to match all substrings

Medication names are not what I would think of are the best candidates for PRX as the names are not consistent.

 

indicator = (indexw(source_var,"ABACAVIR",'/') >0 and indexw(source_var,"EMTRICITABINE",'/') >0 );

should work for two to find both and extend with more if need more in an "and" useage, or nesting of () with or conditions.

 

How many is a 'Large number of meds' and are you discussing mutliple regimens in combination? 

Here may be one way to work with varying number of concerned lists.

data example;
   informat source_var $200.;
   input source_var;
   /* the $25 below should be set large enough to hold the longest names expected*/
   array reg1 {2} $25 _temporary_ ("ABACAVIR", "EMTRICITABINE");
   array reg2 {3} $25 _temporary_ ("ABACAVIR","CIPROFLOXACIN", "DOXYCYCLINE");
   /* strip is needed as the temporary array elements get padded to the length of the set length*/
   /* use indexw to reduce chance of compounds considered matches*/
   indreg1 = ( indexw(source_var,strip(reg1[1]),'/') >0 and indexw(source_var,strip(reg1[2]),'/') >0 );
   /* loop to go over longer lists*/
   counter=0;
   do i = 1 to dim(reg2);
      counter = sum(counter,( indexw(source_var,strip(reg2[i]),'/') >0));
   end;
   indreg2= (counter=3); /* 3 indicates all 3 are found*/
   drop i counter;
   label 
      indreg1='Indicator for regime 1'
      indreg2='Indicator for regime 2'
   ;
        
datalines;
ABACAVIR/COBICISTAT/ELVITEGRAVIR/EMTRICITABINE/TENOFOVIR
ABACAVIR/COBICISTAT/ELVITEGRAVIR/CIPROFLOXACIN/TENOFOVIR
ABACAVIR/DOXYCYCLINE/ELVITEGRAVIR/EMTRICITABINE/TENOFOVIR
ABACAVIR/DOXYCYCLINE/CIPROFLOXACIN/EMTRICITABINE/TENOFOVIR
;
run;

 

Note that INDEXW is case sensitive so you may want to upcase your source_var.

View solution in original post


All Replies
Contributor
Posts: 23

Re: How to use Prxmatch to match all substrings

Hi,

Try below

 

if prxmatch ("m/substring1|substring2|substring3/i",source_var) > 0 then indicator=1 else indicator =0 ;

Occasional Contributor
Posts: 10

Re: How to use Prxmatch to match all substrings

Thanks for your input! I tried your code, but it seems to address the situation where at least one of the substrings is found in source_var, as opposed to that all of the substrings are found in source_var (which is what I want to do). In other words, the code is more similar to count(source_var, "substring1") >0 OR count(source_var, "substring2")>0 OR count(source_var, "substring3")>0
Grand Advisor
Posts: 10,052

Re: How to use Prxmatch to match all substrings

You might post some example data and the searched substrings. If the searched substrings follow patterns then the PRX approach would likely work but if the targets are things like "This String", "bdQ rx - 12345" and "(**)brg" then PRX arent' going to help.

 

If you don't have more that a few of these targets you may want to look at:

indicator= (sum( count(source_var, "substring1") >0 , count(source_var, "substring2")>0, count(source_var, "substring3")>0) > 0);

 

or the index function as count may be overkill if presence is the only requirement.

 

Also is case  insensitivity involved? 

Occasional Contributor
Posts: 10

Re: How to use Prxmatch to match all substrings

Thank you for your feedback!

Basically, I have this text string variable for medication regimen (for example, "ABACAVIR/COBICISTAT/ELVITEGRAVIR/EMTRICITABINE/TENOFOVIR"). Each medication is separated by "/". I want to create an indicator variable to capture the medication regimen that includes "ABACAVIR" and "EMTRICITABINE" ( the order does not matter). That's why I was using count() function but I thought there is probably a much quicker way to do it without having to do multiple count() or index() functions in a statement, especially to save time when I need to search for a large numbers of meds.

Solution
‎03-04-2017 04:53 PM
Grand Advisor
Posts: 10,052

Re: How to use Prxmatch to match all substrings

Medication names are not what I would think of are the best candidates for PRX as the names are not consistent.

 

indicator = (indexw(source_var,"ABACAVIR",'/') >0 and indexw(source_var,"EMTRICITABINE",'/') >0 );

should work for two to find both and extend with more if need more in an "and" useage, or nesting of () with or conditions.

 

How many is a 'Large number of meds' and are you discussing mutliple regimens in combination? 

Here may be one way to work with varying number of concerned lists.

data example;
   informat source_var $200.;
   input source_var;
   /* the $25 below should be set large enough to hold the longest names expected*/
   array reg1 {2} $25 _temporary_ ("ABACAVIR", "EMTRICITABINE");
   array reg2 {3} $25 _temporary_ ("ABACAVIR","CIPROFLOXACIN", "DOXYCYCLINE");
   /* strip is needed as the temporary array elements get padded to the length of the set length*/
   /* use indexw to reduce chance of compounds considered matches*/
   indreg1 = ( indexw(source_var,strip(reg1[1]),'/') >0 and indexw(source_var,strip(reg1[2]),'/') >0 );
   /* loop to go over longer lists*/
   counter=0;
   do i = 1 to dim(reg2);
      counter = sum(counter,( indexw(source_var,strip(reg2[i]),'/') >0));
   end;
   indreg2= (counter=3); /* 3 indicates all 3 are found*/
   drop i counter;
   label 
      indreg1='Indicator for regime 1'
      indreg2='Indicator for regime 2'
   ;
        
datalines;
ABACAVIR/COBICISTAT/ELVITEGRAVIR/EMTRICITABINE/TENOFOVIR
ABACAVIR/COBICISTAT/ELVITEGRAVIR/CIPROFLOXACIN/TENOFOVIR
ABACAVIR/DOXYCYCLINE/ELVITEGRAVIR/EMTRICITABINE/TENOFOVIR
ABACAVIR/DOXYCYCLINE/CIPROFLOXACIN/EMTRICITABINE/TENOFOVIR
;
run;

 

Note that INDEXW is case sensitive so you may want to upcase your source_var.

Occasional Contributor
Posts: 10

Re: How to use Prxmatch to match all substrings

Thank you very much for the clarification.
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 179 views
  • 1 like
  • 3 in conversation