DATA Step, Macro, Functions and more

Categorizing from a Text Field - Rx Drug Categorizaiton

Reply
Regular Contributor
Posts: 199

Categorizing from a Text Field - Rx Drug Categorizaiton

I need to categorize prescription drug information using data that was self-reported in a survey in an open field / text box. 

 

rxsublist is a variable whose values are character strings. It was an open field in a survey. A typical example reads, "Paracetamol, Ibuprofen, Tramadol, valium" or "Paxil, Neurontin, Serzone, Abilify, Zoloft"

 

I want SAS to do the following:

If rxsublist contains "Paracetamol" OR "Ibuprofen" then newvar_nsaid = 1, if not newvar_said=0.

If rxsublist contains "Tramadol" OR "percocet" then newvar_opiate=1, if not newvar_optiate=0;

If rxsublist contains "valium" OR "xanax OR "Xanax" then newvar_benzo=1, if not newvar_benzo=0; 

 

I have tried using the IN operator, and multiple OR statements, but I can't figure it out.

 

What is the best way to do this?

Super User
Posts: 10,500

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

Why don't you post the error messages you got?

 

Since you imply that you did If/then/else that means you were in a data step and CONTAINS only works in Proc SQL. Also OR should have a comparison on both side: I a=1 OR b=3. If you use something like:

If a='this' or 'that' then 'that' is an invalid numeric variable and generates an error. If you use A=3 or 4 the 4 is considered the result of a comparison and any non-missing non-zero value is considered to be true by SAS . Therefore "if A=3 or 4" is always true.

 

The construct: If not newvar_said=0 is way incorrect. The construct would be If thisvalue then newvar_said=1; Else newvar_said=0;

NOT is an operator negating the truth value of the next variable or expression.

 

First you need to decide if your strings only exist as complete words or if they may occur imbedded in other strings.

Second to use equality comparisons such as IN then make sure the case you compare is correct. Your example implies that letter case is not consistent. IN is the equivalent of If this=that or this=somethingelse or this=yetanothervalue. You might pull each word out of your list separately but

 

Try something like this:

data want;

   set have;

   rxsublist= upcase(rxsublist);

  NewVar_said = (index(rxsublist,'PARACETAMOL')>0 or index(rxsublist,'IBUPROFEN')>0 );

  NewVar_opiate = (index(rxsublist,'TRAMADOL')>0 or index(rxsublist,'PERCOCET')>0 );

  NewVar_benzo = (index(rxsublist,'VALIUM')>0 or index(rxsublist,'XANAX')>0 );

run;

 

This uses the SAS behavior of assigning the value 1 to a true comparison and 0 to false. So the outside ( ) tells SAS to treat everything within as a logical comparison and return 1 or 0.

 

Another approach could be breaking rxsublist into individual words and some array processing but depending on how many values are in the entries and how many specific ones you have to search for other approaches may become more robust.

Respected Advisor
Posts: 4,649

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

You could use

 

newvar_said = findw(rxsublist, "paracetamol", ,"i") or findw(rxsublist, "ibuprofen", ,"i");

etc. The 'i' modifier makes the match case insensitive.

PG
Regular Contributor
Posts: 199

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

@PGStats  W/ your suggested syntax, what should the value of newvar_said if the expression is true?

 

Thanks for your help!

Respected Advisor
Posts: 4,649

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

SAS codes a true value as 1 and a false value as 0.  When evaluating a logical expression, missing values and zeros are considered false and anything else is considered true. FINDW returns 0 if the word is not found and a positive number when the word is found. 

PG
Regular Contributor
Posts: 199

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

@PGStats That's what I thought...When I include this syntax in the DATA step, alll the values in the newvar_said equal zero, despite the value of the variable being equal to the quoted string (i.e. findw(rxsublist, "ibuprofen", ,"i")...What am I missing? Thanks!

 

 

 

 

Respected Advisor
Posts: 4,649

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

Sorry about that. It appears that leaving out the third argument to function FINDW doesnt have the effect I expected. Try this instead:

 

newvar_said = findw(rxsublist, "paracetamol", " ", "pio") or findw(rxsublist, "ibuprofen", " ", "pio");

The p modifier adds punctuation marks to the list of delimiters. The o modifier request some optimisation.

PG
Regular Contributor
Posts: 199

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

@PGStats Thanks, we are getting closer! One additional modification is needed however...and I should have made this clear at the onset. Sorry.

 

Some observations have values with multiple drugs in the same category. For example, "Effexor , Amytriptyline, Metformin, Valuim, Flomax,". 

 

In my categorization scheme, Effexor and Amytriptyline are both anti-depressants. When I use the following code, a value of 1 is generated for rx_anti_dep. What I really want is a value of 1 to be generated for EACH drug, so that the value in this scenario would be 2, 1 for Effexor and 1 for Amytriptyline.

 

rx_anti_dep = findw(rxsublist, "Amytriptyline", " ", "pio") OR

                       findw(rxsublist, "Effexor", " ", "pio");

 

Any suggestions?

Respected Advisor
Posts: 4,649

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

I do.

 

rx_anti_dep = (findw(rxsublist, "Amytriptyline", " ", "pio") > 0) +
                       (findw(rxsublist, "Effexor", " ", "pio") > 0);
PG
Regular Contributor
Posts: 199

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

@PGStats I'm not exactly sure how to describe the change in the output by replacing "OR" w/ "+"...The value of rx_anti_dep = 12 for the observation "Effexor , Amytriptyline, Metformin, Valuim, Flomax,". It should equal 2. I'm not sure where the 12 comes from since there are more than 12 "+"s (See below).

 

rx_anti_dep = findw(rxsublist, "Prozac", " ", "pio") +
findw(rxsublist, "prozac", " ", "pio") +
findw(rxsublist, "Paxcil", " ", "pio") +
findw(rxsublist, "trazadone", " ", "pio") +
findw(rxsublist, "effexor", " ", "pio") +
findw(rxsublist, "depression meds", " ", "pio") +
findw(rxsublist, "anti-depressant med", " ", "pio") +
findw(rxsublist, "anti-depressants", " ", "pio") +
findw(rxsublist, "Welbutrin", " ", "pio") +
findw(rxsublist, "welbutrin", " ", "pio") +
findw(rxsublist, "wellbutrin", " ", "pio") +
findw(rxsublist, "Wellbutrin", " ", "pio") +
findw(rxsublist, "Celexa", " ", "pio") +
findw(rxsublist, "celexa", " ", "pio") +
findw(rxsublist, "Anti-depressants", " ", "pio") +
findw(rxsublist, "antidepressants", " ", "pio") +
findw(rxsublist, "Depression Pills", " ", "pio") +
findw(rxsublist, "many anti depression meds", " ", "pio") +
findw(rxsublist, "Lexapro", " ", "pio") +
findw(rxsublist, "zoloft", " ", "pio") +
findw(rxsublist, "Zoloft", " ", "pio") +
findw(rxsublist, "Esciotalopram", " ", "pio") +
findw(rxsublist, "venlafaxine", " ", "pio") +
findw(rxsublist, "anti-depressants", " ", "pio") +
findw(rxsublist, "amitriptyline", " ", "pio") +
findw(rxsublist, "Amytriptyline", " ", "pio") +
findw(rxsublist, "SSRIs", " ", "pio") +
findw(rxsublist, "Venlafaxine", " ", "pio") +
findw(rxsublist, "amyitriptaline", " ", "pio") +
findw(rxsublist, "mirtazipine", " ", "pio") +
findw(rxsublist, "Citalopram", " ", "pio");

Respected Advisor
Posts: 4,649

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

Look again at what I proposed... (FINDW(...)>0) + ...  

also, do not repeat the check for different upcase/lowcase combinations of the same word. 

PG
Trusted Advisor
Posts: 1,378

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

 

use a code like:

 

data want;

  set have;

        length  drugx catg1-catg4     $15;    /* assuming 3 categories + other */

        drop i drugx;

        call missing(catg1-catg4);

 

 

       no_of_drugs = countw(rxsublist,',  ');

       do i=1 to no_of_drugs;

            drugx = lowcase(scan(rxsublist, i));

            if drugx in ("paracetamol"  "Ibuprofen" ) then catg1 = 1; else

            if drugx in ("tramadol"  "percocet")         then catg2 = 1; else

            if drugx in ( "valium"  "xanax")                then catg3 = 1; else catg4 = 1;

       end;

run;

   

            

            

       

Regular Contributor
Posts: 199

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

@Shmuel Thanks for your help. Your approach was my original approach, but it seemed like the IN Operator was struggling when the value contained more than one string (i.e. Drug or quoted word/phrase). For example, if the value was "Zoloft, Xanax, Immitrex, Pain killers/Muscle Relaxers.", a value of 1 was generated for "Zoloft" (i.e. catg5, let's say), but a value of missing was generated for catg3 (i.e. Xanax) and catg6 (i.e. Immitrex), etc.. If the only value in the observation was "Zoloft" this wasn't a problem, but that is rarely the case in this dataset.

 

I just substituted your exact code and the results are similar, except...there are instances where it actually worked (See Observation 1767 in the attached). In the vast majority of observations however, only one value was generated for one new variable. In a small number of cases, the syntax seems to have worked correctly (i.e. Observation 1767). I don't understand why b/c there are no observable differences in the values and the syntax is the same. Hmmm....

Trusted Advisor
Posts: 1,378

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

In your first post you wrote:

If rxsublist contains "Paracetamol" OR "Ibuprofen" then newvar_nsaid = 1, if not newvar_said=0.
If rxsublist contains "Tramadol" OR "percocet" then newvar_opiate=1, if not newvar_optiate=0;
If rxsublist contains "valium" OR "xanax OR "Xanax" then newvar_benzo=1, if not newvar_benzo=0; 

I used the term category as group of drugs, thus defining catg1-catg4 according to above couples given.

 

Let us look at rxsublist =  "Zoloft, Xanax, Immitrex, Pain killers/Muscle Relaxers."

then the code  no_of_drugs = countw(rxsublist,',  ');  will result in no_of_drugs = 6 words 

because I have defind only comma and space as delimiters. A slight change, adding slash

to the delimiters:  no_of_drugs = countw(rxsublist,',  /');  will result in no_of_drugs = 7 words.

 

In this case not all words are drugs. As you have descriptive words and other non drugs names,

then catg4 used as other drug may be used as a control  to look for  skipped/forgotten drugs, in order to add then to the code and rerun. You can even check for such words without assigning any category, even not other category, like:

       if drugx in ("and" "or" "other" "for" "like" ... etc ...) then; else

 

You can define categories one per each drug or per group of drugs, as you find it most applicable.

in case of category catg3 used for drug "Xanax" only, then you can add a line like:

     if drugx = "xanax" then catg3 = 1;  else

just remeber that drugx contains lowcase words.

 

I'm not sure if I have unswered all your questions. I have looked at the table result you have attached,

but I miss the logic - your code - given in the category definitions.

 

If you have more questions, focus on the issue giving: rxsublist, the relevant code line, the result of this code-line

and the expected result.

Respected Advisor
Posts: 3,892

Re: Categorizing from a Text Field - Rx Drug Categorizaiton

@jcorroon

 

I would go for an approach using formats as this will help you to avoid "death by coding".

Depending on your real data you will probably have to amend the character used as delimiter in the scan() function (it's currently a blank).

 

/** option 1 **/
proc format;
  invalue $ DrugCat (just upcase)
    'PROZAC' = 'Group 1'
    'PAXCIL' = 'Group 1'
    'TRAZADONE' = 'Group 2'
    'EFFEXOR' = 'Group 2'
    other = 'other'
    ;
run;

data sample_long(keep=pat_id Drug DrugCat);
  infile datalines truncover;
  input have_str $100.;

  pat_id=1;
  length Drug $20;
  do _i=1 to countw(have_str,' ');
    Drug=scan(have_str,_i,' ');
    DrugCat=input(Drug,$DrugCat.);
    output;
  end;
datalines;
trazadone effexor prozac paxcil
effexor prozac paxcil
trazadone  prozac paxcil
trazadone something
;
run;

/** option 2 **/
proc format;
  invalue DrugCatNum (just upcase)
    'PROZAC' = 1
    'PAXCIL' = 1
    'TRAZADONE' = 2
    'EFFEXOR' = 2
    other = 99
    ;
run;

data sample_wide(drop=_:);
  infile datalines truncover;
  input have_str $100.;

  pat_id=1;
  length _Drug $20;
  array arr_DrugCatCnt [*} DrugCatCnt_1 - DrugCatCnt_2 DrugCatCnt_99;
  do _i=1 to countw(have_str,' ');
    _Drug=scan(have_str,_i,' ');
    _DrugCat=input(_Drug,DrugCatNum.);
    if _DrugCat=99 then _ind=dim(arr_DrugCatCnt); else _ind=_DrugCat;
    arr_DrugCatCnt[_ind]=sum(arr_DrugCatCnt[_ind],1);
  end;
datalines;
trazadone effexor prozac paxcil
effexor prozac paxcil
trazadone  prozac paxcil
trazadone something
;
run;
Ask a Question
Discussion stats
  • 15 replies
  • 510 views
  • 7 likes
  • 5 in conversation