BookmarkSubscribeRSS Feed
_maldini_
Barite | Level 11

I need to categorize prescription drug information using data that was self-reported in a survey in an open field / text box. 

 

rxsublist is a variable whose values are character strings. It was an open field in a survey. A typical example reads, "Paracetamol, Ibuprofen, Tramadol, valium" or "Paxil, Neurontin, Serzone, Abilify, Zoloft"

 

I want SAS to do the following:

If rxsublist contains "Paracetamol" OR "Ibuprofen" then newvar_nsaid = 1, if not newvar_said=0.

If rxsublist contains "Tramadol" OR "percocet" then newvar_opiate=1, if not newvar_optiate=0;

If rxsublist contains "valium" OR "xanax OR "Xanax" then newvar_benzo=1, if not newvar_benzo=0; 

 

I have tried using the IN operator, and multiple OR statements, but I can't figure it out.

 

What is the best way to do this?

15 REPLIES 15
ballardw
Super User

Why don't you post the error messages you got?

 

Since you imply that you did If/then/else that means you were in a data step and CONTAINS only works in Proc SQL. Also OR should have a comparison on both side: I a=1 OR b=3. If you use something like:

If a='this' or 'that' then 'that' is an invalid numeric variable and generates an error. If you use A=3 or 4 the 4 is considered the result of a comparison and any non-missing non-zero value is considered to be true by SAS . Therefore "if A=3 or 4" is always true.

 

The construct: If not newvar_said=0 is way incorrect. The construct would be If thisvalue then newvar_said=1; Else newvar_said=0;

NOT is an operator negating the truth value of the next variable or expression.

 

First you need to decide if your strings only exist as complete words or if they may occur imbedded in other strings.

Second to use equality comparisons such as IN then make sure the case you compare is correct. Your example implies that letter case is not consistent. IN is the equivalent of If this=that or this=somethingelse or this=yetanothervalue. You might pull each word out of your list separately but

 

Try something like this:

data want;

   set have;

   rxsublist= upcase(rxsublist);

  NewVar_said = (index(rxsublist,'PARACETAMOL')>0 or index(rxsublist,'IBUPROFEN')>0 );

  NewVar_opiate = (index(rxsublist,'TRAMADOL')>0 or index(rxsublist,'PERCOCET')>0 );

  NewVar_benzo = (index(rxsublist,'VALIUM')>0 or index(rxsublist,'XANAX')>0 );

run;

 

This uses the SAS behavior of assigning the value 1 to a true comparison and 0 to false. So the outside ( ) tells SAS to treat everything within as a logical comparison and return 1 or 0.

 

Another approach could be breaking rxsublist into individual words and some array processing but depending on how many values are in the entries and how many specific ones you have to search for other approaches may become more robust.

PGStats
Opal | Level 21

You could use

 

newvar_said = findw(rxsublist, "paracetamol", ,"i") or findw(rxsublist, "ibuprofen", ,"i");

etc. The 'i' modifier makes the match case insensitive.

PG
_maldini_
Barite | Level 11

@PGStats  W/ your suggested syntax, what should the value of newvar_said if the expression is true?

 

Thanks for your help!

PGStats
Opal | Level 21

SAS codes a true value as 1 and a false value as 0.  When evaluating a logical expression, missing values and zeros are considered false and anything else is considered true. FINDW returns 0 if the word is not found and a positive number when the word is found. 

PG
_maldini_
Barite | Level 11

@PGStats That's what I thought...When I include this syntax in the DATA step, alll the values in the newvar_said equal zero, despite the value of the variable being equal to the quoted string (i.e. findw(rxsublist, "ibuprofen", ,"i")...What am I missing? Thanks!

 

 

 

 

PGStats
Opal | Level 21

Sorry about that. It appears that leaving out the third argument to function FINDW doesnt have the effect I expected. Try this instead:

 

newvar_said = findw(rxsublist, "paracetamol", " ", "pio") or findw(rxsublist, "ibuprofen", " ", "pio");

The p modifier adds punctuation marks to the list of delimiters. The o modifier request some optimisation.

PG
_maldini_
Barite | Level 11

@PGStats Thanks, we are getting closer! One additional modification is needed however...and I should have made this clear at the onset. Sorry.

 

Some observations have values with multiple drugs in the same category. For example, "Effexor , Amytriptyline, Metformin, Valuim, Flomax,". 

 

In my categorization scheme, Effexor and Amytriptyline are both anti-depressants. When I use the following code, a value of 1 is generated for rx_anti_dep. What I really want is a value of 1 to be generated for EACH drug, so that the value in this scenario would be 2, 1 for Effexor and 1 for Amytriptyline.

 

rx_anti_dep = findw(rxsublist, "Amytriptyline", " ", "pio") OR

                       findw(rxsublist, "Effexor", " ", "pio");

 

Any suggestions?

PGStats
Opal | Level 21

I do.

 

rx_anti_dep = (findw(rxsublist, "Amytriptyline", " ", "pio") > 0) +
                       (findw(rxsublist, "Effexor", " ", "pio") > 0);
PG
_maldini_
Barite | Level 11

@PGStats I'm not exactly sure how to describe the change in the output by replacing "OR" w/ "+"...The value of rx_anti_dep = 12 for the observation "Effexor , Amytriptyline, Metformin, Valuim, Flomax,". It should equal 2. I'm not sure where the 12 comes from since there are more than 12 "+"s (See below).

 

rx_anti_dep = findw(rxsublist, "Prozac", " ", "pio") +
findw(rxsublist, "prozac", " ", "pio") +
findw(rxsublist, "Paxcil", " ", "pio") +
findw(rxsublist, "trazadone", " ", "pio") +
findw(rxsublist, "effexor", " ", "pio") +
findw(rxsublist, "depression meds", " ", "pio") +
findw(rxsublist, "anti-depressant med", " ", "pio") +
findw(rxsublist, "anti-depressants", " ", "pio") +
findw(rxsublist, "Welbutrin", " ", "pio") +
findw(rxsublist, "welbutrin", " ", "pio") +
findw(rxsublist, "wellbutrin", " ", "pio") +
findw(rxsublist, "Wellbutrin", " ", "pio") +
findw(rxsublist, "Celexa", " ", "pio") +
findw(rxsublist, "celexa", " ", "pio") +
findw(rxsublist, "Anti-depressants", " ", "pio") +
findw(rxsublist, "antidepressants", " ", "pio") +
findw(rxsublist, "Depression Pills", " ", "pio") +
findw(rxsublist, "many anti depression meds", " ", "pio") +
findw(rxsublist, "Lexapro", " ", "pio") +
findw(rxsublist, "zoloft", " ", "pio") +
findw(rxsublist, "Zoloft", " ", "pio") +
findw(rxsublist, "Esciotalopram", " ", "pio") +
findw(rxsublist, "venlafaxine", " ", "pio") +
findw(rxsublist, "anti-depressants", " ", "pio") +
findw(rxsublist, "amitriptyline", " ", "pio") +
findw(rxsublist, "Amytriptyline", " ", "pio") +
findw(rxsublist, "SSRIs", " ", "pio") +
findw(rxsublist, "Venlafaxine", " ", "pio") +
findw(rxsublist, "amyitriptaline", " ", "pio") +
findw(rxsublist, "mirtazipine", " ", "pio") +
findw(rxsublist, "Citalopram", " ", "pio");

PGStats
Opal | Level 21

Look again at what I proposed... (FINDW(...)>0) + ...  

also, do not repeat the check for different upcase/lowcase combinations of the same word. 

PG
Shmuel
Garnet | Level 18

 

use a code like:

 

data want;

  set have;

        length  drugx catg1-catg4     $15;    /* assuming 3 categories + other */

        drop i drugx;

        call missing(catg1-catg4);

 

 

       no_of_drugs = countw(rxsublist,',  ');

       do i=1 to no_of_drugs;

            drugx = lowcase(scan(rxsublist, i));

            if drugx in ("paracetamol"  "Ibuprofen" ) then catg1 = 1; else

            if drugx in ("tramadol"  "percocet")         then catg2 = 1; else

            if drugx in ( "valium"  "xanax")                then catg3 = 1; else catg4 = 1;

       end;

run;

   

            

            

       

_maldini_
Barite | Level 11

@Shmuel Thanks for your help. Your approach was my original approach, but it seemed like the IN Operator was struggling when the value contained more than one string (i.e. Drug or quoted word/phrase). For example, if the value was "Zoloft, Xanax, Immitrex, Pain killers/Muscle Relaxers.", a value of 1 was generated for "Zoloft" (i.e. catg5, let's say), but a value of missing was generated for catg3 (i.e. Xanax) and catg6 (i.e. Immitrex), etc.. If the only value in the observation was "Zoloft" this wasn't a problem, but that is rarely the case in this dataset.

 

I just substituted your exact code and the results are similar, except...there are instances where it actually worked (See Observation 1767 in the attached). In the vast majority of observations however, only one value was generated for one new variable. In a small number of cases, the syntax seems to have worked correctly (i.e. Observation 1767). I don't understand why b/c there are no observable differences in the values and the syntax is the same. Hmmm....

Shmuel
Garnet | Level 18

In your first post you wrote:

If rxsublist contains "Paracetamol" OR "Ibuprofen" then newvar_nsaid = 1, if not newvar_said=0.
If rxsublist contains "Tramadol" OR "percocet" then newvar_opiate=1, if not newvar_optiate=0;
If rxsublist contains "valium" OR "xanax OR "Xanax" then newvar_benzo=1, if not newvar_benzo=0; 

I used the term category as group of drugs, thus defining catg1-catg4 according to above couples given.

 

Let us look at rxsublist =  "Zoloft, Xanax, Immitrex, Pain killers/Muscle Relaxers."

then the code  no_of_drugs = countw(rxsublist,',  ');  will result in no_of_drugs = 6 words 

because I have defind only comma and space as delimiters. A slight change, adding slash

to the delimiters:  no_of_drugs = countw(rxsublist,',  /');  will result in no_of_drugs = 7 words.

 

In this case not all words are drugs. As you have descriptive words and other non drugs names,

then catg4 used as other drug may be used as a control  to look for  skipped/forgotten drugs, in order to add then to the code and rerun. You can even check for such words without assigning any category, even not other category, like:

       if drugx in ("and" "or" "other" "for" "like" ... etc ...) then; else

 

You can define categories one per each drug or per group of drugs, as you find it most applicable.

in case of category catg3 used for drug "Xanax" only, then you can add a line like:

     if drugx = "xanax" then catg3 = 1;  else

just remeber that drugx contains lowcase words.

 

I'm not sure if I have unswered all your questions. I have looked at the table result you have attached,

but I miss the logic - your code - given in the category definitions.

 

If you have more questions, focus on the issue giving: rxsublist, the relevant code line, the result of this code-line

and the expected result.

Patrick
Opal | Level 21

@_maldini_

 

I would go for an approach using formats as this will help you to avoid "death by coding".

Depending on your real data you will probably have to amend the character used as delimiter in the scan() function (it's currently a blank).

 

/** option 1 **/
proc format;
  invalue $ DrugCat (just upcase)
    'PROZAC' = 'Group 1'
    'PAXCIL' = 'Group 1'
    'TRAZADONE' = 'Group 2'
    'EFFEXOR' = 'Group 2'
    other = 'other'
    ;
run;

data sample_long(keep=pat_id Drug DrugCat);
  infile datalines truncover;
  input have_str $100.;

  pat_id=1;
  length Drug $20;
  do _i=1 to countw(have_str,' ');
    Drug=scan(have_str,_i,' ');
    DrugCat=input(Drug,$DrugCat.);
    output;
  end;
datalines;
trazadone effexor prozac paxcil
effexor prozac paxcil
trazadone  prozac paxcil
trazadone something
;
run;

/** option 2 **/
proc format;
  invalue DrugCatNum (just upcase)
    'PROZAC' = 1
    'PAXCIL' = 1
    'TRAZADONE' = 2
    'EFFEXOR' = 2
    other = 99
    ;
run;

data sample_wide(drop=_:);
  infile datalines truncover;
  input have_str $100.;

  pat_id=1;
  length _Drug $20;
  array arr_DrugCatCnt [*} DrugCatCnt_1 - DrugCatCnt_2 DrugCatCnt_99;
  do _i=1 to countw(have_str,' ');
    _Drug=scan(have_str,_i,' ');
    _DrugCat=input(_Drug,DrugCatNum.);
    if _DrugCat=99 then _ind=dim(arr_DrugCatCnt); else _ind=_DrugCat;
    arr_DrugCatCnt[_ind]=sum(arr_DrugCatCnt[_ind],1);
  end;
datalines;
trazadone effexor prozac paxcil
effexor prozac paxcil
trazadone  prozac paxcil
trazadone something
;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1970 views
  • 7 likes
  • 5 in conversation