Re: Creating new variables from multiple character variable responses

jessho · Posted 05-17-2019 06:53 AM

Hi all,
I have two character variables (med1, med2) for medications prescribed in my dataset. I want to create a cleaner/rolled up third variable that aggregates all similar medications so that I can put this new variable into my logistic regression model.

Med1 Med2
Buspirone 50mg. Citalopram chloride
Buspirone 10mg. Buspirone 10
Buspirone. Citalopram 5mg
Citalopram 20mg. Citalopram

In the data, citalopram and buspirone are written many different ways, sometimes with dosages and full compound names. I am trying to creat a new variable, MedAgg, that combines all the instances these meds are prescribed into just buspirone and citalopram and from there I can assign them numerical values as my goal is to use Medagg as a categorical variable in my regression.

Thanks in advance!

jessho · Posted 05-17-2019 06:53 AM

Astounding · Posted 05-17-2019 10:29 AM

This will involve a repetitive process, where you compile a list, check the data, and adjust the list. In more detail, begin by compiling a list of all values in your data set:

data master_list;
   set have;
   entered_value = med1;
   output;
   entered_value = med2;
   output;
   keep entered_value;
run;

proc sort data=master_list nodupkey;
   by entered_value;
run;

Then add the value you would like to use when aggregating. For example:

data perm.master_list;
   set master_list;
   length desired_value $ 10;
   if index(upcase(entered_value), 'BUSPIRONE')
   then desired_value = 'Buspirone';
   if index(upcase(entered_value), 'CITALOPRAM')
   then desired_value = 'Citalopram';
run;

As indicated, save that data set permanently. You will need to make changes to it, based on what it contains. To do that:

proc print data=perm.master_list;
   where desired_value = ' ';
run;

That will display the values that could not be translated properly because of mis-spellings. You will need to fill in the proper value for DESIRED_VALUE in the permanent master list.

Once you have gotten that far, we can talk about ways to utilize the data set. You may need to supply more details for your plan. It's not clear how a single new variable will account for values of two existing variables.

jessho · Posted 05-17-2019 12:23 PM

That makes sense -- when compiling the list of all values in the first step, do I need to list out every single value for each variable -- there are over 1000 with about 500 uniquely written. Is there a way to do that first step more expeditiously, or can I do that first step for only the values I care about (e.g., all the different strengths and ways buspirone is written). Does that make sense?

Astounding · Posted 05-17-2019 12:29 PM

No, you don't need to list out all values. Note that the PROC PRINT uses a WHERE clause. So it only prints those that could not be translated (probably due to poor spelling on the original entry). Unless your data is really bad, most of them will not print. The PROC PRINT gives you a list of those that require human intervention to decide on the right categorization. Try it!

jessho · Posted 05-18-2019 04:08 PM

Hi!

This worked great. I went through this for my 7 medication variables (erxname1-erxname7), and now all the appropriate med names show up in my desired_value variable column. I have saved this is as master_file.

However, this new file no longer seems to have some of the variables from my original data set, work.testunder1. I want to do frequency tables with desired_value meds and diagnoses in my original data set. Any sense of what happened?

Here's set of my original code:

s*/
data master_list;
set work.testunder1;
entered_value = erxname1;
output;
entered_value = erxname2;
output;
entered_value = erxname3;
output;
entered_value = erxname4;
output;
entered_value = erxname5;
output;
entered_value = erxname6;
output;
entered_value = erxname7;
output;
keep entered_value;
run;

proc sort data=master_list ;
by entered_value;
run;

data master_list;
set master_list;

Astounding · Posted 05-18-2019 06:05 PM

A few points ...

There is nothing in your code that changes your original file in any way. You may have changed it, but not by using the program you have posted.

There are a couple of steps missing in creation of the master file. The PROC SORT should use the option NODUPKEY, to remove duplicate versions of the same ENTERED_VALUE. And the bottom of the final DATA step that creates DESIRED_VALUE does not appear.

It is shocking (in a good way) that there are no blank instances of DESIRED_VALUE. This is the first time I have ever seen a medical study with no typos or variations in spelling of a drug name. Possibly this study used an electronic data entry system with a checkbox for the drug name/dose.

These steps do not change your original data, and thus do not add DESIRED_VALUE to your original data. That is the next step once all the DESIRED_VALUEs are in place. These steps merely create a master list that will (somehow) be used going forward to change your original data.

You need to decide what structure to your original data will support the analysis that you want to perform. For example, should the program add 7 new variables with categorized values ("Buspirone" or "Citalopram") that match the more specific incoming values that appear in ERXNAME1 through ERXNAME7? There isn't a right and wrong answer for this step ... it all depends on what you want to do next, and what structure you need for the data going forward to perform that analysis.

jessho · Posted 05-19-2019 09:21 AM

Thank you -- added the n dupkey command. Oops, I think that final data step got truncated when I pasted. Yes, all the prescribing data was combed through once already before I got it, which likely fixed a lot of the problems.

Ideally, I would like the new desired_value to come into my original data_set (work.testunder1) as a categorical variable to make it easier to use in my analysis.

What would be the easiest way to do that?

ballardw · Posted 05-17-2019 10:43 AM

Please define the rules for determining "similar".

jessho · Posted 05-17-2019 12:24 PM

For instance, every instance buspirone is response for med1, I want to count as "buspirone." by every instance, I mean whether buspirone written as "buspirone" or "buspirone 10mg" or "buspirone daily" or buspirone chloride"

Astounding · Posted 05-19-2019 02:35 PM

You have 7 original variables. Are you saying that you want to reduce them to (for example) a single variable? Should BUSPIRONE be 1 if any of the 7 contains "Buspirone" and 0 otherwise?

jessho · Posted 05-19-2019 06:19 PM

No, want to create one categorical variable (medrx) that includes those 7 categories

Astounding · Posted 05-19-2019 07:47 PM

OK, here's another guess as to what you want:

"Buspirone Buspirone Citalopram Neither Citalopram Citalopram Neither"

all contained in a single long character variable.

If that's not it, you will have to provide an example.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away