This will involve a repetitive process, where you compile a list, check the data, and adjust the list. In more detail, begin by compiling a list of all values in your data set:
data master_list;
set have;
entered_value = med1;
output;
entered_value = med2;
output;
keep entered_value;
run;
proc sort data=master_list nodupkey;
by entered_value;
run;
Then add the value you would like to use when aggregating. For example:
data perm.master_list;
set master_list;
length desired_value $ 10;
if index(upcase(entered_value), 'BUSPIRONE')
then desired_value = 'Buspirone';
if index(upcase(entered_value), 'CITALOPRAM')
then desired_value = 'Citalopram';
run;
As indicated, save that data set permanently. You will need to make changes to it, based on what it contains. To do that:
proc print data=perm.master_list;
where desired_value = ' ';
run;
That will display the values that could not be translated properly because of mis-spellings. You will need to fill in the proper value for DESIRED_VALUE in the permanent master list.
Once you have gotten that far, we can talk about ways to utilize the data set. You may need to supply more details for your plan. It's not clear how a single new variable will account for values of two existing variables.
That makes sense -- when compiling the list of all values in the first step, do I need to list out every single value for each variable -- there are over 1000 with about 500 uniquely written. Is there a way to do that first step more expeditiously, or can I do that first step for only the values I care about (e.g., all the different strengths and ways buspirone is written). Does that make sense?
No, you don't need to list out all values. Note that the PROC PRINT uses a WHERE clause. So it only prints those that could not be translated (probably due to poor spelling on the original entry). Unless your data is really bad, most of them will not print. The PROC PRINT gives you a list of those that require human intervention to decide on the right categorization. Try it!
Hi!
This worked great. I went through this for my 7 medication variables (erxname1-erxname7), and now all the appropriate med names show up in my desired_value variable column. I have saved this is as master_file.
However, this new file no longer seems to have some of the variables from my original data set, work.testunder1. I want to do frequency tables with desired_value meds and diagnoses in my original data set. Any sense of what happened?
Here's set of my original code:
s*/
data master_list;
set work.testunder1;
entered_value = erxname1;
output;
entered_value = erxname2;
output;
entered_value = erxname3;
output;
entered_value = erxname4;
output;
entered_value = erxname5;
output;
entered_value = erxname6;
output;
entered_value = erxname7;
output;
keep entered_value;
run;
proc sort data=master_list ;
by entered_value;
run;
data master_list;
set master_list;
A few points ...
There is nothing in your code that changes your original file in any way. You may have changed it, but not by using the program you have posted.
There are a couple of steps missing in creation of the master file. The PROC SORT should use the option NODUPKEY, to remove duplicate versions of the same ENTERED_VALUE. And the bottom of the final DATA step that creates DESIRED_VALUE does not appear.
It is shocking (in a good way) that there are no blank instances of DESIRED_VALUE. This is the first time I have ever seen a medical study with no typos or variations in spelling of a drug name. Possibly this study used an electronic data entry system with a checkbox for the drug name/dose.
These steps do not change your original data, and thus do not add DESIRED_VALUE to your original data. That is the next step once all the DESIRED_VALUEs are in place. These steps merely create a master list that will (somehow) be used going forward to change your original data.
You need to decide what structure to your original data will support the analysis that you want to perform. For example, should the program add 7 new variables with categorized values ("Buspirone" or "Citalopram") that match the more specific incoming values that appear in ERXNAME1 through ERXNAME7? There isn't a right and wrong answer for this step ... it all depends on what you want to do next, and what structure you need for the data going forward to perform that analysis.
Thank you -- added the n dupkey command. Oops, I think that final data step got truncated when I pasted. Yes, all the prescribing data was combed through once already before I got it, which likely fixed a lot of the problems.
Ideally, I would like the new desired_value to come into my original data_set (work.testunder1) as a categorical variable to make it easier to use in my analysis.
What would be the easiest way to do that?
Please define the rules for determining "similar".
For instance, every instance buspirone is response for med1, I want to count as "buspirone." by every instance, I mean whether buspirone written as "buspirone" or "buspirone 10mg" or "buspirone daily" or buspirone chloride"
No, want to create one categorical variable (medrx) that includes those 7 categories
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.