I'm using SAS University Edition and I need to create dummy variables for my data, but the data I'm using is spread across multiple columns. Each row is a patient's hospital visit. I would like to create a dummy variable for each individual diagnosis code.
Here's an example of what my data looks like:
And here's what I want it to look like in the end:
Another complicating matter is that there are hundreds of Diagnosis columns so I would prefer a way that doesn't require typing them all out. Can anyone offer advice?
Hundreds of names isn't a problem. But some other issues are. For example, what do these diagnosis codes look like? Are they valid names for variables in SAS? Perhaps they begin with a number, or perhaps they contain illegal characters such as a decimal point. Or perhaps they contain so many characters that they would exceed SAS's limit of 32 characters for a name. Will the spelling be consistent from one instance to the next?
If they start with a number, how would you like to handle that? Could the word "Diag_" be added as the first part of the name?
The diagnosis codes are valid names for variables-- they look like K8001, S722XA, etc. and they would not exceed the limit of 32 characters. They all start with a letter but are a combinations of letters and numbers after that. The spelling will be consistent as well. If it helps, they are ICD-10-CM diagnosis codes.
That's a good start.
To process the data, then, begin by getting lists of all possible diagnosis codes:
proc sql;
create table diag1 as select distinct diagnosis1 as diagnosis from have order by diagnosis;
create table diag2 as select distinct diagnosis2 as diagnosis from have order by diagnosis;
create table diag3 as select distinct diagnosis3 as diagnosis from have order by diagnosis;
quit;
data all_diags;
merge diag1 diag2 diag3;
by diagnosis;
if diagnosis > ' ';
run;
This gives you a unique list of all codes found in the data.
Next, create a macro variable holding those codes:
proc sql;
select diagnosis into : all_codes separated by ' ' from all_diags;
quit;
%put &all_codes;
You don't need the %PUT statement, but it might help illustrate what is happening at this point.
Finally, use that list in a DATA step and search for matches. Assuming 6 characters are needed for the longest code:
data want;
set have;
array diags {*} $ 6 &all_codes;
do _n_ = 1 to dim(diags);
diags{_n_} = 0;
if diagnosis1 = vname(diags{_n_}) then diags{_n_} = 1;
if diagnosis2 = vname(diags{_n_}) then diags{_n_} = 1;
if diagnosis3 = vname(diags{_n_}) then diags{_n_} = 1;
end;
run;
I'm assuming here that you really have just 3 diagnoses per hospital stay. You could always add a few more to the program if need be.
The code is untested. It looks right, so see if it works for what you need.
All of this seems to work up until the last section of code. Then I get the following error:
Do you know what is wrong or how to get around this error?
data have;
input id (d1-d3) ($);
cards;
1 A B C
2 Y B D
3 Z B .
;
run;
data temp;
set have;
d=d1;v=1;output;
d=d2;v=1;output;
d=d3;v=1;output;
run;
proc transpose data=temp(where=(d is not missing)) out=temp1;
by id;
id d;
var v;
run;
proc stdize data=temp1 out=want reponly missing=0;
var _numeric_;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.