Hi, I'm trying to create a dummy variable for each character variable.
data have;
StudyID comorbidity1 comorbidity2 comorbidity3 comorbidty4
1 COPD HTN DM NONE
2 HTN COPD DM MENTAL_ILLNESS
data want;
StudyID COPD HTN DM MENTAL_ILLNESS
1 1 1 1 0
2 1 1 1 1
*what you start with;
DATA Have
;
LENGTH StudyID $8
Comorbidity1 $32
Comorbidity2 $32
Comorbidity3 $32
Comorbidity4 $32
;
INFILE Datalines MISSOVER;
INPUT StudyID $
Comorbidity1 $
Comorbidity2 $
Comorbidity3 $
Comorbidity4 $
;
DATALINES;
1 COPD HTN DM NONE
2 HTN COPD DM MENTAL_ILLNESS
3 DM
4 MENTAL_ILLNESS NONE
5 DM MENTAL_ILLNESS
;
RUN;
*transpose to long;
proc transpose data=have out=long;
by studyID;
var comorbidity1-Comorbidity4;
run;
*add indicator variable and delete empty rows;
data long2;
set long;
count=1;
if missing(col1) then delete;
run;
*transpose to wide again;
proc transpose data=long2 out=wide prefix=DX_;
by studyID;
id col1;
var count;
run;
*fill in missing with 0;
data want;
set wide;
array _dx(*) dx_:;
do i=1 to dim(_dx);
if missing(_dx(i)) then _dx(i) = 0;
end;
run;
Do not create dummy variables yourself. SAS has already done this for you in many procedures, so if you are trying to create an analysis or statistical model involving character variables, then the CLASS statement in most modelling procedures are all that you need.
*what you start with;
DATA Have
;
LENGTH StudyID $8
Comorbidity1 $32
Comorbidity2 $32
Comorbidity3 $32
Comorbidity4 $32
;
INFILE Datalines MISSOVER;
INPUT StudyID $
Comorbidity1 $
Comorbidity2 $
Comorbidity3 $
Comorbidity4 $
;
DATALINES;
1 COPD HTN DM NONE
2 HTN COPD DM MENTAL_ILLNESS
3 DM
4 MENTAL_ILLNESS NONE
5 DM MENTAL_ILLNESS
;
RUN;
*transpose to long;
proc transpose data=have out=long;
by studyID;
var comorbidity1-Comorbidity4;
run;
*add indicator variable and delete empty rows;
data long2;
set long;
count=1;
if missing(col1) then delete;
run;
*transpose to wide again;
proc transpose data=long2 out=wide prefix=DX_;
by studyID;
id col1;
var count;
run;
*fill in missing with 0;
data want;
set wide;
array _dx(*) dx_:;
do i=1 to dim(_dx);
if missing(_dx(i)) then _dx(i) = 0;
end;
run;
If you really needed the column names without the Dx_ prefix, a little macro that would strip them off can be written like so:
%MACRO Remove_Prefix_From_Var(Lib=, Set=, Pfx=, Debug=NO);
%LOCAL Cmnt;
%LOCAL Vars;
%IF %QUPCASE(&Debug) = YES %THEN
%DO;
%LET Cmnt = ;
%PUT &Nte1 Debug is on in macro &SysMacroName;
%END;
%ELSE
%DO;
%LET Cmnt = *;
%END;
PROC CONTENTS DATA=&Lib..&Set
OUT =WORK.&Set.V
VARNUM
NOPRINT
;
RUN;
PROC SQL NOPRINT;
SELECT CATS(Name, '=', TRANWRD(lowcase(SUBSTR(Name, 1, LENGTH("&Pfx"))), lowcase("&Pfx"), ''), SUBSTR(Name, LENGTH("&Pfx") + 1))
AS Work_Var LENGTH=512
INTO : Vars SEPARATED BY ' '
FROM WORK.&Set.V
;
QUIT;
%&Cmnt.PUT NOTE: &=Vars;
PROC DATASETS LIBRARY=&Lib NOLIST;
MODIFY &Set;
RENAME
&Vars
;
QUIT;
%MEND Remove_Prefix_From_Var;
I used SAShelp.gas to generate test data like so:
data work.gas;
set sashelp.gas(Rename=(cpratio=dx_CpRatio
EqRatio=dx_EqRatio
Fuel=Dx_Fuel
NOx=Dx_NOx))
;
run;
**-----------------------------------------------------------------------------**;
%Remove_Prefix_From_Var(Lib=WORK, Set=Gas, Pfx=Dx_, Debug=YES);
But of course you have actual data.
If it's useful, great. Otherwise, carry on.
Jim
I would solve the problem this way:
DATA Have (DROP=Comorbidity)
Columns (KEEP=Comorbidity)
;
DROP _:;
LENGTH StudyID $8
Comorbidity1 $32
Comorbidity2 $32
Comorbidity3 $32
Comorbidity4 $32
;
INFILE Datalines MISSOVER;
INPUT StudyID $
Comorbidity1 $
Comorbidity2 $
Comorbidity3 $
Comorbidity4 $
;
ARRAY Comorbidities [4] Comorbidity1 - Comorbidity4;
DO _i = 1 TO 4;
Comorbidity = Comorbidities[_i];
IF UPCASE(Comorbidity) ^= 'NONE' AND
NOT MISSING(Comorbidity) THEN
OUTPUT Columns;
END;
OUTPUT Have;
DELETE;
DATALINES;
1 COPD HTN DM NONE
2 HTN COPD DM MENTAL_ILLNESS
3 DM
4 MENTAL_ILLNESS NONE
5 DM MENTAL_ILLNESS
;
RUN;
PROC SORT DATA=Columns NODUPKEY;
BY Comorbidity;
RUN;
DATA _NULL_;
SET Columns NOBS=Nbr_of_Columns;
IF _N_ = 1 THEN
CALL SYMPUTX('Nbr_of_Columns', Nbr_of_Columns, 'G');
CALL SYMPUTX(CATS('Column', PUT(_N_, 3.)), Comorbidity, 'G');
RUN;
DATA Want;
SET Have;
DROP Comorbidity1 - Comorbidity&Nbr_of_Columns j;
ARRAY Comorbidities [*] Comorbidity1 - Comorbidity&Nbr_of_Columns;
%MACRO Init_Column_Values;
%LOCAL i;
%DO i = 1 %TO &Nbr_of_Columns;
&&Column&i = 0;
%END;
%MEND;
%Init_Column_Values;
DO j = 1 TO &Nbr_of_Columns;
SELECT(Comorbidities[j]);
%MACRO Set_Column_Values;
%LOCAL i;
%DO i = 1 %TO &Nbr_of_Columns;
WHEN ("&&Column&i")
&&Column&i = 1;
%END;
%MEND;
%Set_Column_Values
OTHERWISE;
END;
END;
RUN;
The program is dynamic. The program will automatically adjust to however many unique values are in your data. The program will then dynamically determine which columns are populated and which are not setting a boolean (0 or 1).
Results from the above data:
Jim
If we add made-up conditions BMI and GERI, as in the following:
1 COPD HTN DM NONE 2 HTN COPD DM MENTAL_ILLNESS 3 DM 4 MENTAL_ILLNESS NONE 5 DM MENTAL_ILLNESS 6 BMI GERI
The program adjusts automatically, and the results are as follows:
Jim
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.