BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
stancemcgraw
Obsidian | Level 7

Hi, I'm trying to create a dummy variable for each character variable. 

 

data have;

StudyID   comorbidity1 comorbidity2 comorbidity3 comorbidty4

      1           COPD              HTN              DM               NONE

       2           HTN                COPD           DM         MENTAL_ILLNESS

data want;

    StudyID      COPD           HTN         DM       MENTAL_ILLNESS

          1              1                      1             1                   0

          2              1                    1             1                     1

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User
*what you start with;
DATA	Have	
		
		;
	
	LENGTH	StudyID			$8
			Comorbidity1	$32
			Comorbidity2	$32
			Comorbidity3	$32
			Comorbidity4	$32
			;
	INFILE	Datalines		MISSOVER;
	INPUT	StudyID			$
			Comorbidity1	$
			Comorbidity2	$
			Comorbidity3	$
			Comorbidity4	$
			;
DATALINES;
1           COPD              HTN              DM         NONE
2           HTN               COPD             DM         MENTAL_ILLNESS
3           DM
4           MENTAL_ILLNESS    NONE
5           DM                MENTAL_ILLNESS
;
RUN;


*transpose to long;
proc transpose data=have out=long;
by studyID;
var comorbidity1-Comorbidity4;
run;

*add indicator variable and delete empty rows;
data long2;
set long;
count=1;
if missing(col1) then delete;
run;

*transpose to wide again;
proc transpose data=long2 out=wide prefix=DX_;
by studyID;
id col1;
var count;
run;

*fill in missing with 0;
data want;
set wide;
array _dx(*) dx_:;
do i=1 to dim(_dx);
if missing(_dx(i)) then _dx(i) = 0;
end;

run;

View solution in original post

11 REPLIES 11
PaigeMiller
Diamond | Level 26

Do not create dummy variables yourself. SAS has already done this for you in many procedures, so if you are trying to create an analysis or statistical model involving character variables, then the CLASS statement in most modelling procedures are all that you need.

--
Paige Miller
Reeza
Super User
Not the way this data is structured - it needs to be restructured regardless here to work properly in a statistical model.
Reeza
Super User
How many different categories do you have? If it's a few it's easy enough to manually code them but if it needs to be dynamic that's more coding...and likely more stable if this is ever going to be run in the future.
Reeza
Super User
*what you start with;
DATA	Have	
		
		;
	
	LENGTH	StudyID			$8
			Comorbidity1	$32
			Comorbidity2	$32
			Comorbidity3	$32
			Comorbidity4	$32
			;
	INFILE	Datalines		MISSOVER;
	INPUT	StudyID			$
			Comorbidity1	$
			Comorbidity2	$
			Comorbidity3	$
			Comorbidity4	$
			;
DATALINES;
1           COPD              HTN              DM         NONE
2           HTN               COPD             DM         MENTAL_ILLNESS
3           DM
4           MENTAL_ILLNESS    NONE
5           DM                MENTAL_ILLNESS
;
RUN;


*transpose to long;
proc transpose data=have out=long;
by studyID;
var comorbidity1-Comorbidity4;
run;

*add indicator variable and delete empty rows;
data long2;
set long;
count=1;
if missing(col1) then delete;
run;

*transpose to wide again;
proc transpose data=long2 out=wide prefix=DX_;
by studyID;
id col1;
var count;
run;

*fill in missing with 0;
data want;
set wide;
array _dx(*) dx_:;
do i=1 to dim(_dx);
if missing(_dx(i)) then _dx(i) = 0;
end;

run;
stancemcgraw
Obsidian | Level 7
this works until step 3, making data wide again it says, " The ID value "DX_Use_Choices_1000___9999" occurs twice in the same BY group".....
Reeza
Super User
That means you have duplicate either in the diagnosis or in the IDs?
Figure out what is duplicated and either de-duplicate it before this process or modify your example data to reflect your actual data ad I can take another look at it.
stancemcgraw
Obsidian | Level 7
I used the LET option on the transpose statement and that seemed to work. Now I just need to remove the DX_ suffix from each variable. Not sure how to do that...
Reeza
Super User
prefix = option on PROC TRANSPOSE added those - I do that to make it easier to reference all the variables at once within an array. You can add IDLABEL so that your labels are clean and use those instead when printing/displaying the report.
jimbarbour
Meteorite | Level 14

If you really needed the column names without the Dx_ prefix, a little macro that would strip them off can be written like so:

%MACRO	Remove_Prefix_From_Var(Lib=, Set=, Pfx=, Debug=NO);
	%LOCAL	Cmnt;
	%LOCAL	Vars;

	%IF	%QUPCASE(&Debug)	=	YES	%THEN
		%DO;
			%LET	Cmnt	=	;
			%PUT	&Nte1  Debug is on in macro &SysMacroName;
		%END;
	%ELSE
		%DO;
			%LET	Cmnt	=	*;
		%END;

	PROC	CONTENTS	DATA=&Lib..&Set
						OUT	=WORK.&Set.V
						VARNUM
						NOPRINT
						;
	RUN;

	PROC	SQL	NOPRINT;
		SELECT	CATS(Name, '=', TRANWRD(lowcase(SUBSTR(Name, 1, LENGTH("&Pfx"))), lowcase("&Pfx"), ''), SUBSTR(Name, LENGTH("&Pfx") + 1))
			AS	Work_Var	LENGTH=512
			INTO	:	Vars	SEPARATED BY ' '
			FROM	WORK.&Set.V
			;
	QUIT;

	%&Cmnt.PUT	NOTE:  &=Vars;

	PROC	DATASETS	LIBRARY=&Lib	NOLIST;
		MODIFY	&Set;
			RENAME	
				&Vars
				;
	QUIT;
%MEND	Remove_Prefix_From_Var;

I used SAShelp.gas to generate test data like so:

data	work.gas;
	set	sashelp.gas(Rename=(cpratio=dx_CpRatio
							EqRatio=dx_EqRatio
							Fuel=Dx_Fuel
							NOx=Dx_NOx))
							;
run;

**-----------------------------------------------------------------------------**;

%Remove_Prefix_From_Var(Lib=WORK, Set=Gas, Pfx=Dx_, Debug=YES);

But of course you have actual data.

 

If it's useful, great.  Otherwise, carry on.

 

Jim

 

 

jimbarbour
Meteorite | Level 14

I would solve the problem this way:

DATA	Have	(DROP=Comorbidity)	
		Columns	(KEEP=Comorbidity)
		;
	DROP	_:;
	LENGTH	StudyID			$8
			Comorbidity1	$32
			Comorbidity2	$32
			Comorbidity3	$32
			Comorbidity4	$32
			;
	INFILE	Datalines		MISSOVER;
	INPUT	StudyID			$
			Comorbidity1	$
			Comorbidity2	$
			Comorbidity3	$
			Comorbidity4	$
			;
	ARRAY	Comorbidities	[4]	Comorbidity1 - Comorbidity4;
	DO	_i			=	1	TO	4;
		Comorbidity	=	Comorbidities[_i];
		IF	UPCASE(Comorbidity)	^=	'NONE'	AND
			NOT	MISSING(Comorbidity)		THEN
			OUTPUT	Columns;
	END;
	OUTPUT	Have;
	DELETE;
DATALINES;
1           COPD              HTN              DM         NONE
2           HTN               COPD             DM         MENTAL_ILLNESS
3           DM
4           MENTAL_ILLNESS    NONE
5           DM                MENTAL_ILLNESS
;
RUN;

PROC	SORT	DATA=Columns	NODUPKEY;
	BY	Comorbidity;
RUN;

DATA	_NULL_;
	SET	Columns	NOBS=Nbr_of_Columns;
	IF	_N_	=	1	THEN
		CALL	SYMPUTX('Nbr_of_Columns', Nbr_of_Columns, 'G');
	CALL	SYMPUTX(CATS('Column', PUT(_N_, 3.)), Comorbidity, 'G');
RUN;

DATA	Want;
	SET	Have;
	DROP	Comorbidity1 - Comorbidity&Nbr_of_Columns	j;
	ARRAY	Comorbidities	[*]	Comorbidity1 - Comorbidity&Nbr_of_Columns;

	%MACRO	Init_Column_Values;
		%LOCAL	i;
		%DO	i	=	1	%TO	&Nbr_of_Columns;
			&&Column&i		=	0;
		%END;
	%MEND;
	%Init_Column_Values;

	DO	j	=	1	TO	&Nbr_of_Columns;
		SELECT(Comorbidities[j]);
			%MACRO	Set_Column_Values;
				%LOCAL	i;
				%DO	i	=	1	%TO	&Nbr_of_Columns;
						WHEN	("&&Column&i")
							&&Column&i		=	1;
				%END;
			%MEND;
			%Set_Column_Values
			OTHERWISE;
		END;
	END;
RUN;

The program is dynamic.  The program will automatically adjust to however many unique values are in your data.  The program will then dynamically determine which columns are populated and which are not setting a boolean (0 or 1).

 

Results from the above data:

jimbarbour_0-1600896927154.png

 

Jim

jimbarbour
Meteorite | Level 14

If we add made-up conditions BMI and GERI, as in the following:

1           COPD              HTN              DM         NONE
2           HTN               COPD             DM         MENTAL_ILLNESS
3           DM
4           MENTAL_ILLNESS    NONE
5           DM                MENTAL_ILLNESS
6           BMI               GERI

The program adjusts automatically, and the results are as follows:

jimbarbour_0-1600897434056.png

 

Jim

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 1922 views
  • 1 like
  • 4 in conversation