Solved: Re: Create new dummy variables from multiple columns of character vari...

stancemcgraw · Posted 09-23-2020 03:31 PM

Hi, I'm trying to create a dummy variable for each character variable.

data have;

StudyID comorbidity1 comorbidity2 comorbidity3 comorbidty4

1 COPD HTN DM NONE

2 HTN COPD DM MENTAL_ILLNESS

data want;

StudyID COPD HTN DM MENTAL_ILLNESS

1 1 1 1 0

2 1 1 1 1

Reeza · Posted 09-23-2020 05:50 PM

*what you start with;
DATA	Have	
		
		;
	
	LENGTH	StudyID			$8
			Comorbidity1	$32
			Comorbidity2	$32
			Comorbidity3	$32
			Comorbidity4	$32
			;
	INFILE	Datalines		MISSOVER;
	INPUT	StudyID			$
			Comorbidity1	$
			Comorbidity2	$
			Comorbidity3	$
			Comorbidity4	$
			;
DATALINES;
1           COPD              HTN              DM         NONE
2           HTN               COPD             DM         MENTAL_ILLNESS
3           DM
4           MENTAL_ILLNESS    NONE
5           DM                MENTAL_ILLNESS
;
RUN;


*transpose to long;
proc transpose data=have out=long;
by studyID;
var comorbidity1-Comorbidity4;
run;

*add indicator variable and delete empty rows;
data long2;
set long;
count=1;
if missing(col1) then delete;
run;

*transpose to wide again;
proc transpose data=long2 out=wide prefix=DX_;
by studyID;
id col1;
var count;
run;

*fill in missing with 0;
data want;
set wide;
array _dx(*) dx_:;
do i=1 to dim(_dx);
if missing(_dx(i)) then _dx(i) = 0;
end;

run;

View solution in original post

PaigeMiller · Posted 09-23-2020 03:37 PM

Do not create dummy variables yourself. SAS has already done this for you in many procedures, so if you are trying to create an analysis or statistical model involving character variables, then the CLASS statement in most modelling procedures are all that you need.

--
Paige Miller

Reeza · Posted 09-23-2020 03:39 PM

Not the way this data is structured - it needs to be restructured regardless here to work properly in a statistical model.

Reeza · Posted 09-23-2020 03:37 PM

How many different categories do you have? If it's a few it's easy enough to manually code them but if it needs to be dynamic that's more coding...and likely more stable if this is ever going to be run in the future.

Reeza · Posted 09-23-2020 05:50 PM

*what you start with;
DATA	Have	
		
		;
	
	LENGTH	StudyID			$8
			Comorbidity1	$32
			Comorbidity2	$32
			Comorbidity3	$32
			Comorbidity4	$32
			;
	INFILE	Datalines		MISSOVER;
	INPUT	StudyID			$
			Comorbidity1	$
			Comorbidity2	$
			Comorbidity3	$
			Comorbidity4	$
			;
DATALINES;
1           COPD              HTN              DM         NONE
2           HTN               COPD             DM         MENTAL_ILLNESS
3           DM
4           MENTAL_ILLNESS    NONE
5           DM                MENTAL_ILLNESS
;
RUN;


*transpose to long;
proc transpose data=have out=long;
by studyID;
var comorbidity1-Comorbidity4;
run;

*add indicator variable and delete empty rows;
data long2;
set long;
count=1;
if missing(col1) then delete;
run;

*transpose to wide again;
proc transpose data=long2 out=wide prefix=DX_;
by studyID;
id col1;
var count;
run;

*fill in missing with 0;
data want;
set wide;
array _dx(*) dx_:;
do i=1 to dim(_dx);
if missing(_dx(i)) then _dx(i) = 0;
end;

run;

stancemcgraw · Posted 09-24-2020 12:50 PM

this works until step 3, making data wide again it says, " The ID value "DX_Use_Choices_1000___9999" occurs twice in the same BY group".....

Reeza · Posted 09-24-2020 01:49 PM

That means you have duplicate either in the diagnosis or in the IDs?
Figure out what is duplicated and either de-duplicate it before this process or modify your example data to reflect your actual data ad I can take another look at it.

stancemcgraw · Posted 09-24-2020 02:09 PM

I used the LET option on the transpose statement and that seemed to work. Now I just need to remove the DX_ suffix from each variable. Not sure how to do that...

Reeza · Posted 09-24-2020 03:21 PM

prefix = option on PROC TRANSPOSE added those - I do that to make it easier to reference all the variables at once within an array. You can add IDLABEL so that your labels are clean and use those instead when printing/displaying the report.

jimbarbour · Posted 09-24-2020 07:01 PM

If you really needed the column names without the Dx_ prefix, a little macro that would strip them off can be written like so:

%MACRO	Remove_Prefix_From_Var(Lib=, Set=, Pfx=, Debug=NO);
	%LOCAL	Cmnt;
	%LOCAL	Vars;

	%IF	%QUPCASE(&Debug)	=	YES	%THEN
		%DO;
			%LET	Cmnt	=	;
			%PUT	&Nte1  Debug is on in macro &SysMacroName;
		%END;
	%ELSE
		%DO;
			%LET	Cmnt	=	*;
		%END;

	PROC	CONTENTS	DATA=&Lib..&Set
						OUT	=WORK.&Set.V
						VARNUM
						NOPRINT
						;
	RUN;

	PROC	SQL	NOPRINT;
		SELECT	CATS(Name, '=', TRANWRD(lowcase(SUBSTR(Name, 1, LENGTH("&Pfx"))), lowcase("&Pfx"), ''), SUBSTR(Name, LENGTH("&Pfx") + 1))
			AS	Work_Var	LENGTH=512
			INTO	:	Vars	SEPARATED BY ' '
			FROM	WORK.&Set.V
			;
	QUIT;

	%&Cmnt.PUT	NOTE:  &=Vars;

	PROC	DATASETS	LIBRARY=&Lib	NOLIST;
		MODIFY	&Set;
			RENAME	
				&Vars
				;
	QUIT;
%MEND	Remove_Prefix_From_Var;

I used SAShelp.gas to generate test data like so:

data	work.gas;
	set	sashelp.gas(Rename=(cpratio=dx_CpRatio
							EqRatio=dx_EqRatio
							Fuel=Dx_Fuel
							NOx=Dx_NOx))
							;
run;

**-----------------------------------------------------------------------------**;

%Remove_Prefix_From_Var(Lib=WORK, Set=Gas, Pfx=Dx_, Debug=YES);

But of course you have actual data.

If it's useful, great. Otherwise, carry on.

Jim

jimbarbour · Posted 09-23-2020 05:35 PM

I would solve the problem this way:

DATA	Have	(DROP=Comorbidity)	
		Columns	(KEEP=Comorbidity)
		;
	DROP	_:;
	LENGTH	StudyID			$8
			Comorbidity1	$32
			Comorbidity2	$32
			Comorbidity3	$32
			Comorbidity4	$32
			;
	INFILE	Datalines		MISSOVER;
	INPUT	StudyID			$
			Comorbidity1	$
			Comorbidity2	$
			Comorbidity3	$
			Comorbidity4	$
			;
	ARRAY	Comorbidities	[4]	Comorbidity1 - Comorbidity4;
	DO	_i			=	1	TO	4;
		Comorbidity	=	Comorbidities[_i];
		IF	UPCASE(Comorbidity)	^=	'NONE'	AND
			NOT	MISSING(Comorbidity)		THEN
			OUTPUT	Columns;
	END;
	OUTPUT	Have;
	DELETE;
DATALINES;
1           COPD              HTN              DM         NONE
2           HTN               COPD             DM         MENTAL_ILLNESS
3           DM
4           MENTAL_ILLNESS    NONE
5           DM                MENTAL_ILLNESS
;
RUN;

PROC	SORT	DATA=Columns	NODUPKEY;
	BY	Comorbidity;
RUN;

DATA	_NULL_;
	SET	Columns	NOBS=Nbr_of_Columns;
	IF	_N_	=	1	THEN
		CALL	SYMPUTX('Nbr_of_Columns', Nbr_of_Columns, 'G');
	CALL	SYMPUTX(CATS('Column', PUT(_N_, 3.)), Comorbidity, 'G');
RUN;

DATA	Want;
	SET	Have;
	DROP	Comorbidity1 - Comorbidity&Nbr_of_Columns	j;
	ARRAY	Comorbidities	[*]	Comorbidity1 - Comorbidity&Nbr_of_Columns;

	%MACRO	Init_Column_Values;
		%LOCAL	i;
		%DO	i	=	1	%TO	&Nbr_of_Columns;
			&&Column&i		=	0;
		%END;
	%MEND;
	%Init_Column_Values;

	DO	j	=	1	TO	&Nbr_of_Columns;
		SELECT(Comorbidities[j]);
			%MACRO	Set_Column_Values;
				%LOCAL	i;
				%DO	i	=	1	%TO	&Nbr_of_Columns;
						WHEN	("&&Column&i")
							&&Column&i		=	1;
				%END;
			%MEND;
			%Set_Column_Values
			OTHERWISE;
		END;
	END;
RUN;

The program is dynamic. The program will automatically adjust to however many unique values are in your data. The program will then dynamically determine which columns are populated and which are not setting a boolean (0 or 1).

Results from the above data:

Jim

jimbarbour · Posted 09-23-2020 05:44 PM

If we add made-up conditions BMI and GERI, as in the following:

1           COPD              HTN              DM         NONE
2           HTN               COPD             DM         MENTAL_ILLNESS
3           DM
4           MENTAL_ILLNESS    NONE
5           DM                MENTAL_ILLNESS
6           BMI               GERI

The program adjusts automatically, and the results are as follows:

Jim

Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Re: Create new dummy variables from multiple columns of character variables

Classroom Training Available!