combining multiple columns

pacman94 · Posted 12-14-2020 11:49 AM

Hello

I have the following dataset:

ID	race1	race2	race3	race4	race_sum
1	1	0	1	0	2
2	0	1	0	0	1
3	0	0	0	1	1
4	1	0	0	0	1

What I would like to do is that:

if race_sum >1 then it should final_race = 'Multiple race'

if race_sum = 1 then it should be the original race variable

something like this:

ID	race1	race2	race3	race4	race_sum	final_race
1	1	0	1	0	2	Multiple Race
2	0	1	0	0	1	race2
3	0	0	0	1	1	race4
4	1	0	0	0	1	race1

data test;

set test;

if race_sum > 1 then final_race = 'Multiple Race'

if race_sum = 1 then final_race =??

;run;

I am not sure how to do the later part

Thanks!!

Reeza · Posted 12-14-2020 12:09 PM

Functions you need:

WHICHN - returns the index of the 1

VNAME - returns the variable name for a variable at a specified index

Nesting them together gets you what you need.

Concepts need: Arrays

https://stats.idre.ucla.edu/sas/seminars/sas-arrays/

Untested:

data want ;
set have;

array _race(*) race1-race4;

if race_sum > 1 then final_race = "Multiple Race";
else final_race = vname(_race(whichn(1, of _race(*))));

run;

FYI - programming using the following style is dangerous and not recommended. It destroys your input data so you need to run your code again from the prior step at minimum and makes it significantly harder to debug your code.

data test;
set test;

@pacman94 wrote:

Hello

I have the following dataset:

ID race1 race2 race3 race4 race_sum

1 1 0 1 0 2

2 0 1 0 0 1

3 0 0 0 1 1

4 1 0 0 0 1

What I would like to do is that:

if race_sum >1 then it should final_race = 'Multiple race'

if race_sum = 1 then it should be the original race variable

something like this:

ID race1 race2 race3 race4 race_sum final_race

1 1 0 1 0 2 Multiple Race

2 0 1 0 0 1 race2

3 0 0 0 1 1 race4

4 1 0 0 0 1 race1

data test;

set test;

if race_sum > 1 then final_race = 'Multiple Race'

if race_sum = 1 then final_race =??

;run;

I am not sure how to do the later part

Thanks!!

Reeza · Posted 12-14-2020 12:20 PM

Fully tested:

data have;
	infile cards dlm='09'x;
	input ID Caucasian AfricanAmerican EastAsian SouthEastAsian;
	label AfricanAmerican='African American' EastAsian="East Asian" 
		SOuthEastAsian="Southeast Asian";
	cards;
1	1	0	1	0	
2	0	1	0	0
3	0	0	0	1
4	1	0	0	0
;
	;
	;;
run;

data want;
	set have;
	length list_race_label list_race_name $200.;
	array _race(*) Caucasian--SouthEastAsian;
	
	race_sum=sum(of _race(*));

	if race_sum > 1 then
		final_race="Multiple Race";
	else
		final_race=vname(_race(whichn(1, of _race(*))));
	*list of race;

	do i=1 to dim(_race);

		if _race(i)=1 then
			list_race_label=catx(", ", trim(list_race_label), vlabel(_race(i)));

		if _race(i)=1 then
			list_race_name=catx(", ", trim(list_race_label), vname(_race(i)));
	end;
run;

https://gist.github.com/statgeek/972120a19b583acc3a947b58b3177a41

taylormccormick · Posted 12-14-2020 12:38 PM

I'm pretty new at this but I think this will work...

Else if race1= 1 then final_race= 'race1'
Else if race2= 1 then final_race= 'race2'
Else if race3= 1 then final_race= 'race3'
Else if race4= 1 then final_race= 'race4';

ballardw · Posted 12-14-2020 12:58 PM

90% plus of requests when I get to process similar data the user would like the "race" responses in some specific order.

The easiest way I have found to do that consistently is to assign a numeric value to the "final race" variable and an associated format that displays the text in the desired order.

So I would start with creating the Format first such as:

Proc format;
value raceorder
1 = "Asian"
2 = "Black"
3 = "Native American"
4 = "White"
5 = "Multiple Race"
6 = "Unknown"
;
run;

And then use the Sum/ whichn information to assign the above numeric values.

My example includes Unknown category because I get too much garbage and always have some in most of my data. That would be the sum of races le 0. (less than in case your system has MISSING values) or Whichn result of 0.

Example of creating data with "race" variables not in the possibly desired report order and making the final race variable match a report order as defined in that format above.

data have;
input ID 	race1 	race2 	race3 	race4 ;
label 
   Race1 = 'White'
   Race2 = 'Black'
   Race3 = 'Asian'
   Race4 = 'Native American'
;
datalines;
1 	1 	0 	1 	0 	
2 	0 	1 	0 	0 	
3 	0 	0 	0 	1 	
4 	1 	0 	0 	0 	
5  0  0  1  0
6  0  0  0  0 
;
data want;
   set have;
   array r race: ;
   if sum(of r(*))=2 then final_race=5;
   else select (whichn(1, of r(*)));
      when (1) final_race =4;
      when (2) final_race =2;
      when (3) final_race =1;
      when (4) final_race =3;
      otherwise final_race=6;
   end;
run;

proc freq data=want;
   tables final_race;
   format final_race raceorder.;
run;

I use an explicit value for the "Unknown" or missing or what ever your system may want because if you use an actual missing value then the sort order place the "Unknown" first.

Style choices abound in variations of this.

The format approach can become extremely powerful if you get to work with one of the data sets that has subgroups of Asian (one I have seen has something like 24) or Hispanic split up into geographic sub groups. Then Multilabel formats could provide the way to do subgroup/main "race" categories.

combining multiple columns

Re: combining multiple columns

Re: combining multiple columns

Re: combining multiple columns

Re: combining multiple columns

combining multiple columns

Re: combining multiple columns

Re: combining multiple columns

Re: combining multiple columns

Re: combining multiple columns

Registration is open

SAS Training: Just a Click Away