BookmarkSubscribeRSS Feed
pacman94
Calcite | Level 5

Hello

 

I have the following dataset: 

IDrace1race2race3race4race_sum
110102
201001
300011
410001

 

What I would like to do is that:

if race_sum >1 then it should final_race = 'Multiple race'

if race_sum = 1 then it should be the original race variable

 

something like this:

IDrace1race2race3race4race_sumfinal_race
110102Multiple Race
201001race2
300011race4
410001race1

 

data test;

set test;

if race_sum > 1 then final_race = 'Multiple Race'

if race_sum = 1 then final_race =?? 

;run;

 

I am not sure how to do the later part

 

Thanks!!

4 REPLIES 4
Reeza
Super User

Functions you need:

WHICHN - returns the index of the 1

VNAME - returns the variable name for a variable at a specified index

 

Nesting them together gets you what you need.

 

Concepts need: Arrays

https://stats.idre.ucla.edu/sas/seminars/sas-arrays/

 

Untested:

 

data want ;
set have;

array _race(*) race1-race4;

if race_sum > 1 then final_race = "Multiple Race";
else final_race = vname(_race(whichn(1, of _race(*))));

run;

 

FYI - programming using the following style is dangerous and not recommended. It destroys your input data so you need to run your code again from the prior step at minimum and makes it significantly harder to debug your code. 

data test;
set test;

@pacman94 wrote:

Hello

 

I have the following dataset: 

ID race1 race2 race3 race4 race_sum
1 1 0 1 0 2
2 0 1 0 0 1
3 0 0 0 1 1
4 1 0 0 0 1

 

What I would like to do is that:

if race_sum >1 then it should final_race = 'Multiple race'

if race_sum = 1 then it should be the original race variable

 

something like this:

ID race1 race2 race3 race4 race_sum final_race
1 1 0 1 0 2 Multiple Race
2 0 1 0 0 1 race2
3 0 0 0 1 1 race4
4 1 0 0 0 1 race1

 

data test;

set test;

if race_sum > 1 then final_race = 'Multiple Race'

if race_sum = 1 then final_race =?? 

;run;

 

I am not sure how to do the later part

 

Thanks!!


 

Reeza
Super User

Fully tested:

 

data have;
	infile cards dlm='09'x;
	input ID Caucasian AfricanAmerican EastAsian SouthEastAsian;
	label AfricanAmerican='African American' EastAsian="East Asian" 
		SOuthEastAsian="Southeast Asian";
	cards;
1	1	0	1	0	
2	0	1	0	0
3	0	0	0	1
4	1	0	0	0
;
	;
	;;
run;

data want;
	set have;
	length list_race_label list_race_name $200.;
	array _race(*) Caucasian--SouthEastAsian;
	
	race_sum=sum(of _race(*));

	if race_sum > 1 then
		final_race="Multiple Race";
	else
		final_race=vname(_race(whichn(1, of _race(*))));
	*list of race;

	do i=1 to dim(_race);

		if _race(i)=1 then
			list_race_label=catx(", ", trim(list_race_label), vlabel(_race(i)));

		if _race(i)=1 then
			list_race_name=catx(", ", trim(list_race_label), vname(_race(i)));
	end;
run;

https://gist.github.com/statgeek/972120a19b583acc3a947b58b3177a41

taylormccormick
Calcite | Level 5
I'm pretty new at this but I think this will work...

Else if race1= 1 then final_race= 'race1'
Else if race2= 1 then final_race= 'race2'
Else if race3= 1 then final_race= 'race3'
Else if race4= 1 then final_race= 'race4';
ballardw
Super User

90% plus of requests when I get to process similar data the user would like the "race" responses in some specific order.

The easiest way I have found to do that consistently is to assign a numeric value to the "final race" variable and an associated format that displays the text in the desired order.

So I would start with creating the Format first such as:

Proc format;
value raceorder
1 = "Asian"
2 = "Black"
3 = "Native American"
4 = "White"
5 = "Multiple Race"
6 = "Unknown"
;
run;

And then use the Sum/ whichn information to assign the above numeric values.

My example includes Unknown category because I get too much garbage and always have some in most of my data. That would be the sum of races le 0. (less than in case your system has MISSING values) or Whichn result of 0.

 

Example of creating data with "race" variables not in the possibly desired report order and making the final race variable match a report order as defined in that format above.

data have;
input ID 	race1 	race2 	race3 	race4 ;
label 
   Race1 = 'White'
   Race2 = 'Black'
   Race3 = 'Asian'
   Race4 = 'Native American'
;
datalines;
1 	1 	0 	1 	0 	
2 	0 	1 	0 	0 	
3 	0 	0 	0 	1 	
4 	1 	0 	0 	0 	
5  0  0  1  0
6  0  0  0  0 
;
data want;
   set have;
   array r race: ;
   if sum(of r(*))=2 then final_race=5;
   else select (whichn(1, of r(*)));
      when (1) final_race =4;
      when (2) final_race =2;
      when (3) final_race =1;
      when (4) final_race =3;
      otherwise final_race=6;
   end;
run;

proc freq data=want;
   tables final_race;
   format final_race raceorder.;
run;

I use an explicit value for the "Unknown" or missing or what ever your system may want because if you use an actual missing value then the sort order place the "Unknown" first.

 

Style choices abound in variations of this.

The format approach can become extremely powerful if you get to work with one of the data sets that has subgroups of Asian (one I have seen has something like 24) or Hispanic split up into geographic sub groups. Then Multilabel formats could provide the way to do subgroup/main "race" categories.

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1263 views
  • 0 likes
  • 4 in conversation