Hello
I have the following dataset:
ID | race1 | race2 | race3 | race4 | race_sum |
1 | 1 | 0 | 1 | 0 | 2 |
2 | 0 | 1 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 1 | 1 |
4 | 1 | 0 | 0 | 0 | 1 |
What I would like to do is that:
if race_sum >1 then it should final_race = 'Multiple race'
if race_sum = 1 then it should be the original race variable
something like this:
ID | race1 | race2 | race3 | race4 | race_sum | final_race |
1 | 1 | 0 | 1 | 0 | 2 | Multiple Race |
2 | 0 | 1 | 0 | 0 | 1 | race2 |
3 | 0 | 0 | 0 | 1 | 1 | race4 |
4 | 1 | 0 | 0 | 0 | 1 | race1 |
data test;
set test;
if race_sum > 1 then final_race = 'Multiple Race'
if race_sum = 1 then final_race =??
;run;
I am not sure how to do the later part
Thanks!!
Functions you need:
WHICHN - returns the index of the 1
VNAME - returns the variable name for a variable at a specified index
Nesting them together gets you what you need.
Concepts need: Arrays
https://stats.idre.ucla.edu/sas/seminars/sas-arrays/
Untested:
data want ;
set have;
array _race(*) race1-race4;
if race_sum > 1 then final_race = "Multiple Race";
else final_race = vname(_race(whichn(1, of _race(*))));
run;
FYI - programming using the following style is dangerous and not recommended. It destroys your input data so you need to run your code again from the prior step at minimum and makes it significantly harder to debug your code.
data test;
set test;
@pacman94 wrote:
Hello
I have the following dataset:
ID race1 race2 race3 race4 race_sum 1 1 0 1 0 2 2 0 1 0 0 1 3 0 0 0 1 1 4 1 0 0 0 1
What I would like to do is that:
if race_sum >1 then it should final_race = 'Multiple race'
if race_sum = 1 then it should be the original race variable
something like this:
ID race1 race2 race3 race4 race_sum final_race 1 1 0 1 0 2 Multiple Race 2 0 1 0 0 1 race2 3 0 0 0 1 1 race4 4 1 0 0 0 1 race1
data test;
set test;
if race_sum > 1 then final_race = 'Multiple Race'
if race_sum = 1 then final_race =??
;run;
I am not sure how to do the later part
Thanks!!
Fully tested:
data have;
infile cards dlm='09'x;
input ID Caucasian AfricanAmerican EastAsian SouthEastAsian;
label AfricanAmerican='African American' EastAsian="East Asian"
SOuthEastAsian="Southeast Asian";
cards;
1 1 0 1 0
2 0 1 0 0
3 0 0 0 1
4 1 0 0 0
;
;
;;
run;
data want;
set have;
length list_race_label list_race_name $200.;
array _race(*) Caucasian--SouthEastAsian;
race_sum=sum(of _race(*));
if race_sum > 1 then
final_race="Multiple Race";
else
final_race=vname(_race(whichn(1, of _race(*))));
*list of race;
do i=1 to dim(_race);
if _race(i)=1 then
list_race_label=catx(", ", trim(list_race_label), vlabel(_race(i)));
if _race(i)=1 then
list_race_name=catx(", ", trim(list_race_label), vname(_race(i)));
end;
run;
https://gist.github.com/statgeek/972120a19b583acc3a947b58b3177a41
90% plus of requests when I get to process similar data the user would like the "race" responses in some specific order.
The easiest way I have found to do that consistently is to assign a numeric value to the "final race" variable and an associated format that displays the text in the desired order.
So I would start with creating the Format first such as:
Proc format; value raceorder 1 = "Asian" 2 = "Black" 3 = "Native American" 4 = "White" 5 = "Multiple Race" 6 = "Unknown" ; run;
And then use the Sum/ whichn information to assign the above numeric values.
My example includes Unknown category because I get too much garbage and always have some in most of my data. That would be the sum of races le 0. (less than in case your system has MISSING values) or Whichn result of 0.
Example of creating data with "race" variables not in the possibly desired report order and making the final race variable match a report order as defined in that format above.
data have; input ID race1 race2 race3 race4 ; label Race1 = 'White' Race2 = 'Black' Race3 = 'Asian' Race4 = 'Native American' ; datalines; 1 1 0 1 0 2 0 1 0 0 3 0 0 0 1 4 1 0 0 0 5 0 0 1 0 6 0 0 0 0 ; data want; set have; array r race: ; if sum(of r(*))=2 then final_race=5; else select (whichn(1, of r(*))); when (1) final_race =4; when (2) final_race =2; when (3) final_race =1; when (4) final_race =3; otherwise final_race=6; end; run; proc freq data=want; tables final_race; format final_race raceorder.; run;
I use an explicit value for the "Unknown" or missing or what ever your system may want because if you use an actual missing value then the sort order place the "Unknown" first.
Style choices abound in variations of this.
The format approach can become extremely powerful if you get to work with one of the data sets that has subgroups of Asian (one I have seen has something like 24) or Hispanic split up into geographic sub groups. Then Multilabel formats could provide the way to do subgroup/main "race" categories.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.