Hello,
I have a dataset that ressembles the one below (table A): each person (id) could tick to which ethnicity they identified themselves, so they could answer yes to multiple ones (each ethnicity was a variable with the answer being yes, no, or not stated). I am looking to have ONE variable that tells me the ethnicity for each id (if people answered yes to multiple ethnicities, they would be reclassified into ''multiple ethnicities'' and if a value was missing or not stated, it would be reclassified as ''not stated'' in ethnicities or deleted from the dataset. See example below of result wanted (table B). I wanted to use the ARRAY function, but not sure how to do this specifically. Thanks!
Table A
ID | Ethnicity_2 | Ethnicity_2 | Ethnicity_3 | Ethnicity_4 | Ethnicity_5 |
1 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | 1 | 1 | 0 | 0 |
4 | 1 | 0 | 0 | 0 | 0 |
5 | 99 | 0 | 0 | 1 | 1 |
6 | 0 | 0 | 0 | 0 | 1 |
7 | 0 | 1 | 0 | 1 | 0 |
8 | 0 | 1 | 0 | 0 | 0 |
9 | 0 | 0 | 1 | 0 | 0 |
10 | 0 | 0 | 0 | 1 | 0 |
1= YES | |||||
0= NO | |||||
99= NOT STATED |
Table B
ID | Ethnicity_1 | Ethnicity_2 | Ethnicity_3 | Ethnicity_4 | Ethnicity_5 | Ethnicity_final |
1 | 0 | 0 | 0 | 0 | 1 | Ethnicity_5 |
2 | 0 | 1 | 0 | 0 | 0 | Ethnicity_2 |
3 | 0 | 1 | 1 | 0 | 0 | Mutiple ethnicities |
4 | 1 | 0 | 0 | 0 | 0 | Ethnicity_1 |
5 | 99 | 0 | 0 | 1 | 1 | Not stated |
6 | 0 | 0 | 0 | 0 | 1 | Ethnicity_5 |
7 | 0 | 1 | 0 | 1 | 0 | Mutiple ethnicities |
8 | 0 | 1 | 0 | 0 | 0 | Ethnicity_2 |
9 | 0 | 0 | 1 | 0 | 0 | Ethnicity_3 |
10 | 0 | 0 | 0 | 1 | 0 | Ethnicity_4 |
* Ethnicity_final could also be numeric to make it easier: 1-5 and 6 = 'multiple ethnicities' and 7= 'not stated' for example |
data want;
set have;
array _eth (5) ethnicity_1-ethnicity_5;
*check for multiple ethnicities;
if sum(of _eth(*)) >1 then ethnicity_final = 'Multiple Ethnicity';
*check for none specified;
else if sum(of _eth(*))=0 then ethnicity_final = 'Not Stated';
*else only 1 is specified, map to ethnicity value;
else do;
*find which value in array is 1;
index_ethnicity = whichn(1, of _eth(*));
*get variable name of ethnicity;
ethnicity_final = vname(eth(index_ethnicity));
end;
run;
Untested because it's lunch time 🙂
@SJ12 wrote:
Hello,
I have a dataset that ressembles the one below (table A): each person (id) could tick to which ethnicity they identified themselves, so they could answer yes to multiple ones (each ethnicity was a variable with the answer being yes, no, or not stated). I am looking to have ONE variable that tells me the ethnicity for each id (if people answered yes to multiple ethnicities, they would be reclassified into ''multiple ethnicities'' and if a value was missing or not stated, it would be reclassified as ''not stated'' in ethnicities or deleted from the dataset. See example below of result wanted (table B). I wanted to use the ARRAY function, but not sure how to do this specifically. Thanks!
Table A
ID Ethnicity_2 Ethnicity_2 Ethnicity_3 Ethnicity_4 Ethnicity_5 1 0 0 0 0 1 2 0 1 0 0 0 3 0 1 1 0 0 4 1 0 0 0 0 5 99 0 0 1 1 6 0 0 0 0 1 7 0 1 0 1 0 8 0 1 0 0 0 9 0 0 1 0 0 10 0 0 0 1 0 1= YES 0= NO 99= NOT STATED
Table B
ID Ethnicity_1 Ethnicity_2 Ethnicity_3 Ethnicity_4 Ethnicity_5 Ethnicity_final 1 0 0 0 0 1 Ethnicity_5 2 0 1 0 0 0 Ethnicity_2 3 0 1 1 0 0 Mutiple ethnicities 4 1 0 0 0 0 Ethnicity_1 5 99 0 0 1 1 Not stated 6 0 0 0 0 1 Ethnicity_5 7 0 1 0 1 0 Mutiple ethnicities 8 0 1 0 0 0 Ethnicity_2 9 0 0 1 0 0 Ethnicity_3 10 0 0 0 1 0 Ethnicity_4
* Ethnicity_final could also be numeric to make it easier: 1-5 and 6 = 'multiple ethnicities' and 7= 'not stated' for example
@Reeza wrote:
data want; set have; array _eth (5) ethnicity_1-ethnicity_5; *check for multiple ethnicities; if sum(of _eth(*)) >1 then ethnicity_final = 'Multiple Ethnicity'; *check for none specified; else if sum(of _eth(*))=0 then ethnicity_final = 'Not Stated'; *else only 1 is specified, map to ethnicity value; else do; *find which value in array is 1; index_ethnicity = whichn(1, of _eth(*)); *get variable name of ethnicity; ethnicity_final = vname(eth(index_ethnicity)); end; run;
Untested because it's lunch time 🙂
Set the 99 to missing before checking any values.
Otherwise this will set "Multiple" when an "unstated" 99 occurs.
if sum(of _eth(*)) >1 then ethnicity_final = 'Multiple Ethnicity';
Thank you!
Arrays aren't really needed here, as there are many functions that can work across an entire row.
proc format;
value ethf 2='Multiple Ethnicities' 99='Not Stated';
run;
data want;
set have;
if sum(of ethnicity:)=1 then ethnicity_final=whichn(1,of ethnicity:);
else if whichn(99,of ethnicity:) then ethnicity_final=99;
else if sum(of ethnicity:)>1 then ethnicity_final=2;
format ethnicity_final ethf.;
run;
If you really need to have 'Ethnicity 1' instead of plain old '1' in the ethnicity_final variable, that's a little more work. You can use
proc format;
value ethf 1='Ethnicity 1' 2='Ethnicity 2' 3='Ethnicity 3' 4='Ethnicity 4' 5='Ethnicity 5'
98='Multiple Ethnicities' 99='Not Stated';
run;
data want;
set have;
if sum(of ethnicity:)=1 then ethnicity_final=whichn(1,of ethnicity:);
else if whichn(99,of ethnicity:) then ethnicity_final=99;
else if sum(of ethnicity:)>1 then ethnicity_final=98;
format ethnicity_final ethf.;
run;
Now let me throw out an opinion and some advice ... it may be the case that you have greatly simplified the problem so as you can present it to us. That is fine. The problem I have is that when you create text for an output, such as the text 'Ethinicity_1', this is likely difficult for others to understand because they don't know what the 5 ethnicities are by ethnicity number. Instead, provide text that means something, rather than coded numbers which your audience may not understand. If your data is set up such that the ethnicity is Caucausian when ethnicity_1=1 and Hispanic when when ethnicity_2=1 and so on, then the output should not say "Ethnicity_1", it should say "Caucasian", which your audience will understand better. Don't produce cryptic output, produce understandable output.
Sounds good, thanks for your help!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.