I am relatively new to SAS and I am trying to calculate Race and ethnicity from a matric table as Follows:
Hispanic | White | Black | Asian | American Indian/Native American | Hawaiian/Pacific Islander | Unknown |
N | Y | N | N | N | N | |
N | Y | N | N | N | N | |
Z | Y | N | N | N | N | |
N | Y | N | N | Y | N | |
N | Y | N | N | N | N | |
Z | Z | Z | Z | Z | Z | |
Y | N | N | N | N | N |
Y = Yes, N=No, Z=Unknown
I just need to get a count of each race and add a 'Multiracial' variable, not sure how to do that since each race is its own variable and not sure how to code for multiracial. Multiracial would be Y to any two or more of the races. And if Hispanic = Y then only Hispanic regardless if any other race = Y.
Thanks,
Thomas
Using letters Y N Z is not the easiest way to do things here, in my opinion.
Using numeric value 1 0 and missing makes the problem a lot simpler. So, if you start where the values are 1 0 and missing, then this gets you the desired result.
data have2;
set have;
if sum(of hispanic--hawaiian)>1 and hispanic^=1 then multiracial=1;
else multiracial=0;
if n(of hispanic--hawaiian)=0 then unknown=1;
else unknown=0;
run;
proc summary data=have2;
var hispanic--hawaiian multiracial unknown;
output out=want sum=;
run;
Now, since you are new, I point out that we can only work with data in SAS data set, not the screen capture of your data that you show which doesn't even have valid variable names. So you can make your SAS coding simple if your data is in meaningful form, and you must have valid SAS variable names.
Using letters Y N Z is not the easiest way to do things here, in my opinion.
Using numeric value 1 0 and missing makes the problem a lot simpler. So, if you start where the values are 1 0 and missing, then this gets you the desired result.
data have2;
set have;
if sum(of hispanic--hawaiian)>1 and hispanic^=1 then multiracial=1;
else multiracial=0;
if n(of hispanic--hawaiian)=0 then unknown=1;
else unknown=0;
run;
proc summary data=have2;
var hispanic--hawaiian multiracial unknown;
output out=want sum=;
run;
Now, since you are new, I point out that we can only work with data in SAS data set, not the screen capture of your data that you show which doesn't even have valid variable names. So you can make your SAS coding simple if your data is in meaningful form, and you must have valid SAS variable names.
I agree that 1 and 0 would make this easier but this is the data that I have and this is the format required by whom we report the data to. I am trying to verify their counts with what we submitted.
For all kinds of statistics, you are WAY better off with the numeric 0/1/missing. The fact that people who have no clue about computing deliver data in bad formats, layouts and contents does NOT force you to also be an idiot and make your life unnecessarily hard.
Convert the data to numeric as soon as you get it (use a proper informat), and use formats to display them in the manner the pumpkins expect.
@mothert wrote:
I agree that 1 and 0 would make this easier but this is the data that I have and this is the format required by whom we report the data to. I am trying to verify their counts with what we submitted.
Then convert the Y Z N to 1 0 missing yourself, and then the code I provided will work*.
Otherwise, please provide an example SAS data set of your Y N Z data, with actual valid SAS data set names, following these instructions or by typing in working SAS data step code yourself. Emphasis on working. Screen captures and Excel files are not acceptable. Once we have that, we will show you how to convert the Y N Z to 1 0 missing, and then the code I provided will work*.
* — assuming you adjust the code for the new variable names of the 1 0 missing variables.
What is the purpose of that Unknown column in the matrix? It has no values and you do not mention it at all. That matrix doesn't look like much of guideline.
Provide some example of your data. Best is as working data step code. At least provide the names of your SAS variables and some example values. Best would be include an identification variable so we can discuss things like "why when ID=23 does xxx occur?" instead of trying to count lines. Try to cover as many cases as you think you may have in the data.
For that example provide what you expect as a result. It is best to separate the "have" starting data and the "want" result so there is no confusion.
I need a count of the Unknown which is the value of 'Z' in the race fields. I need to add variables, Multi-Racial and Unknown. And the matrix I provided is an example of the data. The variable names were RWHT = White, RBLK = Black, RASN = Asian, RAIN = American Indian, RHIP = Hawaiian/Pacific Islander, HISP = Hispanic. Need to add Mutliracial and Unknow. Mutliracial would be if any two or more of RWHT, RBLK, RASN, RAIN, RHIP = Y. Unknown would be if any race/ethinicty = Z and all others = N.
The result I am looking for would be as follows:
Unduplicated Headcount | 2022 |
Asian/Pacific Islanders | 128 |
Black (non-Hispanic) | 595 |
Hispanic | 478 |
American Indian/Alaskan Native | 34 |
White (non-Hispanic) | 4,552 |
Multi Racial | 132 |
Unknown | 100 |
The matrix is a sample of the actual date from the file I am reading into SAS.
"The matrix is a sample of the actual date from the file I am reading into SAS."
So, how are you reading that??? The column headings aren't valid variable names and awkward to work with. And still not description or values for the column Uknown.
If that is a text file I will show you how I would read that data.
proc format; invalue ynz (upcase) 'Y'=1 'N'=0 'Z'=. other=_error_; run; data read; informat Hispanic White Black Asian AIAN HoPI ynz.; input Hispanic White Black Asian AIAN HoPI; label AIAN='American Indian/Alaska Native' Hopi='Hawaiian/Pacific Islander' ; /* NOTE providing actual examples of other values and MULTIRACIAL */ datalines; N Y N N N N N Y N N N N Z Y N N N N N Y N N Y N N Y N N N N Z Z Z Z Z Z Y N N N N N N Y Y N N N N N Y n n n n n n y n n n n n n y y n n n n n y ;
The above uses a slightly advanced technique of a custom informat. The INVALUE makes and names numeric informat. That means that it will read TEXT and result in a numeric value. The UPCASE option tells SAS to make the text upper case before comparing to the list of values, which helps if someone isn't real consistent when entering data into a form. The
'Y'=1 means "read a Y and assign 1 as the value". The other=_error_ means if a value other than Y, N or Z (after UPCASE!) is encountered then treat the value as an error. That means the log will show a note about invalid data and assign a missing value to the variable. In real work that means investigate which value is encountered and you decide what to do with it.
The data step uses an Informat statement to assign the informat to the variables and then uses Input to read them.
Note: You did not provide any values in your Unknown column so I did not include them. I did include some rows with other values as you did not provide any Asian, AIAN or HoPI (often Hawaiian or Other Pacific Islander) values so I did.
I get lots of garbage data. Read it as makes sense and create the report to standard. The boss doesn't care as long as the results match the requirements, on time, accurate and repeatable (when the NEXT stupid file comes in).
data read;
input (Hispanic White Black Asian AIAN HoPI) ($);
label AIAN='American Indian/Alaska Native'
Hopi='Hawaiian/Pacific Islander'
;
datalines;
N Y N N N N
N Y N N N N
Z Y N N N N
N Y N N Y N
N Y N N N N
Z Z Z Z Z Z
Y N N N N N
;
proc transpose data=read(obs=0) out=temp;
var _all_;
run;
proc sql noprint;
select cat('sum(',_name_,'="Y") as ',_name_) into : list separated by ','
from temp;
create table temp2 as
select &list. from read;
quit;
proc transpose data=temp2 out=want;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.