Hi everyone,
I am having trouble with if/then statements.
I am trying to create several new variables from two variables: race and hispanic.
Race:
0=MR
1= White
2= Black
3= American Indian
4=Asian
5=Pacific Islander
8= Other
Hispanic:
1= Spanish/Hispanic Origin
2= Not of Spanish/Hispanic Origin
9=Unknown
I need to create several binary variables:
White/non Hispanic
Black/non Hispanic
Hispanic
Other
So far, I have tried this code:
data h.data2;
set h.data;
if (nrace=1) and (hispanic=2) then whtnh=1;
else whtnh=0;
if (nrace=2) and (hispanic=2) then blknh=1;
else blknh=0;
if nrace not in ('1','2','0') and (hispanic=1) then hisp=1;
else hisp=0;
if nrace not in ('1', '2', '0', '8') then else=1;
else else=0;
run;
The code works and produces results, but the total of each '1' does not equal the total sample, 550274:
=551202
Am I not thinking this through logically, or is it a coding issue? Thanks in advance. I've included the log below.
Log:
So you can debug this yourself. You go into data set h.data2 and find records where none of these new variables is 1, these are not getting assigned to a combined group, and from there you can determine where your IF statements are failing.
Now I'm lost. I don't see double-counting, and I don't see 551202. When I add up the numbers next to category 1, I don't get 551202, I get 526233.
To resolve this, you need to LOOK AT the data set h.data2 with your own eyes and see where none of the new variables is 1, they will all be zero, that's where your logic error is. (I guess it's also possible that two or more of the new variables are 1, that would also be a problem, which can be identified the same way)
Your may notice that the total cumulative frequency for all of your variables is 550274. So the exact same number of records are getting assigned something.
So, how many records have a missing value for the race or ethnicity variable?
A very easy piece of code to test your recoding results is to use Proc freq with list.
Try this : (or what ever the actual data set name is, your "log" does not match code shown for data set names) any you LOG only shows the output data set with 550274 records:
NOTE: The data set H.MHMATCH2 has 550274 observations and 63 variables.
Proc freq data=h2.data2; tables nrace*hispanic*whtnh / list missing; run;
The list option places the values on a single line so is easy to read. The Missing means the variables with missing values will be shown. If you have 551202 records in the data set this table will total to 551202 and will pretty clearly show where/why your recoding comes in with fewer records.
@bazingarollcall wrote:
Thanks for your reply, ballardw. My sample size is actually 550274 and the race/ethnicity variables total to 551202, so some observations are presumably being counted more than once. How do I go about checking this?
I think that you need to write out some very explicit numbers stating where it comes from and how you adt them to get 551202. None of the data you showed in the first post does that.
If you are totaling all of the 1's across 4 variables then you have misunderstood what your code is doing. There is nothing in your code that made any of the assignments of the variables whtnh, blknh, hisp and else exclusive. Nothing. If you expect the 1 values to total to 550274 then your assignment code is wrong.
Use the Proc Freq code I showed before only add all 4 of your dependent variables in the tables statement:
tables nrace*hispanic*whtnh* blknh* hisp * else / list missing;
You will see that you have some values with 1 in multiple columns.
Note that not using the proper values can generate issues. When you use Nrace=2 in one place and Nrace not in ('0' '1' '2') then conversions of values become a possible issue. So, is NRACE numeric or Character?
The following data step shows what happens when coding with each combination of Nrace and Hispanic as provided assuming NO records are missing either of the variables.
data toofew toomany; do nrace=0 to 5,8; do hispanic= 1,2,9; if (nrace=1) and (hispanic=2) then whtnh=1; else whtnh=0; if (nrace=2) and (hispanic=2) then blknh=1; else blknh=0; if nrace not in (1,2,0) and (hispanic=1) then hisp=1; else hisp=0; if nrace not in (1, 2, 0, 8) then else=1; else else=0; if sum(whtnh,blknh,hisp,else)>1 then output toomany; if sum(whtnh,blknh,hisp,else)=0 then output toofew; end; end; run;
Note that the data set TOOMANY will have three records that show that HISP and ELSE are both set to one. That might be part of your issue.
I would suggest that you look at the TOOFEW data set as well. That shows 9 records where the values of all the variables are 0. Which considering that some of then have Hispanic=1 may well not be what you expect.
I really do not understand why your race = MR (multiracial), white or black is excluded from HISP. Typically if you have a "non-Hispanic White" category then Hispanic White is counted in Hispanic, same with black or other races.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.