BookmarkSubscribeRSS Feed
bazingarollcall
Fluorite | Level 6

Hi everyone, 

 

I am having trouble with if/then statements.

 

I am trying to create several new variables from two variables: race and hispanic.

Race: 

0=MR
1= White
2= Black
3= American Indian
4=Asian
5=Pacific Islander
8= Other

 

Hispanic:

1= Spanish/Hispanic Origin

2= Not of Spanish/Hispanic Origin

9=Unknown

 

I need to create several binary variables:

White/non Hispanic

Black/non Hispanic

Hispanic

Other

 

So far, I have tried this code:

data h.data2;
set h.data;


if (nrace=1) and (hispanic=2) then whtnh=1;
else whtnh=0;

 

if (nrace=2) and (hispanic=2) then blknh=1;
else blknh=0;

 

if nrace not in ('1','2','0') and (hispanic=1) then hisp=1;
else hisp=0;

 

if nrace not in ('1', '2', '0', '8') then else=1;
else else=0;
run;

 

The code works and produces results, but the total of each '1' does not equal the total sample, 550274:

 

bazingarollcall_0-1621952294458.png

=551202

 

 

Am I not thinking this through logically, or is it a coding issue? Thanks in advance. I've included the log below. 

 

 

 

 

Log:

 
 
 
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 data h.mhmatch2;
74 set h.mhmatch;
NOTE: Data file H.MHMATCH.DATA is in a format that is native to another host, or the file encoding does not match the session
encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce
performance.
75
76
77 if (nrace=1) and (hispanic=2) then whtnh=1;
78 else whtnh=0;
79
80
81 if (nrace=2) and (hispanic=2) then blknh=1;
82 else blknh=0;
83
84
85 if nrace not in ('1','2','0') and (hispanic=1) then hisp=1;
86 else hisp=0;
87
88
89 if nrace not in ('1', '2', '0', '8') then else=1;
90 else else=0;
91 run;
 
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
77:20 81:20 85:5 85:37 89:5
NOTE: There were 550274 observations read from the data set H.MHMATCH.
NOTE: The data set H.MHMATCH2 has 550274 observations and 63 variables.
NOTE: DATA statement used (Total process time):
real time 7.08 seconds
cpu time 3.96 seconds
 
 
92
93
94 *check to make sure they were created correctly;
95 proc freq data=h.mhmatch2;
96 tables whtnh blknh hisp else;
97 run;
 
NOTE: There were 550274 observations read from the data set H.MHMATCH2.
NOTE: PROCEDURE FREQ used (Total process time):
real time 3.99 seconds
cpu time 0.59 seconds
 
 
98
99
100 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
112
 
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 data h.mhmatch2;
74 set h.mhmatch;
NOTE: Data file H.MHMATCH.DATA is in a format that is native to another host, or the file encoding does not match the session
encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce
performance.
75
76
77 if (nrace=1) and (hispanic=2) then whtnh=1;
78 else whtnh=0;
79
80
81 if (nrace=2) and (hispanic=2) then blknh=1;
82 else blknh=0;
83
84
85 if nrace not in ('1','2','0') and (hispanic=1) then hisp=1;
86 else hisp=0;
87
88
89 if nrace not in ('1', '2', '0', '8') then else=1;
90 else else=0;
91 run;
 
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
77:20 81:20 85:5 85:37 89:5
NOTE: There were 550274 observations read from the data set H.MHMATCH.
NOTE: The data set H.MHMATCH2 has 550274 observations and 63 variables.
NOTE: DATA statement used (Total process time):
real time 7.08 seconds
cpu time 3.96 seconds
 
 
92
93
94 *check to make sure they were created correctly;
95 proc freq data=h.mhmatch2;
96 tables whtnh blknh hisp else;
97 run;
 
NOTE: There were 550274 observations read from the data set H.MHMATCH2.
NOTE: PROCEDURE FREQ used (Total process time):
real time 3.99 seconds
cpu time 0.59 seconds
 
 
98
99
100 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
112
6 REPLIES 6
PaigeMiller
Diamond | Level 26

So you can debug this yourself. You go into data set h.data2 and find records where none of these new variables is 1, these are not getting assigned to a combined group, and from there you can determine where your IF statements are failing.

--
Paige Miller
bazingarollcall
Fluorite | Level 6
Thanks for your reply, PaigeMiller. My sample size is actually 550274 and the race/ethnicity variables total to 551202.

Some observations are being counted twice. How should I go about checking which ones are being counted more than once?
PaigeMiller
Diamond | Level 26

Now I'm lost. I don't see double-counting, and I don't see 551202. When I add up the numbers next to category 1, I don't get 551202, I get 526233.

 

To resolve this, you need to LOOK AT the data set h.data2 with your own eyes and see where none of the new variables is 1, they will all be zero, that's where your logic error is. (I guess it's also possible that two or more of the new variables are 1, that would also be a problem, which can be identified the same way)

--
Paige Miller
ballardw
Super User

Your may notice that the total cumulative frequency for all of your variables is 550274. So the exact same number of records are getting assigned something.

So, how many records have a missing value for the race or ethnicity variable?

 

A very easy piece of code to test your recoding results is to use Proc freq with list.

Try this : (or what ever the actual data set name is, your "log" does not match code shown for data set names) any you LOG only shows the output data set with 550274 records: 

NOTE: The data set H.MHMATCH2 has 550274 observations and 63 variables.

 

Proc freq data=h2.data2;
  tables nrace*hispanic*whtnh / list missing;
run;

The list option places the values on a single line so is easy to read. The Missing means the variables with missing values will be shown. If you have 551202 records in the data set this table will total to 551202 and will pretty clearly show where/why your recoding comes in with fewer records.

bazingarollcall
Fluorite | Level 6
Thanks for your reply, ballardw. My sample size is actually 550274 and the race/ethnicity variables total to 551202, so some observations are presumably being counted more than once. How do I go about checking this?
ballardw
Super User

@bazingarollcall wrote:
Thanks for your reply, ballardw. My sample size is actually 550274 and the race/ethnicity variables total to 551202, so some observations are presumably being counted more than once. How do I go about checking this?

I think that you need to write out some very explicit numbers stating where it comes from and how you adt them to get 551202. None of the data you showed in the first post does that.

If you are totaling all of the 1's across 4 variables then you have misunderstood what your code is doing. There is nothing in your code that made any of the assignments of the variables whtnh, blknh, hisp and else exclusive. Nothing. If you expect the 1 values to total to 550274 then your assignment code is wrong.

 

Use the Proc Freq code I showed before only add all 4 of your dependent variables in the tables statement:

 

tables nrace*hispanic*whtnh* blknh* hisp * else / list missing;

You will see that you have some values with 1 in multiple columns.

 

Note that not using the proper values can generate issues. When you use Nrace=2 in one place and Nrace not in ('0' '1' '2') then conversions of values become a possible issue. So, is NRACE numeric or Character?

The following data step shows what happens when coding with each combination of Nrace and Hispanic as provided assuming NO records are missing either of the variables.

 

data toofew toomany;
   do nrace=0 to 5,8;
      do hispanic= 1,2,9;
         if (nrace=1) and (hispanic=2) then whtnh=1;
         else whtnh=0;
         if (nrace=2) and (hispanic=2) then blknh=1;
         else blknh=0;
         if nrace not in (1,2,0) and (hispanic=1) then hisp=1;
         else hisp=0;
         if nrace not in (1, 2, 0, 8) then else=1;
         else else=0;
         if sum(whtnh,blknh,hisp,else)>1 then  output toomany;
         if sum(whtnh,blknh,hisp,else)=0 then  output toofew;
      end;
   end;
run;

Note that the data set TOOMANY will have three records that show that HISP and ELSE are both set to one. That might be part of your issue.

 

I would suggest that you look at the TOOFEW data set as well. That shows 9 records where the values of all the variables are 0. Which considering that some of then have Hispanic=1 may well not be what you expect.

I really do not understand why your race = MR (multiracial), white or black is excluded from HISP. Typically if you have a "non-Hispanic White" category then Hispanic White is counted in Hispanic, same with black or other races.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1682 views
  • 0 likes
  • 3 in conversation