BookmarkSubscribeRSS Feed
bazingarollcall
Fluorite | Level 6

Hello,

I have a dataset with several categorical variables. I'd like to produce chi square tests on some of the variables. One of the variables, race, has several levels:

 

White

Black

Hispanic

Other

Unknown

 

The level 'unknown' cannot be included in the chi square test, but must be included in the total population of interest, which is ~85,000.

 

I've tried the following code:

 

proc format;
   value $plan
    'var1','var2'='var1+2' 
    'var3' 'var4' 'var5'='var3-5'
run;

proc freq data=popstudy;
  tables (race sex age)*(var1) / chisq;
  format var1 $plan.;
 where race not= ('Unknown') ;
run;

 

This reduces the total population to ~60,000, which is not what I need.

 

I've also tried subsetting the dataset and producing another dataset that removes all values of 'Unknown' race, but this reduces the population as the code above does and is not correct.

 

My supervisor says I need to rename the 'Unknown' level as equal to 99999 or something similar, and include a "where" statement in my proc freq statement, but I am not sure what they mean by this.

 

 

Thank you.

1 REPLY 1
ballardw
Super User

@bazingarollcall wrote:

Hello,

I have a dataset with several categorical variables. I'd like to produce chi square tests on some of the variables. One of the variables, race, has several levels:

 

White

Black

Hispanic

Other

Unknown

 

The level 'unknown' cannot be included in the chi square test, but must be included in the total population of interest, which is ~85,000.

 

I've tried the following code:

 

proc format;
   value $plan
    'var1','var2'='var1+2' 
    'var3' 'var4' 'var5'='var3-5'
run;

proc freq data=popstudy;
  tables (race sex age)*(var1) / chisq;
  format var1 $plan.;
 where race not= ('Unknown') ;
run;

 

This reduces the total population to ~60,000, which is not what I need.

 

I've also tried subsetting the dataset and producing another dataset that removes all values of 'Unknown' race, but this reduces the population as the code above does and is not correct.

 

My supervisor says I need to rename the 'Unknown' level as equal to 99999 or something similar, and include a "where" statement in my proc freq statement, but I am not sure what they mean by this.

 

 

Thank you.


First of all, what do you think this statement does:

where race not= ('Unknown') ;

 

You aren't going to get a chi-sqr with a total cell count for anything when you reduce the level of any variable or have missing values for the variable. Just does not work that way. Chi-sqr uses counts. If there is nothing to count then the count is not done for the combination of that variable and the other.

 

If your N must be 85,000 in the calculations using race then you need to include the "unknown" in one or more ways. Simplest, include 'unknown' as a value. Another is to "impute" values for race but if you have a category missing for nearly 30% of your records I would be very leery of that.

 

I think you need to discuss this requirement with your instructor a bit more. The approach you sort of outline only replaces 'unknown' with '99999', so doesn't help and any where is going to reduce your n.

You could set race to SAS missing value instead of "unknown" but that will still be reducing the n for tables involving the race variable. However that would not reduce the n for the statistics involving age or sex.

 

What specific question are you supposed to be answering with this chi-sqr?

 

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 1779 views
  • 2 likes
  • 2 in conversation