Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- Exclude categorical variable level from chi square analysis, but not f...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 02-24-2021 01:13 PM
(1407 views)

Hello,

I have a dataset with several categorical variables. I'd like to produce chi square tests on some of the variables. One of the variables, race, has several levels:

White

Black

Hispanic

Other

Unknown

The level 'unknown' cannot be included in the chi square test, but must be included in the total population of interest, which is ~85,000.

I've tried the following code:

proc format;

value $plan

'var1','var2'='var1+2'

'var3' 'var4' 'var5'='var3-5'

run;

proc freq data=popstudy;

tables (race sex age)*(var1) / chisq;

format var1 $plan.;

where race not= ('Unknown') ;

run;

This reduces the total population to ~60,000, which is not what I need.

I've also tried subsetting the dataset and producing another dataset that removes all values of 'Unknown' race, but this reduces the population as the code above does and is not correct.

My supervisor says I need to rename the 'Unknown' level as equal to 99999 or something similar, and include a "where" statement in my proc freq statement, but I am not sure what they mean by this.

Thank you.

1 REPLY 1

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@bazingarollcall wrote:

Hello,

I have a dataset with several categorical variables. I'd like to produce chi square tests on some of the variables. One of the variables, race, has several levels:

White

Black

Hispanic

Other

Unknown

The level 'unknown' cannot be included in the chi square test, but must be included in the total population of interest, which is ~85,000.

I've tried the following code:

proc format;

value $plan

'var1','var2'='var1+2'

'var3' 'var4' 'var5'='var3-5'

run;proc freq data=popstudy;

tables (race sex age)*(var1) / chisq;

format var1 $plan.;

where race not= ('Unknown') ;

run;

This reduces the total population to ~60,000, which is not what I need.

I've also tried subsetting the dataset and producing another dataset that removes all values of 'Unknown' race, but this reduces the population as the code above does and is not correct.

My supervisor says I need to rename the 'Unknown' level as equal to 99999 or something similar, and include a "where" statement in my proc freq statement, but I am not sure what they mean by this.

Thank you.

First of all, what do you think this statement does:

where race not= ('Unknown') ;

You aren't going to get a chi-sqr with a total cell count for anything when you reduce the level of any variable or have missing values for the variable. Just does not work that way. Chi-sqr uses counts. If there is nothing to count then the count is not done for the combination of that variable and the other.

If your N must be 85,000 in the calculations using race then you need to include the "unknown" in one or more ways. Simplest, include 'unknown' as a value. Another is to "impute" values for race but if you have a category missing for nearly 30% of your records I would be very leery of that.

I think you need to discuss this requirement with your instructor a bit more. The approach you sort of outline only replaces 'unknown' with '99999', so doesn't help and any where is going to reduce your n.

You could set race to SAS missing value instead of "unknown" but that will still be reducing the n for tables involving the race variable. However that would not reduce the n for the statistics involving age or sex.

What specific question are you supposed to be answering with this chi-sqr?

📢

**ANNOUNCEMENT**

The early bird rate has been extended! Register by March 18 for just $695 - $100 off the standard rate.

Check out the agenda and get ready for a jam-packed event featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events.** **

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.