<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Exclude categorical variable level from chi square analysis, but not from total population in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/Exclude-categorical-variable-level-from-chi-square-analysis-but/m-p/721667#M80222</link>
    <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/370041"&gt;@bazingarollcall&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I have a dataset with several categorical variables. I'd like to produce chi square tests on some of the variables. One of the variables, race, has several levels:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;White&lt;/P&gt;
&lt;P&gt;Black&lt;/P&gt;
&lt;P&gt;Hispanic&lt;/P&gt;
&lt;P&gt;Other&lt;/P&gt;
&lt;P&gt;Unknown&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The level 'unknown' cannot be included in the chi square test, but must be included in the total population of interest, which is ~85,000.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I've tried the following code:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;proc format;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; value $plan&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'var1','var2'='var1+2'&amp;nbsp;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'var3' 'var4' 'var5'='var3-5'&lt;BR /&gt;run;&lt;/P&gt;
&lt;P&gt;proc freq data=popstudy;&lt;BR /&gt;&amp;nbsp; tables (race sex age)*(var1) / chisq;&lt;BR /&gt;&amp;nbsp; format var1 $plan.;&lt;BR /&gt;&amp;nbsp;where race not= ('Unknown') ;&lt;BR /&gt;run;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This reduces the total population to ~60,000, which is not what I need.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I've also tried subsetting the dataset and producing another dataset that removes all values of&amp;nbsp;'Unknown' race, but this reduces the population as the code above does and is not correct.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My supervisor says I need to rename the 'Unknown' level as equal to 99999 or something similar, and include a "where" statement in my proc freq statement, but I am not sure what they mean by this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thank you.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;First of all, what do you think this statement does:&lt;/P&gt;
&lt;P&gt;where race not= ('Unknown') ;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You aren't going to get a chi-sqr with a total cell count for anything when you reduce the level of any variable or have missing values for the variable. Just does not work that way. Chi-sqr uses counts. If there is nothing to count then the count is not done for the combination of that variable and the other.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If your N must be 85,000 in the calculations using race then you need to include the "unknown" in one or more ways. Simplest, include 'unknown' as a value. Another is to "impute" values for race but if you have a category missing for nearly 30% of your records I would be very leery of that.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I think you need to discuss this requirement with your instructor a bit more. The approach you sort of outline only replaces 'unknown' with '99999', so doesn't help and any where is going to reduce your n.&lt;/P&gt;
&lt;P&gt;You could set race to SAS missing value instead of "unknown" but that will still be reducing the n for tables involving the race variable. However that would not reduce the n for the statistics involving age or sex.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What specific question are you supposed to be answering with this chi-sqr?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 24 Feb 2021 19:57:27 GMT</pubDate>
    <dc:creator>ballardw</dc:creator>
    <dc:date>2021-02-24T19:57:27Z</dc:date>
    <item>
      <title>Exclude categorical variable level from chi square analysis, but not from total population</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Exclude-categorical-variable-level-from-chi-square-analysis-but/m-p/721653#M80221</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have a dataset with several categorical variables. I'd like to produce chi square tests on some of the variables. One of the variables, race, has several levels:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;White&lt;/P&gt;&lt;P&gt;Black&lt;/P&gt;&lt;P&gt;Hispanic&lt;/P&gt;&lt;P&gt;Other&lt;/P&gt;&lt;P&gt;Unknown&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The level 'unknown' cannot be included in the chi square test, but must be included in the total population of interest, which is ~85,000.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I've tried the following code:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;proc format;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; value $plan&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'var1','var2'='var1+2'&amp;nbsp;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'var3' 'var4' 'var5'='var3-5'&lt;BR /&gt;run;&lt;/P&gt;&lt;P&gt;proc freq data=popstudy;&lt;BR /&gt;&amp;nbsp; tables (race sex age)*(var1) / chisq;&lt;BR /&gt;&amp;nbsp; format var1 $plan.;&lt;BR /&gt;&amp;nbsp;where race not= ('Unknown') ;&lt;BR /&gt;run;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This reduces the total population to ~60,000, which is not what I need.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I've also tried subsetting the dataset and producing another dataset that removes all values of&amp;nbsp;'Unknown' race, but this reduces the population as the code above does and is not correct.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My supervisor says I need to rename the 'Unknown' level as equal to 99999 or something similar, and include a "where" statement in my proc freq statement, but I am not sure what they mean by this.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;</description>
      <pubDate>Wed, 24 Feb 2021 18:13:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Exclude-categorical-variable-level-from-chi-square-analysis-but/m-p/721653#M80221</guid>
      <dc:creator>bazingarollcall</dc:creator>
      <dc:date>2021-02-24T18:13:28Z</dc:date>
    </item>
    <item>
      <title>Re: Exclude categorical variable level from chi square analysis, but not from total population</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Exclude-categorical-variable-level-from-chi-square-analysis-but/m-p/721667#M80222</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/370041"&gt;@bazingarollcall&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I have a dataset with several categorical variables. I'd like to produce chi square tests on some of the variables. One of the variables, race, has several levels:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;White&lt;/P&gt;
&lt;P&gt;Black&lt;/P&gt;
&lt;P&gt;Hispanic&lt;/P&gt;
&lt;P&gt;Other&lt;/P&gt;
&lt;P&gt;Unknown&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The level 'unknown' cannot be included in the chi square test, but must be included in the total population of interest, which is ~85,000.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I've tried the following code:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;proc format;&lt;BR /&gt;&amp;nbsp;&amp;nbsp; value $plan&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'var1','var2'='var1+2'&amp;nbsp;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'var3' 'var4' 'var5'='var3-5'&lt;BR /&gt;run;&lt;/P&gt;
&lt;P&gt;proc freq data=popstudy;&lt;BR /&gt;&amp;nbsp; tables (race sex age)*(var1) / chisq;&lt;BR /&gt;&amp;nbsp; format var1 $plan.;&lt;BR /&gt;&amp;nbsp;where race not= ('Unknown') ;&lt;BR /&gt;run;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This reduces the total population to ~60,000, which is not what I need.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I've also tried subsetting the dataset and producing another dataset that removes all values of&amp;nbsp;'Unknown' race, but this reduces the population as the code above does and is not correct.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My supervisor says I need to rename the 'Unknown' level as equal to 99999 or something similar, and include a "where" statement in my proc freq statement, but I am not sure what they mean by this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thank you.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;First of all, what do you think this statement does:&lt;/P&gt;
&lt;P&gt;where race not= ('Unknown') ;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You aren't going to get a chi-sqr with a total cell count for anything when you reduce the level of any variable or have missing values for the variable. Just does not work that way. Chi-sqr uses counts. If there is nothing to count then the count is not done for the combination of that variable and the other.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If your N must be 85,000 in the calculations using race then you need to include the "unknown" in one or more ways. Simplest, include 'unknown' as a value. Another is to "impute" values for race but if you have a category missing for nearly 30% of your records I would be very leery of that.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I think you need to discuss this requirement with your instructor a bit more. The approach you sort of outline only replaces 'unknown' with '99999', so doesn't help and any where is going to reduce your n.&lt;/P&gt;
&lt;P&gt;You could set race to SAS missing value instead of "unknown" but that will still be reducing the n for tables involving the race variable. However that would not reduce the n for the statistics involving age or sex.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What specific question are you supposed to be answering with this chi-sqr?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Feb 2021 19:57:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Exclude-categorical-variable-level-from-chi-square-analysis-but/m-p/721667#M80222</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2021-02-24T19:57:27Z</dc:date>
    </item>
  </channel>
</rss>

