Solved: Re: Chi Square for inclusive values

merdemk · Posted 04-27-2021 12:15 AM

Hello all,

My apologies if this was asked before. I have a variable indicating why people select a particular academic field. It is something like this:

Obs Reason

1 money (coded as 0)

2 ideals (1)

3 money and ideals (2)

4 none (3)

I want to test the hypothesis "people are more likely to select a field because of money than ideals" . The existence of "money and ideals" confuses me. If I create a binary variable I have to exclude money and ideals. I don't think I can exclude "money and ideals" and test only money vs ideals as almost half of the observations are "money and ideals". Any help is much appreciated. Thanks!

ballardw · Posted 04-27-2021 11:11 AM

Including option 2 in any way confounds the data because it is clearly neither one of the other two choices.

Subset the data and do the test you were going to do.

No reason to create an additional variable for a quick test just use a: Where <your variable name> in (0 1); and see what you get.

The write up should include that half the data was thrown away because the data collection did not match the requirements for the analysis.

Or go back to whoever requests this analysis and discuss what the analysis plan indicated how this question was intended to be used.

View solution in original post

ballardw · Posted 04-27-2021 12:56 AM

You have a fairly classic example of not matching questions in a survey to the analysis plan, or more common, no analysis plan before writing the survey.

Depending on what you actually collected there may not be much you can do except use all 4 levels. How many "fields" are represented in the data, or are you not looking at the fields in this analysis? And how many respondents? You may not have much sample to work with including "field".

Was that an actual question or are these the results of coding from other actual questions, or a multi-response question that allowed selecting both categories and possibly more categories? Or the result of coding open text responses into a category?

merdemk · Posted 04-27-2021 01:58 AM

Thanks for the answer. Unfortunately I didn't design the survey. Here is the frequency data:

I believe that was the actual question and subjects were able to select multiple choices at once. Do you think omitting option 2 causes bias? If I omit 2 then 0/1 ratio is 35/22 but if I include 2 then it becomes 89/76. They are quite different plus I would be artificially increasing the sample size in the latter.

ballardw · Posted 04-27-2021 11:11 AM

Including option 2 in any way confounds the data because it is clearly neither one of the other two choices.

Subset the data and do the test you were going to do.

No reason to create an additional variable for a quick test just use a: Where <your variable name> in (0 1); and see what you get.

The write up should include that half the data was thrown away because the data collection did not match the requirements for the analysis.

Or go back to whoever requests this analysis and discuss what the analysis plan indicated how this question was intended to be used.

Reeza · Posted 04-27-2021 12:02 PM

If this is homework/class work you do it anyways. If someone was asking for the analysis you explain the survey wasn't designed to answer this question as it's a multiselect question where a user isn't picking between two specific choices. You can make an assumption, either put everyone who said money/ideals in Money and everyone else in Ideals, excluding None of course or exclude the group that selects both and see what the results are - I would probably do it both ways to see what happens out of curiousity.

FreelanceReinh · Posted 05-03-2021 02:47 PM

Hello @merdemk,

Sorry for the late response, but here's another suggestion: Wouldn't it be appropriate to divide this multiple-choice survey question into two Yes/No questions (i.e., "... because of money?" and "... because of ideals?")? Then, assuming that the respondents were randomly selected from a large population, you could model it as two dependent binary variables (each having a Bernoulli distribution) and use McNemar's test to test the null hypothesis of equal marginal probabilities for "Yes."

Code example:

data have;
Motivation_V18_H6=_n_-1;
input n @@;
cards;
35 22 54 8
;

data want;
set have;
money  = (Motivation_V18_H6 in (0 2));
ideals = (Motivation_V18_H6 in (1 2));
run;

proc freq data=want;
weight n;
tables money*ideals / agree;
run;

Result:

Table of money by ideals

money     ideals

Frequency|
Percent  |
Row Pct  |
Col Pct  |       0|       1|  Total
---------+--------+--------+
       0 |      8 |     22 |     30
         |   6.72 |  18.49 |  25.21
         |  26.67 |  73.33 |
         |  18.60 |  28.95 |
---------+--------+--------+
       1 |     35 |     54 |     89
         |  29.41 |  45.38 |  74.79
         |  39.33 |  60.67 |
         |  81.40 |  71.05 |
---------+--------+--------+
Total          43       76      119
            36.13    63.87   100.00


Statistics for Table of money by ideals


          McNemar's Test

Chi-Square        DF    Pr > ChiSq

    2.9649         1        0.0851


           Simple Kappa Coefficient

            Standard
Estimate       Error    95% Confidence Limits

 -0.1107      0.0841     -0.2755       0.0541

Sample Size = 119

I've run simulations to estimate the power and type I error probability of the test in situations similar to your numeric example:

/* Format for p-values (using a 5% significance level) */

proc format;
value signif
low-0.05   = 'significant'
0.05<-high = 'not significant';
run;

/* Simulation of two dependent Bernoulli distributed random variables¹,                   */
/* estimation of the power and type I error probability of McNemar's test                 */
/*                                                                                        */
/* ¹ based on sample code from Rick Wicklin's blog article                                */
/*   https://blogs.sas.com/content/iml/2016/03/16/simulate-multinomial-sas-data-step.html */

%let SampleSize = 1000000;
%let N = 119;

%macro sim(outds,  /* Name of output dataset */
              p1,  /* Parameter of the Bernoulli distribution of the row variable (X) */
              p2,  /* Parameter of the Bernoulli distribution of the column variable (Y) */
              p11  /* p11=P(X=Y=1); note: max(0, p1+p2-1) <= p11 <= min(p1, p2) */
          );
data &outds(keep=i pv);
call streaminit(27182818);
array probs[4] _temporary_ (%sysevalf(1-&p1-&p2+&p11) /* p00 */
                            %sysevalf(&p2-&p11)       /* p01 */
                            %sysevalf(&p1-&p11)       /* p10 */
                            %sysevalf(&p11));         /* p11 */
array x[4] x00 x01 x10 x11;
do i = 1 to &SampleSize; 
  ItemsLeft = &N;
  cumProb = 0;
  do j = 1 to dim(probs)-1;
    p = probs[j] / (1 - cumProb);
    x[j] = rand('binom', p, ItemsLeft);
    ItemsLeft = ItemsLeft - x[j];
    cumProb = cumProb + probs[j];
  end;
  x[dim(probs)] = ItemsLeft;

  if x01 ne x10 then pv=sdf('chisq',(x01-x10)**2/(x01+x10),1); /* p-value of McNemar's test */
  else pv=1;
  output;
end;
run;

proc freq data=&outds;
format pv signif.;
tables pv / binomial;
run;

%mend sim;

%sim(sim0, 165/238, 165/238, 54/119) /* one particular case of the null hypothesis (p1=p2) */
/* Estimated probability of type I error: 0.0492 */

%sim(sim1, 89/119, 76/119, 54/119) /* Corr(X,Y)=(&p11-&p1*&p2)/sqrt(&p1*&p2*(1-&p1)*(1-&p2))=-0.1144 */
/* Estimated power: 0.4052 */

%sim(sim2, 89/119, 76/119, 72/119) /* larger p11, Corr(X,Y)=0.6107 */
/* Estimated power: 0.8530 */

Two variants of McNemar's test (not shown here) -- the exact version available with the EXACT MCNEM statement in PROC FREQ and the asymptotic version using Edwards' correction (see Fleiss, Levin, and Paik, 2003, p. 375) -- turned out to be rather conservative, hence less powerful than the default "uncorrected" asymptotic version of the test.

merdemk · Posted 05-03-2021 03:57 PM

Hi @FreelanceReinhard,

Thank you so much! This is an excellent idea!

Catch up on SAS Innovate 2026