Hello all,
My apologies if this was asked before. I have a variable indicating why people select a particular academic field. It is something like this:
Obs Reason
1 money (coded as 0)
2 ideals (1)
3 money and ideals (2)
4 none (3)
I want to test the hypothesis "people are more likely to select a field because of money than ideals" . The existence of "money and ideals" confuses me. If I create a binary variable I have to exclude money and ideals. I don't think I can exclude "money and ideals" and test only money vs ideals as almost half of the observations are "money and ideals". Any help is much appreciated. Thanks!
Including option 2 in any way confounds the data because it is clearly neither one of the other two choices.
Subset the data and do the test you were going to do.
No reason to create an additional variable for a quick test just use a: Where <your variable name> in (0 1); and see what you get.
The write up should include that half the data was thrown away because the data collection did not match the requirements for the analysis.
Or go back to whoever requests this analysis and discuss what the analysis plan indicated how this question was intended to be used.
You have a fairly classic example of not matching questions in a survey to the analysis plan, or more common, no analysis plan before writing the survey.
Depending on what you actually collected there may not be much you can do except use all 4 levels. How many "fields" are represented in the data, or are you not looking at the fields in this analysis? And how many respondents? You may not have much sample to work with including "field".
Was that an actual question or are these the results of coding from other actual questions, or a multi-response question that allowed selecting both categories and possibly more categories? Or the result of coding open text responses into a category?
Thanks for the answer. Unfortunately I didn't design the survey. Here is the frequency data:
I believe that was the actual question and subjects were able to select multiple choices at once. Do you think omitting option 2 causes bias? If I omit 2 then 0/1 ratio is 35/22 but if I include 2 then it becomes 89/76. They are quite different plus I would be artificially increasing the sample size in the latter.
Including option 2 in any way confounds the data because it is clearly neither one of the other two choices.
Subset the data and do the test you were going to do.
No reason to create an additional variable for a quick test just use a: Where <your variable name> in (0 1); and see what you get.
The write up should include that half the data was thrown away because the data collection did not match the requirements for the analysis.
Or go back to whoever requests this analysis and discuss what the analysis plan indicated how this question was intended to be used.
Hello @merdemk,
Sorry for the late response, but here's another suggestion: Wouldn't it be appropriate to divide this multiple-choice survey question into two Yes/No questions (i.e., "... because of money?" and "... because of ideals?")? Then, assuming that the respondents were randomly selected from a large population, you could model it as two dependent binary variables (each having a Bernoulli distribution) and use McNemar's test to test the null hypothesis of equal marginal probabilities for "Yes."
Code example:
data have;
Motivation_V18_H6=_n_-1;
input n @@;
cards;
35 22 54 8
;
data want;
set have;
money = (Motivation_V18_H6 in (0 2));
ideals = (Motivation_V18_H6 in (1 2));
run;
proc freq data=want;
weight n;
tables money*ideals / agree;
run;
Result:
Table of money by ideals money ideals Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0 | 8 | 22 | 30 | 6.72 | 18.49 | 25.21 | 26.67 | 73.33 | | 18.60 | 28.95 | ---------+--------+--------+ 1 | 35 | 54 | 89 | 29.41 | 45.38 | 74.79 | 39.33 | 60.67 | | 81.40 | 71.05 | ---------+--------+--------+ Total 43 76 119 36.13 63.87 100.00 Statistics for Table of money by ideals McNemar's Test Chi-Square DF Pr > ChiSq 2.9649 1 0.0851 Simple Kappa Coefficient Standard Estimate Error 95% Confidence Limits -0.1107 0.0841 -0.2755 0.0541 Sample Size = 119
I've run simulations to estimate the power and type I error probability of the test in situations similar to your numeric example:
/* Format for p-values (using a 5% significance level) */
proc format;
value signif
low-0.05 = 'significant'
0.05<-high = 'not significant';
run;
/* Simulation of two dependent Bernoulli distributed random variables¹, */
/* estimation of the power and type I error probability of McNemar's test */
/* */
/* ¹ based on sample code from Rick Wicklin's blog article */
/* https://blogs.sas.com/content/iml/2016/03/16/simulate-multinomial-sas-data-step.html */
%let SampleSize = 1000000;
%let N = 119;
%macro sim(outds, /* Name of output dataset */
p1, /* Parameter of the Bernoulli distribution of the row variable (X) */
p2, /* Parameter of the Bernoulli distribution of the column variable (Y) */
p11 /* p11=P(X=Y=1); note: max(0, p1+p2-1) <= p11 <= min(p1, p2) */
);
data &outds(keep=i pv);
call streaminit(27182818);
array probs[4] _temporary_ (%sysevalf(1-&p1-&p2+&p11) /* p00 */
%sysevalf(&p2-&p11) /* p01 */
%sysevalf(&p1-&p11) /* p10 */
%sysevalf(&p11)); /* p11 */
array x[4] x00 x01 x10 x11;
do i = 1 to &SampleSize;
ItemsLeft = &N;
cumProb = 0;
do j = 1 to dim(probs)-1;
p = probs[j] / (1 - cumProb);
x[j] = rand('binom', p, ItemsLeft);
ItemsLeft = ItemsLeft - x[j];
cumProb = cumProb + probs[j];
end;
x[dim(probs)] = ItemsLeft;
if x01 ne x10 then pv=sdf('chisq',(x01-x10)**2/(x01+x10),1); /* p-value of McNemar's test */
else pv=1;
output;
end;
run;
proc freq data=&outds;
format pv signif.;
tables pv / binomial;
run;
%mend sim;
%sim(sim0, 165/238, 165/238, 54/119) /* one particular case of the null hypothesis (p1=p2) */
/* Estimated probability of type I error: 0.0492 */
%sim(sim1, 89/119, 76/119, 54/119) /* Corr(X,Y)=(&p11-&p1*&p2)/sqrt(&p1*&p2*(1-&p1)*(1-&p2))=-0.1144 */
/* Estimated power: 0.4052 */
%sim(sim2, 89/119, 76/119, 72/119) /* larger p11, Corr(X,Y)=0.6107 */
/* Estimated power: 0.8530 */
Two variants of McNemar's test (not shown here) -- the exact version available with the EXACT MCNEM statement in PROC FREQ and the asymptotic version using Edwards' correction (see Fleiss, Levin, and Paik, 2003, p. 375) -- turned out to be rather conservative, hence less powerful than the default "uncorrected" asymptotic version of the test.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.