How do I simulate case control data in analyzing power for conditional logistic regression model? I have experience in simulating logistic regression models, but not sure how to generate data that's case control (1:1 ratio) design. Thank you!
I'm glad you found these thoughts helpful. I wasn't actually "suggesting" an approach, I was sort of wondering out loud how you intended to get the 1:1 ratio. My understanding is that it is okay to recruit more "cases" than "controls," and that can improve the power, especially for rare diseases/events. I found this abstract of a Lancet article, although I have not read the paper.
Ultimately, a simulation should reflect the design of the experiment that you are intending to carry out. In a real study, how would you recruit the cases and controls? If you use random selection, then in your simulation you can output the first N cases and the first N controls to generate the case-control sample. Of course, you might need to generate many more than 2N observations, so you should probably use DO UNTIL logic. Although your intentions are not clear, perhaps the following simulation might be modified to your needs:
%let N = 15; /* want N in each group */
data CaseControl;
length Group $7;
call streaminit(1234); /* set the random number seed */
NCase = 0; NControl = 0;
do until (NCase >= &N and NControl >= &N);
x1 = rand("Uniform");
c1 = rand("Bernoulli", 0.3);
eta = 1 + 0.8*x1 - 1.2*c1 + 0.4*c1*x1; /* interaction model */
mu = logistic(eta);
Y = rand("Bernoulli", mu);
Group = ifc(Y = 1, "Case", "Control");
if Group = "Case" & NCase < &N then do;
NCase + 1; output;
end;
if Group = "Control" & NControl < &N then do;
NControl + 1; output;
end;
end;
keep Y x1 c1 Group;
run;
proc logistic data=CaseControl plots(only)=effectplot;
class c1 ;
model Y(event='1') = x1 | c1;
run;
I think we need a few more details about your plans for the study, such as how you plan to achieve the 1:1 ratio. Typically case-control studies are observational, and the "case" group has the disease whereas the "control" group does not. Are you assuming that the half the population has the disease and half don't? Or do you plan to over/undersample the population to achieve the 1:1 ratio? Could you provide more details?
In general, all simulation studies, start with a model of the population from which you will sample. Read the articles "Simulating data for a logistic regression model" (DATA step) and "Simulate many samples from a logistic regression model" (SAS/IML) for the general framework. (You can also read an example about using simulation to estimate power.) During the simulation you can assign GROUP="Control" for subjects that have Y=0 and GROUP="Case" for subjects that have Y=1. But in general these groups won't have a 1:1 ratio unless additional steps are taken.
Thank you for your reply!!!
This is actually very helpful. Yes I'm doing power for a case-control data (1:1 ratio) that aimed to test an interaction between a continuous variable and race. Generating logistic data using an interaction term is no problem for me, but I got stuck at making half the sample control, half case. Seems you are suggesting generating larger data and sample from it to get the 1:1 case-control data that I want? That is a great idea! Question though: Will sampling from the larger data set change the effect size I used to generate the larger data set?
Thank you! I'm so glad I posted the question after scratching my head in front of my computer for a whole day!!
I'm glad you found these thoughts helpful. I wasn't actually "suggesting" an approach, I was sort of wondering out loud how you intended to get the 1:1 ratio. My understanding is that it is okay to recruit more "cases" than "controls," and that can improve the power, especially for rare diseases/events. I found this abstract of a Lancet article, although I have not read the paper.
Ultimately, a simulation should reflect the design of the experiment that you are intending to carry out. In a real study, how would you recruit the cases and controls? If you use random selection, then in your simulation you can output the first N cases and the first N controls to generate the case-control sample. Of course, you might need to generate many more than 2N observations, so you should probably use DO UNTIL logic. Although your intentions are not clear, perhaps the following simulation might be modified to your needs:
%let N = 15; /* want N in each group */
data CaseControl;
length Group $7;
call streaminit(1234); /* set the random number seed */
NCase = 0; NControl = 0;
do until (NCase >= &N and NControl >= &N);
x1 = rand("Uniform");
c1 = rand("Bernoulli", 0.3);
eta = 1 + 0.8*x1 - 1.2*c1 + 0.4*c1*x1; /* interaction model */
mu = logistic(eta);
Y = rand("Bernoulli", mu);
Group = ifc(Y = 1, "Case", "Control");
if Group = "Case" & NCase < &N then do;
NCase + 1; output;
end;
if Group = "Control" & NControl < &N then do;
NControl + 1; output;
end;
end;
keep Y x1 c1 Group;
run;
proc logistic data=CaseControl plots(only)=effectplot;
class c1 ;
model Y(event='1') = x1 | c1;
run;
Thank you, Rich for your help! I went with our original thought, generated larger data from the model I'm testing, used Proc Surveyselect to select a 1:1 ratio case/control, and achieved what I wanted. The solution your sent is definitely more elegant. I'm saving it for the future. Thank you for your prompt replies, suggestions, and help!
Hi @Rick_SAS, you are definitely an expert in simulation. Could you please help me with my issue:
Thank you very much! Best regards, Thierry
The simulation method @Rick_SAS's suggestion is ok. But the analysis should be conditional logistic regression due to the design. Therefore, there has to be a strata statement in proc logistic.
data simulation;
array p{0:1}; *vector of probabilities of exposure in case and control group;
array odds{0:1};
or=1.5; *the odds ratio, that are common for all strata;
do strata=1 to 100000;
p[0]=rand('uniform'); *the probability of exposure in the control group;
*now some simple calculations to calculate probability in the case group;
odds[0]=p[0]/(1-p[0]);
odds[1]=or*odds[0];
p[1]=odds[1]/(1+odds[1]);
do case=0 to 1;
exposure=rand('bernoulli',p[case]);
output;
end;
end;
keep strata case exposure;
run;
*The strata statement in proc logistic condition that due to the design there is one case per stratum;
proc logistic data=simulation;
class exposure(ref="0");
model case(event="1")=exposure;
strata strata/nosummary;
run;
Thank you for your reply - yes, conditional model should be used for matched data. I liked your simulation part, too!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.