Solved: Re: Simulating Case Control data

liliu · Posted 06-09-2018 07:29 PM

How do I simulate case control data in analyzing power for conditional logistic regression model? I have experience in simulating logistic regression models, but not sure how to generate data that's case control (1:1 ratio) design. Thank you!

Rick_SAS · Posted 06-10-2018 12:50 PM

I'm glad you found these thoughts helpful. I wasn't actually "suggesting" an approach, I was sort of wondering out loud how you intended to get the 1:1 ratio. My understanding is that it is okay to recruit more "cases" than "controls," and that can improve the power, especially for rare diseases/events. I found this abstract of a Lancet article, although I have not read the paper.

Ultimately, a simulation should reflect the design of the experiment that you are intending to carry out. In a real study, how would you recruit the cases and controls? If you use random selection, then in your simulation you can output the first N cases and the first N controls to generate the case-control sample. Of course, you might need to generate many more than 2N observations, so you should probably use DO UNTIL logic. Although your intentions are not clear, perhaps the following simulation might be modified to your needs:

%let N = 15;             /* want N in each group */ 
data CaseControl;
length Group $7;
call streaminit(1234);                     /* set the random number seed */

NCase = 0; NControl = 0;
do until (NCase >= &N and NControl >= &N);  
   x1 = rand("Uniform");
   c1 = rand("Bernoulli", 0.3);
   eta = 1 + 0.8*x1 - 1.2*c1 + 0.4*c1*x1;  /* interaction model */
   mu = logistic(eta);
   Y = rand("Bernoulli", mu);
   Group = ifc(Y = 1, "Case", "Control");
   if Group = "Case" & NCase < &N then do;
      NCase + 1; output;
   end;
   if Group = "Control" & NControl < &N then do;
      NControl + 1; output;
   end;
end;

keep Y x1 c1 Group;
run;

proc logistic data=CaseControl plots(only)=effectplot;
class c1 ;
model Y(event='1') = x1 | c1;
run;

View solution in original post

Rick_SAS · Posted 06-10-2018 06:32 AM

I think we need a few more details about your plans for the study, such as how you plan to achieve the 1:1 ratio. Typically case-control studies are observational, and the "case" group has the disease whereas the "control" group does not. Are you assuming that the half the population has the disease and half don't? Or do you plan to over/undersample the population to achieve the 1:1 ratio? Could you provide more details?

In general, all simulation studies, start with a model of the population from which you will sample. Read the articles "Simulating data for a logistic regression model" (DATA step) and "Simulate many samples from a logistic regression model" (SAS/IML) for the general framework. (You can also read an example about using simulation to estimate power.) During the simulation you can assign GROUP="Control" for subjects that have Y=0 and GROUP="Case" for subjects that have Y=1. But in general these groups won't have a 1:1 ratio unless additional steps are taken.

liliu · Posted 06-10-2018 10:55 AM

Thank you for your reply!!!

This is actually very helpful. Yes I'm doing power for a case-control data (1:1 ratio) that aimed to test an interaction between a continuous variable and race. Generating logistic data using an interaction term is no problem for me, but I got stuck at making half the sample control, half case. Seems you are suggesting generating larger data and sample from it to get the 1:1 case-control data that I want? That is a great idea! Question though: Will sampling from the larger data set change the effect size I used to generate the larger data set?

Thank you! I'm so glad I posted the question after scratching my head in front of my computer for a whole day!!

Rick_SAS · Posted 06-10-2018 12:50 PM

I'm glad you found these thoughts helpful. I wasn't actually "suggesting" an approach, I was sort of wondering out loud how you intended to get the 1:1 ratio. My understanding is that it is okay to recruit more "cases" than "controls," and that can improve the power, especially for rare diseases/events. I found this abstract of a Lancet article, although I have not read the paper.

Ultimately, a simulation should reflect the design of the experiment that you are intending to carry out. In a real study, how would you recruit the cases and controls? If you use random selection, then in your simulation you can output the first N cases and the first N controls to generate the case-control sample. Of course, you might need to generate many more than 2N observations, so you should probably use DO UNTIL logic. Although your intentions are not clear, perhaps the following simulation might be modified to your needs:

%let N = 15;             /* want N in each group */ 
data CaseControl;
length Group $7;
call streaminit(1234);                     /* set the random number seed */

NCase = 0; NControl = 0;
do until (NCase >= &N and NControl >= &N);  
   x1 = rand("Uniform");
   c1 = rand("Bernoulli", 0.3);
   eta = 1 + 0.8*x1 - 1.2*c1 + 0.4*c1*x1;  /* interaction model */
   mu = logistic(eta);
   Y = rand("Bernoulli", mu);
   Group = ifc(Y = 1, "Case", "Control");
   if Group = "Case" & NCase < &N then do;
      NCase + 1; output;
   end;
   if Group = "Control" & NControl < &N then do;
      NControl + 1; output;
   end;
end;

keep Y x1 c1 Group;
run;

proc logistic data=CaseControl plots(only)=effectplot;
class c1 ;
model Y(event='1') = x1 | c1;
run;

liliu · Posted 06-10-2018 05:20 PM

Thank you, Rich for your help! I went with our original thought, generated larger data from the model I'm testing, used Proc Surveyselect to select a 1:1 ratio case/control, and achieved what I wanted. The solution your sent is definitely more elegant. I'm saving it for the future. Thank you for your prompt replies, suggestions, and help!

Rick_SAS · Posted 06-10-2018 07:52 PM

Great! The advantage of the PROC SURVEYSELECT approach is that is equally valid for real or simulated data,

liliu · Posted 06-11-2018 12:58 PM

Thank you, Rick, for all your help!

Just so you know, I ran the simulations using your program just to make
sure mine is correct. Results are very close with yours is consistently a
little bit more conservative than mine (1000 simulations, if mine has 81.9%
power, yours would be 80.5). I ended up using results from your program to
be safe.

I really appreciate your expertise and help!!!

tritringuyen · Posted 08-14-2018 03:09 PM

Hi @Rick_SAS, you are definitely an expert in simulation. Could you please help me with my issue:

https://communities.sas.com/t5/SAS-Procedures/Simulation-Sampling-and-OLS-regression-in-SAS/m-p/4867...

Thank you very much! Best regards, Thierry

JacobSimonsen · Posted 08-15-2018 05:18 AM

The simulation method @Rick_SAS's suggestion is ok. But the analysis should be conditional logistic regression due to the design. Therefore, there has to be a strata statement in proc logistic.

data simulation;

array p{0:1}; *vector of probabilities of exposure in case and control group;

array odds{0:1};

or=1.5; *the odds ratio, that are common for all strata;

do strata=1 to 100000;

p[0]=rand('uniform'); *the probability of exposure in the control group;

*now some simple calculations to calculate probability in the case group;

odds[0]=p[0]/(1-p[0]);

odds[1]=or*odds[0];

p[1]=odds[1]/(1+odds[1]);

do case=0 to 1;

exposure=rand('bernoulli',p[case]);

output;

end;

keep strata case exposure;

run;

*The strata statement in proc logistic condition that due to the design there is one case per stratum;

proc logistic data=simulation;

class exposure(ref="0");

model case(event="1")=exposure;

strata strata/nosummary;

run;

liliu · Posted 08-15-2018 04:18 PM

Thank you for your reply - yes, conditional model should be used for matched data. I liked your simulation part, too!