Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Simulating Case Control data

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 06-09-2018 07:29 PM
(3043 views)

How do I simulate case control data in analyzing power for conditional logistic regression model? I have experience in simulating logistic regression models, but not sure how to generate data that's case control (1:1 ratio) design. Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I'm glad you found these thoughts helpful. I wasn't actually "suggesting" an approach, I was sort of wondering out loud how you intended to get the 1:1 ratio. My understanding is that it is okay to recruit more "cases" than "controls," and that can improve the power, especially for rare diseases/events. I found this abstract of a Lancet article, although I have not read the paper.

Ultimately, a simulation should reflect the design of the experiment that you are intending to carry out. In a real study, how would you recruit the cases and controls? If you use random selection, then in your simulation you can output the first N cases and the first N controls to generate the case-control sample. Of course, you might need to generate many more than 2N observations, so you should probably use DO UNTIL logic. Although your intentions are not clear, perhaps the following simulation might be modified to your needs:

```
%let N = 15; /* want N in each group */
data CaseControl;
length Group $7;
call streaminit(1234); /* set the random number seed */
NCase = 0; NControl = 0;
do until (NCase >= &N and NControl >= &N);
x1 = rand("Uniform");
c1 = rand("Bernoulli", 0.3);
eta = 1 + 0.8*x1 - 1.2*c1 + 0.4*c1*x1; /* interaction model */
mu = logistic(eta);
Y = rand("Bernoulli", mu);
Group = ifc(Y = 1, "Case", "Control");
if Group = "Case" & NCase < &N then do;
NCase + 1; output;
end;
if Group = "Control" & NControl < &N then do;
NControl + 1; output;
end;
end;
keep Y x1 c1 Group;
run;
proc logistic data=CaseControl plots(only)=effectplot;
class c1 ;
model Y(event='1') = x1 | c1;
run;
```

9 REPLIES 9

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I think we need a few more details about your plans for the study, such as how you plan to achieve the 1:1 ratio. Typically case-control studies are observational, and the "case" group has the disease whereas the "control" group does not. Are you assuming that the half the population has the disease and half don't? Or do you plan to over/undersample the population to achieve the 1:1 ratio? Could you provide more details?

In general, all simulation studies, start with a model of the population from which you will sample. Read the articles "Simulating data for a logistic regression model" (DATA step) and "Simulate many samples from a logistic regression model" (SAS/IML) for the general framework. (You can also read an example about using simulation to estimate power.) During the simulation you can assign GROUP="Control" for subjects that have Y=0 and GROUP="Case" for subjects that have Y=1. But in general these groups won't have a 1:1 ratio unless additional steps are taken.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for your reply!!!

This is actually very helpful. Yes I'm doing power for a case-control data (1:1 ratio) that aimed to test an interaction between a continuous variable and race. Generating logistic data using an interaction term is no problem for me, but I got stuck at making half the sample control, half case. Seems you are suggesting generating larger data and sample from it to get the 1:1 case-control data that I want? That is a great idea! Question though: Will sampling from the larger data set change the effect size I used to generate the larger data set?

Thank you! I'm so glad I posted the question after scratching my head in front of my computer for a whole day!!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I'm glad you found these thoughts helpful. I wasn't actually "suggesting" an approach, I was sort of wondering out loud how you intended to get the 1:1 ratio. My understanding is that it is okay to recruit more "cases" than "controls," and that can improve the power, especially for rare diseases/events. I found this abstract of a Lancet article, although I have not read the paper.

Ultimately, a simulation should reflect the design of the experiment that you are intending to carry out. In a real study, how would you recruit the cases and controls? If you use random selection, then in your simulation you can output the first N cases and the first N controls to generate the case-control sample. Of course, you might need to generate many more than 2N observations, so you should probably use DO UNTIL logic. Although your intentions are not clear, perhaps the following simulation might be modified to your needs:

```
%let N = 15; /* want N in each group */
data CaseControl;
length Group $7;
call streaminit(1234); /* set the random number seed */
NCase = 0; NControl = 0;
do until (NCase >= &N and NControl >= &N);
x1 = rand("Uniform");
c1 = rand("Bernoulli", 0.3);
eta = 1 + 0.8*x1 - 1.2*c1 + 0.4*c1*x1; /* interaction model */
mu = logistic(eta);
Y = rand("Bernoulli", mu);
Group = ifc(Y = 1, "Case", "Control");
if Group = "Case" & NCase < &N then do;
NCase + 1; output;
end;
if Group = "Control" & NControl < &N then do;
NControl + 1; output;
end;
end;
keep Y x1 c1 Group;
run;
proc logistic data=CaseControl plots(only)=effectplot;
class c1 ;
model Y(event='1') = x1 | c1;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Great! The advantage of the PROC SURVEYSELECT approach is that is equally valid for real or simulated data,

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you, Rick, for all your help!

Just so you know, I ran the simulations using your program just to make

sure mine is correct. Results are very close with yours is consistently a

little bit more conservative than mine (1000 simulations, if mine has 81.9%

power, yours would be 80.5). I ended up using results from your program to

be safe.

I really appreciate your expertise and help!!!

Just so you know, I ran the simulations using your program just to make

sure mine is correct. Results are very close with yours is consistently a

little bit more conservative than mine (1000 simulations, if mine has 81.9%

power, yours would be 80.5). I ended up using results from your program to

be safe.

I really appreciate your expertise and help!!!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi @Rick_SAS, you are definitely an expert in simulation. Could you please help me with my issue:

Thank you very much! Best regards, Thierry

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

**The simulation method @Rick_SAS's suggestion is ok. But the analysis should be conditional logistic regression due to the design. Therefore, there has to be a strata statement in proc logistic.**

**data** simulation;

array p{**0**:**1**}; *vector of probabilities of exposure in case and control group;

array odds{**0**:**1**};

or=**1.5**; *the odds ratio, that are common for all strata;

do strata=**1** to **100000**;

p[**0**]=rand('uniform'); *the probability of exposure in the control group;

*now some simple calculations to calculate probability in the case group;

odds[**0**]=p[**0**]/(**1**-p[**0**]);

odds[**1**]=or*odds[**0**];

p[**1**]=odds[**1**]/(**1**+odds[**1**]);

do case=**0** to **1**;

exposure=rand('bernoulli',p[case]);

output;

end;

end;

keep strata case exposure;

**run**;

*The strata statement in proc logistic condition that due to the design there is one case per stratum;

**proc** **logistic** data=simulation;

class exposure(ref="0");

model case(event="1")=exposure;

strata strata/nosummary;

**run**;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.