BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Epi_Stats
Obsidian | Level 7

Hi,

 

My question relates to general guidance when conducting regression analysis, and I have therefore not provided example data.

 

I have a patient population with N=1,000.

 

In this population, a small number of patients (n<50) received a particular drug, which is the exposure focus of interest in my analysis.

 

I want to run a logistic regression to examine the effect of a patient exposed to this drug and their risk of hospital admission.

 

To obtain an adjusted effect estimate, I need to include 6 co-variates in my model. When I do this, I get an odds-ratio of ~2 (SE 0.8), and the CIs for the effect estimate are quite wide, no doubt partly due to the small sample.

 

I am aware of the "rule of 10 events per variable" when running logistic regression, but I am wondering if I am interpreting this correctly - in my above example, does this mean 10 hospital admissions per co-variate included in my model, or 10 patients exposed to the drug of interest per co-variate included in my model?

 

Thank you in advance for your help,

 

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

@Epi_Stats wrote:

I have a patient population with N=1,000.

In this population, a small number of patients (n<50) received a particular drug, which is the exposure focus of interest in my analysis.

I want to run a logistic regression to examine the effect of a patient exposed to this drug and their risk of hospital admission.

To obtain an adjusted effect estimate, I need to include 6 co-variates in my model. When I do this, I get an odds-ratio of ~2 (SE 0.8), and the CIs for the effect estimate are quite wide, no doubt partly due to the small sample.

I am aware of the "rule of 10 events per variable" when running logistic regression, but I am wondering if I am interpreting this correctly - in my above example, does this mean 10 hospital admissions per co-variate included in my model, or 10 patients exposed to the drug of interest per co-variate included in my model?

OK. I have read your original question again.
You have 1000 patients.
You want to model the (risk of) hospital admission.
You believe having received a particular drug lowers that risk. And only 50 people (let's call them your treatment group) were exposed to this drug.

The rule of 10 is referring to the number of hospital admissions (10 hospital admissions per coefficient) as hospital admission (Yes <-> No) is your dependent variable, not to the number of people in your treatment group.
A 5% treatment group is big enough if the drug-effect is strong. 
The statistical proc (like proc logistic) takes the weak number (relative to the total) of medical drug users into account when calculating the effect-size and its confidence intervals. The less people taking the drug ... the more difficult it becomes to get a significant effect for it, but if the effect is strong ... it shows up. So, there's no harm / risk in trying to include "drug" as explanatory variable. Even with 15 (medical) drug users, you could try that. But you have noticed yourself that CI's of effect estimate become very wide which is indeed (partially) due to low number and imbalance.

 

@PaigeMiller : in credit scoring terms ... @Epi_Stats is wondering if the rural / urban place of living can be used as a candidate effect to model credit-worthiness (going delinquent or not) if only 5% is rural (50 rural versus 950 urban).

 

Koen

View solution in original post

11 REPLIES 11
Epi_Stats
Obsidian | Level 7

Hi JosvanderVelden,

 

Thank you for your response.

 

Yes, I have searched for a solution to my question in the forum, but have not found an answer.

 

I want to know if the "events" in the "rule of 10" relates to outcome or exposure?

 

My regression equation is as follows:

 

Yi = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + β6X6 + β7X7 + εi

 

Where Y is the dependent variable hospital admission (Yes/No); X1 is the drug exposure of interest; and X2-7 are confounders included in the model.

 

In my dataset, there are approximately 50 patients taking this drug, so since I have 6 confounders (X2-7) in my model, would I require at least 60 patients taking the drug in order to calculate an effect estimate with greater precision? - I am aware of the lack of consensus for the "rule of 10" in regression, but excluding this, I am wondering if I am interpreting the rule correctly in my example?

 

sbxkoenk
SAS Super FREQ

Hello,

 

Your logistic regression has a binary target.

Your binary target is coded : 'Y' vs. 'N'     or     1 vs. 0     or     'event' vs. 'non-event'     or ...
One of the two levels in your outcome (target variable) is the minority level (possibly even rare).

For every coefficient (parameter) that you need to estimate in your logistic regression equation, you need at least 10 observations belonging to the minority level. 

 

Take care : some co-variates are maybe class effects (CLASS statement).
For a class effect with k levels , you need to estimate (k-1) coefficients (assuming effect coding is used). So, counting the n° of covariates is not enough.

Kind regards,

Koen

Epi_Stats
Obsidian | Level 7

Thank you Koen,

 

I am conscious of all of this, but it does not answer my question - which I have now re-phrased below for clarity.

PaigeMiller
Diamond | Level 26

I have to admit that I am always skeptical of these "rules of thumb" that provide advice on some statistical topic.

 

To be specific, I have no doubt that this "rule" applies to some situations and has worked for some people. I am not skeptical of that. I am skeptical that this is a universal "rule" that applies everywhere.

 

In particular, if the subject matter you are analyzing has very strong effects, you might be able to detect with fewer observations than the "rule" requires. If the effects are weak, you might need a lot more data than the "rule" requires.

--
Paige Miller
Epi_Stats
Obsidian | Level 7

Thank you Paige,

 

I agree with you, and I am equally skeptical of such "rules of thumb"!...

 

Perhaps I need to re-phrase my question,

 

If the proportion of individuals in my population (exposed to a drug or treatment of interest) is very small, say <10% of the overall population, then is it acceptable to not perform regression for this particular subgroup, since the sample is underpowered and the output will not be clinically significant?

PaigeMiller
Diamond | Level 26

@Epi_Stats wrote:

Thank you Paige,

 

I agree with you, and I am equally skeptical of such "rules of thumb"!...

 

Perhaps I need to re-phrase my question,

 

If the proportion of individuals in my population (exposed to a drug or treatment of interest) is very small, say <10% of the overall population, then is it acceptable to not perform regression for this particular subgroup, since the sample is underpowered and the output will not be clinically significant?


People in my industry (banking) try to model loans that go delinquent. For most situations, this is far less than 10% of the overall population of loans. Models of this data still are useful. 

 

In a previous job in manufacturing, we tried to model what causes bad parts to be made. Again, far less than 10% are in the "bad" category, and yet the models were useful.

 

It seems that in your question you have more than two categories, but I wouldn't let that stop me from including the small group in the analyses. Whether or not it will be clinically significant (as opposed to statistically significant) is something that I cannot judge.

--
Paige Miller
Epi_Stats
Obsidian | Level 7

Thank you very much, Paige. Your response has helped me!😁

sbxkoenk
SAS Super FREQ

@Epi_Stats wrote:

I have a patient population with N=1,000.

In this population, a small number of patients (n<50) received a particular drug, which is the exposure focus of interest in my analysis.

I want to run a logistic regression to examine the effect of a patient exposed to this drug and their risk of hospital admission.

To obtain an adjusted effect estimate, I need to include 6 co-variates in my model. When I do this, I get an odds-ratio of ~2 (SE 0.8), and the CIs for the effect estimate are quite wide, no doubt partly due to the small sample.

I am aware of the "rule of 10 events per variable" when running logistic regression, but I am wondering if I am interpreting this correctly - in my above example, does this mean 10 hospital admissions per co-variate included in my model, or 10 patients exposed to the drug of interest per co-variate included in my model?

OK. I have read your original question again.
You have 1000 patients.
You want to model the (risk of) hospital admission.
You believe having received a particular drug lowers that risk. And only 50 people (let's call them your treatment group) were exposed to this drug.

The rule of 10 is referring to the number of hospital admissions (10 hospital admissions per coefficient) as hospital admission (Yes <-> No) is your dependent variable, not to the number of people in your treatment group.
A 5% treatment group is big enough if the drug-effect is strong. 
The statistical proc (like proc logistic) takes the weak number (relative to the total) of medical drug users into account when calculating the effect-size and its confidence intervals. The less people taking the drug ... the more difficult it becomes to get a significant effect for it, but if the effect is strong ... it shows up. So, there's no harm / risk in trying to include "drug" as explanatory variable. Even with 15 (medical) drug users, you could try that. But you have noticed yourself that CI's of effect estimate become very wide which is indeed (partially) due to low number and imbalance.

 

@PaigeMiller : in credit scoring terms ... @Epi_Stats is wondering if the rural / urban place of living can be used as a candidate effect to model credit-worthiness (going delinquent or not) if only 5% is rural (50 rural versus 950 urban).

 

Koen

Epi_Stats
Obsidian | Level 7

Thanks Koen!

PaigeMiller
Diamond | Level 26

@sbxkoenk wrote:

@PaigeMiller : in credit scoring terms ... @Epi_Stats is wondering if the rural / urban place of living can be used as a candidate effect to model credit-worthiness (going delinquent or not) if only 5% is rural (50 rural versus 950 urban).

 


We sometimes use a variable like "previous bankruptcy" in modeling and off the top of my head, I would imagine that the rate of previous bankruptcy is under 5%. When we do such modeling, there are usually 100,000 or more total observations in the data set, and hence 5,000 or more "previous bankruptcy" in the data.

--
Paige Miller

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 3356 views
  • 4 likes
  • 4 in conversation