Solved: Re: Regression and Rule of 10

Epi_Stats · Posted 10-09-2023 04:01 AM

Hi,

My question relates to general guidance when conducting regression analysis, and I have therefore not provided example data.

I have a patient population with N=1,000.

In this population, a small number of patients (n<50) received a particular drug, which is the exposure focus of interest in my analysis.

I want to run a logistic regression to examine the effect of a patient exposed to this drug and their risk of hospital admission.

To obtain an adjusted effect estimate, I need to include 6 co-variates in my model. When I do this, I get an odds-ratio of ~2 (SE 0.8), and the CIs for the effect estimate are quite wide, no doubt partly due to the small sample.

I am aware of the "rule of 10 events per variable" when running logistic regression, but I am wondering if I am interpreting this correctly - in my above example, does this mean 10 hospital admissions per co-variate included in my model, or 10 patients exposed to the drug of interest per co-variate included in my model?

Thank you in advance for your help,

sbxkoenk · Posted 10-10-2023 03:18 PM

@Epi_Stats wrote:

I have a patient population with N=1,000.

In this population, a small number of patients (n<50) received a particular drug, which is the exposure focus of interest in my analysis.

I want to run a logistic regression to examine the effect of a patient exposed to this drug and their risk of hospital admission.

To obtain an adjusted effect estimate, I need to include 6 co-variates in my model. When I do this, I get an odds-ratio of ~2 (SE 0.8), and the CIs for the effect estimate are quite wide, no doubt partly due to the small sample.

I am aware of the "rule of 10 events per variable" when running logistic regression, but I am wondering if I am interpreting this correctly - in my above example, does this mean 10 hospital admissions per co-variate included in my model, or 10 patients exposed to the drug of interest per co-variate included in my model?

OK. I have read your original question again.
You have 1000 patients.
You want to model the (risk of) hospital admission.
You believe having received a particular drug lowers that risk. And only 50 people (let's call them your treatment group) were exposed to this drug.

The rule of 10 is referring to the number of hospital admissions (10 hospital admissions per coefficient) as hospital admission (Yes <-> No) is your dependent variable, not to the number of people in your treatment group.
A 5% treatment group is big enough if the drug-effect is strong.
The statistical proc (like proc logistic) takes the weak number (relative to the total) of medical drug users into account when calculating the effect-size and its confidence intervals. The less people taking the drug ... the more difficult it becomes to get a significant effect for it, but if the effect is strong ... it shows up. So, there's no harm / risk in trying to include "drug" as explanatory variable. Even with 15 (medical) drug users, you could try that. But you have noticed yourself that CI's of effect estimate become very wide which is indeed (partially) due to low number and imbalance.

@PaigeMiller : in credit scoring terms ... @Epi_Stats is wondering if the rural / urban place of living can be used as a candidate effect to model credit-worthiness (going delinquent or not) if only 5% is rural (50 rural versus 950 urban).

Koen

View solution in original post

JosvanderVelden · Posted 10-09-2023 04:51 AM

Have you searched for the post on this subject in this forum? A simple search should provide URLs to threads such as https://communities.sas.com/t5/Statistical-Procedures/Logistic-Regression-10-events-per-predictor-ru... & https://communities.sas.com/t5/SAS-Data-Science/A-Question-on-Modeling-Rare-Events-Data/m-p/482319.

Epi_Stats · Posted 10-09-2023 05:20 AM

Hi JosvanderVelden,

Thank you for your response.

Yes, I have searched for a solution to my question in the forum, but have not found an answer.

I want to know if the "events" in the "rule of 10" relates to outcome or exposure?

My regression equation is as follows:

Y_i = β₀ + β₁X₁ + β₂X₂ + β₃X₃ + β₄X₄ + β₅X₅ + β₆X₆ + β₇X₇ + ε_i

_{Where Y is the dependent variable hospital admission (Yes/No); X1 is the drug exposure of interest; and X2-7 are confounders included in the model.}

_{In my dataset, there are approximately 50 patients taking this drug, so since I have 6 confounders (X2-7) in my model, would I require at least 60 patients taking the drug in order to calculate an effect estimate with greater precision? - I am aware of the lack of consensus for the "rule of 10" in regression, but excluding this, I am wondering if I am interpreting the rule correctly in my example?}

sbxkoenk · Posted 10-10-2023 12:35 PM

Hello,

Your logistic regression has a binary target.

Your binary target is coded : 'Y' vs. 'N' or 1 vs. 0 or 'event' vs. 'non-event' or ...
One of the two levels in your outcome (target variable) is the minority level (possibly even rare).

For every coefficient (parameter) that you need to estimate in your logistic regression equation, you need at least 10 observations belonging to the minority level.

Take care : some co-variates are maybe class effects (CLASS statement).
For a class effect with k levels , you need to estimate (k-1) coefficients (assuming effect coding is used). So, counting the n° of covariates is not enough.

Kind regards,

Koen

Epi_Stats · Posted 10-10-2023 02:17 PM

Thank you Koen,

I am conscious of all of this, but it does not answer my question - which I have now re-phrased below for clarity.

PaigeMiller · Posted 10-10-2023 01:43 PM

I have to admit that I am always skeptical of these "rules of thumb" that provide advice on some statistical topic.

To be specific, I have no doubt that this "rule" applies to some situations and has worked for some people. I am not skeptical of that. I am skeptical that this is a universal "rule" that applies everywhere.

In particular, if the subject matter you are analyzing has very strong effects, you might be able to detect with fewer observations than the "rule" requires. If the effects are weak, you might need a lot more data than the "rule" requires.

--
Paige Miller

Epi_Stats · Posted 10-10-2023 02:16 PM

Thank you Paige,

I agree with you, and I am equally skeptical of such "rules of thumb"!...

Perhaps I need to re-phrase my question,

If the proportion of individuals in my population (exposed to a drug or treatment of interest) is very small, say <10% of the overall population, then is it acceptable to not perform regression for this particular subgroup, since the sample is underpowered and the output will not be clinically significant?

PaigeMiller · Posted 10-10-2023 02:39 PM

@Epi_Stats wrote:

Thank you Paige,

I agree with you, and I am equally skeptical of such "rules of thumb"!...

Perhaps I need to re-phrase my question,

If the proportion of individuals in my population (exposed to a drug or treatment of interest) is very small, say <10% of the overall population, then is it acceptable to not perform regression for this particular subgroup, since the sample is underpowered and the output will not be clinically significant?

People in my industry (banking) try to model loans that go delinquent. For most situations, this is far less than 10% of the overall population of loans. Models of this data still are useful.

In a previous job in manufacturing, we tried to model what causes bad parts to be made. Again, far less than 10% are in the "bad" category, and yet the models were useful.

It seems that in your question you have more than two categories, but I wouldn't let that stop me from including the small group in the analyses. Whether or not it will be clinically significant (as opposed to statistically significant) is something that I cannot judge.

--
Paige Miller

Epi_Stats · Posted 10-10-2023 03:01 PM

Thank you very much, Paige. Your response has helped me!😁

sbxkoenk · Posted 10-10-2023 03:18 PM

@Epi_Stats wrote:

I have a patient population with N=1,000.

In this population, a small number of patients (n<50) received a particular drug, which is the exposure focus of interest in my analysis.

I want to run a logistic regression to examine the effect of a patient exposed to this drug and their risk of hospital admission.

To obtain an adjusted effect estimate, I need to include 6 co-variates in my model. When I do this, I get an odds-ratio of ~2 (SE 0.8), and the CIs for the effect estimate are quite wide, no doubt partly due to the small sample.

I am aware of the "rule of 10 events per variable" when running logistic regression, but I am wondering if I am interpreting this correctly - in my above example, does this mean 10 hospital admissions per co-variate included in my model, or 10 patients exposed to the drug of interest per co-variate included in my model?

OK. I have read your original question again.
You have 1000 patients.
You want to model the (risk of) hospital admission.
You believe having received a particular drug lowers that risk. And only 50 people (let's call them your treatment group) were exposed to this drug.

The rule of 10 is referring to the number of hospital admissions (10 hospital admissions per coefficient) as hospital admission (Yes <-> No) is your dependent variable, not to the number of people in your treatment group.
A 5% treatment group is big enough if the drug-effect is strong.
The statistical proc (like proc logistic) takes the weak number (relative to the total) of medical drug users into account when calculating the effect-size and its confidence intervals. The less people taking the drug ... the more difficult it becomes to get a significant effect for it, but if the effect is strong ... it shows up. So, there's no harm / risk in trying to include "drug" as explanatory variable. Even with 15 (medical) drug users, you could try that. But you have noticed yourself that CI's of effect estimate become very wide which is indeed (partially) due to low number and imbalance.

@PaigeMiller : in credit scoring terms ... @Epi_Stats is wondering if the rural / urban place of living can be used as a candidate effect to model credit-worthiness (going delinquent or not) if only 5% is rural (50 rural versus 950 urban).

Koen

Epi_Stats · Posted 10-10-2023 03:46 PM

Thanks Koen!

PaigeMiller · Posted 10-10-2023 04:36 PM

@sbxkoenk wrote:

@PaigeMiller : in credit scoring terms ... @Epi_Stats is wondering if the rural / urban place of living can be used as a candidate effect to model credit-worthiness (going delinquent or not) if only 5% is rural (50 rural versus 950 urban).

We sometimes use a variable like "previous bankruptcy" in modeling and off the top of my head, I would imagine that the rate of previous bankruptcy is under 5%. When we do such modeling, there are usually 100,000 or more total observations in the data set, and hence 5,000 or more "previous bankruptcy" in the data.

--
Paige Miller

Catch up on SAS Innovate 2026