Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Stepwise logistic regression

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 07-24-2019 12:46 PM
(3711 views)

I am attempting to use the stepwise selection method to formulate a parsimonious model from 30 covariates, a dichotomous outcome, and 177 observations. SLENTRY=SLSTAY=0.1 and the initial, univariate Chi-square scores show 10 variables meeting the entry criterion. However, two predictors with the largest Chi-square scores each terminate the stepwise process because they both fail (P>0.6) the predictor retention criterion, once entered and the output states "Model building terminates because the last effect entered is removed by the Wald statistic criterion". If I exclude these two predictors from the stepwise selection, the model proceeds as *expected* until no additional predictors meet the entry criterion. I have two questions: 1) Why does a predictor with a very large Chi-square score, and p=0.0007, fail to be retained in the stepwise model? and 2) Is it statistically-defensible to exclude predictors from the stepwise process with large Chi-square scores and proceed as I have described above? All advice and citations accepted with gratitude.

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Opinion gratefully noted.

11 REPLIES 11

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

To run models that are reliable you usually need 25 obs per covariate. You would need 25*30 = 750 observations to run this model at minimum, assuming none of your covariates are categorical. You don't have enough data to run what you want. I would consider doing a PLS regression instead.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Stepwise regression is what I call a counter-intuitive method. It adds variables into the model because they meet some significance criterion, and then it can remove that same variable in the next step (or later step) because it no longer meets the significance criterion. How can that be? How does that make sense? Why would you want to use such a procedure? How would you explain it to someone?

If you want to hear what people say about it, go to your favorite internet search engine and type in "problems with stepwise regression" and read what people say.

What is happening is that when you have correlated predictor variables (as your 30 variables are), the presence of (for example) X7 in the model affects and changes the co-efficients of X1-X6 , and so when the coefficients change, the p-values change and a variable that was significant without X7 in the model can become not significant when X7 is in the model.

So, what should a conscientious data analyst do? My OPINION is that you should not use any form of Stepwise regression (not stepwise, not forward, not backward). Instead, I use Partial Least Squares regression (PROC PLS in SAS) when I have many correlated X variables, and in PLS, a variable that is a good predictor remains a good predictor even when other variables are entered into (or removed from) the model. But wait — PROC PLS only works on continuous Y variables, it doesn't handle the logistic case. There is nothing in SAS that will perform Logistic PLS. There is a paper which explains the Logistic PLS algorithm, and I have written a SAS macro that performs Logistic PLS based upon this paper. I like the way it works in these situations, but I don't think my employer would want me to share the macro.

So what should you do? Well, I don't know. There is R code that performs Logistic PLS, if that's something that would help.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I did suggest that SAS produce a PROC that performs Logistic PLS, but no one has voted for it 😞

https://communities.sas.com/t5/SASware-Ballot-Ideas/Logistic-version-of-PROC-PLS/idi-p/485503

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

That’s a splendid response. Thank you. Now I have to convince a client.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@lcmichael_unc wrote:

Can I explain everything that STEPWISE does? No, I can't.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Humor is an excellent explanation. Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@lcmichael_unc wrote:

What does the log say?

I would not be surprised to have something that relates to @Reeza's comment about sample size.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The log is silent...

NOTE: PROC LOGISTIC is modeling the probability that SVR12=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 0.

NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 2.

NOTE: LACKFIT is ignored since there is no explanatory variable in the model.

NOTE: The data set WORK.RSQUARE has 1 observations and 7 variables.

NOTE: The data set WORK.PARAMEST has 6 observations and 9 variables.

NOTE: The data set WORK.MODELINFO has 5 observations and 3 variables.

NOTE: The data set WORK.GOF has 2 observations and 5 variables.

NOTE: The data set WORK.ODDSRAT has 2 observations and 5 variables.

NOTE: The data set WORK.NOBS has 2 observations and 6 variables.

NOTE: There were 174 observations read from the data set FR190301.MITT_GT_VF.

NOTE: PROCEDURE LOGISTIC used (Total process time):

real time 0.18 seconds

cpu time 0.14 seconds

NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 0.

NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 2.

NOTE: LACKFIT is ignored since there is no explanatory variable in the model.

NOTE: The data set WORK.RSQUARE has 1 observations and 7 variables.

NOTE: The data set WORK.PARAMEST has 6 observations and 9 variables.

NOTE: The data set WORK.MODELINFO has 5 observations and 3 variables.

NOTE: The data set WORK.GOF has 2 observations and 5 variables.

NOTE: The data set WORK.ODDSRAT has 2 observations and 5 variables.

NOTE: The data set WORK.NOBS has 2 observations and 6 variables.

NOTE: There were 174 observations read from the data set FR190301.MITT_GT_VF.

NOTE: PROCEDURE LOGISTIC used (Total process time):

real time 0.18 seconds

cpu time 0.14 seconds

...and I wish it were not so.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@ballardw wrote:

@lcmichael_unc wrote:

What does the log say?

I would not be surprised to have something that relates to @Reeza's comment about sample size.

In my opinion, this is a deficiency of the method of stepwise regression, and has nothing to do with sample size.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Opinion gratefully noted.

Are you ready for the spotlight? We're accepting content ideas for **SAS Innovate 2025** to be held May 6-9 in Orlando, FL. The call is **open **until September 25. Read more here about **why** you should contribute and **what is in it** for you!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.