turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Forecasting
- /
- Logistic Regression - Number of variables

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-12-2016 02:44 PM

Hi,

I am building a logistic reg model and all is fine and I got a model which is predicting quite good.

My problem is that I used 40+ variables and model choose around 19 variable in the final outpout. Isnt it too much to have 19 variables in the final model? I know it varies by business but the models I built earlier never had more than 8-9 variables in the final output. The model is predicting quite well on 2 different types of Validation datasets so it dose not seem over fitted to me.

Another question: I have a variable Product in the final model which is highly predictive and it has three categories (three types of products ). I tried to break this model into three different models - one for each product. However, my overall model (where product is a variable) is more predictive than any of three individual models. Shall I keep only one model or it makes sense to break it into three different models - one for each product?

Many thanks as always.

Regards

Sachin

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sachin01663

11-12-2016 09:25 PM

How many observations do you have?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sachin01663

11-12-2016 09:55 PM

1) You can use PROC HPGENSELECT to check how many variables you should retain again.

2)Maybe you should consider about Mixed Logistic Regression,Make product as a mixed effect.

Check PROC GEE or PROC GLIMMIX

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sachin01663

11-13-2016 12:12 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sachin01663

11-13-2016 12:15 AM

@Reeza: Validation datasets had around 1 m each on which the model is working

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sachin01663

11-13-2016 01:52 AM

With that many observations your probably fine. You can also consider reducing your significance level for less variables and see how it affects your accuracy. Typically it's 0.05, but because you have so much data many things will be statistically significant even though not practically.

Have you set prior probabilities since your event rate is so low?

How does your event rate change for the three levels that your considering stratifying on? Have you considered a stratified model instead of separate models?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

11-13-2016 04:21 PM

Hi, you are rightI should increase the sig level as the sample size is huge. I havnt done the prior probabilities or Stratified Regression. I will read about them more.

If my model is tested on two different validation data sets and is predicting same output, shouldn't it be fine? I am getting .7 C statistics (70% concordant pairs) in trainging as well as validation dataset.

Thanks for all the replies.

Sachin

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

11-13-2016 04:25 PM

Yes, the event rate changes a lot. Product A has 2.5% , B has 4% and C has 3%. This is the reason I wanted to build different models becuase oneproduct has very high event rate so its highgly correlated.