I've got a statistical model with 27 Variables coming in. It was built on a Train dataset and tested on a Test dataset, I've tested it through a KS test and PSI. It fails the KS Test which I think shows that it's overfitted.
I'm just not sure what I should do next?
Should I start taking some of the variables out of the model? If so, how do I decide which ones? Do I let other variables come in instead of the ones I'm taking out?
Sorry.
For the second way, I mean NOT using option selection= .
proc logistic......
model .....;
run;
and check output:
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -6.4769 6.8880 0.8842 0.3471
Age 1 1.6308 0.9349 3.0425 0.0811
Weight 1 -0.1569 0.0810 3.7507 0.0528
And drop all the variables which P Value > 0.05 by hand/manual .
Should I start taking some of the variables out of the model?
Yes. 27 variables are way too many for a scorecard.
Generally , a scorecard contains 8-15 variables.
drop insignificant variables have many ways:
1)
proc logistic......
model ...../selection=stepwise
........
2) drop the P value > 0.05 variables in parameter estimator table.
3)proc hpgenselect
Thanks for this!
I'm building the model using Proc Logistic, then in the output table "Summary of Stepwise Selection" there weren't any variables that had a Pr>Chi-Square value >0.05, is this the P Value?
I have dropped a couple of the variables which has left me with 22 variables and the validation is slightly better but not fully validated.
I've not used proc hpgenselect before so I'll read up on that and how to use it to help my model build.
Sorry.
For the second way, I mean NOT using option selection= .
proc logistic......
model .....;
run;
and check output:
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -6.4769 6.8880 0.8842 0.3471
Age 1 1.6308 0.9349 3.0425 0.0811
Weight 1 -0.1569 0.0810 3.7507 0.0528
And drop all the variables which P Value > 0.05 by hand/manual .
Is this the method I should use if I'm modelling categorical variables too?
As I will have maybe 4 different "bands" coming in for the class variables, so I null the band that has P>0.05 or do I null the whole variables if one band has that value?
Thanks for your help with this, it's much appreciated.
do I null the whole variables if one band has that value?
Drop the whole variable whatever it is a category variable or a numeric .(i.e. null the whole variable )
1. How did you decide which variables to leave in your model? What model fitting approach did you use?
2. Is your model designed to be explanatory or predictive?
3. How many observations do you have? If you fit a model with 27 variables and 500 observations it will never work.
4. For your categorical variable, how many levels do you have? A categorical variable with four levels is actually considered as 3 variables, since it requires 3 variables?
5. What did you set as the parameterization method for your categorical variables?
6. For the categorical variables, did you analyze them ahead of time to see if the levels make sense, or if they could be combined?
@manonlyn wrote:
Thanks for this!
I'm building the model using Proc Logistic, then in the output table "Summary of Stepwise Selection" there weren't any variables that had a Pr>Chi-Square value >0.05, is this the P Value?
I have dropped a couple of the variables which has left me with 22 variables and the validation is slightly better but not fully validated.
I've not used proc hpgenselect before so I'll read up on that and how to use it to help my model build.
1. They would need to have a suitable Value of Information and make logical sense for the trend.
2. Predictive.
3. Approx 6000 observations
4.The categorical variables have different levels, I am only modelling categorical variables.
5. Not sure what this question means.
6. Yes, I analyzed over 200 variables (categorical) to make sure the levels made logical sense and combined levels where needed.
Thanks for your interest in helping me with this.
I don't think I have anything under that option. I just put CLASS then list the class variables followed by (REF = 'NULL')
Thanks.
This is a question about Stat, Better post it at Stat forum and calling @StatDave
I've got it to validate today! Thanks for all your help with this it's really appreciated. If I have similar questions in the future I'll be sure to post them on the stat board.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.