BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
manonlyn
Obsidian | Level 7

I've got a statistical model with 27 Variables coming in. It was built on a Train dataset and tested on a Test dataset, I've tested it through a KS test and PSI. It fails the KS Test which I think shows that it's overfitted.

I'm just not sure what I should do next?

Should I start taking some of the variables out of the model? If so, how do I decide which ones? Do I let other variables come in instead of the ones I'm taking out?

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Sorry.

For the second way, I mean NOT using option selection= .

proc logistic......

model .....;

run;

 

and check output:

 

Analysis of Maximum Likelihood Estimates

Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -6.4769 6.8880 0.8842 0.3471
Age 1 1.6308 0.9349 3.0425 0.0811
Weight 1 -0.1569 0.0810 3.7507 0.0528

 

And drop all the variables which P Value > 0.05  by hand/manual .

View solution in original post

12 REPLIES 12
Ksharp
Super User

Should I start taking some of the variables out of the model?

Yes. 27 variables are way too many for a scorecard.

Generally , a scorecard contains 8-15 variables.

drop insignificant variables have many ways:

1)

proc logistic......

model ...../selection=stepwise

........

 

2) drop the P value > 0.05 variables in parameter estimator table.

 

3)proc hpgenselect 

manonlyn
Obsidian | Level 7

Thanks for this!

 

I'm building the model using Proc Logistic, then in the output table "Summary of Stepwise Selection" there weren't any variables that had a Pr>Chi-Square value >0.05, is this the P Value? 

 

I have dropped a couple of the variables which has left me with 22 variables and the validation is slightly better but not fully validated. 

 

I've not used proc hpgenselect  before so I'll read up on that and how to use it to help my model build. 

 

 

Ksharp
Super User

Sorry.

For the second way, I mean NOT using option selection= .

proc logistic......

model .....;

run;

 

and check output:

 

Analysis of Maximum Likelihood Estimates

Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -6.4769 6.8880 0.8842 0.3471
Age 1 1.6308 0.9349 3.0425 0.0811
Weight 1 -0.1569 0.0810 3.7507 0.0528

 

And drop all the variables which P Value > 0.05  by hand/manual .

manonlyn
Obsidian | Level 7

Is this the method I should use if I'm modelling categorical variables too? 

 

As I will have maybe 4 different "bands" coming in for the class variables, so I null the band that has P>0.05 or do I null the whole variables if one band has that value?

 

Thanks for your help with this, it's much appreciated. 

Ksharp
Super User

do I null the whole variables if one band has that value?

 

Drop the whole variable whatever it is a category variable or a numeric .(i.e.  null the whole variable )

 

Reeza
Super User

1. How did you decide which variables to leave in your model? What model fitting approach did you use?

2. Is your model designed to be explanatory or predictive?

3. How many observations do you have? If you fit a model with 27 variables and 500 observations it will never work.

4. For your categorical variable, how many levels do you have? A categorical variable with four levels is actually considered as 3 variables, since it requires 3 variables?

5. What did you set as the parameterization method for your categorical variables?

6. For the categorical variables, did you analyze them ahead of time to see if the levels make sense, or if they could be combined?

 


@manonlyn wrote:

Thanks for this!

 

I'm building the model using Proc Logistic, then in the output table "Summary of Stepwise Selection" there weren't any variables that had a Pr>Chi-Square value >0.05, is this the P Value? 

 

I have dropped a couple of the variables which has left me with 22 variables and the validation is slightly better but not fully validated. 

 

I've not used proc hpgenselect  before so I'll read up on that and how to use it to help my model build. 

 

 


 

manonlyn
Obsidian | Level 7

1. They would need to have a suitable Value of Information and make logical sense for the trend.

2. Predictive.

3. Approx 6000 observations

4.The categorical variables have different levels, I am only modelling categorical variables. 

5. Not sure what this question means. 

6. Yes, I analyzed over 200 variables (categorical) to make sure the levels made logical sense and combined levels where needed. 

 

Thanks for your interest in helping me with this. 

manonlyn
Obsidian | Level 7

I don't think I have anything under that option. I just put CLASS then list the class variables followed by (REF = 'NULL')

 

Thanks.

Reeza
Super User
So you want all your categorical variables to be compared against the case where the value is NULL?
Usually you want PARAM=REF in your CLASS statement as well.
Ksharp
Super User

This is a question about Stat, Better post it at Stat forum and calling @StatDave 

manonlyn
Obsidian | Level 7

I've got it to validate today! Thanks for all your help with this it's really appreciated. If I have similar questions in the future I'll be sure to post them on the stat board. 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 1858 views
  • 5 likes
  • 3 in conversation