BookmarkSubscribeRSS Feed
Shakir_Juolay
Obsidian | Level 7

Consider the following hypothetical situation, in which I have two question regarding the validity of the split between training and validation set.

 

I am trying to build a logistic regression model with only one categorical explanatory variable X for a binary target Y, event being Y = 1.

 

Below is the distribution for Y and X in Raw Table

Y # %
0 97 53%
1 85 47%
Total 182 100%

 

X # %
A 62 34%
B 24 13%
C 91 50%
D 5 3%
Total 182 100%

 

I am doing a 70:30 split for training and validation set.

 

Below is the distribution for Y and X in Training Table

Y # %
0 67 53%
1 60 47%
Total 127 100%

 

X # %
A 44 35%
B 17 13%
C 64 50%
D 2 2%
Total 127 100%

 

Below is the distribution for Y and X in Validation Table

Y # %
0 30 55%
1 25 45%
Total 55 100%

 

X # %
A 18 33%
B 7 13%
C 27 49%
D 3 5%
Total 55 100%

 

Question 1: Given the above numbers for Raw, Training and Validation Tables is the split between Training and Validation a valid one for building a model?

 

If yes, then below are the cross tabulation numbers.

 

Cross Tabulation in Raw Table

 

X Y # %Y=1 in X
A 0 27  
A 1 35 56%
B 0 11  
B 1 13 54%
C 0 57  
C 1 34 37%
D 0 2  
D 1 3 60%

 

Cross Tabulation in Training Table

 

X Y # %Y=1 in X
A 0 17  
A 1 27 61%
B 0 8  
B 1 9 53%
C 0 41  
C 1 23 36%
D 0 1  
D 1 1 50%

 

Cross Tabulation in Validation Table

 

X Y # %Y=1 in X
A 0 10  
A 1 8 44%
B 0 3  
B 1 4 57%
C 0 16  
C 1 11 41%
D 0 1  
D 1 2 67%

 

Question 2: If answer to Question 1 is yes, then given the cross tabulation numbers above and differences in %Y=1 in X (event rate) for different levels of X in Raw, Training and Validation Tables (like for X=A 56% in Raw, 61% in Training and 44% in Validation) is the split between Training and Validation a valid one for building a model?

2 REPLIES 2
DougWielenga
SAS Employee

There are a few issues with your hypothetical situation:

   * you have a single categorical input with four levels and a binary target, so you can estimate four distinct predicted values, one for each input level -- it is not clear using logistic regression improves this fit without any interval inputs to consider

  *  you have a relatively small number of observations overall and there are only five observations where X="D" which makes splitting into training and validation a questionable approach

  * given that there are only 8 possible bins for observations to be cast into (two possible outcomes and four possible inputs), the partitioning split seems as good as it could be, but this is likely a better candidate for cross-validation on the training data set were it not such a simple problem. 

 

Data Mining problems typically involve large numbers of observations for which it makes sense to partition into training and validation (and possibly test) data sets.  The differences in the percentages is because with such a small number of observations, a single observation accounts for 0.8% in the training and 1.8% in validation.  The differences in percentages is therefore not surprising, but splitting in the first place is likely not warranted.

 

I hope this helps!


Cordially,
Doug

Shakir_Juolay
Obsidian | Level 7

Thank You Doug.
It was a new learning for me that the benefits of a categorical variable in logistic regression will only be seen when used with an interval variable.
I know my sample size is small and hence the differences in Training and Validation sets. But if my sample size was large enough (say 1000 -> 700 Training and 300 Validation) then are the differences in Training and Validation in terms overall target percentage and target percentage split by the different level of a categorical variable acceptable for model validation.
In other words, I am trying to ask should Training and Validation Sets have same/similar target percentage JUST for the entire set or ALSO for different levels of categorical variables.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 566 views
  • 0 likes
  • 2 in conversation