Quality Check of Training and Validation Set

Shakir_Juolay · Posted 09-13-2020 02:46 AM

Consider the following hypothetical situation, in which I have two question regarding the validity of the split between training and validation set.

I am trying to build a logistic regression model with only one categorical explanatory variable X for a binary target Y, event being Y = 1.

Below is the distribution for Y and X in Raw Table

Y	#	%
0	97	53%
1	85	47%
Total	182	100%

X	#	%
A	62	34%
B	24	13%
C	91	50%
D	5	3%
Total	182	100%

I am doing a 70:30 split for training and validation set.

Below is the distribution for Y and X in Training Table

Y	#	%
0	67	53%
1	60	47%
Total	127	100%

X	#	%
A	44	35%
B	17	13%
C	64	50%
D	2	2%
Total	127	100%

Below is the distribution for Y and X in Validation Table

Y	#	%
0	30	55%
1	25	45%
Total	55	100%

X	#	%
A	18	33%
B	7	13%
C	27	49%
D	3	5%
Total	55	100%

Question 1: Given the above numbers for Raw, Training and Validation Tables is the split between Training and Validation a valid one for building a model?

If yes, then below are the cross tabulation numbers.

Cross Tabulation in Raw Table

X	Y	#	%Y=1 in X
A	0	27
A	1	35	56%
B	0	11
B	1	13	54%
C	0	57
C	1	34	37%
D	0	2
D	1	3	60%

Cross Tabulation in Training Table

X	Y	#	%Y=1 in X
A	0	17
A	1	27	61%
B	0	8
B	1	9	53%
C	0	41
C	1	23	36%
D	0	1
D	1	1	50%

Cross Tabulation in Validation Table

X	Y	#	%Y=1 in X
A	0	10
A	1	8	44%
B	0	3
B	1	4	57%
C	0	16
C	1	11	41%
D	0	1
D	1	2	67%

Question 2: If answer to Question 1 is yes, then given the cross tabulation numbers above and differences in %Y=1 in X (event rate) for different levels of X in Raw, Training and Validation Tables (like for X=A 56% in Raw, 61% in Training and 44% in Validation) is the split between Training and Validation a valid one for building a model?

DougWielenga · Posted 09-14-2020 09:28 AM

There are a few issues with your hypothetical situation:

* you have a single categorical input with four levels and a binary target, so you can estimate four distinct predicted values, one for each input level -- it is not clear using logistic regression improves this fit without any interval inputs to consider

* you have a relatively small number of observations overall and there are only five observations where X="D" which makes splitting into training and validation a questionable approach

* given that there are only 8 possible bins for observations to be cast into (two possible outcomes and four possible inputs), the partitioning split seems as good as it could be, but this is likely a better candidate for cross-validation on the training data set were it not such a simple problem.

Data Mining problems typically involve large numbers of observations for which it makes sense to partition into training and validation (and possibly test) data sets. The differences in the percentages is because with such a small number of observations, a single observation accounts for 0.8% in the training and 1.8% in validation. The differences in percentages is therefore not surprising, but splitting in the first place is likely not warranted.

I hope this helps!

Cordially,
Doug

Shakir_Juolay · Posted 09-16-2020 09:14 AM

Thank You Doug.
It was a new learning for me that the benefits of a categorical variable in logistic regression will only be seen when used with an interval variable.
I know my sample size is small and hence the differences in Training and Validation sets. But if my sample size was large enough (say 1000 -> 700 Training and 300 Validation) then are the differences in Training and Validation in terms overall target percentage and target percentage split by the different level of a categorical variable acceptable for model validation.
In other words, I am trying to ask should Training and Validation Sets have same/similar target percentage JUST for the entire set or ALSO for different levels of categorical variables.

Quality Check of Training and Validation Set

Re: Quality Check of Training and Validation Set

Re: Quality Check of Training and Validation Set

Quality Check of Training and Validation Set

Re: Quality Check of Training and Validation Set

Re: Quality Check of Training and Validation Set

Ready to join fellow brilliant minds for the SAS Hackathon?