topic Quality Check of Training and Validation Set in SAS Data Science

Quality Check of Training and Validation Set

Shakir_Juolay — Sun, 13 Sep 2020 06:46:32 GMT

Consider the following hypothetical situation, in which I have two question regarding the validity of the split between training and validation set.

I am trying to build a logistic regression model with only one categorical explanatory variable X for a binary target Y, event being Y = 1.

Below is the distribution for Y and X in Raw Table

Y	#	%
0	97	53%
1	85	47%
Total	182	100%

X	#	%
A	62	34%
B	24	13%
C	91	50%
D	5	3%
Total	182	100%

I am doing a 70:30 split for training and validation set.

Below is the distribution for Y and X in Training Table

Y	#	%
0	67	53%
1	60	47%
Total	127	100%

X	#	%
A	44	35%
B	17	13%
C	64	50%
D	2	2%
Total	127	100%

Below is the distribution for Y and X in Validation Table

Y	#	%
0	30	55%
1	25	45%
Total	55	100%

X	#	%
A	18	33%
B	7	13%
C	27	49%
D	3	5%
Total	55	100%

Question 1: Given the above numbers for Raw, Training and Validation Tables is the split between Training and Validation a valid one for building a model?

If yes, then below are the cross tabulation numbers.

Cross Tabulation in Raw Table

X	Y	#	%Y=1 in X
A	0	27
A	1	35	56%
B	0	11
B	1	13	54%
C	0	57
C	1	34	37%
D	0	2
D	1	3	60%

Cross Tabulation in Training Table

X	Y	#	%Y=1 in X
A	0	17
A	1	27	61%
B	0	8
B	1	9	53%
C	0	41
C	1	23	36%
D	0	1
D	1	1	50%

Cross Tabulation in Validation Table

X	Y	#	%Y=1 in X
A	0	10
A	1	8	44%
B	0	3
B	1	4	57%
C	0	16
C	1	11	41%
D	0	1
D	1	2	67%

Question 2: If answer to Question 1 is yes, then given the cross tabulation numbers above and differences in %Y=1 in X (event rate) for different levels of X in Raw, Training and Validation Tables (like for X=A 56% in Raw, 61% in Training and 44% in Validation) is the split between Training and Validation a valid one for building a model?

Re: Quality Check of Training and Validation Set

DougWielenga — Mon, 14 Sep 2020 13:28:20 GMT

There are a few issues with your hypothetical situation:

* you have a single categorical input with four levels and a binary target, so you can estimate four distinct predicted values, one for each input level -- it is not clear using logistic regression improves this fit without any interval inputs to consider

* you have a relatively small number of observations overall and there are only five observations where X="D" which makes splitting into training and validation a questionable approach

* given that there are only 8 possible bins for observations to be cast into (two possible outcomes and four possible inputs), the partitioning split seems as good as it could be, but this is likely a better candidate for cross-validation on the training data set were it not such a simple problem.

Data Mining problems typically involve large numbers of observations for which it makes sense to partition into training and validation (and possibly test) data sets. The differences in the percentages is because with such a small number of observations, a single observation accounts for 0.8% in the training and 1.8% in validation. The differences in percentages is therefore not surprising, but splitting in the first place is likely not warranted.

I hope this helps!

Cordially,
Doug

Re: Quality Check of Training and Validation Set

Shakir_Juolay — Wed, 16 Sep 2020 13:14:53 GMT

Thank You Doug.
It was a new learning for me that the benefits of a categorical variable in logistic regression will only be seen when used with an interval variable.
I know my sample size is small and hence the differences in Training and Validation sets. But if my sample size was large enough (say 1000 -> 700 Training and 300 Validation) then are the differences in Training and Validation in terms overall target percentage and target percentage split by the different level of a categorical variable acceptable for model validation.
In other words, I am trying to ask should Training and Validation Sets have same/similar target percentage JUST for the entire set or ALSO for different levels of categorical variables.