10-14-2016 06:35 AM
I have used start groups and end groups nodes to perform 5-fold cross validation on a modelling node in SAS EM, grouping on a random variable in my training data which I created for this purpose. I now wish to use the model I have created to score up a new dataset.
When I export the scoring code I can see that it is referencing the random variable that I created for the purpose of cross-validation in the scoring code, but this variable is not present in my new data as it was only created for the purpose of the cross-validation. Unless I am mis-reading the code it appears to use the value of the random variable to score each of the 5 segments of the data differently. The datasets which I am scoring in the live environment could be fairly small (only a fewthousand records at a time so I don't feel that this would be appropriate)
How do I apply the scoring code to my new data so that every observation is scored consistently?
10-14-2016 06:58 AM
I have created the random variable (called fold), given it values 1-5 and assigned it to the role segment as advised in the answer by m_maldonado to the question (https://communities.sas.com/t5/SAS-Data-Mining/Using-cross-validation-in-Enterprise-Miner/m-p/233635...) link in brackets. I have then used start groups and end groups to perform the cross validation.
The random variable does not appear in the model as a predictor but in the scoring code each of the 5 segments is scored differently according to which fold they are in. I don't see how to apply this to a new dataset unless I also create the random variable on my new data wich does not seem to make sense.
10-14-2016 08:46 AM
Cross validation is used to verify results. Definitely shouldn't have different models for each segment.
Shouldnt the scoring code you use be from steps before the cross validation?
10-14-2016 12:51 PM
Sorry I am late to the party.
I don't have EM handy. Sadly I spend more time in meetings than on hands-on software these days.
This is the kind of thing that I would suggest fixing directly on the score code while someone figures out the right way to do this.
If you have a chance, post the score code of the flow you have (the simpler the data the better), and the community and myself will give you suggestions!
10-16-2016 07:32 AM
I have created my own workaround to this by taking the score code generated and adapting it to score my whole dataset 5 times (once for each fold) and then calculating the average of the predicted probabilities from each model on each observation, which if I am understanding the method correctly from my reading is what is required.
I'll try to create an example version of what I have done with some standard data so that I can post the score code - what would you need, an xml of the diagram?
Meantime thanks for your help
10-16-2016 07:38 PM
I like that workaround!
XML of the diagram or a quick screenshot, or both
When you have a chance, I am also very curious to know more about your learnings about cross validation. In particular, do you feel like you get more predictive power, or anything else you might share?