BookmarkSubscribeRSS Feed
DavidWilson
Calcite | Level 5

I have used start groups and end groups nodes to perform 5-fold cross validation on a modelling node in SAS EM, grouping on a random variable in my training data which I created for this purpose.  I now wish to use the model I have created to score up a new dataset.

 

When I export the scoring code I can see that it is referencing the random variable that I created for the purpose of cross-validation in the scoring code, but this variable is not present in my new data as it was only created for the purpose of the cross-validation.  Unless I am mis-reading the code it appears to use the value of the random variable to score each of the 5 segments of the data differently.  The datasets which I am scoring in the live environment could be fairly small (only a fewthousand records at a time so I don't feel that this would be appropriate)

 

How do I apply the scoring code to my new data so that every observation is scored consistently?

8 REPLIES 8
Reeza
Super User

When you build your model wouldn't that variable have been excluded? 

Im not sure how you used it in CV.

DavidWilson
Calcite | Level 5

I have created the random variable (called fold), given it values 1-5 and assigned it to the role segment as advised in the answer by m_maldonado to the question (https://communities.sas.com/t5/SAS-Data-Mining/Using-cross-validation-in-Enterprise-Miner/m-p/233635...) link in brackets.  I have then used start groups and end groups to perform the cross validation.

 

The random variable does not appear in the model as a predictor but in the scoring code each of the 5 segments is scored differently according to which fold they are in.  I don't see how to apply this to a new dataset unless I also create the random variable on my new data wich does not seem to make sense.

 

Reeza
Super User

Cross validation is used to verify results. Definitely shouldn't have different models for each segment. 

Shouldnt the scoring code you use be from steps before the cross validation? 

Reeza
Super User

You should also wait for an answer from Miguel or someone else my EM skills have gotten really rusty 🙁

DavidWilson
Calcite | Level 5
No problem. Many thanks for trying to help!
M_Maldonado
Barite | Level 11

Hi David,

Sorry I am late to the party.

I don't have EM handy. Sadly I spend more time in meetings than on hands-on software these days.

This is the kind of thing that I would suggest fixing directly on the score code while someone figures out the right way to do this.

 

If you have a chance, post the score code of the flow you have (the simpler the data the better), and the community and myself will give you suggestions!

 

Best,

-M

DavidWilson
Calcite | Level 5

Thanks Miguel

 

I have created my own workaround to this by taking the score code generated and adapting it to score my whole dataset 5 times (once for each fold) and then calculating the average of the predicted probabilities from each model on each observation, which if I am understanding the method correctly from my reading is what is required.

 

I'll try to create an example version of what I have done with some standard data so that I can post the score code - what would you need, an xml of the diagram?

 

Meantime thanks for your help

 

M_Maldonado
Barite | Level 11

I like that workaround!

XML of the diagram or a quick screenshot, or both 🙂

When you have a chance, I am also very curious to know more about your learnings about cross validation. In particular, do you feel like you get more predictive power, or anything else you might share?

 

Thanks!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2170 views
  • 0 likes
  • 3 in conversation