BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
PrestickNinja
Fluorite | Level 6

I am struggling with an odd issue with Miner.

 

The context - I am creating 3 sets of data within my project - a training, validation and test set. So far, so simple. The issue has come in where I want the test set to be a purely out-of-time set (the few months after the development period), while the validation and training set are just a 30/70 split of the remaining, in-time data.

 

What I have done:

I have placed SAS Code nodes after the raw dataset, which split the data into and in-and out-of-time set. This way the in time set can be partitioned as usual using the Data Partition node. In the code I specify that the out of time is the test set (using the macro variable for the test export set). The problem is that Miner insists on creating passing through the original training set too.

 

This is causing issues when I feed this node into the modelling steps after I split the validation set, as there are now 2 training sets. I can't figure out how to make Miner drop the one or at least allow me to select one. The crude solution I have at the moment is to be sure that the correct training set is "on top" when the process is laid out, but this is not a permanent fix as once the model is handed off just rearranging the physical position of the nodes will break the process.

 

Am I missing something obvious or is there no way to prevent a SAS Code node from exporting a training set?

1 ACCEPTED SOLUTION

Accepted Solutions
Urban_Science
Quartz | Level 8

I think I solved it.  Connect the nodes to a "Data Append" node.  Then in the properties of the "Data Append" node, click on "..." for the Data Selector property.  Then change Use to No for the TRAIN role data coming from the Code Node.

View solution in original post

6 REPLIES 6
Urban_Science
Quartz | Level 8

Hi PrestickNinja,

 

I created a simple diagram to see if I could replicate the issue that you were experiencing.  In short, I was unable to replicate your issue, but it gave me some ideas on what might be going wrong for you. 

I discovered that I needed two SAS Code nodes to separate out my out-of-time dataset.  The first one to remove the Out-of-time data from the soon-to-be training and validation data.  The second to create the test data with only the out-of-time data.  From there I needed to link the Test Data into the flow after the the Non-Test data was partitioned.  Attached is a picture of the flow.

 

Code in "Remove Test Data" node:

data &EM_EXPORT_TRAIN;
	set &EM_IMPORT_DATA;
	where Origin ne "Asia";
run;

Code in "Keep only Test Data" node:

data &EM_EXPORT_TEST;
	set &EM_IMPORT_DATA;
where Origin = "Asia";
run;

The data in my diagram is from sasHelp.Cars

The partitioning percentages for the data partition node are 70/30/0

The Model comparison node is there to show that the Test data was successfully passed through.

 

Please let me know if you have any questions!  Good luck!

PrestickNinja
Fluorite | Level 6

Hi Urban_Science

 

Thanks for the thorough attempt at assisting with this. My original diagram looks very similar to yours and I have essentially done the same thing as you (with some extra bits of mapping in the code).

 

The issue I am having is that your second node "Keep only Test Data", if you select Exported Data in the properties, should show that it is exporting both a training set (in this case the raw data) and the test set. I am trying to find a way to drop the training output from the "Keep only Test Data" SAS Code node. If your code isn't generating a training set then I have no idea what I am doing wrong - I am using the code node without modification apart from the code.

 

After messing around with deleting and re-adding the nodes a few times, and plenty of manual updating, I have found that Miner takes the training set of the first node connected, provided the node is updated before connecting the second (test) data. So I have figured out a way to ensure the correct training set gets used. While not ideal, it works, so that is what I am doing for now.

 

Thanks again for taking the time to answer this.

 

 

Urban_Science
Quartz | Level 8
Good news, I have replicated the issue where "Keep only Test Data" passes training data too. I'll try some things over lunch to see what I can find.
Urban_Science
Quartz | Level 8

I think I solved it.  Connect the nodes to a "Data Append" node.  Then in the properties of the "Data Append" node, click on "..." for the Data Selector property.  Then change Use to No for the TRAIN role data coming from the Code Node.

PrestickNinja
Fluorite | Level 6
Ah, great. That sounds like it should fix the issue. I will give it a try as soon as I get in to work

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1124 views
  • 3 likes
  • 2 in conversation