BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Lobbie
Obsidian | Level 7

Hi,

 

I have 2 separate datasets i.e. 1 for training and 1 for testing.  The variables are the same except for the training dataset, it contains a Target variable.  The Target variable is ordinal and contains value Low, Medium and High.  Each record has an unique identifier 'TranID' and variables used for modelling.  I want to build a simple decision tree to predict the probabilities of High, Medium and Low for each record in the test dataset.

 

My questions are,

  1. How do I add the test dataset into the process flow because in the SAS EM guides, they all showed a raw input dataset and then use the Data Partition node to partition into Train, Validate and Test datasets.
  2. Will the decision tree output the test results by "TranID, Low (probability), Medium (probability), High (probability)" which I can export into a text file?  If yes, which node(s) do I have to use?

I am a SAS EM newbie and am using SAS EM version 14.1.

 

Thank you very much in advance.

 

Regards,

Lobbie

1 ACCEPTED SOLUTION

Accepted Solutions
WendyCzika
SAS Employee

What you have for your test data set is what Enterprise Miner considers a "score" data set.  To get predictions for that data set, you can connect both the Input Data node for that data set (with the Role property set to Score) and the modeling node that you want to use for your predictions (that uses your training data) to a Score node, as in the attached screenshot of a sample flow.  The Score node will apply the score code from the model to your score data set that doesn't contain the target.  Hope that helps!

 


ScoreFlow.png

View solution in original post

14 REPLIES 14
WendyCzika
SAS Employee

What you have for your test data set is what Enterprise Miner considers a "score" data set.  To get predictions for that data set, you can connect both the Input Data node for that data set (with the Role property set to Score) and the modeling node that you want to use for your predictions (that uses your training data) to a Score node, as in the attached screenshot of a sample flow.  The Score node will apply the score code from the model to your score data set that doesn't contain the target.  Hope that helps!

 


ScoreFlow.png
MelodieRush
SAS Employee

You can designate your data to be whatever role you want it to be. I created the image below to show you one file is set to Train and the second file is set to Test. I can then feed them both into the next node in the flow.

 

EM_NEWPROJECT2.gif

Catch the SAS Global Forum keynotes, announcements, and tech content!
sasglobalforum.com | #SASGF



MelodieRush
SAS Employee

To save to a .txt file you can use the Save Data Node under the utility Tab

 

2017-03-16_13-00-48.png

Catch the SAS Global Forum keynotes, announcements, and tech content!
sasglobalforum.com | #SASGF



Lobbie
Obsidian | Level 7

Hi Wendy and Melodie,

 

Thank you both for your answers.  

 

@MelodieRush, if I need to score the Test dataset later, do I just connect the Score node to the Decision Tree node as shown by @WendyCzika and SAS EM will know which dataset is which?

 

Regards,

Will

MelodieRush
SAS Employee

Yes, you can never go wrong following @WendyCzika's advice! EM has lots of different data types, training, valadation, test, score, transaction and raw. It knows what to do when your data is correctly identified and defined.

Catch the SAS Global Forum keynotes, announcements, and tech content!
sasglobalforum.com | #SASGF



ChrisGermain
Fluorite | Level 6

What if I need to do data pre-processing on the "test" data set? I have a train and test data set as well, and in production, I would want to add some pre-processing steps that would apply to my "real data" before the modeling nodes run. I'm trying to append my train and test data sets but it doesn't output as one file when I use stat explore or other nodes following. I want to do all of the pre-processing on both the train and test data set to handle missing values, create features, etc. before running through a regression node or decision tree, etc.

 

I'm also a newb but can't seem to figure this out. 

MelodieRush
SAS Employee

Are you bringing your data in as one dataset and using data partition to split into training, validation and test or are you bringing 2 (or 3) separate datasets?

 

 

Catch the SAS Global Forum keynotes, announcements, and tech content!
sasglobalforum.com | #SASGF



ChrisGermain
Fluorite | Level 6

Hi,

 

I'm bringing in 2 separate data tables, train and test. They both have all of the same variables except the target variable in the training data set. If I union them in EG, how would I be able to partition where only the data with the target variable is set for training? Would EM know this by me marking that variable as the target variable?

MelodieRush
SAS Employee

If you combine your data in EG you could use a variation of the code below in the SAS Code node to divide the data the way you want into Training and Test.

 

data &EM_EXPORT_TRAIN &EM_EXPORT_VALIDATE

        &EM_EXPORT_TEST;

set &EM_IMPORT_DATA;

if partition_key=1 then output &EM_EXPORT_TRAIN;

else if partition_key=2 then output &EM_EXPORT_VALIDATE;

else if partition_key=3 then output &EM_EXPORT_TEST;

run;

 

Here's a slide that illustrates using a SAS Code Node with this code (from an upcoming Ask the Expert for Tips and Tricks in SAS Enterprise Miner, look for it in April. 🙂 )

 2018-01-19_15-24-02.png

 

You can add as many nodes as you want before this SAS code node to do your transformations. This will enable you to have all the same transformation on both the Training and Test datasets.

Catch the SAS Global Forum keynotes, announcements, and tech content!
sasglobalforum.com | #SASGF



ChrisGermain
Fluorite | Level 6

Thank you! I think it would make sense to include a partition following the pre-processing to split the TRAIN data into training and validation and the TEST data to it's own dataset.

 

And since I'm such a newbie, a few confirmations:

 

&EM_EXPORT_TRAIN should be "&MY_DATASET_NAME"

partition_key should by "my_field"?

 

 

MelodieRush
SAS Employee

Actually leave &EM_EXPORT_TRAIN, and the other &dataset names like they are, these will then automatically feed into the next node.  The & represents MACRO variables, which means they will resolve to the correct dataset name from previous nodes and for future nodes. If you set up your data source node then connect the SAS Code Node to it, the first line of code only needs to be changed if you are not creating all 3 datasets. Otherwise leave as is and only change partition_id to your "Field Name" and values.

Catch the SAS Global Forum keynotes, announcements, and tech content!
sasglobalforum.com | #SASGF



ChrisGermain
Fluorite | Level 6

For some reason this is not working for me... It doesn't look like the code editor is recognizing the macros or the code properly:

 

1.PNG2.PNG

ChrisGermain
Fluorite | Level 6

Disregard. I had to change the code option from "code format" under Score from "data" to "other".

ChrisGermain
Fluorite | Level 6

Hi,

 

I'm now having an issue after changing around some nodes to try different models. I've run the SAS code node to separate my training and test data using my criteria. I'm following that with a data partition node to split the training set in train and validate portions. After that I have a transform variable node which then feeds all of my models. When I run a score node on the best model, my Test data is not running. Any thoughts on why? I wasn't having an issue when I ran the SAS code node and data partition node following the transform variable node. Pic is attached. Capture.PNG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 8842 views
  • 4 likes
  • 4 in conversation