Solved: Re: SAS Viya Model Studio and Data Partitioning

andreas_zaras · Posted 05-21-2020 03:31 PM

Hello,

I am using SAS Viya Model Studio ML and DM for educational purposes. I use a data set that i partition 70% training, 30% validation. I assume that if i do the project from scratch, every time the software chooses different data sets for training and validation randomly so the rsults of e.g. the Decision tree will every time be slightly different. Is that right?

If yes is there a way to select a seed so every time i create the project the training - validation sets will be the same?

One solution that i have found is to set a partition binary variable in a data set so every time the sets will be the same but i was wondering whether i can do this whithout the extra variable via a seed. The seed was the case in SAS EM.

Thanks in advance,

Andreas

SimonWilliams · Posted 05-22-2020 07:37 AM

Hi Andreas,

Ok i spoke with a colleague in R&D and they confirmed the seed for the partitioning of data within Model Studio 8.5 is fixed value. It is the same value for each project you create and for each run of the data node.

You should be able to verify this by looking at summary statistics for each of the partitioned tables. They should be the same. If you are seeing behaviour which suggests that the partitioning of data is not consistent, then please contact Technical Support and provide some examples.

If you are seeing slightly different results for each run of the model, then perhaps the algorithms that underpin each modelling technique may have seed/starting values that are chosen at random or can be specified by the user. The VDMML documentation may help. https://go.documentation.sas.com/?docsetId=casml&docsetTarget=casml_whatsnew_sect003.htm&docsetVersi...

You are correct that there is no option for the user within Model Studio GUI to set the seed value. Your feedback has been passed on to R&D.

Cheers, Simon

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

View solution in original post

SimonWilliams · Posted 05-21-2020 05:12 PM

Hi Andreas,

Might Proc Partition help: https://go.documentation.sas.com/?docsetId=casstat&docsetTarget=casstat_partition_examples02.htm&doc...

Cheers, Simon

Example 29.2 Stratified Sampling
This example demonstrates how to use PROC PARTITION to perform stratified sampling to partition the data; it uses the same data table as is used in Example 29.1.

You can load the sampsio.hmeq data set into your CAS session by naming your CAS engine libref in the first statement of the following DATA step. This DATA step assumes that your CAS engine libref is named mycas, but you can substitute any appropriately defined CAS engine libref.

data mycas.hmeq;
set sampsio.hmeq;
run;
The following statements perform the partitioning:

proc partition data=mycas.hmeq samppct=10 samppct2=20 seed=10 partind nthreads=3;
by BAD;
output out=mycas.out3 copyvars=(job reason loan value delinq derog);
run;

proc print data=mycas.out3(obs=20);
run;
The SAMPPCT=10 option requests that 10% of the input data be included in the training partition, and the SAMPPCT2=20 option requests that 20% of the input data be included in the testing partition. The SEED= option specifies 10 as the random seed to be used in the partitioning process. The PARTIND option requests that the output data table, mycas.out3, include an indicator that shows whether each observation is selected to a partition (1 for training or 2 for testing) or not (0). The OUTPUT statement requests that the sampled data be stored in a table named mycas.out3, and the COPYVARS= option lists the variables to be copied from mycas.hmeq to mycas.out3.

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

SimonWilliams · Posted 05-21-2020 06:03 PM

You may also find this recently published article helpful: https://communities.sas.com/t5/SAS-Communities-Library/SAS-Model-Studio-8-5-projects-and-considerati...

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

andreas_zaras · Posted 05-22-2020 02:40 AM

Hello Simon,

Thanks for your answer!

So do you agree that if i do the project from scratch, every time the software chooses different data sets for training and validation randomly so the rsults of e.g. the Decision tree will every time be slightly different?

If yes is there a way to select a seed so every time i create the project the training - validation sets will be the same by using the Model Studio GUI?

SimonWilliams · Posted 05-22-2020 07:37 AM

Hi Andreas,

Ok i spoke with a colleague in R&D and they confirmed the seed for the partitioning of data within Model Studio 8.5 is fixed value. It is the same value for each project you create and for each run of the data node.

You should be able to verify this by looking at summary statistics for each of the partitioned tables. They should be the same. If you are seeing behaviour which suggests that the partitioning of data is not consistent, then please contact Technical Support and provide some examples.

If you are seeing slightly different results for each run of the model, then perhaps the algorithms that underpin each modelling technique may have seed/starting values that are chosen at random or can be specified by the user. The VDMML documentation may help. https://go.documentation.sas.com/?docsetId=casml&docsetTarget=casml_whatsnew_sect003.htm&docsetVersi...

You are correct that there is no option for the user within Model Studio GUI to set the seed value. Your feedback has been passed on to R&D.

Cheers, Simon

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

andreas_zaras · Posted 05-22-2020 07:54 AM

Thanks SImon!

SimonWilliams · Posted 05-27-2020 11:22 AM

Hi Andreas,

I received an update from my colleagues on this topic.

In essence, once you have created a Model Studio project which uses data 'x', everytime the data node is run, your partitions will remain the same.

However, if you create multiple projects which use the same set of data 'x', the partitions will look different across the projects.

If you are teaching students and each student has their own project and you really want to them to have identical partitions for data 'x', then use the program method i outline in the communities article to create identical partitions by having each student run the proc partition with the same seed.

Sorry for any confusion, and as mentioned before we've provided feedback for Model Studio users to be able to set the seed in the Model Studio GUI.

Thanks, Simon

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

andreas_zaras · Posted 05-27-2020 03:43 PM

Hi SImon!

Thanks for the update.

ANother good idea for passing to the R&D is that a seed is available for the event based sampling facility. I think now every time you create a project it samples the events and the non events with a new seed so the results won;t be the same.

Thanks,

Andreas

SimonWilliams · Posted 05-28-2020 10:49 AM

Hi Andreas,

I will add your feedback regarding the seed for event based sampling back to R&D.

Cheers, Simon

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users