If you are using or planning on using Model Studio 8.5 and plan to have partitioned data to leverage the method of using training, validation and test data to build your model then read on.
In this article I have 2 topics, namely expanding on the 'automatic data duplication' and how to use some simple code to mitigate it from happening in the first place.
Before I dive in, my thanks to SAS R&D colleagues Fred Burke and Mark Thomas for their help in getting clarity on the points and running some confirmation tests.
In my last article, I briefly mentioned a recent update to the Visual Data Mining and Machine Learning Advanced Topics documentation. The text reads:
Model Studio copies the data source when the first Data node is run. This can cause performance issues and can cause you to run out of disk space. The amount of space that is required depends on the number of saved projects and on the size of the data source. To prevent Model Studio from automatically creating copies of your data, ensure that the following conditions are met:
Some points of clarification.
The sentence “Model Studio copies the data source when the first Data node is run.“, it’s probably more precise to say, “Model Studio copies the data source only on the first run/execution of the Data node in the pipeline“.
Key variables are required within VDMML based projects. If the key variable has a name _DMINDEX_ and this variable is numeric with distinct values, then Viya/CAS will recognize this as being a key variable. If you provide another name for the variable and/or it is not numeric, then you will need to use the Model Studio interface to define the variable as a key variable.
Partitioning of data is optional. It is up to the Model Studio user to deciding whether to select partitioning as part of the Model Studio project properties. The default behaviour of Model Studio is to partition the data. If the user deselects the partition option, then bullet point 2 becomes irrelevant.
To be clear, the data is copied to the CAS Server Controller host machine. The data is only duplicated if the Model Studio user requests the data to be partitioned within the Model Studio interface.
With this in mind I thought it would be worth sharing a visual description of if/when/how the steps should be completed to prevent data duplication.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The overarching recommendation if data duplication is to be avoided, is to ensure the variables be applied prior to the data being used within Model Studio. Readying the data prior to analytical routines is pretty common place and so having a repeatable approach is recommended. Interfaces like Data Studio can be used to generate key and partition variables.
One way tackle this and to implement the 3 bullet points described in the documentation is provided in the example code below. The final table which gets created heart_dmindex_partind is the one which would be used as the input table within Model Studio.
The key points to remember are:
My thanks to Fred for writing the code below. Let us know what you think, by adding your comments below.
/*
* In this example though we are using the sashelp.HEART data set and create both
* the index and partition variables for it. We are using this SAS Data Set
* because it is easily accessible by all users.
*
* However, in your use case your first step would likely be to use the 'Data'
* tool in the Administration application 'Manage Environment' to Import your
* table (i.e., from your Local Filesystem) into your CASUSER CASLIB and then
* add the index and partition variables to that CSA table.
*
* This example code is expected to be run in SASStudio.
*
*/
OPTIONS MPRINT NOSYMBOLGEN NOSOURCE;
options validvarname=ANY validmemname=extend VARLENCHK=NOWARN CASDATALIMIT=ALL;
CAS mycas sessopts=(metrics=true messagelevel=all);
CASLIB _ALL_ ASSIGN SESSREF=mycas;
/* Create the casioca libref so that SAS can use it to access the table in CAS. */
libname mylibref cas caslib="CASUSER";
/* Just to ensure the tables do not exist. */
proc datasets lib=mylibref nolist nowarn;
delete heart_dmindex_partind;
run;
quit;
/* Ensure the tables are not already loaded in CAS; show the tableinfo. */
proc cas;
table.dropTable caslib="CASUSER" name="heart_dmindex_partind" quiet=TRUE;
run;
quit;
/*
* Load the data into CAS so that we can access it.
*
*/
data mylibref.heart_dmindex_partind;
set sashelp.heart;
run;
/*
* Create the _dmIndex_ variable.
* Model Studio will recognize this variable and automatically assign it a role
* of KEY.
*/
data mylibref.heart_dmindex_partind;
length _dmIndex_ 8;
_dmIndex_ = _threadid_ * 1E6 + _N_;
set mylibref.heart_dmindex_partind;
run;
/*
* Create the partition variable '_PartInd_'. The call below specifies that
* the partition variable _PartInd_ be created and that the percentage of
* train, validation and test observations wil be 70%, 10%, and 20% respectively.
*
* For more information see:
* https://go.documentation.sas.com/?docsetId=casstat&docsetTarget=casstat_partition_examples.htm&docsetVersion=8.5
*/
proc partition data=mylibref.heart_dmindex_partind samppct=10 samppct2=20 seed=10 partind nthreads=3;
by STATUS;
output out=mylibref.heart_dmindex_partind;
run;
/*
* The above code will generate a Log view, a Results view and an Output Data
* view. Select the Ouput Data view and verify that the variables _dmIndex_ and
* _PartInd_ have been successfully added to mylibref.heart_dmindex_partind.
*/
/*
* Now write the updated table to the filesystem of the CASUSER CASLIB and
* promote it from session scope to global scope so other CAS sesions you
* create can see it.
*/
proc cas;
session mycas;
table.save /
caslib="CASUSER"
table="heart_dmindex_partind"
name="heart_dmindex_partind.sashdat"
replace=TRUE;
run;
table.fileInfo /
caslib="CASUSER"
path="heart_dmindex_partind.sashdat";
run;
table.promote caslib="CASUSER" name="heart_dmindex_partind";
run;
quit;
/*
* Check the table heart_dmindex_partind is availalbe in CASUSER and
* is available in global scope.
*/
proc casutil ;
list tables incaslib="CASUSER" ;
quit;
CAS _ALL_ clear;
Thanks for reading.
--Simon
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.