This post will focus on the differences and similarities between partitioning data in SAS Enterprise Miner versus SAS Model Studio. In data science, it is well known that when building predictive models, honest assessment on a hold-out sample is the tried-and-true method to best guarantee model generalizability as well as to obtain a fair measure of model performance. This post in my ongoing series on introducing Model Studio to the Enterprise Miner user will address the topic of partitioning data.
If you’ve been keeping up with this series, then you know for a while, I’ve been hinting at addressing the process of building models. This is after all, the primary purpose of these analytical tools. But if you build models for a living, then you know partitioning the data PRIOR to building a model is a critical step. So, I thought it would be appropriate to discuss data partitioning first. And for this topic, I think you’ll see there are more differences than similarities between these analytical workhorses. So, a quick discussion of data partitioning is critical for those moving into the Model Studio world from Enterprise Miner.
Enterprise Miner:
For Enterprise Miner, partitioning data is typically handled within a diagram after the project has been created and after a data source has been defined. If you missed my earlier posts, I‘ve already addressed the topics of Enterprise Miner projects and data sources in part 1 and part 2 of this series. Go back and check them out if you haven’t already. Once the data source node exists within a pipeline, partitioning can be performed. The Data Partition node is found above the Sample tab in the SEMMA tools palette where it is the second node from the left. (For more on SEMMA see part 3 of this series.) Here we see a data source, called Commsdata, exists within a diagram, the Sample tab is active, and the Data Partition node has been selected.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Once the Data Partition node is placed into the diagram, the Data Source node can be connected to it.
Before running the Data Partition node, the analyst typically checks the properties panel and makes any changes that are needed for their project. Here is the default properties panel for the Data Partition node:
The most important properties, arguably, are for Data Set Allocations. This is where the analyst decides how many partitions they want to create from the raw data as well as the allocation percentage for each partition. The default Data Set Allocations are to create Training, Validation, and Test data sets with an allocation of 40%, 30%, and 30%, respectively. Suppose the analyst only wanted a two-way partition of the data, with 70% going to Training and 30% going to Validation. This would be handled by simply changing the Training allocation to be 70 and the Test allocation to be 0.
Percentages are not the only values that can be used in the Data Set Allocations property. Suppose you want only training and validation partitions, and you want observations to have a 2 to 1 ratio between the partitions; that is, you want two observations to train models for each observation to validate them. Integer values representing ratios can also be used. In this case, the data scientist simply enters 2 and 1 for Training and Validation allocations, respectively. This ratio assigns 66.67% of the data for training and 33.33% of the data for validation. But, in this case, typing integers representing a ratio is simpler than typing the actual percentages.
Another important property is for Partitioning Method. This property determines the sampling method to be used to split the data. Typically, the Partitioning Method is handled automatically by Enterprise Miner, at least when the default setting is used. Thus, the decision of what sampling strategy to use is removed from the analyst. The options for this property are Default, Simple Random, Cluster, and Stratified. The Default setting depends on the measurement level of the target variable. If the data source has a class target variable (e.g., nominal or binary), then stratified sampling is used, otherwise simple random sampling is performed. The Cluster option may sound intriguing, and I’m sure it has its applications, but in over 15 years of using Enterprise Miner, I’ve never used it once. But, in case you are curious, here’s what it is, as stated in the Enterprise Miner help documentation: “Using simple cluster partitioning, you allocate the distinct values of the cluster variable to the various partitions using simple random partitioning. Note that with this method, the partition percentages apply to the values of the cluster variable, and not to the number of observations that are in the partition data sets. If you select cluster partitioning as your method, you must use Variables property of the Data Partition node to set the Partition Role of the cluster variable (which can be the target variable) to Cluster.”
It is important to understand what takes place behind the scenes when partitioning is done, and how this differs from Model Studio. For Enterprise Miner, when the raw data set is partitioned, two (or three, depending on the Data Set Allocations property) physical tables are created. Meaning the raw table is physically broken up according to the partitioning strategy and these individual tables are passed down the process flow as nodes in the analysis are used. The tables are SAS data sets and are created and saved within the Projects folder, which is defined when the project is created. For a two-way partitioning of the data, below is what the system names the two created tables:
This behavior is very different from what happens in Model Studio, which I’ll describe in the next section. The exception is when Enterprise Miner is working with distributed data stored on a dedicated Database appliance, such as Hadoop or Teradata, in which case the process aligns more closely with what will be described for Model Studio.
Because data partitioning for Enterprise Miner happens within a diagram, it is possible that a single project can consider different partitioning strategies. One diagram could use a two-way partitioning of raw data with specific allocation percentages while a second diagram within the same project could consider a two-way partitioning of the same raw data with different percentages or even a different number of partitions all together.
Before moving on to how data partitioning is done in Model Studio, there’s another small point to address. There is technically another way that partitioning can be handled in Enterprise Miner. This is when the data is partitioned outside of Enterprise Miner, before the project is even created. If the raw data is manually split, say using SAS Data Step code in SAS Studio, then the pre-partitioned data sets can be brought into an Enterprise Miner project using the Data Source Wizard. For each partition of the data, within the Data Source Wizard the datasets are assigned roles of Train, Validate, and, if needed, Test.
In this case, the Data Partition node would not be used within the diagram, but rather, the individual data partitions, with appropriate roles, would be used.
Model Studio:
Right out of the gate, you need to have a different mindset when it comes to partitioning data in Model Studio compared to Enterprise Miner. There is no Data Partition node in Model Studio. For Model Studio, partitioning happens at the project level, not within pipelines. So unlike for Enterprise Miner, a project can have one and only one partitioning strategy. Also, rather than physically breaking up the raw data into new tables, Model Studio creates a partition indicator variable which assigns observations to a partitioned sample. More on this partition variable later. The partitioning options are available in the Project settings which are available in two ways. First, when the project is being created, the user can access Project settings by clicking the Advanced button in the New Project window.
This opens the New Project Settings window where the Partition Data options are the second item in the column of options to the left.
The Project settings, and thus the Partition Data options, are also available on the Data tab after the project is created. Project settings can be accessed from the Settings short-cut button in the upper right corner of the Data tab window.
The Partition Data options are available to be changed until a node within a pipeline is run. Once a single node is run, the partitioning strategy cannot be changed. As can be seen above, the default data partition creates Training, Validation, and Test samples, with allocations of 60%, 30%, and 10%, respectively. Note that although the default number of partitions created is the same as Enterprise Miner, the default percentages for Model Studio differ. To change the number of partitions or their allocations, simply enter the desired percentages into the appropriate options. To create a 70%/30% split to training and validation, set the values as follows:
Of course, integer values for ratios could also be used just as in Enterprise Miner.
Unlike with Enterprise Miner, there are only two options available for the sampling method: Stratify (the default) and Simple random. These options are available in the Method property. The setting for this option should ideally depend on the target variable. The typical sampling method for an interval target is Simple random, and for all other target measurement levels, Stratified is best.
The Partition Data options can be turned off, by deselecting the Create partition variable option. This would be done if a partition indicator variable was created outside of Model Studio. In this case, the user-created partition variable would be assigned a role of Partition on the Data tab, prior to running a node. As noted above, Model Studio handles partitioning by creating a partition indicator variable, not by physically breaking up the raw data and creating subset tables. Given Model Studio works with in-memory data, a partition variable is a much more efficient way to handle data partitioning. For example, in the Data Exploration node, summary information is provided on the entire raw data set. These summary statistics and graphs are then easily computed and constructed by Model Studio simply ignoring the partition variable. When Model Studio builds a model, the partition variable then assigns each individual observation to an appropriate partitioned table. Typically, we think of indicator variables taking on only values 0 and 1. However, in the case of a three-way partition of the data to training, validation, and test, the partition indicator takes on values of 0, 1, or 2 (see below) to determine which of the three partitions each observation falls into.
Partitioning raw data is a cornerstone concept in building predictive models. The literature and experience indicate it is the best way to arrive at the best performing, most generalizable model. The process for partitioning data is very different between Enterprise Miner and Model Studio. Understanding the differences, and similarities, puts you in a much better position to continue your journey into the wonderful world of Model Studio, for those making the transition away from Enterprise Miner.
More Posts in This Series:
Model Studio for Enterprise Miner Users: Part 1
Model Studio for Enterprise Miner Users: Part 2, Data
Model Studio for Enterprise Miner Users: Part 3, Let’s get philosophical
Model Studio for SAS Enterprise Miner Users: Part 5, Building Models…Let’s get physical!
Find more articles from SAS Global Enablement and Learning here.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.