BookmarkSubscribeRSS Feed

Model Studio for Enterprise Miner Users: Part 2, Data

Started ‎03-29-2024 by
Modified ‎03-29-2024 by
Views 201

This post is part 2 in a multi-part series that has the purpose of introducing Model Studio to Enterprise Miner users. If you did not get a chance to check out Part 1 in the series, do so by clicking here. In this post, I’m going to cover all things data related for the two SAS tools. I’ll discuss how data sources are created in Enterprise Miner, some of the project Advisor Options in Model Studio, and the benefits of the Data tab in a Model Studio project. For brevity, I’ll continue using the same abbreviations for each tool that I introduced in my first post: E-Miner for Enterprise Miner and M-Studio for Model Studio.

 

A Data Source in E-Miner:

 

A data source must be brought into an existing project in E-Miner, so a project must be created first. I covered E-Miner projects in my first post in this series. When a data table is imported into an E-miner project, using E-Miner lingo, it becomes a data source. A data source includes the original data table, but it also has additional metadata attached. For E-Miner, the additional metadata, or information about the data, includes specific information in three areas: the variable roles, variable measurement levels, and the role of the table itself. This metadata allows E-Miner to use the data efficiently and make certain decisions for the data scientist during the analysis. These decisions remove mundane tasks from the analyst and allow him or her to focus on more important details during their work. (If this makes you nervous, don’t be. In E-Miner the user can always change or undue the decisions the software makes.) An example of E-Miner making such a decision is the case where the data has a binary target, and a regression model is to be built. E-Miner will automatically invoke logistic regression over linear regression thus removing the mundane decision of which type of regression model to create from the user. Variable roles include, but are not limited to input, rejected, target, and ID. Variable measurement levels include, but are not limited to interval, nominal, ordinal, and binary. These variable measurement levels go beyond the two measurement types that are usually used when writing a SAS data step: numeric and character. And the possible table roles include, but are not limited to input, score, training, validation, and transaction.

 

From within a project, a new data source is created by using the “File” pull down menu, the “New” short cut button, or by right clicking on “Data Sources” in the project panel. A new data source is created through the Data Source Wizard that takes the user through a series of steps such as naming the data source, selecting the library where the data table exists, establishing metadata for the data source and deciding if a sample of the data should be taken. Here’s a screen shot of step 1 in the Data Source Wizard:

 

01_JT_EM_data_wizard-300x157.png

 

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The number of steps in the wizard updates depending on the metadata as it is established. For example, when the column (i.e., variable) metadata is defined, if the target is given a measurement level of binary, there is an additional step in the wizard for “decision processing”. Decision Processing can be used to update predicted probabilities for a binary target if the event rate in the observed data is different from the true population event rate, as would be the case if the data are over-sampled. I won’t discuss each step in the Data Source Wizard here, but I will cover the Metadata Advisor Options for establishing column metadata which takes place in step 4. Step 4 in the wizard has two options for the Metadata Advisor: Basic, which is the default, and Advanced. When the Basic option is used, every variable is assigned the role of Input, unless a variable name contains the word “target” or “ID”. In these later cases, a variable is then assigned a role of Target or ID, respectively. No variables are rejected when the Basic advisor is used. Further, each variable’s measurement level is assigned as Nominal if it is a categorical variable and Interval if it is numeric. No variable is assigned a measurement level of ordinal or binary using the Basic advisor. E-Miner takes further steps at establishing metadata when the Advanced Advisor is used. Take a look at the Advanced Advisor options below:

 

02_JT_EM_advanced_advisor_options-300x243.png

 

There are three main options: Reject Vars with Excessive Missing Values, Detect Class Levels, and Reject Vars with Excessive Class Values. Each option can be on or off and each has its own default threshold which also can be changed by the user. Two of the options guard against having problematic variables in the data source. “Reject Vars with Excessive Missing Values” protects against having an input variable with a high amount of missing information. The default cut-off is 50%. Categorical variables with too many levels can also create problems when modeling so the option “Reject Vars with Excessive Class Values” guards against this. E-Miner will reject any categorical variable with more than 20 levels when this advanced advisor option is used. Finally, “Detect Class Levels” is for converting numeric variables with a small number of distinct values to categorical, or “Class” variables. The default threshold is 20. So, a numeric variable having only 10 distinct values for example, would be assigned a role of nominal. This option changes numeric variables with 2 distinct values to binary and any numeric variable with a single value to unary.

 

Of course, whether the basic or advanced advisor is used, the analyst can always make individual changes to variables in the Column Metadata step, which is step 5 in the wizard. The second to last step in the wizard is where the role of the table is defined, and the final step is always a summary of the data source including its assigned metadata. When a data source is added to a project, the name appears under the Data Sources folder in the project panel. Something different for E-Miner compared to M-studio, is that an E-Miner project can contain more than one data source.

 

Once a data source is defined, the role of the table can be changed by selecting the data source name in the project panel and then changing the Role property in the properties panel which is located under the project panel. Variable metadata for a data source can be changed within the project panel itself, by right clicking on the name of the data source and selecting Edit Variables…

 

03_JT_EM_data_sources_folder-296x300.png

 

Variable metadata can also be changed within a pipeline by using the Metadata node. More on E-Miner nodes in my next post.

 

Data Access and Advisor Options in M-Studio:

 

As stated in my first post, a project in M-Studio cannot be created without data being assigned to it during the creation process. This behavior is a bit different from E-Miner when data sources are added after the project is created. In M-Studio, creating a project and defining data are inseparable. For M-Studio, contained within the “New Project” window is a required field for Data. The New Project window was covered in my first post in this series. When the user clicks “Browse” to select a data table, M-Studio opens a “Choose Data” window which provides three options: Available, Data Sources, and Import.

 

04_JT_MS_Choose_Data_Tabs_border-300x141.png

 

M-Studio works only with data in CAS memory. Data tables already distributed and available in CAS memory are found under the “Available” tab. Data tables already on the CAS server but not yet loaded into CAS memory are available under the “Data Sources” tab. And data that needs to be imported into CAS, distributed, and loaded into memory are done so by using the “Import” tab.

 

Once an in-memory data table is selected, M-Studio helps establish variable metadata just as the advanced advisor options do for E-Miner. These metadata advisor options are available by clicking the “Advanced” button on the New Project window. One of the advanced options when creating an M-Studio project is for “Advisor Options”. Here’s a screen shot of them:

 

05_JT_MS_advisor_options_good_border.png

 

The three advisor options are the same as the advanced advisor options discussed above for E-Miner, but M-Studio describes them differently. “Maximum Class Levels” here is the same as “Reject Vars with Excessive Class Values” for E-Miner. “Interval cutoff” and “Maximum percent missing” for M-Studio are referred to as “Detect Class Levels” and “Reject Vars with Excessive Missing Values”, respectively, for E-Miner. These options have the same cut-offs as those for E-Miner. Further, in E-Miner when the advanced advisor options are used, the user can decide whether each option is enabled or turned off. For M-Studio, they are automatically used, except for Maximum Percent Missing, which can be disabled by deselecting the “Apply” checkbox. During the creation of an M-Studio project is the only time these metadata options can be seen.

 

The Data Tab in M-Studio:

 

In M-Studio, once the project is created, the project automatically opens on the Data tab. This is a huge difference from E-Miner. The Data tab is possible because in M-Studio only one data set is applied to each project. The Data tab is where M-Studio’s assigned metadata can be viewed and changed by the user. So just like in E-Miner, an analyst is never locked into one of the metadata settings decided upon by the software. Here’s a look at the Data tab for a Data Mining and Machine Learning project:

 

06_JT_MS_data_page-1024x444.png

 

Aside from the Data tab itself, M-Studio takes metadata to a whole new level compared to E-Miner. In addition to the variable roles and measurement levels, M-Studio also allows the analyst to apply certain data preprocessing rules to each variable. Applying these preprocessing rules is optional. On the data tab, when a variable is selected, the metadata fields open in a pane on the right of the Data tab screen. But in addition to the variable role and measurement level, other fields are shown and the fields available depend on the measurement level of the variable. These additional fields allow for data preprocessing rules to be defined for variables. For interval inputs, the additional fields are for Transform, Impute, Lower limit and Upper limit. Rules can be established here for types of transformations (such as inverse, log, square, etc.), imputations (such as custom constant value, maximum, mean, etc.) and lower and upper limits for value replacement. The Data tab only establishes the rule but the data preprocessing action and rule itself are invoked by using the appropriate node within a pipeline. For example, if you set a variable to have an Impute rule of median, and then you place an imputation node in a pipeline, when the pipeline is run, the median is imputed for that variable. For nominal variables rules can be set for Order (such as ascending or descending), Transform, and Impute. Nominal inputs can be assessed for bias, and this is toggled on and off also in the metadata pane on the right. This is another leap beyond E-Miner, as it does not have the ability to assess categorical variables for bias. Below is a view of the metadata pane for an interval input.

 

07_JT_MS_Var_Metadata_interval-186x300.png

 

An additional role for a variable in M-Studio that is not available for E-Miner is Partition. The user can create their own partition variable and partition the data according to this variable. The role of partition is only available for variables with a measurement level of nominal or ordinal. Another difference from E-Miner is that to create an M-Studio project, some variable roles are required. For example, to create a Data Mining and Machine Learning project, a single variable must have the role of Target, or a pipeline cannot be run. A Forecasting project requires two variable roles: Time and Dependent. For Text Analytics projects, a variable must have a role of Text. Further, when it comes to the role of Target for a Data Mining and Machine Learning project, M-Studio can have just one. E-Miner allows for two variables with a role of target because it has two-stage modeling capabilities. M-Studio does not yet have that capability, at least within a single project. In M-Studio, two-stage modeling could be done using two projects, one for each target variable.

 

So, it’s pretty clear that although there are some similarities between “data” in E-Miner and M-Studio, how the tools handle and display them as well as the full capabilities of each are quite different. M-Studio only works with data that resides in CAS memory, and this is simply not the case for E-Miner. This fact alone leads to a lot of the differences. For me, as a user of both tools, it took just a bit of time getting used to the differences. I quickly fell in love with the Data tab and its capabilities in M-Studio once I got used to it. One final warning about the Data tab before wrapping up this installment in my series. After pipelines are run in M-Studio, if changes are made on the Data tab, then all pipelines would require to be rerun. Yes, this can sometimes be frustrating and a bit time consuming, but the tool must operate this way. If I go back and reject a variable that was an input in my prior runs, the pipelines must be rerun because that same variable may have ended up in my final models.

 

For the next post in the series, I’ll be covering differences in how models are built between the two tools. I’ll be covering the SEMMA tools pallet in E-Miner, the modeling nodes in each of the three types of M-Studio projects, and pipeline templates in M-Studio. I hope you are enjoying your journey from the E-Miner world into the new and modern world of M-Studio on SAS Viya. And if you are, stay tuned, there’s a lot more to come!

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎03-29-2024 11:35 AM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags