About chmedi

chmedi · ‎01-02-2023

For the 2022.10 release (October 2022) of Model Studio, the very popular LightGBM gradient boosting framework has been added to Model Studio as an available supervised learning algorithm in the Gradient Boosting node. LightGBM is an open-source gradient boosting package developed by Microsoft, with its first release in 2016. In Model Studio, because it is a variant of gradient boosting and shares many of its properties, the LightGBM algorithm has been integrated into the Gradient Boosting node. The following image contains a pipeline with the Gradient Boosting node. The Perform LightGBM checkbox, which is in the node properties, enables the LightGBM algorithm. When you select Perform LightGBM, the node displays the available LightGBM properties. Clicking on the Run pipeline button at the top executes the Gradient Boosting node. Under the covers, the node executes the SAS LIGHTGRADBOOST procedure, which calls the lightGradBoost.lgbmTrain CAS action to run LightGBM. This trains the LightGBM model with the options that you specified, and produces training and assessment reports in the results. When you right-click on the Gradient Boosting node and select Results, the training reports are displayed on the Node tab. One of those reports is the Iteration History report, a line plot illustrating the change in training and validation accuracy as the boosting iterations (number of trees) increase. Note that the right-hand pane provides an automated description to help you interpret the plot. An additional report is the Training Code report, which contains the Proc Lightgradboost training code. You can use this as an example syntax with which to train your own LightGBM models in SAS Studio. Clicking on the Assessment tab in the results brings up a handful of model assessment reports. These reports assess the LightGBM model against all available data partitions, including Train, Validate, and Test, and are the standard assessment reports generated for any supervised learning node in Model Studio. If you had selected post-training node properties to produce one or more Model Interpretability reports, these will display when you click on the Model Interpretability tab in the results. Gradient Boosting models, while very accurate, are not very interpretable, which makes these reports very important in understanding the LightGBM model. The reports displayed here include Surrogate Variable Importance, PD and ICE Plots (Partial Dependence and Individual Conditional Expectations), LIME Explanations (Local Interpretable Model-agnostic Explanations), and HyperSHAP Values (Shapley). After exiting the node results, you can view and compare pipeline performance across pipelines by clicking on the Pipeline Comparison tab. Shown here are two LightGBM models that you can compare, with a flag that identifies the champion model. You can score new data by clicking your model and selecting Score holdout data from the Project pipeline menu (three vertical dots) at the top. You can also do a side-by-side assessment comparison by selecting both models and clicking the “Compare” at the top, which produces assessment plots that include both models. And then you can register your model in Model Manager by selecting Register models from the Project pipeline menu. Once registered, you can maintain and track the performance of your model in Model Manager, in addition to publishing your model for deployment (you can also publish your model from the Project pipeline menu). Given its popularity and wide usage, providing LightGBM as an available modeling algorithm within Model Studio increases the breadth of modeling options available to Model Studio users. With the power of Model Studio, LightGBM users will appreciate the ease with which assessment and model interpretability reports can be generated, the ease with which models can be compared, and the ease with which models can be registered and published for deployment into production. Appendix Below are descriptions of the LightGBM specific properties in the Gradient Boosting node, with corresponding open-source parameters in parentheses. Basic Options Boosting type (boosting) – A selector to choose the type of boosting algorithm to execute. Gradient boosting decision tree (gbdt) – This is the traditional gradient boosting method. Default. Dropouts additive regression trees (dart) – Mutes the effect of, or drops, one or more trees from the ensemble of boosted trees. This is effective in preventing over specialization. Gradient-based one-side sampling (goss) – Retains data instances with large gradients, or large training error, and down-samples instances with small gradients, or small training error. Number of trees (num_iterations) – The number of boosting iterations. The default value is 100. Learning rate (learning_rate) – The rate at which the gradient descent method converges to the minimum of the loss function. The default value is 0.1. Bagging frequency rate (bagging_freq) – The iteration frequency at which the training data is sampled. As an example, for a value of 5, the data is sampled before training begins, and then after every five iterations. Sampling is enabled for a value greater than 0. The default value is 0. Bagging fraction rate (bagging_fraction) – The fraction of the training data that is sampled when sampling is enabled (Bagging frequency rate > 0). This option is hidden until a Bagging frequency rate greater than 0 is entered. A value less than 1 is required. The default value is 0.5. L1 regularization (lambda_l1) – In a regression model, a regularization parameter (lambda) which is applied to the absolute value of the coefficient in the penalty term to the loss function. The default value is 0. L2 regularization (lambda_l2) – In a regression model, a regularization parameter (lambda) which is applied to the squared value of the coefficient in the penalty term to the loss function. The default value is 1. Interval target objective function (objective) – A selector to choose the objective loss function for an interval target. Fair loss (fair) Gamma (gamma) Huber loss (huber) L1 regression (MAE) (regression_l1) L2 regression (MSE) (regression) – Default Mean absolute percentage error (mape) Poisson (poisson) Quantile (quantile) Tweedie (tweedie) Nominal target objective function (objective) – A selector to choose the objective loss function for a nominal target. For a binary target, the binary log loss function is used. Multinominal logistic regression (multiclass) – Default One vs. rest classification (multiclassova) Ensure deterministic results across job executions (deterministic) – A checkbox to enable deterministic results for the same data and parameters. Not selected by default. Seed (seed) – The value used to generate random numbers for data sampling. The default value is 12345. Tree-splitting Options Maximum depth (max_depth) – The maximum number of generation of nodes, where generation 0 is the root node. The default value is 4. Minimum leaf size (min_data_in_leaf) – The minimum number of training observations in a leaf. The default value is 5. Use missing values (use_missing) – A checkbox to enable the handling of missing values. Selected by default. Number of interval bins (max_bin) – The maximum number of bins for an interval input. The default value is 50. Proportion of inputs to consider per tree (feature_fraction) – Proportion of inputs randomly sampled for use per tree. The default value is 1. Maximum class levels (max_cat_threshold) – The maximum number of levels for a class input. The default value is 128.

chmedi · ‎12-12-2022

Hi Shamel. You can add a Manage Variables node after the Replacement node and reject (New role=Rejected) the replacement "winsorized" variables that you don't want, and set the original input variables back to input (New role=Input). To make this easier, you can sort the Manage Variables screen by name or role, and then shift select multiple variables to make the change all at once for multiple variables.

chmedi · ‎10-18-2021

For the Stable 2021.1.4 release (August 2021) and the LTS 2021.2 release (November 2021) of SAS Model Studio, tree-based imputation has been added to the Imputation node. In the tree-based imputation method, imputation of missing values for an input variable, such as variable x1, is accomplished by training a decision tree that uses all other input variables in the data to predict the value of x1. The model-predicted value is then the imputed value for x1. For an interval variable, a regression tree is trained. For a class variable, a classification tree is trained. In a Model Studio project where the Imputation node has been added to a pipeline, you can specify the default imputation method for Class Inputs and Interval Inputs. Here, under the two Default method selectors, “Decision tree” is the method by which you can specify tree-based imputation for all class inputs and/or interval inputs. A handful of options, located under the Decision Tree Options group, is available to control the splitting and pruning of the decision trees. The first sub-group of these, Splitting Options, controls the splitting of the trees. Splitting options: Classification tree splitting criterion (Class inputs). Default value: Chi-square. Possible values: CHAID Chi-square Entropy Gini Information gain ratio Regression tree splitting criterion (Interval inputs). Default value: F test. Possible values: CHAID F test Variance. Bonferroni (Bonferroni correction). Default: Not selected. Maximum number of branches. Default value: 2 Maximum depth. Default value: 5 Minimum leaf size. Default value: 5 Missing values (how missing values are handled). Default value: Use in search. Possible values: Largest branch Most correlated branch Separate branch Use in search The second sub-group, Pruning Options, controls the pruning of the trees. Pruning options: Subtree method (pruning method). Default value: Cost complexity. Possible values: Cost complexity None (Pruning is not performed) Reduced error Create validation from training data. Default: Selected. Validation proportion (proportion of training data). Default value: 0.3 Pruning is performed when a Subtree method of “Cost complexity” or “Reduced error” is selected. When the option Create validation from training data is selected, a portion of the training data is used for pruning. By default, based upon the Validation proportion value, 70% of the training data is used for training the trees, and 30% is used for pruning the trees. When the option Create validation from training data is deselected, the validation data is used for pruning if the input data is partitioned to include validation data. If not partitioned, pruning is disabled, since there is no explicit validation data available for pruning the trees. Note: Creating validation from the training data for pruning is recommended, even if the data is partitioned, so that the validation data is reserved for the Supervised Learning nodes. After running the Imputation node where you have specified the Decision tree method, open the node results to view the Imputed Variables Summary report. In this report, variables with Method=TREE are those that are imputed with tree-based imputation. Also, the Imputed Variable column contains the names of the generated columns that are populated with the imputed values. The original input variables are left alone, being set to rejected by default so that they are not propagated to downstream nodes. When specifying “Decision tree” as the default method for interval inputs or class inputs, all inputs of either category will be imputed with tree-based imputation. Given that, how do you identify individual variables for the Decision tree method? The place to specify an imputation method for an input variable is the Data tab or the Manage Variables node. However, for the Decision tree method, this functionality is not available prior to the Stable 2021.2.3 release (January 2022) or the LTS 2022.1 release (May 2022). Prior to those releases, this can be achieved in a SAS Code node that precedes the Imputation node. In the Code editor for the SAS Code node, on the Training Code pane, enter the line of code below for each individual input variable, substituting the variable name. Save and close the editor. %dmcas_metachange(name=<variableName>, impute=TREE) In the options for the Imputation node, a default method does not need to be specified for the class/interval inputs (value of “(none)”). If a default method is specified, that method is applied to an input if that input doesn’t have an imputation method specified elsewhere (Data tab, Manage Variables node, SAS Code node). As an example, two inputs are given the Decision tree method, and the default methods for both Class and Interval inputs are set to “(none)”. After running the pipeline, the Imputed Variables Summary report in the Imputation node results verifies that the two inputs were imputed with the Decision tree (TREE) method. Also, the Imputation node results contain the Node Score Code, which can be downloaded. This contains the generated tree-based imputation score code for the two variables. Apart from more traditional “brute-force” methods of imputation, such as Mean, Mode and Median, tree-based imputation provides the more analytical Decision Tree modeling algorithm to predict imputed values in the data, an important addition to SAS Model Studio’s imputation methodology toolbelt.

Anurag12 · ‎03-27-2021

Thanks for the reply..

chmedi · ‎09-18-2020

Overview A new node has been added to Model Studio on SAS Visual Data Mining and Machine Learning 8.5 which, after analyzing your data, automatically generates an entire set of transformed features for export to downstream nodes. It uses a three-step process in generating features. First, it explores the data such that input variables are grouped into categories that share the same statistical profile. This profile uses many variable attributes, including cardinality, coefficient of variation, entropy, qualitative variation, skewness, kurtosis, missingness, and outlier percentage. Next, the node screens input variables to identify variables to be excluded from feature generation, or to be transformed in a specific way. Finally, the variables that survive the screening process are used to generate features, based upon the exploration groupings and as required by the transformation policies which you have selected. In this article, I describe the following functionality and features of this new node: Input Variable Screening Transformation Policies Feature Generation and Selection Feature Transformations Feature Machine Results Input Variable Screening The input variable screening process scans each input for several data quality issues. These are issues that negate its beneficial usage as a predictive modeling input. If an input exceeds the threshold for one or more of these data quality issues, the input is excluded from further feature generation, or it's identified for the Group Rare feature transformation. Several options are available that allow you to control the screening for these individual data quality issues: Coefficient of variation – Identify interval variables that have a low coefficient of variation (close to constant value). These variables are excluded from feature processing. Enabled by default. Group rare levels – Identify nominal variables that have rare levels. These variables are transformed by rare level grouping. Enabled by default. Leakage percent threshold – Identify variables that have a very high level of information about the target (leakage variables). Variables that exceed your specified threshold (target entropy reduction) are excluded from feature processing. Default=90. Mutual information threshold – Identify variables that have a low level of information about the target (not informative). Variables that are below your specified threshold are excluded from feature processing. Default=0.05. Redundancy threshold – Identify variables that are redundant (highly correlated). If the Symmetric Uncertainty for two variables exceeds your specified threshold, the variable that has less information about the target is excluded from feature processing. Default=1. Redundancy screening is not enabled with this default value. Transformation Policies There are seven transformation policies available for selection. The features generated for each policy are designed to treat the data issue ascribed to that policy. Policies flagged with an asterisk are enabled by default. Cardinality - Treatment of high cardinality* Entropy - Treatment of low entropy Kurtosis - Treatment of high kurtosis Missingness - Treatment of missing values* Outliers - Treatment of outliers Qualitative variation - Treatment of low indices of qualitative variation Skewness - Treatment of high skewness* Feature Generation and Selection Multiple features can be generated per input variable, with the type and number of features determined by the transformation policies that are selected. The name of each feature defines the transformation pipeline that's applied for that feature: A feature is named by appending the input variable name (with an underscore) to the transformation name (see the list of feature transformations in the section below). When Feature Selection is enabled (by default), all the features for an input variable are ranked using the Symmetric Uncertainty statistic, and the top ranked features (per input) are selected and output from the node. When disabled, all generated features are output from the node. For Feature Selection, you specify the number of selected features per input with the Number of features per input option (Default=2). The value that you specify for this option is compared against the feature rank values to determine the selected features. If the ranking of features results in a tie (two or more features have the same SU value), this may result in more features being selected for an input than specified. In the table below, which lists the generated features for Input Variable AGE, the third, fourth, and fifth features are tied at Feature Rank 3. For this example, if the specified number of features per input is 1, the first feature is kept. If the specified number of features is 2, the first two features are kept. However, if the specified number of features is 3, the first five features are kept, since features 3, 4, and 5 all have Rank 3. The Where clause for selecting the features is: Where Feature Rank <= Number of features per input. Feature Transformations Listed below is the set of available feature transformations, grouped by transformation policy. This superset of feature transformations is the source for the naming of all features. For additional information on these transformations, follow this link: https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details23.htm&docsetVersion=8.5&locale=en Cardinality (Nominal to Interval transformations) hc_tar_mean – Mean target encoding hc_tar_min – Minimum target encoding hc_tar_max – Maximum target encoding hc_tar_frq_rat – Frequency ratio target encoding hc_tar_woe – Weight of evidence target encoding hc_tar_evt_prob – Event probability target encoding hc_lbl_cnt – Level count rank hc_cnt – Level count hc_cnt_log – Level count followed by Log transformation Entropy, Qualitative variation grp_rare1 – Mode imputation and group rare levels grp_rare2 – Missing level and group rare levels lchehi_lab – Label encoding lcnhenhi_grp_rare – Group rare levels lcnhenhi_rtree5 – Five-bin regression tree binning lcnhenhi_rtree10 – Ten-bin regression tree binning lcnhenhi_dtree5 – Five-bin decision tree binning lcnhenhi_dtree10 – Ten-bin decision tree binning lcnhenhi_woe5 – Five-bin Weight of Evidence binning lcnhenhi_woe10 – Ten-bin Weight of Evidence binning Kurtosis hk_yj – Yeo-Johnson power transformations with parameters -2, -1, 0, 1, 2 hk_dtree_disct5 – Five-bin decision tree binning hk_dtree_disct10 – Ten-bin decision tree binning hk_rtree_disct5 – Five-bin regression tree binning hk_rtree_disct10 – Ten-bin regression tree binning Missingness cpy_int_med_imp – Median imputation cpy_nom_mode_imp_lab – Mode imputation and Label encoding cpy_nom_miss_lev_lab – Missing level and Label encoding miss_ind – Missing indicator Outliers ho_winsor - Winsorization ho_quan_disct5 – Five-bin quantile binning ho_quan_disct10 – Ten-bin quantile binning ho_dtree_disct5 – Five-bin decision tree binning ho_dtree_disct10 – Ten-bin decision tree binning ho_rtree_disct5 – Five-bin regression tree binning ho_rtree_disct10 – Ten-bin regression tree binning Skewness hs_bc – Box-Cox power transformations with parameters -2, -1, 0, 1, 2 hs_dtree_disct5 – Five-bin decision tree binning hs_dtree_disct10 – Ten-bin decision tree binning hs_rtree_disct5 – Five-bin regression tree binning hs_rtree_disct10 – Ten-bin regression tree binning Kurtosis, Outliers, Skewness (Low or Medium-rated values) nhoks_nloks_pow, nhoks_nloks_log – Tukey's ladder of power transformations with parameters -2, -1, -0.5, 0, 0.5, 1, 2 nhoks_nloks_dtree5 – Five-bin decision tree binning nhoks_nloks_dtree10 – Ten-bin decision tree binning nhoks_nloks_rtree5 – Five-bin regression tree binning nhoks_nloks_rtree10 – Ten-bin regression tree binning Kurtosis, Outliers, Skewness (Low-rated values) all_l_oks_dtree5 – Five-bin decision tree binning all_l_oks_dtree10 – Ten-bin decision tree binning all_l_oks_rtree5 – Five-bin regression tree binning all_l_oks_rtree10 – Ten-bin regression tree binning Feature Machine Results After running Feature Machine, click to open the results from the context pop-up menu. When Feature Selection is enabled, the Selected Features report is displayed. This contains the list of selected features, sorted by Input Variable, Feature Rank, and Feature, that are output by the node. Downstream nodes will receive only these features. The Description column, which describes the feature, includes the input variable followed by a colon and the data quality issue, which is then followed by a hyphen and the transformation method. The first feature in the example below has "AGE: Not high (outlier, kurtosis, skewness) - power(2) + impute(median)". The expanded meaning: This feature is for input variable AGE. It addresses the data quality where one or more of Outlier, Kurtosis, and Skewness has a medium value, but none is high. It's transformed by taking the Square and imputing the median value. When Feature Selection is disabled, the Generated Features report is displayed instead. This contains the list of all generated features, which are output by the node for input into downstream nodes. These are sorted by Input Variable and Feature. Always displayed is the Output report, which contains the Generated Features print output. This is a listing of all features generated by Feature Machine, even when Feature Selection is enabled. When Feature Selection is disabled, this contains the same information as the Generated Features report. Summary In this article, I have given an overview of the new Feature Machine node in Model Studio on SAS Visual Data Mining and Machine Learning 8.5, explaining its functionality and how it works. Here are the main points: With this node, you automatically generate transformed features which address different data issues that negatively impact predictive modeling. The number and type of these features are based upon the transformation policies that you select. Variable screening is performed to exclude, from feature generation, variables that exceed certain data assessment thresholds. With the Feature Selection option, you control how many features are selected per input variable. Selected features are the top N ranked features, which are exported to downstream nodes.

chmedi · ‎06-09-2020

Since you're a SAS Partner, why don't you work through your Partner channels? In the meantime, I have some simulated data that produces three clusters. Log-into SAS Studio on the same system as Model Studio, and submit the following code to generate the data in Public: cas; caslib _ALL_ assign; %macro makeRegressorData(nBy=1,nByFixedSize=1,nObs=100,nCont=4, nClass=3,nLev1=3,nLev2=5,nLev3=7); data testdata; drop i j; %if &nCont>0 %then %do; array x{&nCont} x1-x&nCont; %end; %if &nClass>0 %then %do; array c{&nClass} c1-c&nClass;%end; do by=1 to &nBy; if by > &nByFixedSize then nObsInBy = floor(2*ranuni(1111155)*&nObs); else nObsInBy = &nObs; if nObsInBy < 10 then nObsInBy = 10; do i = 1 to nObsInBy; %if &nCont>0 %then %do; do j= 1 to &nCont; x{j} = ranuni(1); end; %end; %if &nClass > 0 %then %do; do j=1 to &nClass; if mod(j,3) = 0 then c{j} = ranbin(1,&nLev3,.6); else if mod(j,3) = 1 then c{j} = ranbin(1,&nLev1,.5); else if mod(j,3) = 2 then c{j} = ranbin(1,&nLev2,.4); end; %end; weight = 1 + ranuni(1); freq = 1 + mod(i,3); *** if ( i = 11 ) then x{2} = .; *** if ( i = 12 ) then c{1} = .; output; end; end; run; %mend; %macro AddDepVar(modelRHS =,errorStd = 1); data testdata; set testdata; y = &modelRHS + &errorStd * rannor(1); run; %mend; %makeRegressorData(nBy=1,nByFixedSize=1,nObs=435,nCont=14, nClass=11,nLev1=1,nLev2=3,nLev3=3) /* %AddDepVar(modelRHS=2*c1 - 0.5*x1,errorStd = 1); */ data testdata2(drop=x105 x106); set testdata; x105 = x5; x106 = x6; x109=x9; x5 = x2 + 0.49147127158235 * x3 -0.5868629022532 * x4 - 1.7447414951792 * x105; x6 = x1 - 0.13846905761895 * x3 + 0.6800302223946 * x4 - 0.45238025995657 * x105; x4 = x106 + 0.00146547818375 * x1 + 0.03017986135363 * x2 - 0.00820230703132 * x3 + 0.03457381414136 * x4 + 0.03738296997610 * x105; x9 = x7 + 0.7138912 * x8 - 0.45238025995657 * x109; x10 = x8 / 30 + 0.931261126 * x7+ 0.03738296997610 * x109; run; data public.test009 (promote=yes keep=x1-x10 c1); set testdata2; run; Then, switch to Model Studio (Build Models), create a project with test009 (you will need to refresh the available table list). Target=c1. For the Variable Clustering node, the one change to make to the properties is to set Maximum number of iterations (under Advanced Options) to 30.

abidi · ‎07-21-2019

Excellent explanations and these new additions will remarkable and make life easy for data scientists.

abidi · ‎07-20-2019

Excellent explanation, awesome.

chmedi · ‎04-30-2018

You are building a pipeline in SAS Visual Data Mining and Machine Learning (VDMML), and you want to perform class variable one-hot encoding so as to make your individual class levels available for analytical modeling (also known as creating dummy variables or class level indicators). How can this be done? This tip describes how you can accomplish this. I first start by illustrating the one-hot encoding process with mocked-up data, then I continue by showing how the hot encoding process is accomplished in SAS Visual Data Mining and Machine Learning. Finally, I describe details on how the class level variables are generated, and the available options for affecting that process. Class Variable One-Hot Encoding - Mocked-up Data The data I'm demonstrating here is mocked-up demographic data. This small sample of data has four class variables with two levels each. Below are the variables and their levels: Gender: Female, Male Race: Black, White Marital: No, Yes Edlevel: College, High School When running the One-Hot Encoding routine, the class variables are broken out into their individual levels, as follows: gender --> gender_Female, gender_Male race --> race_Black, race_Female marital --> marital_No, marital_Yes edlevel --> edlevel_College, edlevel_High school Individual class level variables are generated and populated with values of 0 or 1. The resulting data now contains 8 additional variables: Below is the typical DATA step score code that would be generated for the Gender and Edlevel class variables: Class Variable One-Hot Encoding - SAS Visual Data Mining and Machine Learning Login to Model Studio (SAS Visual Data Mining and Machine Learning) and create a project, selecting your desired data. The example illustrated here is home equity data. There are five input variables that are class variables (highlighted in yellow). SAS Model Studio 8.x on SAS Viya 3, SAS Model Studio on SAS Viya 2020.1.1 - 2021.1.1: On the Pipelines tab for the project, add the SAS Code node to the Data node in the pipeline (note, in the 8.2 release of Model Studio, it was called the Code node). The SAS Code node is found in the Miscellaneous group. Click on the SAS Code node. Note the "Train only data" checkbox in the properties for the node. In the Model Studio 8.2 release, this checkbox is not functional. Ordinarily, if the data is partitioned and you only want the training data, selecting the "Train only data" checkbox would provide just the training data to this node, with the end result of excluding class levels that are not in the training data. There is still a way to use only your training data in the 8.2 release, which I address later in this article. Note the "Code editor" Open button in the properties for the node. The Code editor provides the ability to enter and run any custom SAS code against your source data. Click the Open button. Copy and paste the provided SAS code into the SAS Code entry window (or Training Code entry window in later releases) that opens up. The GitHub link to this code is provided at the end of this article. This code is required for Model Studio 8.1, 8.2, and 8.3 releases. Starting with the Model Studio 8.4 (Viya 3.4) release, all that's necessary is to enter the macro invocation code below. Documentation on the parameters for macro %dmcas_classlevs is provided near the end of this article. %Include "&sourcefolder.&dm_dsep.dmcas_classlevs.sas"; %dmcas_classlevs(p_dummyvarrole=INTERVAL, p_maxnamelen=32) Click the Save icon and the Close button to exit the SAS Code editor. SAS Model Studio on SAS Viya 2021.1.2 and later: One-hot encoding code has been incorporated into the Transformations node, eliminating the need for using the SAS Code node: On the Pipelines tab for the project, add the Transformations node to the Data node in the pipeline. The Transformations node is found in the Data Mining Preprocessing group. Click on the Transformations node, and, within the Class Inputs group, select "One-hot encoding" for the Default class inputs method. Click to run the pipeline. When the pipeline finishes executing, right click the node (SAS Code or Transformations) and select Results. Two items of note in the results: The Node Score Code contains the SAS code used to generate the Class Level Indicator variables: The Output contains the Class Variable Mapping information for all Class input variables: Exit the Results by clicking the Close button. The purpose of generating the Class Level Indicator variables is to replace the original Class variables, using the indicator variables as inputs in downstream nodes of the pipeline. As such, the original Class variables still exist in the data, but they are flagged as rejected and are not used by succeeding nodes. To illustrate, add a Decision Tree modeling node (found in the Supervised Learning group) after the node that was just run (SAS Code or Transformations), and run the pipeline to execute this node. When complete, open the results for the Decision Tree node. Expand the Variable Importance item. Note that this contains Class Level Indicator variables, rather than the original Class variables: Class Variable One-Hot Encoding - Additional Details The source code generates the class level indicators (values of 0 or 1) for all class variables identified in metadata. The class level indicator variable names are derived as <ClassVariableName>_<ClassLevel>. If a derived variable name is greater than the Maximum name length (defaults to 32), the class level part of the variable name is trimmed down to bring the name to the maximum or less. Note that SAS supports a name with a length no greater than 32 bytes. Any duplicates in generated class level names are resolved by using the generic name _CLASSLEVn (_CLASSLEV1, _CLASSLEV2, etc.) for the duplicates. The source code defines the SAS Macro %dmcas_classlevs. For macro parameters that are defined on this macro, parameter p_trainonly allows you to run the node with just the training data in Model Studio 8.2. The macro call which includes the three parameters (with default values) is shown below. %dmcas_classlevs(p_dummyvarrole=INTERVAL, p_maxnamelen=32, p_trainonly=NO) Macro parameter descriptions: p_dummyvarrole: Specifies the level for the Class Level variables. Possible values are INTERVAL or BINARY. If INTERVAL, the Class Level variables are populated with numeric values 0 or 1, and the variables have Level of INTERVAL. If BINARY, they are populated with character values '0' or '1', and the variables have a Level of BINARY. Defaults to INTERVAL if blank. p_maxnamelen: Specifies the maximum variable name length for the generated Class Level variables. Currently SAS supports a variable name length no greater than 32 bytes. Defaults to 32 if blank. p_trainonly: Specifies whether all data is used for determining the Class variable levels, or just the Training data. Possible values are YES or NO. If YES, Training data is used to determine the Class variable levels in the data. If NO, all data is used. Defaults to NO if blank. Class Variable One-Hot Encoding - Where Can I Get the Code? The SAS program to generate Class Level Indicator variables can be accessed from the Github repository: Download the SAS code (GitHub)

chmedi · ‎08-04-2016

In SAS Factory Miner, when creating a project, you are required to define a segment variable by which the data is stratified. The quick and efficient parallel analysis of each segment of the data is a key feature of Factory Miner. However, you may require baseline analysis of all the data in addition to the segmented analysis. Or, you may not have a convenient segmentation variable or don't have a need to segment the data. This tip describes how to accomplish an analysis of the full unsegmented data. First, scan your data for a unary (constant) variable. If such a variable exists, it can function as a segment variable for the analysis. If not, you will need to add a unary dummy variable to your data before adding it as a Data Source. Here is sample code to create a unary variable named "allData" in SAS: data /* your data name */; set /* your data name */; allData = 1; run; Note that this variable should not have any missing values. Next, log-in to Factory Miner and create a new project, adding your data source to the project. Once the project has been created, define your unary variable as the segment variable, and define your target variable. Finally, build the segment profile, modify your model templates, and build your models. The process is the same as with segmented data, but Factory Miner treats the data as a single segment.

chmedi · ‎06-02-2016

You have messy data, data which cannot be analyzed in SAS® Enterprise Miner™ due to its structure. How do you manipulate and restructure this data to make it tidy? How do you bring it into a form to which analytical models in SAS Enterprise Miner can be applied successfully? Hadley Wickham describes in great detail the attributes of "tidy data" and the process for making messy data tidy (Wickham, 2014). In short, a tidy data set has the following two attributes: Each column forms a variable Each row forms an observation In a messy data set, typically a measure variable will be broken out by one or more dimension values into multiple measure columns, or one column will include two or more dimensions. In this tip, I discuss three common messy data problems described by Wickham, and introduce the SAS code that is used to tidy this data. I also include an example messy data set and how this SAS code is used to transform and make this data set tidy. With this tip, I address Sections 3.1, 3.2, and 3.3 from Wickham’s article. Messy Data Scenario 1 – Dimension values stored as column names This data is stored in a presentation/data collection style format, where multiple Measure columns are described by the Dimension values in the column names. In this example, Measure column names W, X, Y, and Z comprise the values for a Dimension variable: To make this data tidy, I melt, or stack, the dimension measure columns into one measure variable, and generate the dimension variable (Dim): SAS Code used to accomplish this restructuring: /*Generate the _CASE_ column. Clear-out any labels on the measure columns.*/ Data _temp_indata /view=_temp_indata; Set indata1; _CASE_=_n_; Label W=; Label X=; Label Y=; Label Z=; Run; /*Transpose columns W,X,Y,Z to generate the MEASURE and DIM columns*/ Proc transpose data=_temp_indata out=outdata (rename=(col1=MEASURE) drop=_CASE_) name=DIM; By _CASE_ YEAR NOTSORTED; Var W X Y Z; Run; Messy Data Scenario 2 – Multiple dimension variables stored in one column In the below data example, a column contains values for multiple Dimension variables: X and Y in column Dim are the values for one dimension, and 1 and 2 are the values for another dimension. To make this data tidy, I parse the Dim column to generate Variables Dim1 and Dim2: SAS Code used to accomplish this restructuring: /*Parse DIM to extract DIM1 and DIM2.*/ Data outdata; Set indata; Length DIM1 $1 DIM2 $1; DIM1=ksubstr(DIM,1,1); DIM2=ksubstr(DIM,2,1); Drop DIM; Run; Messy Data Scenario 3 – Dimension values stored as column names, and Measure variables stored in rows This is data stored in a presentation/data collection style format, where multiple Measure columns are described by the Dimension values in their Names, and multiple Measure variables are stored in rows. In this example, Measure column names W, X, Y, and Z comprise the values of a Dimension variable, and Measure variables A and B are stored in rows. Column Type contains the names of the Measure variables (A and B): To make this data tidy, I melt, or stack, the data to generate the dimension variable (Dim), and Measure variables A and B are rotated out as columns: SAS Code used to accomplish this restructuring: /*Clear-out any labels on the W,X,Y,Z measure columns.*/ Data _temp_indata3; Set indata; Label W=; Label X=; Label Y=; Label Z=; Run; /*Sort by the Identifier variables.*/ Proc sort data=_temp_indata3; By year; Run; /*Transpose the W,X,Y,Z measure columns to generate the A and B measure variables and the dimension variable (DIM).*/ Proc transpose data=_temp_indata3 out=outdata name=DIM; By year; Var W X Y Z; Id type; Run; How can I run one of these programs myself? The above described messy data problems are addressed specifically by three SAS programs: 1) tidy1.sas (Messy Data Scenario 1) 2) tidy2.sas (Messy Data Scenario 2) 3) tidy3.sas (Messy Data Scenario 3) These SAS programs are housed in, and can be accessed from, the following GitHub repository created by Patrick Hall: Download the Files (GitHub) Before running one of these programs to address a messy data scenario, the program must be edited to provide property settings that are specific to the User’s data. I provide a data example below to illustrate the usage of two of these programs. My mocked-up data for this example contains the number of cases for two strains of a disease aggregated by year and by two dimensions: Sex and Ethnicity. The STRAIN1 and STRAIN2 measure variables are broken out into multiple measure columns by the Sex and Ethnicity dimensional values. The names of these measure columns contain the Sex and Ethnicity values separated by a colon. For example: In 2000 there were nine Female Native American (f:n) persons who had Strain 2 of the disease. Both Messy Data Scenarios 2 and 3 are incorporated in this data set. To restructure this data set and make it tidy, program tidy3.sas needs to be run, and then tidy2.sas needs to be run. First, tidy3.sas is edited to make the following property settings that are specific to this data: /*Name of the input dataset. Must include the Libref if the dataset is not in WORK.*/ %let indata3=tidy.mock_strain; /*Name of the generated output dataset. Include a Libref as needed. Must conform to SAS Name rules. If blank, WORK._TIDY3_ will be generated.*/ %let outdata3=tidy.mockstrain_tidy3; /*Character column whose values are the names of the Measure Variables. Specify one column name within the single quotes. Required.*/ /*This column must be populated for every row.*/ %let measNamesCol='strain'; /*Identifier (Fixed) Variables*/ /*Specify these variables by inserting them on lines immediately after the datalines4 statement below, one per line.*/ /*Do not include the column specified for &measNamesCol (above) in this list.*/ Data idvars; input @1 name $char32.; datalines4; year ;;;; Run; Next, tidy3.sas is run. Output data set MOCKSTRAIN_TIDY3 is now restructured such that it contains the _DIM_ column and the STRAIN1 and STRAIN2 measure variable columns. _DIM_ includes the Sex and Ethnicity dimension values. Now tidy2.sas is used to break-out Sex and Ethnicity into two columns. First, tidy2.sas is edited to make the following property settings: /*Name of the input dataset. Must include the Libref if the dataset is not in WORK. Required.*/ %let indata2=tidy.mockstrain_tidy3; /*Name of the generated output dataset. Include a Libref as needed. If blank, WORK._TIDY2_ will be generated.*/ %let outdata2=tidy.mockstrain_tidy2; /*Name of the column containing values for the dimension variables. Specify the Column name within the single quotes.*/ %let dimcol='_DIM_'; /*Space or comma delimited list of the names of the Dimension Variables. The specified Names must conform to SAS Name rules.*/ %let dimnames=SEX ETHNICITY; /*Space or comma delimited list of the character lengths of the Dimension Variables. Specify integer values.*/ /*The number of values specified must match the number of values specified for &dimnames above.*/ %let dimlengths=1 1; /*Method for how the Dimension variable values are stored in the Column. Required.*/ /*1=Delimited by one or more specified characters. 2=Fixed Start Character position, counting from the left. The Start position for a Dimension variable needs to be the same on every Column value.*/ %let dimstoremethod=1; /*Valid value: 1 or 2.*/ /*Delimiter character list for parsing the column values. Specify the characters, including a space as needed, within the single quotes.*/ /*Used if &dimstoremethod=1.*/ /*The first delimited value is assigned to the first Dimension Variable in &dimnames, the second delimited value is assigned to the second Dimension Variable in &dimnames, etc.*/ /*If no characters are specified, the following default list of characters will be used: blank . < ( + & ! $ * ) ; ^ – / , % | */ %let parsedelim=':'; Next, tidy2.sas is run. Output table MOCKSTRAIN_TIDY2 now contains columns SEX and ETHNICITY, which have been generated by parsing the _DIM_ values. This data set is now normalized and tidy. Analytical and statistical models can now be applied successfully to this data set. I’ve discussed the attributes of tidy data, and shown three typical scenarios that make data messy. Additionally, I’ve presented the SAS code that is used to address the three messy data scenarios, and presented a messy data example along with the process that’s used to run the SAS code against that data to make it tidy. Further Reading For more about Tidy data, see: Wickham, H. (2014). Tidy Data. Journal of Statistical Software, Vol 59.

chmedi · ‎05-11-2016

How do you typically aggregate event data for analysis? Event data -- for example, a death, heart attack, or sale of vacation package -- is often aggregated by frequencies and frequency ratios by one or more dimensions. This tip shows how to derive these for analysis in SAS Enterprise Miner. Both SAS Data Step and SAS SQL Procedure code are presented. My data for this tip is the deaths in Mexico in 2008 (DEATHS Dataset). Each row in this data contains information about a death: Cause, place, demographics, time, etc. I also have a lookup dataset which contains the cause of death descriptions (DISEASE Dataset). I want to derive the aggregated frequencies of COD (Cause of Death) by HOD (Hour of Death), and the ratio of those frequencies to the overall respective COD frequencies. First, I summarize the data by COD and HOD, and at higher-level aggregations: This results in output table CODSUM, which contains the aggregated frequencies at four levels: Overall count, HOD counts, COD counts, and counts of COD by HOD. Below is a subset of the counts by COD (_type_=2), and counts by COD and HOD (_type_=3). The overall count is not shown, and the counts by HOD are not shown. Now I join CODSUM with itself so that the frequency ratio column PROP can be calculated. This is done by joining the counts of COD by HOD with the counts of COD, and calculating the ratio between the two. The COD descriptions are also brought in by making a join to DISEASE. The output table HOD2 contains the COD by HOD counts (FREQ) along with the calculated PROP ratio values, and the COD descriptions. This is a small subset of that table. Another useful frequency ratio is the ratio of the overall HOD frequencies to the overall data count (PROPALL_HOD). Further statistical analysis can be accomplished in the correlation of this ratio and the PROP ratio. This additional ratio is derived by joining the overall HOD counts with the total count, and calculating the ratio between the two. The output table HOD2 now contains the additional overall HOD counts (FREQALL_HOD) along with the calculated PROPALL_HOD ratio values. The code shown in this tip so far has used the SQL Procedure for joins and calculations of the ratios. Below I show Data Step code that can be used to accomplish the same result. Conclusion I’ve taken individual Mexico deaths data and used SAS to derive the frequency counts of COD by HOD and two associated COD and HOD frequency ratios. These ratios are useful for further analysis in SAS Enterprise Miner. My colleague Ray Wright, as presented in the “Getting Started with Time Series Clustering” SAS Communities Tip, takes the hourly PROP values by COD and uses SAS Enterprise Miner to cluster the COD Time Series plots into five groups. The assignment of each death to one of these Time Series groups can then be used as input in a predictive model. Further reading For more about the mortality data used in this tip, see: Wickham, H. (2014). Tidy Data. Journal of Statistical Software, Vol 59.

Online Status	Offline
Date Last Visited	‎06-09-2025 03:25 PM

LightGBM in SAS Model Studio

Re: How to Winsorize in Model Studio

Re: How to Winsorize in Model Studio

Tree-based Imputation in SAS Model Studio

Re: Doubts in SAS Model Studio/Data Mining and Machine Learning

New Feature Machine node in Model Studio on SAS Visual Data Mining and...

Re: variable clustering dataset

Re: variable clustering dataset

Re: variable clustering dataset

Three new Variable Clustering features in SAS Model Studio 8.3

Re: Doubts in SAS Model Studio/Data Mining and Machine Learning

LightGBM in SAS Model Studio

Tree-based Imputation in SAS Model Studio

Class Variable One-Hot Encoding in SAS Visual Data Mining and Machine ...

New Feature Machine node in Model Studio on SAS Visual Data Mining and...

Three new Variable Clustering features in SAS Model Studio 8.3

LightGBM in SAS Model Studio

Re: How to Winsorize in Model Studio

Tree-based Imputation in SAS Model Studio

Re: Doubts in SAS Model Studio/Data Mining and Machine Learning

New Feature Machine node in Model Studio on SAS Visual Data Mining and...

Re: variable clustering dataset

Re: Three new Variable Clustering features in SAS Model Studio 8.3

Re: Best transformation – a new feature in SAS Model Studio 8.3

Class Variable One-Hot Encoding in SAS Visual Data Mining and Machine ...

Analyzing Unsegmented Data in SAS® Factory Miner

3 common messy data problems and how to tidy them in SAS

How to aggregate event data for SAS Enterprise Miner