WayneThompson Tracker

Re: Boosting, bagging and Random Forest

WayneThompson — Wed, 03 May 2017 14:37:51 GMT

Very thorough reply by Neville who implemented all of these methods in SAS. A couple of supporting comments:

1. Outliers The base model used in RF is a large decision tree. Decision trees are robust to outliers, because they isolate them in small regions of the feature space. Then, since the prediction for each leaf is the average (for regression) or the majority class (for classification), being isolated in separate leaves, outliers won't influence the rest of the predictions (in the case of regression for instance, they would not impact the mean of the other leaves)

2, Validation Data -- Yes please use in teh common case of rare events where the OOB might not be sufficient.

3. Transforms - consider continous Y as with many algorithms.

Re: What is SAS Visual Statistics?

WayneThompson — Mon, 29 Jun 2015 18:53:03 GMT

Visual Statistics (VS) is a web client based product that supports interactive ad-hoc data analysis whereas Enterprise Miner (EM) is a rich client with more of a batch type interface for developing repeatable analysis through the process flow diagram. VS is also targeted to both novice analysts with some very fundamental understanding of regression modeling and to a lesser extent advanced users. EM tends to be more targeted more towards advanced data miners and scientists with lots more algorithmic depth. One common use case I am seeing is use VS prior to modeling for feature reduction through the interactivity the product provides using all of the data and then fine tune the model in EM. VS generates SAS score code so you can compare models in EM using the Model Import node. Once use case I am not seeing enough is using VS to do post ad hoc validation of models – say evaluating score band cutoffs interactivity and evaluating high leverage and influence points even with real big data. You may also want to use VS to derive segments interactively and the use the Group Processing feature of EM to do stratified modeling. It is a great question – I hope this helps some. I was the EM product manager for over a decade and now am working on VS. I still love both products and use them all the time.

Re: Difference between boosting through start groups and gradient boosting

WayneThompson — Thu, 09 Jan 2014 16:17:49 GMT

Bagging (Breiman 1996) is a common ensemble algorithm, in which you do the following:

Develop separate models on k random samples of the data of about the same size.

2. Fit a classification or regression tree to each sample. I tend to bag only trees but the start and end group nodes. allow other algorithms.

3. Average or vote to derive the final predictions or classifications.

Boosting (Freund and Schapire, 1996) ,also supported through start > tree > end , goes one step further and weights observations that are misclassified in the previous models more heavily for inclusion into subsequent samples. The successive samples are adjusted to accommodate previously computed inaccuracies. Gradient boosting (Friedman 2001) resamples the training data several times to generate results that form a weighted average of the resampled data set. Each tree in the series is fit to the residual of the prediction from the earlier trees in the series. The residual is defined in terms of the derivative of a loss function. For squared error loss and an interval target, the residual is simply the target value minus the predicted value. Because each successive sample is weighted according to the classification accuracy of previous models, this approach is sometimes called stochastic gradient boosting.

Random forests is my favorite data mining algorithm especially, when I have little subject knowledge of the application. You grow many large decision trees at random and vote over all trees in the forest. The algorithm works as follows:

You develop random samples of the data and grow k decision trees. The size of k is large, usually greater than or equal to 100. A typical sample size is about two-thirds of the training data.
At each split point for each tree you evaluate a random subset of candidate inputs (predictors) are evaluated. You hold the size of the subset constant across all trees.
You grow each tree is as large as possible without pruning.

In a random forest this case you are perturbing not only the data but also the variables that are used to construct each tree. The error rate is measured on the remaining holdout data not used for training. This remaining one-third of the data is called the out-of-bag sample. Variable importance can also be inferred based on how often an input was used in the construction of the trees.

Re: Clustering binary data with Enterprise Miner

WayneThompson — Tue, 20 Aug 2013 20:10:48 GMT

HI Sonnfan,

It appears you want to cluster variables and not observations. In that case, you can use variable clustering node, or
factor analysis (see PROC FACTOR) or princicpal components. If you want to cluster rows, the for binary data, the
Euclidean distance measure used by K-Means is equivalent to counting the number of variables on which two cases disagree. However, you can try one of the
following approaches:

Run proc distance by a selecting the distance type
that you want and apply clustering using that distance matrix.

You can project the binary variables and do
clustering as follows:
run Factor Analysis or PCA on the binary
variables
save the factor or component scores as new
variables
cluster on the basis of those scores (In that
case, the data will no longer be binary)

Good luck

RPM vs. Enterprise Miner

WayneThompson — Mon, 03 Oct 2011 21:36:54 GMT

Thanks DLing. Here is a older SGF paper that discuss the RPM task within SAS EG and SAS AMO for Excel and how it is integrated and inturn can be extended in EM.

http://support.sas.com/resources/papers/proceedings10/113-2010.pdf

Decision Tree

WayneThompson — Wed, 14 Sep 2011 14:45:17 GMT

Prediction=0 in the table refers to the prediction for the selected node. Assuming you are strating with the root node and and you have more 0s than 1s then the prediction classification for the root node is =0. When you split on the root node and continue growing the tree interactivelhy hopefully you are resulting in some nodes (leaves) with prediciton = 1.

In some cases when you have a rare target event (1s) and little if any signal in the data the null root node can result in being the final classification. In this case trying using the inverse priors options under "Decisions" for the Input Data Source node. Or obtain add some additonal predictors if you avaialble. Anyway I could still be of base with your question. Hope this helps.

Decision Tree

WayneThompson — Tue, 13 Sep 2011 18:36:11 GMT

Hi Nikhil,

My fault but I probably and not understanding the question well.

When you run similar startup code for the project:

%let EM_INTERACTIVE_TREE_MAXOBS=100000;

%let EM_INTERACTIVE_TREE_SAMPLEMETHOD=RANDOM;

And you are modeling a binary response, do you have 1's and 0's distributed in the root node?

Is your target variable indeed a binary target and set to the binary variable role in the input node or is it set as interval?

Thanks

Decision Tree

WayneThompson — Fri, 07 Jul 2017 17:55:13 GMT

Some users may wish to override default Enterprise Miner interactive decision tree sampling strategies. Enterprise Miner provides two macros that you can issue with your project startup code that will modify interactive decision tree input data sampling behaviors:

%let EM_INTERACTIVE_TREE_MAXOBS= <max-number-of-observations-in-sample>;
%let EM_INTERACTIVE_TREE_SAMPLEMETHOD=<RANDOM | FIRSTN>;

The first macro specifies the maximum number of observations that can exist in an Interactive Decision Tree node sample. You use this macro if you want to manually control the sample size. Otherwise, Enterprise Miner will use its own algorithms to perform sampling for your interactive decision tree.

The second macro specifies the sampling methodology that will be used to create an Interactive Decision Tree node sample. You can use this macro if you want to manually control the methodology Enterprise Miner uses to create interactive decision tree samples. By default, Enterprise Miner uses random sampling for interactive decision trees. You can use the macro to choose between RANDOM and FIRSTN sample creation. You use the EM_INTERACTIVE_TREE_MAXOBS macro to specify the number of observations for both RANDOM and FIRSTN sampling strategies.

Survival node in Enterprise Miner

WayneThompson — Fri, 09 Sep 2011 18:50:26 GMT

By default the last 1/4 of the interval is used for validation. The help below defines further how to specify a user defined interval for validatoin.

Survival Node Train Properties: Survival Validation

The Survival Validation section of the Survival node Properties panel provides the following settings for configuring the validation of your Survival model:

Survival Validation Method – Use the Survival Validation Method property to specify how the validation holdout sample is generated. By default, the training, validation and test data sets that are passed to the Survival node are split into two time intervals. The first 75% of the interval is set aside as training data, for model creation. The remaining portion of the interval is set aside for survival model validation, or scoring. When the default survival validation method is selected, the entire scoring interval is used for validation calculations. However, you can choose to create a user-specified time interval if you desire. When you select user-specified time intervals, you can specify a specific hypothetical scoring date as well as the interval length.
Scoring Time – When you select User Specified as the value for your Survival Validation Method setting, you can use the Scoring Time property to specify a hypothetical scoring date. The hypothetical scoring date will divide the data into the two subsets of data. Select the button to the right of the Scoring Time property to open a table that you use to specify the date value for scoring. The Scoring Time date value that you specify must fall between your defined start date and censor date.
Interval Length – When you select User Specified as the value for your Survival Validation Method, the Interval Length property specifies the length of the time interval used to perform survival validation. The interval will begin with the hypothetical scoring date that you specified in the Scoring Time property setting, and then extend the specified number of time intervals forward. If the Time Unit property is set to Day and the Interval Length property is set to 15, your scoring interval is the 15 days that follow the hypothetical scoring date.

Enterprise Miner 7.1 SAS9.3 Released

WayneThompson — Thu, 18 Aug 2011 14:33:08 GMT

Some you may already know but EM 7.1 SAS9.3 was released July 12th. The pubs team prepared a nice summary here on support.sas.com

http://support.sas.com/documentation/cdl/en/whatsnew/64209/HTML/default/viewer.htm#emdocwhatsnew71.htm

Good luck with your data mining projects.

What happens when WARN is M or MU

WayneThompson — Thu, 18 Aug 2011 14:28:03 GMT

This means missing inputs and unknown inputs. Please try to check your scoring table input vars.

SAS Rapid Predictive Modeler Documentation

WayneThompson — Thu, 18 Aug 2011 14:25:04 GMT

When you open the Rapid Predictive Modeler task select the help button on the task. Totally agree that we should add the help into the main EG online help. Here is an SGF paper http://support.sas.com/resources/papers/proceedings10/113-2010.pdf

The best way to learn how the models are developed is to run them from EG (AMO ) and then review them in EM. Good luck.

Re: renaming variables in E Miner

WayneThompson — Fri, 17 Jun 2011 16:22:00 GMT

Please also see this tech support note http://support.sas.com/kb/38/578.html

As you indicate I would use the SAS Code node to rename the variables using the macros varaible names for the data sources you want to import and export.

SAS has a 32 character limit.
Teradata has a 30 character limit.
Long variable names is something we look into.

Re: Exterprise Miner Error

WayneThompson — Fri, 29 Apr 2011 14:38:15 GMT

Thanks for using SAS EM. Please see this usage note about max levels exceed 512. http://support.sas.com/kb/20/054.html

Also as suggested I would recommend being careful about using a variable like zip code or other nominal variables that have a really large number of discrete nominal values as predictors (inputs). You can reject this variable or using a binning technique.

Rexler Data Mining Survey

WayneThompson — Wed, 20 Apr 2011 19:54:06 GMT

Karl Rexler has asked me to let folks from the SAS DM community know about his Data Miner Survey which cover trends in data mining, applications and vendor tool usage.

If you are interested in the survey it should take approximately 20 minutes to complete. Survey Link:
www.RexerAnalytics.com/Data-Miner-Survey-2011-Intro2.html
Access Code: KS37P

thanks much

Re: Neural Network Graphics

WayneThompson — Fri, 03 Dec 2010 22:07:59 GMT

An interactive neural network buidler to define and visualize the network is a feature we would like to add to EM. thanks

Re: Kernals in Support Vector Machines

WayneThompson — Wed, 10 Nov 2010 22:31:04 GMT

An experimenta linear and nonlinear kernel SVM node is planned for EM 7.1 SAS9.3 mid next year. Thanks for you use of the software. There are quite a few other classification and prediction tools you will want to try. Some users have reported good Gradient Boosting results. Will keep you and the forum up to date.

Re: Data Mining without using EM

WayneThompson — Tue, 07 Sep 2010 18:38:07 GMT

Hello Aha,

Please contact me at wayne.thompson@sas.com to discuss EM pricing relative to your budget and application needs. I am here to listen to your concerns at a minimum.

STAT provides a broad range of predictive/classifciation methods along with clustering and many other techniques. EM provide some algorithms like decision trees, gradient boosting, neural networks, memory based reasoning , etc not found in STAT along with the GUI for developing repeatable self documenting analysis very quickly with an emphasis on complete score code generation for deployment. Others from the list may have more practical reasons -- I work at SAS. Anway we hope you are using SAS and STAT is a great product --- EM requires it and includes the SAS code node for embeding STAT procedures into your EM analysis.

Wayne

Enterprise Miner 6.2 SAS 9.2 released August 17th

WayneThompson — Mon, 23 Aug 2010 20:44:13 GMT

SAS Enterprise Miner 6.2 was released last week which delivers major new features targeted at specific business applications and complex information technology deployments. Rapid Predictive Modeer is a new feature for general business users that need to develop reliable models for predicting customer response and retention. The Interactive Decision Tree is enhanced to display more information about the nodes and leaves and to show more plots based on validation data. The credit scoring functions have been improved to provide more customization of bins and to provide more control over scorecard points (Credit Scoring for EM 6.2). For more details please see the What's new document at

http://support.sas.com/documentation/cdl/en/whatsnew/62580/HTML/default/viewer.htm#/documentation/cdl/en/whatsnew/62580/HTML/default/emwhatsnew62.htm

We hope you find these features useful and fun to use. Wayne Thompson SAS Product Manager Analytics.

Re: Score or points allocations after building logistic regression

WayneThompson — Fri, 07 Jul 2017 19:12:24 GMT

Use the Credit Scoring for EM product which provides variable selection, interactive or batch classing (binning) including weights of evidence calculations, scorecard construction with scaling and diagnostics, and reject inference.