SAS Support Communities

DougWielenga · ‎11-15-2021

@ycenycute -- The thing to understand about any such cluster selection approach ("elbow", CCC, ABC, etc...) is that there is no "right" answer. All approaches effectively attempt to identify where the value of creating a larger number of clusters provides a smaller return in "value". Since there is no right answer for the "correct" number of clusters, it is common to generate several cluster solutions and evaluate the usefulness of each clustering solution in light of your business/research questions of interest. Understanding the nuances of how each approach to identifying good candidate solutions would require an understanding of the mathematics used in generating any statistic used both in the clustering and in the assessment of the clusters. For example, use of a distance metric based on squaring the deviations might give a very different clustering than simply taking the absolute value of the deviations in which larger deviations are not penalized as greatly. Even if you have a good understanding of the those metrics, you must consider any candidate solution in light of the original research/business question. It would be entirely expected for two people with the same data set but different business needs to settle on completely different cluster solutions as ideal. For example, someone wanting to identify non-trivial group sizes for the purposes of marketing might tend toward a smaller number of clusters and might even ignore outliers to better separate the people in the middle of the pack to keep each market segment nontrivial. However, someone looking at the same data and trying to understand new market opportunities might be willing to create a larger number of clusters so they could look toward the small clusters at the fringes which though small are emerging over time to identify new areas of opportunity. In either case, there might not be a particular metric that chooses the ultimate cluster solution for the business problem. The metrics get us closer to identifying good candidates but it is always good to look at a range of nearby solutions in order to better identify the best cluster solution for a particular business problem. Another thing to consider is that cluster solutions depends on the variables that are included, so adding a variable or subtracting a variable changes the potential solution. If you try and put a bunch of variables into a single cluster solution, chances are there are only a small subset of those variables which are really driving the clustering. In many cases, it makes more sense to create several cluster solutions for different subsets of variables that are reasonably considered together. For example, suppose you had information related to recency of purchases, frequency of purchases, and amount of purchases over various time windows (e.g. over the last 30, 60, 90, 180, 360 days). Rather than cramming all of the variables into one cluster solution, it might be far more effective to cluster each of the subgroups of variables separately. You could then build a profile for each potential buyer based on the cluster prediction from each of the three cluster solutions (Recency, Frequency, Monetary) which would build a clearer picture of your candidates. Again, since there is no "correct" cluster solution, you can build any such candidate cluster solutions based on your particular business need. The choice among them in the end is more likely to be driven by the business/research question than by any particular metric. I hope this helps! Cordially, Doug

DougWielenga · ‎09-14-2020

There are a few issues with your hypothetical situation: * you have a single categorical input with four levels and a binary target, so you can estimate four distinct predicted values, one for each input level -- it is not clear using logistic regression improves this fit without any interval inputs to consider * you have a relatively small number of observations overall and there are only five observations where X="D" which makes splitting into training and validation a questionable approach * given that there are only 8 possible bins for observations to be cast into (two possible outcomes and four possible inputs), the partitioning split seems as good as it could be, but this is likely a better candidate for cross-validation on the training data set were it not such a simple problem. Data Mining problems typically involve large numbers of observations for which it makes sense to partition into training and validation (and possibly test) data sets. The differences in the percentages is because with such a small number of observations, a single observation accounts for 0.8% in the training and 1.8% in validation. The differences in percentages is therefore not surprising, but splitting in the first place is likely not warranted. I hope this helps! Cordially, Doug

DougWielenga · ‎01-16-2019

I need guidance on setting a threshold for the "support" in Apriori Algorithm using SAS E-miner. The Association node in SAS Enterprise Miner allows you to specify the Minimum Confidence level and the minimum support either measured by count or percentage using the following node properties: * Minimum Confidence Level - specifies the minimum confidence level that is required to generate a rule. * Support Type - specify minimum support based on Count or Percent, and then set the appropriate parameter below: -- Support Count - specifies the minimum transaction frequency to support an association. -- Support Percentage - specifies the minimum transaction frequency to support an association. I hope this helps! Doug

DougWielenga · ‎01-16-2019

I computed a HP random Forest model using SAS Miner to identify the propensity to buy a product. As result, I have found that my model produce a lift chart value about 9 (on a first decile) in the validation set. In your opinion and experiences, isn't that value too high? Should I expect a smaller value? The maximum possible lift is relative to the overall population rate of your event. The lift in the first decile will differ across data even if they have the same overall population rate since the ability of the available data to predict an event varies across data. Lift is useful since it has no units -- it is just a relative increase/decrease in the occurence of an event, but this almost means that you must consider both the lift and the actual predicted occurrence to properly assess how "big" it is. For a discussion of how the occurrence rate and lift are related, see the discussion at https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/Help-with-over-under-sampling-of-the-rare-event-in-predictive/m-p/388519#M5851 Hope this helps! Doug

DougWielenga · ‎01-16-2019

I have a reponse rate of 2%. I over sampled (50/50) and built a model. ( I took the orignal dataset and oversampled using sample node in sas em) Now I have to score on new observation, but before that I need to put a decisions node and make changes in decision weights. (to adjust the probabilities since I over-sampled, otherwise they are higher). I am getting stuck at what all changes to make in decisions node. I dont have any profit loss matrix. I am following this - https://communities.sas.com/t5/SAS-Communities-Library/Tip-How-to-model-a-rare-target-using-an-overs... in my decisions node, what should be on right lower corner, right upper corner, left lower corner, left upper corner on my decision weights tab? ( I didnt get how they got 1.0526 in the forum) In the thread they have 5% response rate and oversampled to 50-50. There are several things to consider in this situation. Oversampling to 50/50: This popular approach seems to originate from the fact that the greatest power for detecting a binary outcome with a fixed sample size occurs when there is a balanced sample. When you are talking about sampling, however, you are no longer talking about a fixed sample size. In data mining, it is common to have one event be far more rare than the other. In this situation, oversampling to 50/50 (especially when you have only a 2% sample) risks having a non-representative sample of the non-events. Classic metrics of model performance are often going to look very different in highly unbalanced situations as I discussed in https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/Help-with-over-under-sampling-of-the-rare-event-in-predictive/m-p/388519#M5851 Using Inverse Priors: In many situations, you can actually solve both the issues inherent in oversampling so much and the rareness of the event of interest by setting up decision weights as is discussed in SAS Note 47965 available at http://support.sas.com/kb/47/965.html In situations where you don't have specific costs/profits (and even in situations where you do!), this is a reasonable approach to identifying useful models that might otherwise not be available without oversampling heavily. Readjusting the probabilities for oversampling: If you set up prior probabilities using Decision Processing (click on the ... to the right of Decisions for your Input Data node) and click on the Default with Inverse Prior Weights button on the Decisions tab inside the Decisions Processing dialog, your model will reflect the original population even if you oversample. An important question to ask: Do you really need to get the probabilities in terms of the original population? Adjustments made using the Decisions node attempts to adjust the probabilities based on what a representative sample from the population might have created. In reality, it changes the predicted probabilities but does not change the sort order of the observations. Whether the population is oversampled and then adjusted or whether the raw data is used, the probabilities are still only approximations. In practice, the performance on holdout data is still likely optimistic for many reasons including * the holdout data is often used to choose the final model * the data is typically removed in time and other factors that influence the outcome might have changed * the target represented a surrogate for the actual target of interest (e.g. when you model response to a past campaign to try and predict the response to a future campaign is a surrogate target scenario) In practice, it might be more useful to identify the distribution of the predicted outcomes in spite of an oversampling. Should you wish to do so, however, you can specify the priors in the Decisions Processing dialog of the Decisions node in the same way you can in the Input Data node. Just be sure to do it in only one place. The dialog is the same in both places, but you access it by clicking on the ... to the right of Customer Editor in the Decisions node. In general, I prefer doing this in the Input Data node. Please note that changing the priors can also impact the associated decisions. In general, it makes sense to use decision weights based on the priors. The weights you saw in the example came from a data set where the rare event had 5% and the common event had 95%. The inverse prior weights then became 20.00 (1 / 5%) and 1.0526 ( 1 / 95%). Hope this helps! Doug

DougWielenga · ‎01-16-2019

For a discussions regarding SMOTE, please see https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/SAS-Enterprise-Miner-SMOTE-sampling-with-categorical-variables/m-p/394037/thread-id/5980/highlight/true#M6009 https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/SMOTE-with-missing-values/m-p/426917/highlight/true#M6545 For a discussion of the issues with analyzing rare events, please see https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/Help-with-over-under-sampling-of-the-rare-event-in-predictive/m-p/388519#M5851 For an approach to modeling rare events, consider looking at SAS note 47965 available at http://support.sas.com/kb/47/965.html Hope this helps! Cordially, Doug

DougWielenga · ‎01-16-2019

If you are looking to identify univariate outliers, you can look at the distribution of each variable in the Replacement node. This node allows you to visualize the values/levels of continuous/categorical data and to filter the values (if desired) based the following criteria: For continuous variables, you can use the Default Limits Method to specify a default method to determine the range limits for interval variables or Cutoff Values to modify the cutoff values for the various limit methods using the respective options shown below from the Replacement Node documentation: Default Limits Method — Use the Default Limits Method property to specify the default method to determine the range limits for interval variables. Use any of the methods below. Mean Absolute Deviation (MAD) — The Mean Absolute Deviation method eliminates values that are more than n deviations from the median. You specify the threshold value for the number of deviations, n, in the Cutoff for MAD property. User-Specified Limits — The User-Specified Limits method specifies a filter for observations that is based on the interval values that are displayed in the Lower Limit and Upper Limit columns of your data table. You specify these limits in the Interactive Replacement Interval Filter window. Metadata Limits — Metadata Limits are the lower and upper limit attributes that you can specify when you create a data source or when you are modifying the Variables table of an Input Data node on the diagram workspace. Extreme Percentiles — The Extreme Percentiles method filters values that are in the top and bottom pth percentiles of an interval variable's distribution. You specify the upper and lower threshold value for p in the Cutoff Percentiles for Extreme Percentiles property. Modal Center — The Modal Center method eliminates values that are more than n spacings from the modal center. You specify the threshold value for the number of spacings, n, in the Cutoff for Modal Center property. Standard Deviations from the Mean — (default setting) The Standard Deviations from the Mean method filters values that are greater than or equal to n standard deviations from the mean. You must use the Cutoff for Standard Deviation property to specify the threshold value that you want to use for n. None — Do not filter interval variables Cutoff Values — Click the ellipses (...) button to the right of the Cutoff Values property to open the Cutoff Values window. You use the Cutoff Values window to modify the cutoff values for the various limit methods available in the Default Limits Method property. MAD — When you specify Mean Absolute Deviation as your Default Limits Method, you must use the MAD property of the Replacement node to quantify n, the threshold value for the number of deviations from the median value. Specify the number of deviations from the median to be used as cutoff value. That is, values that are that many mean absolute deviations away from the median will be used as the limit values. When set to User-Specified the values specified using the Interval Editor are used. When set to Missing, blanks or missing values are used as the replacement values. Permissible values are real numbers greater than or equal to zero. The default value is 9.0. Percentiles for Extreme Percentiles — When you specify Extreme Percentiles as your Default Limits Method, you must use the Percentiles for Extreme Percentiles property to specify p, the threshold value used to quantify the top and bottom pth percentiles. Permissible values are percentages greater than or equal to 0 and less than 50. (P specifies upper and lower thresholds, 50% + 50% = 100%.) The default value is 0.5, or 0.5%. Modal Center — When you specify Modal Center as your Default Limits Method, you must use the Modal Center property to specify the threshold number of spaces n. That is, values that are that many spacings away from the model center will be used as the limit values Permissible values are real numbers greater than or equal to zero. The default value is 9.0. Standard Deviation — Use the Standard Deviation property to quantify n, the threshold for number of standard deviations from the mean. That is, values that are that many standard deviations away from the mean will be used as the limit values. Permissible values are real numbers greater than or equal to zero. The default value is 3.0. You can click on the ... to the right of Replacement Editor under the Interval Variables (or Class Variables) section in order to interactively view the range of values and choose custom settings for each variable. However, doing this manually can be extraordinarily time consuming to look at individual variables in typical data mining scenarios. If you are looking to identify multivariate outliers, you might consider building principal components with your interval inputs and then looking for outliers on the individual PCs that are generated. This might lead you to identify observations that are not necessarily unusual in any given dimension but which are when considering multiple dimensions. While there is motivation to consider excluding outliers in certain clustering situations which might otherwise be driven by extremely small outlying clusters, it is typically problematic to ignore data from a predictive modeling standpoint. Tree-based methods can minimize the effect of outliers since outliers do not have excessive weight as they do in many distance based optimization methods. I hope this helps! Cordially, Doug

DougWielenga · ‎01-15-2019

The java error message you are encountering is a generic one that can appear for various reasons (even though it is uncommon). If it is only an issue in a particular diagram, you might consider building a new diagram (or a new project). If you are encountering the error in different projects, there might be a configuration issue (e.g. did your client update to a newer version of Java which isn't supported?). If building a new diagram in a new project does not address the problem, you will likely need to contact SAS Technical Support. When contacting SAS Technical Support, you will likely be asked questions such as: 1. Is this happening with all users or with just one user? 2. Is this happening with all data sources or just certain data sources? 3. Are their certain steps you are performing whenever this error occurs? 4. When is the last time you have restarted the SAS services/sessions on your server? 5. Can you reproduce the error in a new project and diagram? You might also check on whether you are running out of Java memory and/or allotted memory for the user. Usually, you can overcome problematic nodes by just dragging in a new node, and you can overcome problematic diagrams by building a new diagram. It is important not to import a corrupted diagram when testing building a new diagram as their might be artifacts associated with the previously used diagrams that are carried over. Hope this helps! Doug

DougWielenga · ‎01-15-2019

I made a decision tree in SAS Enterprise Miner and the Nodes only display some of the values. Is there a way to display all values? When you have a large number of values (or when the levels themselves are long strings of characters), it won't always be possible to view a list of every single level in a static picture or graphic since . If you want to see all of the levels represented in a terminal node, the easiest way is to note the Node ID for the terminal node in question and then view the associated rules by clicking on View --> Model --> Node Rules in the Decision Tree results browser. For nodes other than terminal nodes, you might need to choose a particular path of interest and view the rules for the associated terminal node. You can also use the interactive mode by clicking on the .. to the right of Interactive in the Train section of Tree properties panel and click on the node of interest to see the splitting values. You can always see all the actual values in the score code for a terminal node by clicking on View --> Scoring --> SAS Code in the Decision Tree results browser. Hope this helps! Doug

DougWielenga · ‎01-15-2019

Hello. I am using the Interactive Decision Tree tool in SAS Enterprise Miner 14.3. When I Split a Node and make any edits to the rule, the logworth of that variable drop to 0 such that I cannot tell any longer its relative strength vs the other variables. Is there any workaround for this behavior? When you are in interactive splitting, the logworth being displayed is the logworth for that split. If you split on a variable, it is possible that further splitting on that variable will have a logworth of zero even if the response values differ since a logworth of zero just means that the variable in question cannot be used to create a better fitting model by splitting that particular node. To see overall variable importance (not just for that particular node), close the interactive portion of the tree and open the tree results. Inside the tree results browser, click on Model --> Variable importance to display the overall importance values for the fitted model. Hope this helps! Doug

DougWielenga · ‎01-15-2019

Is there a way (node) in Miner that will tell which observations fall into particular percentiles and/or quantiles for a particular variable? For example: Which observations of Speed fall beyond the %85 quantile. It would be helpful to better understand what you hope to do with the observations. Data Mining data sets typically contain a huge number of observations so writing out observations that meet some criteria like you described is not particularly useful in most situations. It would be easy to run some simple code such as using the MEANS or UNIVARIATE procedure in a SAS Code node to get some specific statistics but you would likely be better off using the STATEXPLORE node or by simply exploring the data which has been exported from a particular node and using the Plot wizard to build graphs of interest. To do so, click on a particular node and then click on the ... to the right of Exported Data in the General properties section of the node properties panel. From here, you can click on Explore... in order to obtain a sample of the data for exploration. From there, you can click on Actions --> Plot (or you can just click on the Plot icon) and build a graph of interest. It is typically not practical to try and plot the whole data set but you can modify the Sample Properties options to increase the Fetch Size to Max which is the maximum that can be downloaded to the SAS Enterprise Miner client. You could also consider creating indicator variables that identify when a variable is above or below some threshold of interest. If you can explain more about what you hope to do with those observations, I might be able to provide some better approaches. Hope this helps! Doug

DougWielenga · ‎01-15-2019

I am building a model with in which I need to use different set of input variables for different targets. There are inputs that should be used for some targets but rejected for others. Is there any way to do this in SAS miner using the Group Processing nodes or any other recommended method. The Group Processing nodes allows you to analyze different targets using the same potential input variables or you can analyze multiple subgroups against the same target(s) using the same potential input variables. If you have different targets and different sets of input variables for each, this is much easier to set up and manage with parallel flows. If all of the input variables are in a single data source, you can simply create a separate branch for each group of inputs that you want to consider and then connect modeling nodes to each of those separate branches. If you have multiple targets that would use the same set of inputs, you can use Group Processing on that particular branch. In the end, keeping the models on separate paths is very beneficial on many occasions for the following reasons: * allows you to see all of the results on any particular model * allows you to retrain only the target(s) of interest without being required to refit all of the models that were done in group processing * allows you to easily obtain scorecode for the target variable(s) run through a particular branch * allows models to be run in parallel rather than sequentially taking advantage of additional CPUs if available Group Processing can be very helpful when fitting a model against multiple segments or when building models for multiple targets with the same input variables. Hope this helps! Doug Assuming all of these subsets of variables

DougWielenga · ‎01-15-2019

1) hierarchical clustering based on maximum number of clusters 2) submitting this solution to FASTCLUS Is this right? How can I see this is the SAS Code? If you look at the options in the Cluster node, you will see the following settings in the Selection Criterion section: Clustering Method: Ward (default) or you can change to Average or Centroid Preliminary Maximum: 50 (default) Minimum: 2 (default) SAS Enterprise Miner identifies initial seeds using the DMVQ procedure (which can perform Vector Quantization and k-means clustering). Note: This initial step was performed by FASTCLUS in early versions of SAS Enterprise Miner prior to the introduction of the DMVQ procedure. The DMVQ procedure provides k seeds to the CLUSTER procedure based on the Preliminary Maximum setting (50 by default). These seeds (50 in my example) are clustered hierarchically by the CLUSTER procedure in order to identify candidate solutions based on the Cubic Clustering Criterion (CCC). The seeds and the associated statistics from this step are written to the <project folder> / Workspaces / <workspace folder> / <node id>_CLUSSEED.sas7bdat data set. The _CCC_ variable in this data set contains the computed CCC value for various steps in the hierarchical clustering of the initial cluster seeds. Candidates for the optimum number of clusters based on this hierarchical step are identified and then the DMVQ procedure runs again to obtain a direct (k-means) cluster analysis of the training data itself based on the number of seeds chosen by the hierarchical step. You can see the results of this in the Output window of the Cluster node where it shows output from the CLUSTER procedure including the Eigenvalues of the Covariance Matrix, the Cluster History, and the Candidates for Optimum Number of Clusters. In order to see the actual code that is running, you will need to add some options to your Project Start Code requesting SAS Enterprise Miner to print the logic from the macros which are running to perform these steps. Specifically, you can get a great deal more detail in the Log if you add the following statement to the Project Start Code: /*** BEGIN SAS CODE ***/ options mprint source mlogic; /*** END SAS CODE ***/ Remember that SAS Enterprise Miner handles ordinal and nominal inputs as well as interval inputs so there is more that is happening but this is the basic outline of how the process works. I hope this helps! Cordially, Doug

DougWielenga · ‎01-15-2019

Is there any way in which I can perform the grid search to find the best parameters. I want to know how can I autotune the parameters for the model on the basis of cross-validation error or any other method. It sounds like you are talking about a Decision Tree model. When building this type of model, interpretation is often as important as overall performance. There are likely to be many different trees that perform very similarly on hold-out data and the best model for your business question will depend on what question you are trying to ask. For example, if you are trying to identify different groups of people to market to, you might set a high minimum number of observations for a terminal leaf since you don't want to create different marketing strategies for a large number of tiny groups of customers. If you are looking for Fraud, however, you might be interested in growing the tree deeper to find unusual terminal nodes with a small number of entries. Even if you end up comparing different types of models (not just trees), you might still choose a simpler model that has some interpretation rather than fitting a complex model that has none. In the case of a tree, though, the interpretation is still typically of great interest. If interpretation is not the goal, a random forest (which is comprised of many trees) is likely to provide a higher assessment value than a single tree model. While it is easy to build out several different trees in parallel to see how different settings impact the building of the tree, there is no way for the software to anticipate which tree will provide the appropriate balance of interpretation and performance so no 'grid search' is available. Hope this helps! Doug

DougWielenga · ‎01-09-2019

So what method is exactly used in this preliminary cluster pass? This is of importance, I guess. To clarify, the hierachical clustering being done is only on the cluster seeds initially generated by the FASTCLUS procedure. There is no hierarchical clustering of the entire date set. The initial cluster seeds (using the maximum number of clusters of interest) are clustered hierarchically reducing the number of seeds to submit to FASTCLUS by one at each step. This generates a different clustering solution for each value from the maximum number of clusters considered to the smallest. You can then choose the clustering solution based on the any of the criteria which are provided and/or interpretability and/or usefulness. SAS Enterprise Miner provides the option to use AVERAGE, CENTROID, or WARD when doing this hierarchical step. Hope this helps! Cordially, Doug

Online Status	Offline
Date Last Visited	‎11-19-2021 02:27 PM

SAS Support Communities

Re: SAS EM: does clustering node have elbow method to select the optim...

Re: Quality Check of Training and Validation Set

Re: Putting a Support Threshold in Apriori Algorithm - SAS E-miner

Re: Lift chart value in business case after Random Forest model

Re: Confusion in decisions node (while scoring after oversampling)

Re: SAS Enterprise Miner: Imbalance data

Re: Calculating Percentiles in Miner

Re: SAS Enterprise Miner 14.3 - Corrupted Diagram

Re: How do I Display All Values in Nodes

Re: Logworth Drop to 0 when Editing Rule for Interactive Decision Tree...

Re: result of segment profile chart doesn't show up all. Is there some...

Re: SAS to Python

Re: Advantages & benefits of using EM Server vs EM for Desktop

Re: Run time error 1008 with market basket analysis node

Re: Using R code in SAS Enterprise Miner

Re: PROC HPSPLIT Decision Tree

Re: How to get ROC curve in Model Comparison

Re: Need links to instructions on reading these SAS charts: Mean Pred...

Re: Help with over/under sampling of the rare event in predictive mode...

Re: Next-Best Offer/Recommendation under the Association Node while us...

Re: SAS EM: does clustering node have elbow method to select the optim...

Re: Quality Check of Training and Validation Set

Re: Putting a Support Threshold in Apriori Algorithm - SAS E-miner

Re: Lift chart value in business case after Random Forest model

Re: Confusion in decisions node (while scoring after oversampling)

Re: SAS Enterprise Miner: Imbalance data

Re: Calculating Percentiles in Miner

Re: SAS Enterprise Miner 14.3 - Corrupted Diagram

Re: How do I Display All Values in Nodes

Re: Logworth Drop to 0 when Editing Rule for Interactive Decision Tree...

Re: Calculating Percentiles in Miner

Re: Multiple Targets in SAS EM Group Processing nodes

Re: Clustering in SAS Enterprise Miner

Re: Autotuning in E miner

Re: Clustering in SAS Enterprise Miner

Follow Us

What is...