About geniusgenie

DougWielenga · ‎01-09-2019

I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model.... The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1. Only thing to worry is "How should I interpret the results in Tree algorithms" ?? Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way? My question to you is why are you using standardized data in the first place? Trees are usually of interest when interpretability is desired. Standardizing your data can make that almost impossible. Interpreting splits based on a variable containing 'price in dollars' is easily understood while replacing that variable with 'log(price in dollars' is almost completely useless. Note that the Tree is creating splits so in theory, any ordinally equivalent set of values (values that are naturally ordered in the same way) will generate the same results. In practice, transforming your input variables can change which split points are considered since not every single possible value is necessarily considered for splitting. If the tree differs in how it splits the observations at any level, all subsequent splits will likewise be impacted. Consider that modeling really represents multiple activities, sometimes done jointly: * variables are selected for use in possible models * candidate models are constructed * candidate models are evaluated to choose final model * interpretation is attempted If your goal is interpretation, the standardization is not likely to help you. If the transformations do generate a better overall model, you can also consider keeping both the transformed and non-transformed values. In this case, you might build the initial model considering both transformed and non-transformed variables as potential input variables, but then perform an analysis of the predicted outcome (based on some threshold you specify) using only the non-transformed variables in a Segment Profile node. In this way you can get the potential benefits of transforming the inputs as well as a more accessible interpretation. In most cases, however, I would not expect all of the extra work to result in meaningful differences since the Tree models are not impacted directly by scale or location changes. I hope this helps! Cordially, Doug

shilpaISBCBA · ‎03-13-2018

BIG THANKS!

WendyCzika · ‎02-01-2018

It is on the "HPDM" (high-performance data mining) tab, second node from right, see screenshot.

DougWielenga · ‎08-24-2017

When you get 100% accuracy, you need to go back and check your input variables to make sure you have not inadvertently included a variable containing information you would not have available when scoring new data. For example, I could easily predict which accounts were going to default if there was a field that indicated how much money was lost when the loan did default, but that information would never be available for new data. You can also get very high classification ratings (although not 100% typically) when you have a rare event that only happens a small percentage of the time. Suppose your event happens 1% of the time, then you can say "nobody has the event" and be 99% correct with respect to misclassification yet not have any model that is of any usefulness. More details would be needed to speculate further on the misclassification aspect. In Data Mining scenarios, you typically have sufficient data to use holdout data (validation data) to demonstrate the model is useful empirically. When you have more limited data, you are left with cross-validation options. When you have very limited data, you are left with assessing things based on your business knowledge. The less data there is, the more uncertainty you are likely to have. With regards to choosing the 'best' model, you need to incorporate your business objectives. You can choose a model based on many different statistics yet none of them might actually be best suited to your situation depending on the business objectives you are trying to accomplish. You need to identify your goals and assess how costly it is to misclassify someone which can be complex if you have more than two levels. In the end, your choice of strategy should support the goals you had when you started building the model in the first place. Hope this helps! Doug

geniusgenie · ‎08-08-2017

Thanks for your reply, I got it done. Regards

geniusgenie · ‎08-08-2017

Hi Doug, Thanks for your reply. I am getting your points. I am using impute before anything gets applied. I will apply your thoughts and will be sharing results with you. Regards

DougWielenga · ‎08-07-2017

Not all of the nodes you indicated do variable selection. For example, the Neural Network uses all available inputs as does the HP SVM models, so it is important to consider doing variable selection prior to running the model. The Decision Tree does variable selection automatically -- the approach depends on the settings you choose -- while the HP Forest node does it optionally. In situations where you use the HP Forest node for variable selection, you would typically be recommended to run a second HP Forest node with only the important variables selected. There is not a simple answer to your question but I would recommend you review the documentation available for all of these nodes in SAS Enterprise Miner by opening the application and clicking on Help --> Contents and then navigating in the panel on the left to the Node Reference. The help for the individual nodes is arranged by tab (Sample, Explore, Modify, Model, Assess). All of the modeling nodes you described are in the Model folder under the Node Reference. Additional documentation is available at http://support.sas.com/documentation/onlinedoc/miner/index.html You can read up on each of the modeling approaches and let us know if you have specific questions about something. I hope this helps! Doug

DougWielenga · ‎08-07-2017

Correlation among input variables could be a very important issue in classical regression where the structure of the model was critical to generating useful results and interpretation. In most data mining scenarios, you have far more data than was available to historical approaches as well as powerful methods (linear & nonlinear) that allow you to model relationships using flexible models which adapt to your data. You can use holdout data to empirically validate the relationships with data rather than relying on assumptions. The amount of interpretation available differs from model to model. Trees provide simple interpretability while neural network and SVM models do not lend themselves to interpretation. Correlation is a concern for interpretation of simple regression models but interpretation is not meaningful if the model is inadequate which they often are. If you would benefit from broader training in using these methods, check out the training available at http://support.sas.com/training/us/paths/dm.html where you can get a better understand about how these different models can be used. Hope this helps! Doug

DougWielenga · ‎08-07-2017

SAS Enterprise Miner generates a ROC curve for the Train, Validate, and Test data set in the Model Comparison node when modeling a binary target. It also generates a misclassification chart for the Train & Validate data sets but it does not generate a misclassification chart for the Test data set. In the design of SAS Enterprise Miner, Test data sets are intended for a final unbiased evaluation of model performance so they are not used by default when a Validate data set is present. Please note that SAS Enterprise Miner always generates F_<target variable name> : the target variable value I _<target variable name> : the predicted target value (based on highest probability) but it can also generates a D _<target variable name> which contains the 'decision' outcome based on the decision weights and priors entered in the target profile when one is present. For example, if the target variable is named 'BAD', SAS Enterprise Miner would create the variables F_BAD, I_BAD, and D_BAD. If a Test data set is available, you can add a SAS Code node after any modeling node and enter the following code in the Training code section. This example assumes the target variable is named BAD. /*** BEGIN SAS CODE ***/ proc freq data=&em_import_test; tables F_BAD*I_BAD; tables F_BAD*I_BAD; *only available if Decision profile has been created; run; /*** END SAS CODE ***/ The code above will generate both misclassification charts if the target profile is available. Hope this helps! Doug

DougWielenga · ‎08-07-2017

If you go to the SAS Enterprise Miner help available by opening SAS Enterprise Miner and then clicking on Help --> Contents, you can navigate to the Cutoff Node by navigating in the panel on the left to Node Reference Assess Nodes Cutoff Node and then navigate in the panel on the right to Cutoff Node Train Properties, you can scroll down until you see Event Precision Equal Recall where it says the following: Event Precision Equal Recall — With precision defined as % true predicted events / (true predicted + false predicted) and recall defined as the event classification rate, this method chooses the point at which precision and recall are equal. There are two ways to find this in the output: (1) In the Overall Rates plot, a line is drawn at the requested point and hovering over the line with the mouse will show the cutoff (see attached document for plot) (2) In the Output section, you can see the point at which the first two columns are closest is when the cutoff is at 0.36 (I added the bold -- see attached document for partial table). -------------------------------------------------- | | | | |Overall | | | Event | True | False |Classif-| | |Precisi-|Positive|Positive|ication | | |on Rate| Rate | Rate | Rate | |--------+--------+--------+---------+---------| |Cutoff | | | | | |--------| | | | | |0.99 | 200.00| 8.30 | 0.00 | 161.78| |--------+--------+--------+--------+----------| |0.98 | 200.00| 13.26| 0.00| 162.77 | |--------+--------+--------+--------+---------| |0.97 | 196.67| 15.47| 0.07| 163.15 | |--------+--------+--------+--------+----------| . . . . . . . . . . . . . . . |--------+--------+---------+--------+---------| |0.38 | 134.11| 125.06| 15.31| 172.80| |--------+---------+--------+--------+---------| |0.37 | 132.51| 127.47| 16.18| 172.59| |--------+---------+--------+--------+---------| |0.36 | 131.01 | 129.67| 17.01 | 172.36| |--------+---------+--------+--------+---------| |0.35 | 128.45| 131.27| 18.22| 171.71| |--------+---------+--------+--------+---------| I hope this helps! Doug

DougWielenga · ‎08-07-2017

It really doesn't matter to an algorithm where the data came from or whether or not there should be 'missing' values or not. The data structure for the techniques you are describing anticipate that there are going to be distinct units/observations/entities on each row (not spread across multiple rows) and each column will contain an attribute for the unit/observation/entity on the corresponding row. So if we were looking at cars, your rows might correspond to a particular make and model of a car and the columns might correspond to things like suggested retail price, city mpg, hiway mpg, number of cylinders, drivetrain type (front/rear/all-wheel), bluetooth enabled (yes/no), etc... It is possible that you don't have complete information even in simple situations like this since Mazda doesn't have cylinders (its a chamber) in its rotary engine, and some models might not post certain information. It is important to note that a neural network, a support vector machine, or a regression model will drop any observation with incomplete data which simply means there is a missing value for one or more of the input variables. Decision Tree models are able to incorporate these observations but you must impute/guess the missing value if you want the observation to be considered at all in your neural network or regression model. Adding rows with incomplete data will not help these latter modeling types but even incomplete data can be used by a Decision Tree model. If the rows that have been 'added' are not really contributing any additional information to the model, it is possible that one of those methods requiring complete data might be helpful. From a method standpoint however, it is important to understand how the methods are interpreting your data and to decide what will generate meaningful result. I hope this helps, Doug

WendyCzika · ‎07-21-2017

See this link to a recent webinar on this very topic: https://communities.sas.com/t5/SAS-Communities-Library/Variable-Selection-SAS-Enterprise-Guide-amp-SAS-Enterprise-Miner/ta-p/319961 or the slides here: https://communities.sas.com/kntur85557/attachments/kntur85557/library/1742/1/Combined%20Slides%2010MAR17%20AtE%20Variable%20Selection%20in%20EG%20and%20EM-%20Ne....pdf

geniusgenie · ‎06-05-2017

Hi Wendy, Thanks for your message, i am able to enter this code. Could you also tell me is this option going to work for all HP Forest nodes(In case I have multiple nodes) or I can use it for a separate node as well. I have a standard HP Forest node and I am looking to add a another node separately on which I am looking to implement this code as a conditional decision tree. So, what should I do in this case?

Reeza · ‎06-04-2017

You should consider getting this book, it has a lot of answers to the questions you've been asking. https://play.google.com/store/books/details?id=YM6GAgAAQBAJ&source=productsearch&utm_source=HA_Desktop_US&utm_medium=SEM&utm_campaign=PLA&pcampaignid=MKTAD0930BO1&gclid=CMHw8cLOpNQCFSoIMgodONIMqQ&gclsrc=ds&dclid=CI6dhsPOpNQCFYQtaQodKocAyw

Tom · ‎06-04-2017

The way you can create a clear example of what you want is to copy and paste your data as TEXT ONLY. No tabs. No HTML table. Make sure to use the Insert code or Insert SAS code icons in the rich text editor on this site so that it will preserve the spaces and be displayed using non-proportional font. If you want to edit the text in the code boxes remember to put the cursor inside the box and click the icon again so that you can edit in the pop-up window. If you try to edit without doing that it messes up the formatting of the line breaks. Do not try to post an example with more than one of two variables and one or two groups (filenames?). Once you know how to do the transformation for variable, say SD_TYPE, then you can do it for other variables that you want to process in the same way yourself. The basic structure of your program is going to be. transform SHEET2 into NEW_SHEET2 that is one row per filename. transform SHEET3 into NEW_SHEET3 that is one row per filename. Then you just need to merge the three datasaets. data want ; merge sheet1 new_sheet2 new_sheet3 ; by filename; run;

Online Status	Offline
Date Last Visited	‎09-09-2017 10:31 AM

Re: How to check overfitting

Re: How to check overfitting

Re: How to check overfitting

How to check overfitting

Re: How to calculate f-score of classifiers

Re: Need an advice on Imbalance datasets

Re: Need an advice on Imbalance datasets

Re: Missing values in a column

feature selection methods in sas em

Missing values in a column

Re: How to do the feature extraction or selection in SAS Enterprise mi...

Re: How to calculate f-score of classifiers

Re: Correlated variables in classifiers

Re: Precision and recall scores in cutoff node

Re: Missing values in a column

Re: How to interpret results when decision tree used with standardised...

Re: How to see correlation matrix in sas enterprise miner

Re: C4.5 and c5.0 in decision trees

Re: How to check overfitting

Re: How to calculate f-score of classifiers

Re: Need an advice on Imbalance datasets

Re: How to do the feature extraction or selection in SAS Enterprise mi...

Re: Correlated variables in classifiers

Re: Testing Confusion matrix in Enterprise Miner

Re: Precision and recall scores in cutoff node

Re: Missing values in a column

Re: feature selection methods in sas em

Re: decision stump and conditional decision trees

Re: Cross validation in machine learning algorithms

Re: dataset integration