About PadraicGNeville

PadraicGNeville · ‎01-23-2017

Depends on what the prediction would be without a model. If the proportion of observations with the most common target value in the data is near 1 - 0.058, then a misclassification rate of 0.058 is not good. On the other hand, if the proportion is around 1/2, then 0.058 is a great number. I suspect AUC of 0.81 is good, because it is much larger than 0.5.

PadraicGNeville · ‎01-20-2017

Yes this is weird. I guess no split is created in the initial tree. I cannot guess why reducing the number of training observations fixes it. If you are willing and able to provide the data to SAS Tech Support, I will figure out why this is happening. -Padraic

PadraicGNeville · ‎11-30-2016

I'm stumped. May I use your data to reproduce it? If so, you can either upload the data to the community, or upload it privately through SAS Technical Support. In that case tell Tech Support that "Padraic Neville wants the data in order to investigate the problem," and they will quickly let me know when it is available. Technical Support will want the site number that appears at the top of the SAS logs. In the Enterprise Miner, you can get the log by: Launch the SAS Enterprise Miner client. Open any project and run your diagram flow. Right-click on Results and select View►SAS Results►Log. Search for Site at the top of the log to identify your site number.

PadraicGNeville · ‎11-30-2016

You understand it perfectly. The software appears to be confused. Something is confusing it. There is a parameter to set the within-node sample size. (I do remember its name.) Set it to something larger than 24,000. If that does not cure it, I wonder whether there a FREQ variable, and what the counts of the target classes are. Let me know and we can go from there. -Padraic

PadraicGNeville · ‎10-24-2016

Hi, Subham. Unfortunately, the Boosting node in EM is not designed to output information about the individual trees beyond the number of leaves. The EM node invokes PROC TREEBOOST to build the boosted model. An expert user might know how to run PROC TREEBOOST directly in an EM Code node. In that case, one could include the RULES= dataset, STATSBYNODE= dataset, and the TOPOLOGY=dataset options in the SAVE statement. Together they would describe all the splits and node statistics. However, they are not documented or really supported by tech support for the boosting node. In my opinion multicolinearity is not a problem. The boosting algorithm does not invert a matrix, so there is no concern about matrix ranks. Regarding the ratio variable, if there are no other missing values in the data, then simply set the -99 values to missing. "Missing" is a special value that boosting will assign to the best branch independently of the other values. The algorithm also considers splitting Missing vs Non-Missing. If the ratio variable contains other missing values that you do not want to merge with the special -99 value, then one approach is to create a second variable with value 0 for observations with non-special ratio values and 1 for observations with the special value. In the original ratio variable, replace the -99 value with the average of the legitimate values. Setting it to the average value is an attempt to make those observations uninfluential in the split search. If there are more special values than just -99, then create a separate value for each in the second variable, and declare that variable nominal instead of binary. This approach is just an idea. I cannot think of a better approach. Good luck. Padraic

PadraicGNeville · ‎07-07-2016

A weight is a positive number. "Weighting observations" means a positive number is associated with each observation, and the algorithm utilizes that number somehow. Intuitively, observations with larger weights influence the algorithm more than observations with smaller weights. When Adaboost has created 10 trees in its boosting model, it will assign small weights to observations it is predicting well, so that Adaboost will create the 11th tree focusing on observations it hitherto predicted poorly. "Weighting trees" means the predictions from the trees are multiplied by a weight: P(X) = W1 T1(X) + W2 T2(X), where Ti(X) is the prediction of tree i for inputs X, and Wi are the weights. Sometimes the gradient boosting algorith is explained: first train the next tree (T2), and then find a single number (W2) that works best.

PadraicGNeville · ‎07-07-2016

This is constant prediction. Either no split is created or many trees were created and then pruned back with validation data. If many trees are created and then pruned, then there should be an iteration sequence. If no split is created, then yes, some special handling of the settings is necessary. For example, setting the leafsize = 1. Without seeing the data I cannot help much with this.

PadraicGNeville · ‎07-07-2016

"weight" could refer to weighting the observations or to weighting the trees in the model. A boosting model typically consists of a sum of decision trees trained sequentially. Some algorithms describe the sum as weighted. In Adaboost, the original boosting algorithm, observations are given weights before training a tree. The weights are different for each tree. In gradient boosting algorithms do not use weights like this. Instead, the algorithm modifies the target values input to a tree. The EM Boosting node uses gradient boosting. In some rare occassions, people assign weighted values to the observations at the start in order to match proportions of groups in the training data with those in a future population to which the model will be applied.

PadraicGNeville · ‎07-07-2016

The boosting node handles nominal targets automatically. No special settings are necessary other than declaring the target as nominal. There should be a result, even if the result is to predict all observations the same way.

PadraicGNeville · ‎07-07-2016

There is no way to tell PROC TREEBOOST to incorporate other learners. That said, if the target Y has interval level of measurement then use a psuedo-target (Y-P) as input to PROC TREEBOOST, where P is the prediction from another learner. The final prediction is P_treeboost + P, where P_treeboost is the prediction from PROC TREEBOOST.

PadraicGNeville · ‎05-26-2016

This should never happen. A SAS procedure (ARBOR) is hopelessly confused about the format of a categorical variable. The procedure generated scoring code with the lines, ****** TEMPORARY VARIABLES FOR FORMATTED VALUES ******; LENGTH _ARBFMT_12 $ 12; DROP _ARBFMT_12; _ARBFMT_12 = ' '; /* Initialize to avoid warning. */ LENGTH _ARBFMT_0 $ 0; DROP _ARBFMT_0; _ARBFMT_0 = ' '; /* Initialize to avoid warning. */ and so on. Elsewhere should be lines of the form, _ARBFMT_0 = PUT( variable, format); Such lines reveal the variable and format that tripped the SAS procedure. If you could find those lines you might be able to fix them by 1. correcting format if it is wrong, and 2. putting the length of the format in the LENGTH ARBFMT_0 statement. If you would give SAS Technical Support the Tree4_emtree data set and let them know I want to look at it, then I will figure out the bug and whether it can be addressed. (https://support.sas.com/techsup/contact/ ) I appologize for the inconvenience. Kind Regards, Padraic

PadraicGNeville · ‎05-13-2016

Yes. P(class j) = scale * unadjusted_P( j) * prior(j) / proportion_in_data(j), where the scale is chosen to get sum over j of P(j) = 1.

PadraicGNeville · ‎05-13-2016

Hi. No, SAS EM does not think you only have 86.72 events. The display is adjusting the counts to reflect the adjusted priors. There might be a display setting that turns the adjustment off (I don't know). In any case, the computational code knows about all the observations. The adjustments can change the computations in three places: 1. Depending on user-properties, the split search will act as if there are 86.72 events or as if there are 1301 events. 2. Depending on user-properties, the tree can be retrospectively pruned based on the adjusted numbers or the unadjusted numbers. 3. The posterior probabilities will be adjusted. I am guessing that the default behaviour for 1 and 2 is to not incorporate the adjustments. Why: because typically the adjustment makes the event look more rare, and rare events typically fool trees into being too small. As you point out, the data step code coming out of logistic includes code at the end to adjust the posterior probabilities. The decision tree code does not output corresponding code because it outputs posterior probabilities that are already adjusted: decision tree computes the adjustments before outputing the data step code. Let us know if you still have questions. -Padraic

PadraicGNeville · ‎02-12-2016

The log should report the name of the dataset and the number of obsevations shortly after it reports the PROC ARBOR and SAVE statement. Remember to replace "sequence" by a libname.memname of your choice. For example, EM_LIB.Sequence.

PadraicGNeville · ‎02-11-2016

You're helping me more than I am helping you now. The EM macro, directory, and file names are a mystery to me. I'm sorry I can't help more.

Online Status	Offline
Date Last Visited	‎01-07-2020 12:12 PM

Re: Random Branch Assignments (RBA)

Re: How to convert SAS Decision Tree model to PMML format?

Re: proc arboretum

Re: proc arboretum

Re: How to convert SAS Decision Tree model to PMML format?

Re: Getting OOB error for each tree in RF

Re: how to define positive and negative samples well?

Re: Getting a proximity matrix from a random forest

Re: SEMMA

Re: proc arboretum

Re: HPForest predicted class probabilities differ across runs - is thi...

Re: Stratified bootstrap sampling with random forest

Re: Cross Validation in regression and decision trees

Re: How to convert SAS Decision Tree model to PMML format?

Re: proc arboretum

Re: Predictive Model Results

Re: Enterprise Miner Gradient Boosting not producing model

Re: Enterprise miner Node Leaf size issues

Re: Enterprise miner Node Leaf size issues

Re: Gradient Boosting Output Understanding in EM

Re: Weight in Gradient Boosting

Re: Multilevel classification using Gradient boosting SAS

Re: Weight in Gradient Boosting

Re: Multilevel classification using Gradient boosting SAS

Re: Proc Treeboost

Re: Decision Tree Error: Temporary Variable for Formatted Value

Re: Oversampling and Decision tree help Plz!

Re: Oversampling and Decision tree help Plz!

Re: SAS Miner: Export Leaf Table (without manually saving) from a node...

Re: SAS Miner: Export Leaf Table (without manually saving) from a node...