Solved: Frequency variable in Decision Tree node

topkatz · Posted 08-28-2009 10:49 PM

Hi.

I have assigned a variable the type of frequency in the Decision Tree node, although it's really a weight variable. All the values are between 0 and 1. Enterprise Miner 5.3 doesn't have a weight type for variables. I had hoped that only the relative values of the weights would affect the composition of the tree, but I find that is not the case, because when I multiply all the weights by a single common positive value, the tree can change significantly. My theory is that the minimum splitting rule of the tree is based on the sum of the frequencies, rather than on the raw number of observations. Is this correct? Is there more to it? Thanks!

-- TMK --

DougWielenga · Posted 08-11-2017 11:34 AM

Note: I edited the original poster's Title to remove the version number since the answer below applies to all versions!

Enterprise Miner does not handle weight variables because there is not a standard for how weight variables should be handled in a data mining context.  Enterprise Miner does support the use of a FREQ variable which can take non-integer values.  More information is located in the Predictive Modeling chapter of the Enterprise Miner Reference which is available by going to Help --> Contents from within Enterprise Miner and then navigating to

Analytics
  Predictive Modeling

in the panel on the left and then clicking on "The Frequency Variable and Weighted Estimation" from the links on the right.

/* BEGIN EXCERPT */

The Frequency Variable and Weighted Estimation

All of the modeling nodes allow you to specify a frequency variable. Typically, the values of the frequency variable are nonnegative integers. The data are treated as if each case were replicated as many times as the value of the frequency variable.

Unlike most SAS procedures, the modeling nodes in Enterprise Miner accept values for a frequency variable that are not integers without truncating the fractional part. Thus, you can use a frequency variable to perform weighted analyses. However, Enterprise Miner does not provide explicit support for sampling weights, noise-variance weights, or other analyses where the weight variable does not represent the frequency of occurrence of each case. If the frequency variable represents sampling weights or noise-variance weights, the point estimates of regression coefficients and neural network weights will be valid. But if the frequency variable does not represent actual frequencies, then standard errors, significance tests, and statistics such as MSE, AIC, and SBC may be invalid. If you want to do weighted estimation under the usual assumption for weighted least-squares that the weights are inversely proportional to the noise variance (error variance) of the target variable, you can obtain statistically correct results by specifying frequency values that add up to the sample size. If you want to use sampling weights that are inversely proportional to the sampling probability of each case, you can get approximate estimates for MSE and related statistics in the Regression and Neural Network nodes by specifying frequencies that add up to the effective sample size. A pessimistic approximation to the effective sample size is provided by [sum(w(i)]^2/sum(w(i)^2), where w(i) is a sampling weight for case i. This trick will not work properly with the Tree node.

/* END EXCERPT */

Specifying a FREQ variable only makes sense when it corresponds to the actual number of replicates for an observation.  It is also used in some cases for credit scoring applications as part of a common practice in the industry.  Other than these two uses, we generally don't recommend the use of the FREQ variable in Enterprise Miner.

In some modeling methods, such as when creating decision trees, specifying fractional frequencies less than one can cause certain anomalies.  Enterprise Miner treats each row as an observation in predictive modeling, and specifying a frequency less than one can lead to the inference that there are many observations even though their cumulative frequency is still very close to zero.  This can lead to errors in the Tree which is using the number of observations in the node to decide when splitting should occur, and it doesn't check for fractional frequencies, since these should be positive integers values in a modeling data set.

I hope this helps!

Doug

View solution in original post

DougWielenga · Posted 08-11-2017 11:34 AM

Note: I edited the original poster's Title to remove the version number since the answer below applies to all versions!

Enterprise Miner does not handle weight variables because there is not a standard for how weight variables should be handled in a data mining context.  Enterprise Miner does support the use of a FREQ variable which can take non-integer values.  More information is located in the Predictive Modeling chapter of the Enterprise Miner Reference which is available by going to Help --> Contents from within Enterprise Miner and then navigating to

Analytics
  Predictive Modeling

in the panel on the left and then clicking on "The Frequency Variable and Weighted Estimation" from the links on the right.

/* BEGIN EXCERPT */

The Frequency Variable and Weighted Estimation

All of the modeling nodes allow you to specify a frequency variable. Typically, the values of the frequency variable are nonnegative integers. The data are treated as if each case were replicated as many times as the value of the frequency variable.

Unlike most SAS procedures, the modeling nodes in Enterprise Miner accept values for a frequency variable that are not integers without truncating the fractional part. Thus, you can use a frequency variable to perform weighted analyses. However, Enterprise Miner does not provide explicit support for sampling weights, noise-variance weights, or other analyses where the weight variable does not represent the frequency of occurrence of each case. If the frequency variable represents sampling weights or noise-variance weights, the point estimates of regression coefficients and neural network weights will be valid. But if the frequency variable does not represent actual frequencies, then standard errors, significance tests, and statistics such as MSE, AIC, and SBC may be invalid. If you want to do weighted estimation under the usual assumption for weighted least-squares that the weights are inversely proportional to the noise variance (error variance) of the target variable, you can obtain statistically correct results by specifying frequency values that add up to the sample size. If you want to use sampling weights that are inversely proportional to the sampling probability of each case, you can get approximate estimates for MSE and related statistics in the Regression and Neural Network nodes by specifying frequencies that add up to the effective sample size. A pessimistic approximation to the effective sample size is provided by [sum(w(i)]^2/sum(w(i)^2), where w(i) is a sampling weight for case i. This trick will not work properly with the Tree node.

/* END EXCERPT */

Specifying a FREQ variable only makes sense when it corresponds to the actual number of replicates for an observation.  It is also used in some cases for credit scoring applications as part of a common practice in the industry.  Other than these two uses, we generally don't recommend the use of the FREQ variable in Enterprise Miner.

In some modeling methods, such as when creating decision trees, specifying fractional frequencies less than one can cause certain anomalies.  Enterprise Miner treats each row as an observation in predictive modeling, and specifying a frequency less than one can lead to the inference that there are many observations even though their cumulative frequency is still very close to zero.  This can lead to errors in the Tree which is using the number of observations in the node to decide when splitting should occur, and it doesn't check for fractional frequencies, since these should be positive integers values in a modeling data set.

I hope this helps!

Doug

Frequency variable in Decision Tree node

Re: Frequency variable in Decision Tree node

Re: Frequency variable in Decision Tree node