Re: Calculating Percentiles in Miner

gcarterIT · Posted 10-22-2018 07:50 AM

Is there a way (node) in Miner that will tell which observations fall into particular percentiles and/or quantiles for a particular variable?

For example: Which observations of Speed fall beyond the %85 quantile.

Thanks

DougWielenga · Posted 01-15-2019 04:15 PM

Is there a way (node) in Miner that will tell which observations fall into particular percentiles and/or quantiles for a particular variable?

For example: Which observations of Speed fall beyond the %85 quantile.

It would be helpful to better understand what you hope to do with the observations. Data Mining data sets typically contain a huge number of observations so writing out observations that meet some criteria like you described is not particularly useful in most situations. It would be easy to run some simple code such as using the MEANS or UNIVARIATE procedure in a SAS Code node to get some specific statistics but you would likely be better off using the STATEXPLORE node or by simply exploring the data which has been exported from a particular node and using the Plot wizard to build graphs of interest.

To do so, click on a particular node and then click on the ... to the right of Exported Data in the General properties section of the node properties panel. From here, you can click on Explore... in order to obtain a sample of the data for exploration. From there, you can click on Actions --> Plot (or you can just click on the Plot icon) and build a graph of interest. It is typically not practical to try and plot the whole data set but you can modify the Sample Properties options to increase the Fetch Size to Max which is the maximum that can be downloaded to the SAS Enterprise Miner client. You could also consider creating indicator variables that identify when a variable is above or below some threshold of interest.

If you can explain more about what you hope to do with those observations, I might be able to provide some better approaches.

Hope this helps!

Doug

gcarterIT · Posted 01-15-2019 09:05 PM

Thanks.

I was hoping to use the mean along with the upper quantiles to identify outliers.

DougWielenga · Posted 01-16-2019 10:20 AM

If you are looking to identify univariate outliers, you can look at the distribution of each variable in the Replacement node. This node allows you to visualize the values/levels of continuous/categorical data and to filter the values (if desired) based the following criteria:

For continuous variables, you can use the Default Limits Method to specify a default method to determine the range limits for interval variables or Cutoff Values to modify the cutoff values for the various limit methods using the respective options shown below from the Replacement Node documentation:

Default Limits Method — Use the Default Limits Method property to specify the default method to determine the range limits for interval variables. Use any of the methods below.
- Mean Absolute Deviation (MAD) — The Mean Absolute Deviation method eliminates values that are more than n deviations from the median. You specify the threshold value for the number of deviations, n, in the Cutoff for MAD property.
- User-Specified Limits — The User-Specified Limits method specifies a filter for observations that is based on the interval values that are displayed in the Lower Limit and Upper Limit columns of your data table. You specify these limits in the Interactive Replacement Interval Filter window.
- Metadata Limits — Metadata Limits are the lower and upper limit attributes that you can specify when you create a data source or when you are modifying the Variables table of an Input Data node on the diagram workspace.
- Extreme Percentiles — The Extreme Percentiles method filters values that are in the top and bottom pth percentiles of an interval variable's distribution. You specify the upper and lower threshold value for p in the Cutoff Percentiles for Extreme Percentiles property.
- Modal Center — The Modal Center method eliminates values that are more than n spacings from the modal center. You specify the threshold value for the number of spacings, n, in the Cutoff for Modal Center property.
- Standard Deviations from the Mean — (default setting) The Standard Deviations from the Mean method filters values that are greater than or equal to n standard deviations from the mean. You must use the Cutoff for Standard Deviation property to specify the threshold value that you want to use for n.
- None — Do not filter interval variables

Cutoff Values — Click the ellipses (...) button to the right of the Cutoff Values property to open the Cutoff Values window. You use the Cutoff Values window to modify the cutoff values for the various limit methods available in the Default Limits Method property.
- MAD — When you specify Mean Absolute Deviation as your Default Limits Method, you must use the MAD property of the Replacement node to quantify n, the threshold value for the number of deviations from the median value. Specify the number of deviations from the median to be used as cutoff value. That is, values that are that many mean absolute deviations away from the median will be used as the limit values. When set to User-Specified the values specified using the Interval Editor are used. When set to Missing, blanks or missing values are used as the replacement values. Permissible values are real numbers greater than or equal to zero. The default value is 9.0.
- Percentiles for Extreme Percentiles — When you specify Extreme Percentiles as your Default Limits Method, you must use the Percentiles for Extreme Percentiles property to specify p, the threshold value used to quantify the top and bottom pth percentiles. Permissible values are percentages greater than or equal to 0 and less than 50. (P specifies upper and lower thresholds, 50% + 50% = 100%.) The default value is 0.5, or 0.5%.
- Modal Center — When you specify Modal Center as your Default Limits Method, you must use the Modal Center property to specify the threshold number of spaces n. That is, values that are that many spacings away from the model center will be used as the limit values Permissible values are real numbers greater than or equal to zero. The default value is 9.0.
- Standard Deviation — Use the Standard Deviation property to quantify n, the threshold for number of standard deviations from the mean. That is, values that are that many standard deviations away from the mean will be used as the limit values. Permissible values are real numbers greater than or equal to zero. The default value is 3.0.

You can click on the ... to the right of Replacement Editor under the Interval Variables (or Class Variables) section in order to interactively view the range of values and choose custom settings for each variable. However, doing this manually can be extraordinarily time consuming to look at individual variables in typical data mining scenarios.

If you are looking to identify multivariate outliers, you might consider building principal components with your interval inputs and then looking for outliers on the individual PCs that are generated. This might lead you to identify observations that are not necessarily unusual in any given dimension but which are when considering multiple dimensions.

While there is motivation to consider excluding outliers in certain clustering situations which might otherwise be driven by extremely small outlying clusters, it is typically problematic to ignore data from a predictive modeling standpoint. Tree-based methods can minimize the effect of outliers since outliers do not have excessive weight as they do in many distance based optimization methods.

I hope this helps!

Cordially,

Doug

Calculating Percentiles in Miner