If you are looking to identify univariate outliers, you can look at the distribution of each variable in the Replacement node. This node allows you to visualize the values/levels of continuous/categorical data and to filter the values (if desired) based the following criteria:
For continuous variables, you can use the Default Limits Method to specify a default method to determine the range limits for interval variables or Cutoff Values to modify the cutoff values for the various limit methods using the respective options shown below from the Replacement Node documentation:
Default Limits Method — Use the Default Limits Method property to specify the default method to determine the range limits for interval variables. Use any of the methods below.
Mean Absolute Deviation (MAD) — The Mean Absolute Deviation method eliminates values that are more than n deviations from the median. You specify the threshold value for the number of deviations, n, in the Cutoff for MAD property.
User-Specified Limits — The User-Specified Limits method specifies a filter for observations that is based on the interval values that are displayed in the Lower Limit and Upper Limit columns of your data table. You specify these limits in the Interactive Replacement Interval Filter window.
Metadata Limits — Metadata Limits are the lower and upper limit attributes that you can specify when you create a data source or when you are modifying the Variables table of an Input Data node on the diagram workspace.
Extreme Percentiles — The Extreme Percentiles method filters values that are in the top and bottom pth percentiles of an interval variable's distribution. You specify the upper and lower threshold value for p in the Cutoff Percentiles for Extreme Percentiles property.
Modal Center — The Modal Center method eliminates values that are more than n spacings from the modal center. You specify the threshold value for the number of spacings, n, in the Cutoff for Modal Center property.
Standard Deviations from the Mean — (default setting) The Standard Deviations from the Mean method filters values that are greater than or equal to n standard deviations from the mean. You must use the Cutoff for Standard Deviation property to specify the threshold value that you want to use for n.
None — Do not filter interval variables
Cutoff Values — Click the ellipses (...) button to the right of the Cutoff Values property to open the Cutoff Values window. You use the Cutoff Values window to modify the cutoff values for the various limit methods available in the Default Limits Method property.
MAD — When you specify Mean Absolute Deviation as your Default Limits Method, you must use the MAD property of the Replacement node to quantify n, the threshold value for the number of deviations from the median value. Specify the number of deviations from the median to be used as cutoff value. That is, values that are that many mean absolute deviations away from the median will be used as the limit values. When set to User-Specified the values specified using the Interval Editor are used. When set to Missing, blanks or missing values are used as the replacement values. Permissible values are real numbers greater than or equal to zero. The default value is 9.0.
Percentiles for Extreme Percentiles — When you specify Extreme Percentiles as your Default Limits Method, you must use the Percentiles for Extreme Percentiles property to specify p, the threshold value used to quantify the top and bottom pth percentiles. Permissible values are percentages greater than or equal to 0 and less than 50. (P specifies upper and lower thresholds, 50% + 50% = 100%.) The default value is 0.5, or 0.5%.
Modal Center — When you specify Modal Center as your Default Limits Method, you must use the Modal Center property to specify the threshold number of spaces n. That is, values that are that many spacings away from the model center will be used as the limit values Permissible values are real numbers greater than or equal to zero. The default value is 9.0.
Standard Deviation — Use the Standard Deviation property to quantify n, the threshold for number of standard deviations from the mean. That is, values that are that many standard deviations away from the mean will be used as the limit values. Permissible values are real numbers greater than or equal to zero. The default value is 3.0.
You can click on the ... to the right of Replacement Editor under the Interval Variables (or Class Variables) section in order to interactively view the range of values and choose custom settings for each variable. However, doing this manually can be extraordinarily time consuming to look at individual variables in typical data mining scenarios.
If you are looking to identify multivariate outliers, you might consider building principal components with your interval inputs and then looking for outliers on the individual PCs that are generated. This might lead you to identify observations that are not necessarily unusual in any given dimension but which are when considering multiple dimensions.
While there is motivation to consider excluding outliers in certain clustering situations which might otherwise be driven by extremely small outlying clusters, it is typically problematic to ignore data from a predictive modeling standpoint. Tree-based methods can minimize the effect of outliers since outliers do not have excessive weight as they do in many distance based optimization methods.
I hope this helps!
Cordially,
Doug
... View more