Is there a way (node) in Miner that will tell which observations fall into particular percentiles and/or quantiles for a particular variable?
For example: Which observations of Speed fall beyond the %85 quantile.
Thanks
Is there a way (node) in Miner that will tell which observations fall into particular percentiles and/or quantiles for a particular variable?
For example: Which observations of Speed fall beyond the %85 quantile.
It would be helpful to better understand what you hope to do with the observations. Data Mining data sets typically contain a huge number of observations so writing out observations that meet some criteria like you described is not particularly useful in most situations. It would be easy to run some simple code such as using the MEANS or UNIVARIATE procedure in a SAS Code node to get some specific statistics but you would likely be better off using the STATEXPLORE node or by simply exploring the data which has been exported from a particular node and using the Plot wizard to build graphs of interest.
To do so, click on a particular node and then click on the ... to the right of Exported Data in the General properties section of the node properties panel. From here, you can click on Explore... in order to obtain a sample of the data for exploration. From there, you can click on Actions --> Plot (or you can just click on the Plot icon) and build a graph of interest. It is typically not practical to try and plot the whole data set but you can modify the Sample Properties options to increase the Fetch Size to Max which is the maximum that can be downloaded to the SAS Enterprise Miner client. You could also consider creating indicator variables that identify when a variable is above or below some threshold of interest.
If you can explain more about what you hope to do with those observations, I might be able to provide some better approaches.
Hope this helps!
Doug
Thanks.
I was hoping to use the mean along with the upper quantiles to identify outliers.
If you are looking to identify univariate outliers, you can look at the distribution of each variable in the Replacement node. This node allows you to visualize the values/levels of continuous/categorical data and to filter the values (if desired) based the following criteria:
For continuous variables, you can use the Default Limits Method to specify a default method to determine the range limits for interval variables or Cutoff Values to modify the cutoff values for the various limit methods using the respective options shown below from the Replacement Node documentation:
You can click on the ... to the right of Replacement Editor under the Interval Variables (or Class Variables) section in order to interactively view the range of values and choose custom settings for each variable. However, doing this manually can be extraordinarily time consuming to look at individual variables in typical data mining scenarios.
If you are looking to identify multivariate outliers, you might consider building principal components with your interval inputs and then looking for outliers on the individual PCs that are generated. This might lead you to identify observations that are not necessarily unusual in any given dimension but which are when considering multiple dimensions.
While there is motivation to consider excluding outliers in certain clustering situations which might otherwise be driven by extremely small outlying clusters, it is typically problematic to ignore data from a predictive modeling standpoint. Tree-based methods can minimize the effect of outliers since outliers do not have excessive weight as they do in many distance based optimization methods.
I hope this helps!
Cordially,
Doug
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.