Hi there
I am willing to try the SMOTE method and use the code available online. I am wondering about the effects of having missing values in the dataset on which to run the code. I would guess it is best to deal with those missing values and run the SOTE code after, am I correct?
Thanks
Nicolas
The answer to your question depends on what approach you are using to apply SMOTE. In another SAS communities discussion available at
there is a link to code which uses the MODECLUS procedure to attempt this approach. The documentation to MODECLUS can be found in the documentation for SAS/STAT available at
http://support.sas.com/documentation/onlinedoc/stat/
where it shares the following:
Missing Values
If the data are coordinates, observations with missing values are excluded from the analysis. If the data are distances, missing values are treated as infinite. The neighbors of each observation are determined solely by the distances in that observation. The distances are not required to be symmetric, and there is no check for symmetry; the neighbors of each observation are determined only from the distances in that observation. This treatment of missing values is quite different from that of the CLUSTER procedure, which ignores the upper triangle of the distance matrix.
How this impacts your data ultimately depends in part on the type of data and the nature of the missing values. Replacing the missing values will likely provide different results and it is unclear whether those results will be superior. Replacing missing values is typically done in situations where it is necessary to avoid ignoring a large percentage of the data. It is not necessary in SAS Enterprise Miner to replace all missing values because you can generate a set of cluster seeds from a subset of the original data, but as you increase the number of observations with missing values on the input variables, you move toward obtaining a less meaningful solution (which still might be better than no solution at all, but not necessarily!).
Hope this helps!
Doug
The answer to your question depends on what approach you are using to apply SMOTE. In another SAS communities discussion available at
there is a link to code which uses the MODECLUS procedure to attempt this approach. The documentation to MODECLUS can be found in the documentation for SAS/STAT available at
http://support.sas.com/documentation/onlinedoc/stat/
where it shares the following:
Missing Values
If the data are coordinates, observations with missing values are excluded from the analysis. If the data are distances, missing values are treated as infinite. The neighbors of each observation are determined solely by the distances in that observation. The distances are not required to be symmetric, and there is no check for symmetry; the neighbors of each observation are determined only from the distances in that observation. This treatment of missing values is quite different from that of the CLUSTER procedure, which ignores the upper triangle of the distance matrix.
How this impacts your data ultimately depends in part on the type of data and the nature of the missing values. Replacing the missing values will likely provide different results and it is unclear whether those results will be superior. Replacing missing values is typically done in situations where it is necessary to avoid ignoring a large percentage of the data. It is not necessary in SAS Enterprise Miner to replace all missing values because you can generate a set of cluster seeds from a subset of the original data, but as you increase the number of observations with missing values on the input variables, you move toward obtaining a less meaningful solution (which still might be better than no solution at all, but not necessarily!).
Hope this helps!
Doug
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.
