SMOTE with missing values

NicolasC — Thu, 11 Jan 2018 16:55:23 GMT

Hi there

I am willing to try the SMOTE method and use the code available online. I am wondering about the effects of having missing values in the dataset on which to run the code. I would guess it is best to deal with those missing values and run the SOTE code after, am I correct?

Thanks

Nicolas

Re: SMOTE with missing values

DougWielenga — Fri, 12 Jan 2018 16:36:44 GMT

The answer to your question depends on what approach you are using to apply SMOTE. In another SAS communities discussion available at

https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/SAS-Enterprise-Miner-SMOTE-sampling-with-categorical-variables/m-p/394037/thread-id/5980/highlight/true#M6009

there is a link to code which uses the MODECLUS procedure to attempt this approach. The documentation to MODECLUS can be found in the documentation for SAS/STAT available at

http://support.sas.com/documentation/onlinedoc/stat/

where it shares the following:

Missing Values

If the data are coordinates, observations with missing values are excluded from the analysis. If the data are distances, missing values are treated as infinite. The neighbors of each observation are determined solely by the distances in that observation. The distances are not required to be symmetric, and there is no check for symmetry; the neighbors of each observation are determined only from the distances in that observation. This treatment of missing values is quite different from that of the CLUSTER procedure, which ignores the upper triangle of the distance matrix.

How this impacts your data ultimately depends in part on the type of data and the nature of the missing values. Replacing the missing values will likely provide different results and it is unclear whether those results will be superior. Replacing missing values is typically done in situations where it is necessary to avoid ignoring a large percentage of the data. It is not necessary in SAS Enterprise Miner to replace all missing values because you can generate a set of cluster seeds from a subset of the original data, but as you increase the number of observations with missing values on the input variables, you move toward obtaining a less meaningful solution (which still might be better than no solution at all, but not necessarily!).

Hope this helps!

Doug

topic Re: SMOTE with missing values in SAS Data Science

SMOTE with missing values

Re: SMOTE with missing values