BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
NicolasC
Fluorite | Level 6

Hi there

 

I am willing to try the SMOTE method and use the code available online. I am wondering about the effects of having missing values in the dataset on which to run the code. I would guess it is best to deal with those missing values and run the SOTE code after, am I correct?

 

Thanks

 

Nicolas

 

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

The answer to your question depends on what approach you are using to apply SMOTE.   In another SAS communities discussion available at

 

https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/SAS-Enterprise-Miner-SMOTE-sampling-with-...

 

there is a link to code which uses the MODECLUS procedure to attempt this approach. The documentation to MODECLUS can be found in the documentation for SAS/STAT available at 

 

  http://support.sas.com/documentation/onlinedoc/stat/

 

where it shares the following:

Missing Values

If the data are coordinates, observations with missing values are excluded from the analysis. If the data are distances, missing values are treated as infinite. The neighbors of each observation are determined solely by the distances in that observation. The distances are not required to be symmetric, and there is no check for symmetry; the neighbors of each observation are determined only from the distances in that observation. This treatment of missing values is quite different from that of the CLUSTER procedure, which ignores the upper triangle of the distance matrix.

 

How this impacts your data ultimately depends in part on the type of data and the nature of the missing values.   Replacing the missing values will likely provide different results and it is unclear whether those results will be superior.  Replacing missing values is typically done in situations where it is necessary to avoid ignoring a large percentage of the data.    It is not necessary in SAS Enterprise Miner to replace all missing values because you can generate a set of cluster seeds from a subset of the original data, but as you increase the number of observations with missing values on the input variables, you move toward obtaining a less meaningful solution (which still might be better than no solution at all, but not necessarily!).  

 

Hope this helps!

Doug

View solution in original post

1 REPLY 1
DougWielenga
SAS Employee

The answer to your question depends on what approach you are using to apply SMOTE.   In another SAS communities discussion available at

 

https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/SAS-Enterprise-Miner-SMOTE-sampling-with-...

 

there is a link to code which uses the MODECLUS procedure to attempt this approach. The documentation to MODECLUS can be found in the documentation for SAS/STAT available at 

 

  http://support.sas.com/documentation/onlinedoc/stat/

 

where it shares the following:

Missing Values

If the data are coordinates, observations with missing values are excluded from the analysis. If the data are distances, missing values are treated as infinite. The neighbors of each observation are determined solely by the distances in that observation. The distances are not required to be symmetric, and there is no check for symmetry; the neighbors of each observation are determined only from the distances in that observation. This treatment of missing values is quite different from that of the CLUSTER procedure, which ignores the upper triangle of the distance matrix.

 

How this impacts your data ultimately depends in part on the type of data and the nature of the missing values.   Replacing the missing values will likely provide different results and it is unclear whether those results will be superior.  Replacing missing values is typically done in situations where it is necessary to avoid ignoring a large percentage of the data.    It is not necessary in SAS Enterprise Miner to replace all missing values because you can generate a set of cluster seeds from a subset of the original data, but as you increase the number of observations with missing values on the input variables, you move toward obtaining a less meaningful solution (which still might be better than no solution at all, but not necessarily!).  

 

Hope this helps!

Doug

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 3599 views
  • 0 likes
  • 2 in conversation