Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- SAS Data Science
- /
- Re: SAS Enterprise Miner: SMOTE sampling with categorical variables

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 09-07-2017 04:10 PM
(6653 views)

Hi everybody,

Do you know a way to perform SMOTE sampling in SAS Enterprise Miner with categorical variables ?

I have found a SAS code doing this task:

http://support.sas.com/resources/papers/proceedings15/3282-2015.zip

The problem is that my data set contains categorical explanatory variables and the code is only adapted to numeric variables.

However, I know that categorical variables could be handled (https://www.jair.org/media/953/live-953-2037-jair.pdf) for example using the suited R package SMOTE.

If you have a code example in SAS to perform such task, I would really appreciate to see how it works.

Thank you so much for your help,

Marco

Do you know a way to perform SMOTE sampling in SAS Enterprise Miner with categorical variables ?

I have found a SAS code doing this task:

http://support.sas.com/resources/papers/proceedings15/3282-2015.zip

The problem is that my data set contains categorical explanatory variables and the code is only adapted to numeric variables.

However, I know that categorical variables could be handled (https://www.jair.org/media/953/live-953-2037-jair.pdf) for example using the suited R package SMOTE.

If you have a code example in SAS to perform such task, I would really appreciate to see how it works.

Thank you so much for your help,

Marco

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The code you found uses the MODECLUS procedure which (as you pointed out) is intended for numerical data. It also has the problem of not being able to scale to the size of typical data mining data sets. The Cluster node in SAS Enterprise Miner does allow for using categorical variables in creating a cluster solution and is capable of handling large scale data. Therefore, you might consider creating clusters with the Cluster node and then sampling from the segments it produces as desired to achieve a similar effect.

The challenge with including categorical variables in a cluster solution is that they are natural segmenting variables already -- having all their mass at a set of distinct points -- while the numerical variables are typically distributed across a much greater set of values which must then be grouped based on centroids. The resulting clusters, however, typically do not break cleanly based on the categorical variable levels and might produce a result that is more difficult to explain.

Hope this helps!

Doug

2 REPLIES 2

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The code you found uses the MODECLUS procedure which (as you pointed out) is intended for numerical data. It also has the problem of not being able to scale to the size of typical data mining data sets. The Cluster node in SAS Enterprise Miner does allow for using categorical variables in creating a cluster solution and is capable of handling large scale data. Therefore, you might consider creating clusters with the Cluster node and then sampling from the segments it produces as desired to achieve a similar effect.

The challenge with including categorical variables in a cluster solution is that they are natural segmenting variables already -- having all their mass at a set of distinct points -- while the numerical variables are typically distributed across a much greater set of values which must then be grouped based on centroids. The resulting clusters, however, typically do not break cleanly based on the categorical variable levels and might produce a result that is more difficult to explain.

Hope this helps!

Doug

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@DougWielenga wrote:The code you found uses the MODECLUS procedure which (as you pointed out) is intended for numerical data. It also has the problem of not being able to scale to the size of typical data mining data sets. The Cluster node in SAS Enterprise Miner does allow for using categorical variables in creating a cluster solution and is capable of handling large scale data. Therefore, you might consider creating clusters with the Cluster node and then sampling from the segments it produces as desired to achieve a similar effect.

Hey Doug,

Could you explain in more details how can we use the output of the cluster node to include it into SMOTE SAS code?

I think I don't understand the idea.

I found this article about the method that allows categorical variables but there is only pseudocode provided:

http://support.sas.com/resources/papers/proceedings15/3483-2015.pdf

Any ideas how it could be implemented using SAS code?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.