BookmarkSubscribeRSS Feed
uguraltuntas67
Fluorite | Level 6

Hello everyone,

 

I have a dataset like below. It's not unique for USER_ID's, and Jobs and Gender variables are categoric. I have 8 million data and i want to cluster users.

 

User_ID, Previous_Jobs, Experience(Month), Gender, Age

1,              Driver,              1.15,                        M,        28         

1,              Cleaner,            6,                            M,        28

1,              Accountant,       24,                          M,         28

2,              Data Analyst      36,                          F,          26

2,              Data Scientist    12,                          F,          26

 

These are my variables and i want to use all of them. I am using SAS 9.4. 

I can cluster on Enterprise Guide or Enterprise Miner. 

How can i handle multiple user_id, previous jobs and experience column. How can i cluster that variables (including categorical)?

Is there any method, paper, flow or suggestion that you can share with me?

2 REPLIES 2
JThompson
SAS Employee

I'm not sure I'll be able to answer all your questions, but let me provide some feedback.  BTW: I am an Enterprise Miner user, so my answers will pertain to that tool.  I've never built clusters in Enterprise Guide.  In E-Miner, your data must have one unique ID per row.  The data you have listed in "transactional" where a unique ID can fall on more than one row, but this type of data cannot be used to build clusters in E-miner.  You must have one row per ID.

Also, in E-Miner, the node can use categorical inputs, but this can have a direct, and sometimes negative effect on the results.  the way E-miner handles categorical inputs is y dummy coding. In most cases, the levels of your categorical input will drive the clusters to be formed, meaning you are not likely going to see different levels fall within the same cluster. For this reason, we usually try to avoid using categorical inputs in clustering unless necessary.  the "experience" variable you give as an example should work fine, as it is interval (meaning numeric).

Finally, I do not have a paper or resource to share, but if you check the E-miner help/documentation, the document on the cluster node may be helpful. It includes an example of how dummy coding is used for a nominal input.

Hope this helps.

Jeff

Kemal_Sozer
Fluorite | Level 6

Hello,

 

I think that this page is your answer. 

 

Clustering Nominal Variables

 

BR.

 

Kemal.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 818 views
  • 0 likes
  • 3 in conversation