Re: Clustering with nominal data

uguraltuntas67 · Posted 03-04-2021 07:23 AM

Hello everyone,

I have a dataset like below. It's not unique for USER_ID's, and Jobs and Gender variables are categoric. I have 8 million data and i want to cluster users.

User_ID, Previous_Jobs, Experience(Month), Gender, Age

1, Driver, 1.15, M, 28

1, Cleaner, 6, M, 28

1, Accountant, 24, M, 28

2, Data Analyst 36, F, 26

2, Data Scientist 12, F, 26

These are my variables and i want to use all of them. I am using SAS 9.4.

I can cluster on Enterprise Guide or Enterprise Miner.

How can i handle multiple user_id, previous jobs and experience column. How can i cluster that variables (including categorical)?

Is there any method, paper, flow or suggestion that you can share with me?

JThompson · Posted 06-01-2021 11:58 AM

I'm not sure I'll be able to answer all your questions, but let me provide some feedback. BTW: I am an Enterprise Miner user, so my answers will pertain to that tool. I've never built clusters in Enterprise Guide. In E-Miner, your data must have one unique ID per row. The data you have listed in "transactional" where a unique ID can fall on more than one row, but this type of data cannot be used to build clusters in E-miner. You must have one row per ID.

Also, in E-Miner, the node can use categorical inputs, but this can have a direct, and sometimes negative effect on the results. the way E-miner handles categorical inputs is y dummy coding. In most cases, the levels of your categorical input will drive the clusters to be formed, meaning you are not likely going to see different levels fall within the same cluster. For this reason, we usually try to avoid using categorical inputs in clustering unless necessary. the "experience" variable you give as an example should work fine, as it is interval (meaning numeric).

Finally, I do not have a paper or resource to share, but if you check the E-miner help/documentation, the document on the cluster node may be helpful. It includes an example of how dummy coding is used for a nominal input.

Hope this helps.

Jeff

Kemal_Sozer · Posted 03-08-2022 09:42 AM

Hello,

I think that this page is your answer.

Clustering Nominal Variables

BR.

Kemal.

Clustering with nominal data

Re: Clustering with nominal data

Re: Clustering with nominal data

SAS Innovate 2025: Call for Content