BookmarkSubscribeRSS Feed
cmajorros
Calcite | Level 5

Dear all experts,

Hi I am a rookie of data mining filed. I was assigned in a project. The objectives of my project is to find out the characteristics of good customer and the bad one.

Therefore I designed my experiment by following these steps:

1) Manage missing value by using replacement node (Some factors are shown as missing value because they are not found in database such as bankrupted customer, if one used to be in bankrupted record before, this field will be shown as "Yes", if not, it will be shown as "." therefore, this kind of missing value should be replaced with "No")

2) Drop missing value by using "Impute" Node

3) Over sampling the data (Good and bad should have the same proportion 50:50)

4) I reduce multicolinearity and find out the potential factors which will be used in clustering later by using "Regression" Node

5) Clustering the data

Do you think is there any problems about my experimental design? Please suggest me if there are any steps i should change. Besides, I still have  the problem with step 5: clustering data, I know that target variable is unable to use in clustering technique. Therefore, After step 4, should i separate the data into 2 groups: Good and Bad , and apply cluster technique in each group? I am not sure about what i design is correct or not. Is there any examples or any literature reviews?



Thanks for you all help in advance and look forward hearing from you all soon


Best regards,

Ros

7 REPLIES 7
Ksharp
Super User

I suggest  proc logistic .

cmajorros
Calcite | Level 5

I have never tried this way before, Is there any demonstrations show me how to use it and  how does it work.

stat_sas
Ammonite | Level 13

Hi,

There are few things that might be helpful in your design.

1.     If your variables have lot of missing values let us 50% or more then it's better to drop those variable for further analysis. We can't generalize all the time that missing will

        always be 'No'.

2.     Not sure what do you mean by drop missing values using impute node.?

3.     After oversampling you will have a clustered data based on your target variable. You can perform cluster analysis for two clusters solution based on independent variables

        and correlate target and non-target customers within each cluster. This will give you an idea how significant are the independent variables  in clustering target and non-

        target customers.

cmajorros
Calcite | Level 5

Dear stat@sas

Thanks for your reply. I first used replacement for manage with missing value for some factors which the value should be 0 not missing this causes of when i map the data and they were not found the record. For other kinds of factors , if the missing value exceed 50 percent it will be elimimated by Impute node

I have one more question. I have more than 100 factors in my experiment. I think I need to eliminate factors which are not related to being good/ bad customer by using regression node . Is that ok, If i do regression before separating the data into to 2 groups (Good and Bad) ? .

stat_sas
Ammonite | Level 13

Yes, this is step 4 in your design.

paguiar
Calcite | Level 5

use the variable selection node to eliminate variables that have missing values and low correlation to the target.

you can also enable to AOV16 to transform interval variables in categoric ones.

cmajorros
Calcite | Level 5

Thanks, but what the different between
Variable Selection and Regression? I think both method are able to eliminate correlation. I have never used Variable Selection before. How does it work?

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1367 views
  • 6 likes
  • 4 in conversation