09-10-2014 03:17 AM
Dear all experts,
Hi I am a rookie of data mining filed. I was assigned in a project. The objectives of my project is to find out the characteristics of good customer and the bad one.
Therefore I designed my experiment by following these steps:
1) Manage missing value by using replacement node (Some factors are shown as missing value because they are not found in database such as bankrupted customer, if one used to be in bankrupted record before, this field will be shown as "Yes", if not, it will be shown as "." therefore, this kind of missing value should be replaced with "No")
2) Drop missing value by using "Impute" Node
3) Over sampling the data (Good and bad should have the same proportion 50:50)
4) I reduce multicolinearity and find out the potential factors which will be used in clustering later by using "Regression" Node
5) Clustering the data
Do you think is there any problems about my experimental design? Please suggest me if there are any steps i should change. Besides, I still have the problem with step 5: clustering data, I know that target variable is unable to use in clustering technique. Therefore, After step 4, should i separate the data into 2 groups: Good and Bad , and apply cluster technique in each group? I am not sure about what i design is correct or not. Is there any examples or any literature reviews?
Thanks for you all help in advance and look forward hearing from you all soon
09-10-2014 01:34 PM
There are few things that might be helpful in your design.
1. If your variables have lot of missing values let us 50% or more then it's better to drop those variable for further analysis. We can't generalize all the time that missing will
always be 'No'.
2. Not sure what do you mean by drop missing values using impute node.?
3. After oversampling you will have a clustered data based on your target variable. You can perform cluster analysis for two clusters solution based on independent variables
and correlate target and non-target customers within each cluster. This will give you an idea how significant are the independent variables in clustering target and non-
09-10-2014 11:43 PM
Thanks for your reply. I first used replacement for manage with missing value for some factors which the value should be 0 not missing this causes of when i map the data and they were not found the record. For other kinds of factors , if the missing value exceed 50 percent it will be elimimated by Impute node
I have one more question. I have more than 100 factors in my experiment. I think I need to eliminate factors which are not related to being good/ bad customer by using regression node . Is that ok, If i do regression before separating the data into to 2 groups (Good and Bad) ? .
09-11-2014 01:27 PM
use the variable selection node to eliminate variables that have missing values and low correlation to the target.
you can also enable to AOV16 to transform interval variables in categoric ones.
09-12-2014 09:20 PM
Thanks, but what the different between
Variable Selection and Regression? I think both method are able to eliminate correlation. I have never used Variable Selection before. How does it work?