Dear all experts, Hi I am a rookie of data mining filed. I was assigned in a project. The objectives of my project is to find out the characteristics of good customer and the bad one. Therefore I designed my experiment by following these steps: 1) Manage missing value by using replacement node (Some factors are shown as missing value because they are not found in database such as bankrupted customer, if one used to be in bankrupted record before, this field will be shown as "Yes", if not, it will be shown as "." therefore, this kind of missing value should be replaced with "No") 2) Drop missing value by using "Impute" Node 3) Over sampling the data (Good and bad should have the same proportion 50:50) 4) I reduce multicolinearity and find out the potential factors which will be used in clustering later by using "Regression" Node 5) Clustering the data Do you think is there any problems about my experimental design? Please suggest me if there are any steps i should change. Besides, I still have the problem with step 5: clustering data, I know that target variable is unable to use in clustering technique. Therefore, After step 4, should i separate the data into 2 groups: Good and Bad , and apply cluster technique in each group? I am not sure about what i design is correct or not. Is there any examples or any literature reviews? Thanks for you all help in advance and look forward hearing from you all soon Best regards, Ros
... View more