09-05-2012 10:41 AM
I am relatively new to SAS, having been using it for the last month to do some statistical analysis on some small/medium dataset. I am now working with a much larger dataset (~40000 observations) with around 300 variables.
Of these 300 variables, more than half are not numerical (binary, categorical) so I have created another dataset with the same number of observations but all numerical variables. My conversion rules are as follows:
1. Binary variable: Y = 1, N = 0, missing data = 0
2. Categorical: 0, 1, 2 .. n where n is the number of categories. Missing data set to 0
I then run PROC VARCLUS on that data with the hope to be able to reduce the number of variables to make a better prediction model:
proc varclus data=worktable maxeigen=0.7 outtree=tree maxclusters=2;
this give me a total variation explained ~ 30
increase to maxclusters=3 give me total variation explained ~ 50, increase all the way maxclusters=30 and total variation explained ~ 120
i then increase the maxclusters = 40 and it give me total variation explained ~ 130, this total variation explained always goes up. Reading across internet, I found that this values are normally around ~30,40 range and actually goes down if the maxclusters increases more than 10.
I am aware that each dataset is unique and different. However, the result I have is quite abnormal. Do you have any suggestion or explanation why the total variation explained I got is so large?
Thank you very much,
09-05-2012 10:57 AM
I would be wary of converting missing data to a valid category. Because VARCLUS omits missing values from the analysis, you are getting a very different result than if you use the following:
1. Binary variable: Y = 1, N = 0, missing data = '.'
2. Categorical: 1, 2 .. n where n is the number of categories. Missing data set to '.'
09-05-2012 04:09 PM
For any observation with a missing data variable, if I did not convert it to a value, varclus will not process that observation. My dataset got missing data in almost all variables and thus, if I did not do so, it will say ~39000 observation omitted due to missing data.
Do you have a suggestion for this case?