10-11-2012 03:35 PM
This question is more about methodology than proc varclus.
"I have a question that if we are using proc varclus to eliminate redundancy in the IV's, how do we go about selecting the cluster representatives? I know the lower the (1-R^2) ratio, the better is a variable as a representative, however, if we use other factors such as business sense or univariate chi square of a variable along with (1-R^2) ratio then should we select cluster representatives that have a higher univariate chi square or making more 'business sense' even if they are having a higher (1-R^2) ratio?..Or else, we should go by selecting the top 5 , top 10 variables per cluster and then look at other statistics later on? Please advise..!"
10-12-2012 07:56 AM
Business sense should trump all other considerations, in my opinion. If you can't explain to your audience what is going on, it is difficult to make a good case for the selection. Note that this is the case only if you are in the situation where you have several variables that are "close" in explanatory power.
And again, it is only opinion.
10-12-2012 08:31 AM
Let's say I am defining the predictive power of a variable through its univariate chi square statistic. I think that there will be variables with higher (1-R^2) ratio and higher predictive power which will be left out if we just concentrate on lower (1-R^2) ratio only to eliminate redundancy. Or, since the variables have been grouped into clustering, all the variables inside one cluster will be having almost similar predictive power?
10-15-2012 08:30 AM
Still don't know if this is helpful, but depending on the measure used to cluster, wouldn't those variables within a cluster really all be measuring about the same thing/have simialr patterns/be explicit representations of some underlying variable? (As you might guess, I haven't used varclus in a LONG time.) So given that, I would pick the variable that makes the most business sense, even if it is slightly lower in predictive power. Now if there is a fair difference in the statistics (and I don't have an objective measure of "fair difference"), you really should pick the "better" predictor and see if it makes sense from a business sense point.
I am starting to think that PROC PLS might be of some utility here.
10-15-2012 10:38 AM
Proc varclus could be of use in removing the 'redundancy' and not 'irrelevancy'..What I did was, next time I requested a large number of clusters (50 and then 100). That ensured lesser number of variables in each cluster for me to analyze and shortlist. Then, one can think of picking low (1-R^2) and business sense variables.
I will surely look at proc PLS too..