11-27-2014 11:16 AM

Will someone please answer my statistically naive question: Can I use chi square to test for relationship (or independence) between clusters (produced by K-means, using 2 continuos variables) and categorical variables? I attempted a logistic regression but separation occurred and I could not correct for this and need an alternative. Note this is not medical data but a social science study of household demographics. Thank you!

11-27-2014 12:11 PM

Separation in this case would be a good thing - it means some variables are classify/categorizing your clusters well.

It's worth looking into what variables caused the separation and if its truly predictive or a measurement error.

11-27-2014 12:43 PM

Thank you! I am not a statistician and do not program and am using JMP for my analysis. I am unable to to determine the cause of separation. Also, to clarify, I have three clusters that emerged from k-means (using 2 variables: income, and %catch sold). I then wanted to test the relationship between the 3 clusters and other variables (location of household, habitat, gear, etc) as independent chi square tests. I am asking about the chi-square option because when I ran logistic regression (clusters-dependent: location of household, habitat, gear, etc-independent) separation occurred. In other words I am trying to determine if there is a relationship between market oriented households (from clusters) and their location, habitat they extract from etc.

I understand this may not be robust as regression but is it an option?

Thanks again!

12-01-2014 04:51 PM

Yes, assuming your observations (households, I assume) are independent, you could look at the association of each variable (ignoring all other variables) with the clusters by using PROC FREQ. For example:

proc freq;

table location*cluster / chisq;

run;

However, to determine the partial effect of each variable after taking account of the effects of the other variables, you would need to either use stratification or a model-based approach. The model-based approach would be a nominal (generalized logit) logistic model. It sounds like you already tried this and it resulted in separation issues, probably due to the data being too sparse to support the model. A possible alternative is stratification using the CMH option in PROC FREQ. For example, this tests the effect of location after stratifying on habitat. You may want the NOPRINT option to avoid printing all the separate location*cluster tables for the various habitat levels.

proc freq;

table habitat*location*cluster / cmh noprint;

run;

12-01-2014 05:13 PM

Thank you, this is very helpful. I appreciate it!