BookmarkSubscribeRSS Feed
annasantos
Calcite | Level 5

Will someone please answer my statistically naive question: Can I use chi square to test for relationship (or independence) between clusters (produced by K-means, using 2 continuos variables) and categorical variables? I attempted a logistic regression but separation occurred and I could not correct for this and need an alternative. Note this is not medical data but a social science study of household demographics. Thank you!

4 REPLIES 4
Reeza
Super User

Separation in this case would be a good thing - it means some variables are classify/categorizing your clusters well. 

It's worth looking into what variables caused the separation and if its truly predictive or a measurement error.

annasantos
Calcite | Level 5

Thank you! I am not a statistician and do not program and am using JMP for my analysis. I am unable to to determine the cause of separation. Also, to clarify, I have three clusters that emerged from k-means (using 2 variables: income, and %catch sold). I then wanted to test the relationship between the 3 clusters and other variables (location of household, habitat, gear, etc) as independent chi square tests. I am asking about the chi-square option because when I ran logistic regression (clusters-dependent: location of household, habitat, gear, etc-independent) separation occurred. In other words I am trying to determine if there is a relationship between market oriented households (from clusters) and their location, habitat they extract from etc.

I understand this may not be robust as regression but is it an option?

Thanks again!

StatDave
SAS Super FREQ

Yes, assuming your observations (households, I assume) are independent, you could look at the association of each variable (ignoring all other variables) with the clusters by using PROC FREQ.  For example:

proc freq;

table location*cluster / chisq;

run;

However, to determine the partial effect of each variable after taking account of the effects of the other variables, you would need to either use stratification or a model-based approach.  The model-based approach would be a nominal (generalized logit) logistic model.  It sounds like you already tried this and it resulted in separation issues, probably due to the data being too sparse to support the model.  A possible alternative is stratification using the CMH option in PROC FREQ.  For example, this tests the effect of location after stratifying on habitat.  You may want the NOPRINT option to avoid printing all the separate location*cluster tables for the various habitat levels.

proc freq;

table habitat*location*cluster / cmh noprint;

run;

annasantos
Calcite | Level 5

Thank you, this is very helpful. I appreciate it!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1557 views
  • 4 likes
  • 3 in conversation