New Contributor
Posts: 2

Appropriate metrics to characterize group size distribution?

I want to know what are appropriate metrics to capture the distribution of subgroup sizes (in addition to variance of the sizes).

Consider the simple case where I have 100 data points forming three subgroups in two ways. The first case is subgroups A (30 data point), B (30 data points), and C (40 points) while the second is subgroups A (5 data points), B (5 data points), and C (90 data points). It's obvious that in the first case, subgroup sizes are more evenly distributed as opposed to that in the second case. Yet, are there any statistical metrics to indicate this situation?

-Julie

Super User
Posts: 13,583

Re: Appropriate metrics to characterize group size distribution?

Chi-squares may be appropriate starting point.

What question is the analysis supposed to answer?

New Contributor
Posts: 2

Re: Appropriate metrics to characterize group size distribution?

Thank you. The question is "how evenly distributed the subgroup/cluster sizes are?" Maybe there is a better to articulate it but this is essentially the what I want to address.

Super User
Posts: 23,776

Re: Appropriate metrics to characterize group size distribution?

Then @ballardw suggestion of Chi-Square seems appropriate to me as well.

Super User
Posts: 13,583

Re: Appropriate metrics to characterize group size distribution?

Here is a brief example using a chi-square test to demonstrate one possible approach. This data set has 3 scenarios, variable TestGroup, that has distributions for your count groups A,B and C (variable Bin) for two subpopulations (variable SampGr). In each case SampGr=2 represents a hypothetical "even distribution" of counts in the bins of roughly one-third in each of 3 bins and the SampGr=1 is what you actually observe. The Rate variable represents the count.

data test;
input TestGroup SampGr  Bin \$ rate;
datalines;
1 1  A 20
1 1  B 50
1 1  C 30
1 2  A 33
1 2  B 33
1 2  C 33
2 1  A 10
2 1  B 10
2 1  C 80
2 2  A 33
2 2  B 33
2 2  C 33
3 1  A 25
3 1  B 35
3 1  C 40
3 2  A 33
3 2  B 33
3 2  C 33
;
run;

proc freq data=test;
by testgroup;
tables bin*Sampgr /chisq ;
weight rate;
run;

Look at the output for each By group and look at the Statistics. The chi-square test here is basically a measure of similarity. The lower the p-value the less likely the data is similarly distributed. You could use the p-value for the chi-square or the other coefficients as a "metric".

The first testgroup looks very likely to not be similar (p-value=0.0332), ie not evenly distributed in SampGr 1, the second testgroup is almost definitely not similar (p-value <0.0001) and third is somewhat smooth (p-value = .4008). Perfect agreement would result in a p-value of 1.

There is a reason I used Rate for the weight value. You could easily standardize data by using the percentages from your raw data.

Discussion stats
• 4 replies
• 315 views
• 6 likes
• 3 in conversation