turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- General Programming
- /
- Appropriate metrics to characterize group size dis...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-05-2015 06:45 PM

I want to know what are appropriate metrics to capture the distribution of subgroup sizes (in addition to variance of the sizes).

Consider the simple case where I have 100 data points forming three subgroups in two ways. The first case is subgroups A (30 data point), B (30 data points), and C (40 points) while the second is subgroups A (5 data points), B (5 data points), and C (90 data points). It's obvious that in the first case, subgroup sizes are more evenly distributed as opposed to that in the second case. Yet, are there any statistical metrics to indicate this situation?

Please feel free to share your ideas. I appreciate your time.

-Julie

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-05-2015 07:14 PM

Chi-squares may be appropriate starting point.

What question is the analysis supposed to answer?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-05-2015 08:33 PM

Thank you. The question is "how evenly distributed the subgroup/cluster sizes are?" Maybe there is a better to articulate it but this is essentially the what I want to address.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-05-2015 08:42 PM

Then @ballardw suggestion of Chi-Square seems appropriate to me as well.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-06-2015 11:41 AM

Here is a brief example using a chi-square test to demonstrate one possible approach. This data set has 3 scenarios, variable TestGroup, that has distributions for your count groups A,B and C (variable Bin) for two subpopulations (variable SampGr). In each case SampGr=2 represents a hypothetical "even distribution" of counts in the bins of roughly one-third in each of 3 bins and the SampGr=1 is what you actually observe. The Rate variable represents the count.

data test;

input TestGroup SampGr Bin $ rate;

datalines;

1 1 A 20

1 1 B 50

1 1 C 30

1 2 A 33

1 2 B 33

1 2 C 33

2 1 A 10

2 1 B 10

2 1 C 80

2 2 A 33

2 2 B 33

2 2 C 33

3 1 A 25

3 1 B 35

3 1 C 40

3 2 A 33

3 2 B 33

3 2 C 33

;

run;

proc freq data=test;

by testgroup;

tables bin*Sampgr /chisq ;

weight rate;

run;

Look at the output for each By group and look at the Statistics. The chi-square test here is basically a measure of similarity. The lower the p-value the less likely the data is similarly distributed. You could use the p-value for the chi-square or the other coefficients as a "metric".

The first testgroup looks very likely to not be similar (p-value=0.0332), ie not evenly distributed in SampGr 1, the second testgroup is almost definitely not similar (p-value <0.0001) and third is somewhat smooth (p-value = .4008). Perfect agreement would result in a p-value of 1.

There is a reason I used Rate for the weight value. You could easily standardize data by using the percentages from your raw data.