BookmarkSubscribeRSS Feed
juliema
Calcite | Level 5

I want to know what are appropriate metrics to capture the distribution of subgroup sizes (in addition to variance of the sizes).

Consider the simple case where I have 100 data points forming three subgroups in two ways. The first case is subgroups A (30 data point), B (30 data points), and C (40 points) while the second is subgroups A (5 data points), B (5 data points), and C (90 data points). It's obvious that in the first case, subgroup sizes are more evenly distributed as opposed to that in the second case. Yet, are there any statistical metrics to indicate this situation?

Please feel free to share your ideas. I appreciate your time.

-Julie

4 REPLIES 4
ballardw
Super User

Chi-squares may be appropriate starting point.

What question is the analysis supposed to answer?

juliema
Calcite | Level 5

Thank you. The question is "how evenly distributed the subgroup/cluster sizes are?" Maybe there is a better to articulate it but this is essentially the what I want to address.

Reeza
Super User

Then @ballardw suggestion of Chi-Square seems appropriate to me as well.

ballardw
Super User

Here is a brief example using a chi-square test to demonstrate one possible approach. This data set has 3 scenarios, variable TestGroup, that has distributions for your count groups A,B and C (variable Bin) for two subpopulations (variable SampGr). In each case SampGr=2 represents a hypothetical "even distribution" of counts in the bins of roughly one-third in each of 3 bins and the SampGr=1 is what you actually observe. The Rate variable represents the count.

data test;
   input TestGroup SampGr  Bin $ rate;
datalines;
1 1  A 20
1 1  B 50
1 1  C 30
1 2  A 33
1 2  B 33
1 2  C 33
2 1  A 10
2 1  B 10
2 1  C 80
2 2  A 33
2 2  B 33
2 2  C 33
3 1  A 25
3 1  B 35
3 1  C 40
3 2  A 33
3 2  B 33
3 2  C 33
;
run;

proc freq data=test;
   by testgroup;
   tables bin*Sampgr /chisq ;
   weight rate;
run;

Look at the output for each By group and look at the Statistics. The chi-square test here is basically a measure of similarity. The lower the p-value the less likely the data is similarly distributed. You could use the p-value for the chi-square or the other coefficients as a "metric".

The first testgroup looks very likely to not be similar (p-value=0.0332), ie not evenly distributed in SampGr 1, the second testgroup is almost definitely not similar (p-value <0.0001) and third is somewhat smooth (p-value = .4008). Perfect agreement would result in a p-value of 1.

There is a reason I used Rate for the weight value. You could easily standardize data by using the percentages from your raw data.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 546 views
  • 6 likes
  • 3 in conversation