BookmarkSubscribeRSS Feed
laura944
Fluorite | Level 6

 

Hi All, I have the below data on a “Diversity Index” over time for a given population of people:

 

YearDiversity Index
200535%
200636%
200737%
200838%
200939%
201038%
201140%

 

 

Per the census website, the DI: “the DI tells us the chance that two people chosen at random will be from different racial and ethnic groups….The DI is bounded between 0 and 1, with a zero-value indicating that everyone in the population has the same racial and ethnic characteristics, while a value close to 1 indicates that everyone in the population has different characteristics.” (full definition below)

 

I’d like to: 1) determine if there is a statistically significant upward trend in the index over time for this population, and 2) plot the data and show CIs around the estimates.


Questions:

1) Would beta regression work for this, with the DI as the dependent variable and year as the independent variable? This seems to be a “continuous proportion” v. a “count proportion.” We don’t have a value of DI per person, it is one value for a group of people.

2) Some of the same people would be in the DI in different years. Should this be taken into account? If so, how – e.g., robust standard errors in a beta regression? Doesn't seen obvious as again, it's a value only calculated for a population and not an individual.

 

 

EQUATION BELOW:

 

 

Diversity Index Equation 

 

DI = 1 – (H² + W² + B² + AIAN² + Asian² + NHPI² + SOR² + Multi²)

  

H is the proportion of the population who are Hispanic or Latino.

W is the proportion of the population who are White alone, not Hispanic or Latino.

B is the proportion of the population who are Black or African American alone, not Hispanic or Latino.

AIAN is the proportion of the population who are American Indian and Alaska Native alone, not Hispanic or Latino.

Asian is the proportion of the population who are Asian alone, not Hispanic or Latino.

NHPI is the proportion of the population who are Native Hawaiian and Other Pacific Islander alone, not Hispanic or Latino.

SOR is the proportion of the population who are Some Other Race alone, not Hispanic or Latino.

MULTI is the proportion of the population who are Two or More Races, not Hispanic or Latino.

 

Source: https://www.census.gov/library/visualizations/interactive/racial-and-ethnic-diversity-in-the-united-...

 

 

 

6 REPLIES 6
SteveDenham
Jade | Level 19

Save for the fact that the beta distribution is not supported at zero, this looks like a pretty good approach, as the DI is not a proportion, given the example equation.  The only other distribution that comes to mind quickly is maybe a gamma, with a non-standard logit link.

 

One issue I have, at least with the example data, is that it looks like the response is linear over the range presented, so you probably ought to include that approach as well.

 

SteveDenham

laura944
Fluorite | Level 6

Thank you Steve!! I had planned to include year as the independent variable. Would including it as continuous (e.g. coded as 1, 2, 3, 4..) address the issue you have about the response being linear over the range presented?

SteveDenham
Jade | Level 19

If you include time as continuous, you force the relationship to be linear, which is what I was thinking.  Recoding makes sense if you want to interpret the coefficient as change in DI per year.

 

SteveDenham

laura944
Fluorite | Level 6

Excellent thanks - very helpful.

StatDave
SAS Super FREQ

See these notes (56992, 57480) about modeling continuous proportions. Note that the beta model is one option. The fractional logistic and 4- or 5-parameter logistic models are others. Note that while proportions over a range near 0.5 might be approximately linear, this is usually not the case as you approach 0 or 1, so a logistic sort of model makes more sense.

laura944
Fluorite | Level 6

thank you!! I will check out the notes!!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 723 views
  • 4 likes
  • 3 in conversation