BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
saoirse872
Fluorite | Level 6

**Note – data is hypothetical****

 

I would like to compare the percentage of women in my data to the percentage of women in a population dataset, to see if the % women in my data is high or low. So for example,

 

 my datapopulation
Female30%25%
Male70%75%

 

However, the two datasets overall have different mixes of occupations and age groups, so we don’t expect the percentage of women to be the same. Therefore, we have to adjust for these differences. The table shows, for the population, the percentage of women in each combination of occupation and age groups, as well as the CIs around the percentage.

 

 

POPULATION DATA:  
Professional% of womenCIs for proportion
Younger50%(45%, 55%)
Older70%(64%, 76%)
Not professional  
Older20%(15%, 25%)
Younger15%(10%, 20%

 

 

In my dataset, I have 50% who are not professional and older, 20% who are not professional and younger, and 30% who are professional and younger.

therefore, to get the expected proportion of women, I calculate: .5*.2 + .2*.15 + .3*.5 = .28

 

However, the population numbers have CIs around the estimates. Should those be ignored or taken into account here?

 

Thanks in advance for any thoughts.

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

That was my initial thought but rejected it for a reason that, on reconsideration, doesn't seem problematic. So yes, I believe that would work and would not require aggregating the data. In PROC LOGISTIC, just be sure to use the EVENT= option after the response variable in the MODEL statement to select the female level as the event level to model. You can then specify SOURCE in the LSMEANS statement, rather than the SLICE statement, with the ILINK option to see the estimated, adjusted probabilities for the two sources.

View solution in original post

4 REPLIES 4
StatDave
SAS Super FREQ

One approach is to obtain the aggregated, combined data and fit a log-linear model (a Poisson model on the aggregated counts) with gender and source (sample, population) and their interaction as the predictors of primary interest, and occupation and age (and anything else needed) as covariates. You can then compare the sources adjusted for the covariates. For example, assuming that all of the above variables are categorical, or can be, the SLICE statement will provide the means in each gender-source combination and test for differences between the sources for each gender.

proc freq; 
table source*gender*occupation*age_group / noprint out=aggdata;
run;
proc genmod;
class gender source occupation age_group;
model count = gender|source occupation age_group / dist=poisson;
slice gender*source / sliceby=gender ilink means diff cl;
run;
saoirse872
Fluorite | Level 6

Thank you!! This is a fantastic suggestion. A follow up question - I'm curious as to rationale for a log linear model here? As opposed to say a logistic regression with gender as the dependent variable, and independent variables as age, source, and occupation?

StatDave
SAS Super FREQ

That was my initial thought but rejected it for a reason that, on reconsideration, doesn't seem problematic. So yes, I believe that would work and would not require aggregating the data. In PROC LOGISTIC, just be sure to use the EVENT= option after the response variable in the MODEL statement to select the female level as the event level to model. You can then specify SOURCE in the LSMEANS statement, rather than the SLICE statement, with the ILINK option to see the estimated, adjusted probabilities for the two sources.

saoirse872
Fluorite | Level 6

Wonderful. thank you for the excellent advice!!!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 634 views
  • 2 likes
  • 2 in conversation