(Note: cross posted to stack exchange but no responses)
(Note: cross posted to stack exchange but no responses)
@mjkop56 wrote:
Thank you! Hypothetical example. Here is the original data:
Male Female PNA # Applied 900 1800 300 # Selected 330 570 100 Success Rate 37% 32% 33%
2 extremes:
1) all the PNA are male - success rate for male= 36%, female = 32% (note, F does not change)
2) all the PNA are female - success rates for male = 37% (M does not change), female = 32%
I guess it's more that the data would be presented as a range: Male success rate ranges from 36% - 37%
and Female success rate ranges from 32%-32%
(Note this example is made up, real data would show more of a difference)
Okay, you can certainly do this, these are not error bars in any standard sense, and you would be advised to make that very very very very clear. As pointed out by @Reeza , also be prepared to have your assumptions challenged.
However, say there is missing data (some people did not report their gender) and one argues that error bars are needed to account for the missing data.
How would you compute these error bars in this case? These would not be based upon some standard error of the estimate that is usually done with sampled date. They would be some other type of error bars ... and that's fine with me as long as you explain that these would not be based on sampling variability. But really, how do you compute the error bars in this case?
Imputing is another way to handle this.
Thank you! Hypothetical example. Here is the original data:
Male | Female | PNA | |
# Applied | 900 | 1800 | 300 |
# Selected | 330 | 570 | 100 |
Success Rate | 37% | 32% |
33% |
2 extremes:
1) all the PNA are male - success rate for male= 36%, female = 32% (note, F does not change)
2) all the PNA are female - success rates for male = 37% (M does not change), female = 32%
I guess it's more that the data would be presented as a range: Male success rate ranges from 36% - 37%
and Female success rate ranges from 32%-32%
(Note this example is made up, real data would show more of a difference)
@mjkop56 wrote:
Thank you! Hypothetical example. Here is the original data:
Male Female PNA # Applied 900 1800 300 # Selected 330 570 100 Success Rate 37% 32% 33%
2 extremes:
1) all the PNA are male - success rate for male= 36%, female = 32% (note, F does not change)
2) all the PNA are female - success rates for male = 37% (M does not change), female = 32%
I guess it's more that the data would be presented as a range: Male success rate ranges from 36% - 37%
and Female success rate ranges from 32%-32%
(Note this example is made up, real data would show more of a difference)
Okay, you can certainly do this, these are not error bars in any standard sense, and you would be advised to make that very very very very clear. As pointed out by @Reeza , also be prepared to have your assumptions challenged.
thank you - your responses have been very helpful.
Missing data are hard. There's a whole literature about handling missing data. I think what you're describing is some sort of 'sensitivity analysis.'
Note, in your example case I think the extremes would be the:
Of course it's very unlikely that either of those worst case scenario extremes occurred, which is how you end up in the land of multiple imputation etc.
thank you! I'm a bit unclear about how you came up with the extremes - could you please explain?
@mjkop56 wrote:
thank you! I'm a bit unclear about how you came up with the extremes - could you please explain?
Hi,
Your original data is:
Male | Female | PNA | |
# Applied | 900 | 1800 | 300 |
# Selected | 330 | 570 | 100 |
Success Rate | 37% | 32% |
33% |
The extremes are not "all the PNA are male or al the PNA are female." The extremes are (I think) that gender is perfectly associated with selection among the PNA. That would (I think) give you the most extreme estimates you could get for the acceptance rates of males and females.
So if all all males among the PNA are selected and no females among the PNA are selected, your table becomes:
Male | Female | PNA | |
# Applied | 900+100=1000 | 1800+200=2000 | |
# Selected | 330+100=430 | 570 | |
Success Rate | 43% | 28.5% |
The other extreme is all females among the PNA are selected and no males among the PNA are accepted:
Male | Female | PNA | |
# Applied | 900+200=1100 | 1800+100=1900 | |
# Selected | 330 | 570+100=670 | |
Success Rate | 30% | 35% |
thanks for the explanation!
I agree imputation is problematic, and who is in the PNA is unknown (I think it could go either way).
"Prefer not to say is not missing, it's a category on it's own." => good point, as someone actually chose this option rather than skipping the question. But I'm wondering what the practical difference is between those two situations? In the case of "prefer not to answer", we don't know where they fit, so it seems to be the same thing as missing.
If you're going to assume a gender after someone has said they don't want to specify one, what was the point of including that option?
The point in including it is that providing gender needs to be optional, it can't be required. However, if there is a sizeable portion that did not want to answer, it affects the conclusions you can make.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.