BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mjkop56
Obsidian | Level 7
Hi all, have data on the large number of people who were applying to receive an award. Everyone who applied must fill out a survey and specify their gender (although they can choose "prefer not to say"). My understanding it is NOT appropriate to put "error bars" on this data given that the survey was not a sample of the population, but rather the whole population.
 
However, say there is missing data (some people did not report their gender) and one argues that error bars are needed to account for the missing data. Hypothetical example: Submissions are 40% female, 50% male, and 10% did not say. Or you could say that success rates for females are 25%, males 26%, unknown is 27% (so conclusion depends on who is in unknown); and one could conceivably create error bars to show what the result would be if ALL unknowns were all women v. ALL the unknowns are men.
 
My understanding is that the error bars are used to reflect sampling error and NOT missing data. With missing data one would typically impute. Is this correct? Is there any value in making up error bars for missing data?
 

(Note: cross posted to stack exchange but no responses)

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

@mjkop56 wrote:

Thank you! Hypothetical example. Here is the original data:

 

  Male Female PNA
# Applied 900 1800 300
# Selected 330 570 100
Success Rate 37% 32%

33%

 

2 extremes:

1) all the PNA are male - success rate for male= 36%, female = 32% (note, F does not change)

2) all the PNA are female - success rates for male = 37% (M does not change), female = 32%

 

 I guess it's more that the data would be presented as a range: Male success rate ranges from 36% - 37%

and  Female success rate ranges from 32%-32% 

 

(Note this example is made up, real data would show more of a difference)


Okay, you can certainly do this, these are not error bars in any standard sense, and you would be advised to make that very very very very clear. As pointed out by @Reeza , also be prepared to have your assumptions challenged.

--
Paige Miller

View solution in original post

16 REPLIES 16
PaigeMiller
Diamond | Level 26

However, say there is missing data (some people did not report their gender) and one argues that error bars are needed to account for the missing data.

 

How would you compute these error bars in this case? These would not be based upon some standard error of the estimate that is usually done with sampled date. They would be some other type of error bars ... and that's fine with me as long as you explain that these would not be based on sampling variability. But really, how do you compute the error bars in this case?

 

Imputing is another way to handle this.

--
Paige Miller
mjkop56
Obsidian | Level 7

Thank you! Hypothetical example. Here is the original data:

 

  Male Female PNA
# Applied 900 1800 300
# Selected 330 570 100
Success Rate 37% 32%

33%

 

2 extremes:

1) all the PNA are male - success rate for male= 36%, female = 32% (note, F does not change)

2) all the PNA are female - success rates for male = 37% (M does not change), female = 32%

 

 I guess it's more that the data would be presented as a range: Male success rate ranges from 36% - 37%

and  Female success rate ranges from 32%-32% 

 

(Note this example is made up, real data would show more of a difference)

PaigeMiller
Diamond | Level 26

@mjkop56 wrote:

Thank you! Hypothetical example. Here is the original data:

 

  Male Female PNA
# Applied 900 1800 300
# Selected 330 570 100
Success Rate 37% 32%

33%

 

2 extremes:

1) all the PNA are male - success rate for male= 36%, female = 32% (note, F does not change)

2) all the PNA are female - success rates for male = 37% (M does not change), female = 32%

 

 I guess it's more that the data would be presented as a range: Male success rate ranges from 36% - 37%

and  Female success rate ranges from 32%-32% 

 

(Note this example is made up, real data would show more of a difference)


Okay, you can certainly do this, these are not error bars in any standard sense, and you would be advised to make that very very very very clear. As pointed out by @Reeza , also be prepared to have your assumptions challenged.

--
Paige Miller
mjkop56
Obsidian | Level 7

thank you - your responses have been very helpful.

Quentin
Super User

Missing data are hard. There's a whole literature about handling missing data. I think what you're describing is some sort of 'sensitivity analysis.' 

 

Note, in your example case I think the extremes would be the:

 

  • PNA category has 100 males that were selected and 200 females that were rejected.  That would give you a 43% acceptance for males and 28.5% acceptance for females.  
  • PNA category has 100 females that were selected and 200 males that were rejected. That would give you a 30% acceptance for males and 35% acceptance for females.   

Of course it's very unlikely that either of those worst case scenario extremes occurred, which is how you end up in the land of multiple imputation etc.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
mjkop56
Obsidian | Level 7

thank you! I'm a bit unclear about how you came up with the extremes - could you please explain?

Quentin
Super User

@mjkop56 wrote:

thank you! I'm a bit unclear about how you came up with the extremes - could you please explain?


Hi,

 

Your original data is:

  Male Female PNA
# Applied 900 1800 300
# Selected 330 570 100
Success Rate 37% 32%

33%

 

The extremes are not "all the PNA are male or al the PNA  are female."  The extremes are (I think) that gender is perfectly associated with selection among the PNA.  That would (I think) give you the most extreme estimates you could get for the acceptance rates of males and females.

 

So if all all males among the PNA are selected and no females among the PNA are selected, your table becomes:

 

 

  Male Female PNA
# Applied 900+100=1000 1800+200=2000  
# Selected 330+100=430 570  
Success Rate 43% 28.5%  

 

The other extreme is all females among the PNA are selected and no males among the PNA are accepted:

 

  Male Female PNA
# Applied 900+200=1100 1800+100=1900  
# Selected 330 570+100=670  
Success Rate 30% 35%  

 

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Reeza
Super User
Prefer not to say is not missing, it's a category on it's own. And from my somewhat biased experience, it's most likely to be women/non-binary who fill in that choice primarily. However, having that option and then attempting imputation seems problematic. If you want to analyze by subpopulation then things get a bit different but I wouldn't be imputing or combining it.
mjkop56
Obsidian | Level 7

I agree imputation is problematic, and who is in the PNA is unknown (I think it could go either way).

Reeza
Super User
As long as the sample size is large enough I wouldn't recommend consolidating these categories. PNA could be non-binary which would not fit in either group.
mjkop56
Obsidian | Level 7

"Prefer not to say is not missing, it's a category on it's own." => good point, as someone actually chose this option rather than skipping the question. But I'm wondering what the practical difference is between those two situations? In the case of "prefer not to answer", we don't know where they fit, so it seems to be the same thing as missing.

Reeza
Super User

If you're going to assume a gender after someone has said they don't want to specify one, what was the point of including that option?

 

 

mjkop56
Obsidian | Level 7

The point in including it is that providing gender needs to be optional, it can't be required.  However, if there is a sizeable portion that did not want to answer, it affects the conclusions you can make.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 16 replies
  • 1038 views
  • 7 likes
  • 5 in conversation