turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Winsorize independent variable for two groups sepa...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-21-2015 08:47 AM

Hi,

I'm working with a dataset that is a combination of two separate datasets. There are certain variables that are positively skewed. I have decided to winsorize to address this but I wasn't sure if I should winsorize the variables from the different datasets separately or all together.

For example (winsorize at 75th percentile):

Dataset Freckles Winsorize_together_75 Winsorize_groups_separately_75

1 10 10 10

1 15 15 15

1 20 20 20

1 99 75 99

1 100 75 99

1 10 10 10

2 15 15 15

2 20 20 20

2 20 20 20

2 25 25 25

2 75 75 55

2 105 75 55

2 35 35 35

2 35 35 35

Should I winsorize the positively skewed variable for the overall dataset or for the two datasets separately?

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-21-2015 08:53 AM

The decision on what analysis to do depends on the goal of the analysis. Can you please state the goal of your analysis? Thanks.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-21-2015 11:23 AM

Thanks for your reply. I'm trying to perform a t-test.

To give you more background, the individuals all come from one online community but were divided based on how they responded to a particular question.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-21-2015 12:16 PM

he2182 wrote:

Thanks for your reply. I'm trying to perform a t-test.

To give you more background, the individuals all come from one online community but were divided based on how they responded to a particular question.

This doesn't really tell us whether the distributions of the individuals are the same or different, based on how they responded to a particular question. So, I don't really have a recommendation based on this about how to perform the winsorizing.

But I do agree completely with @SteveDenham on this matter, in which case the issue of how to winsorize isn't relevant.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-21-2015 09:08 AM

To expand on PaigeMiller's suggestion, do you think the data are coming from a single population or from two different populations?

if you merge and then Winsorize, you are assuming that each sample is drawn from the same population. If you Winsorize separately, you are implicitly assuming that each sample comes from it's own population, which makes me wonder whether it is appropriate to merging them together.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-21-2015 09:13 AM

Rick_SAS wrote:

To expand on PaigeMiller's suggestion, do you think the data are coming from a single population or from two different populations?

Well, Rick, that's a good point, but it wasn't my point. My point is that if you are looking to compare means or medians, then that might lead to one decision, and if you are looking to compare standard deviations or variances, you might choose a different decision. We don't know what comparison or analysis the user wants to do.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-21-2015 11:43 AM

I don't particularly care for Winsorizing data. Trading a long tail for a heavy tail accomplishes little insofar as having a normal distribution, and it grossly underestimates the true variance, no matter what the distribution. If there isn't a particular process known to generate the data (waiting times, counts, etc.), then it's probably not a good idea to assume an underlying non-normal distribution, sucha as a gamma or Poisson. Which means:

Why not use a nonparametric test? The median is probably a better indicator of central tendency for these samples in any case, so a Wilcoxon rank sum test would be nearly ideal for what I think the OP is trying to do.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-21-2015 12:37 PM

Thank you for your suggestion.

I have already performed the Wilcoxon rank sum test but also wanted to do a t-test.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-26-2015 02:38 PM

Why do the t test? You have already tested whether the two groups differ as far as location. If you did do another test, did you plan on adjusting the p value for multiple testing? Or, and I really hope this is not the case, were you going to keep doing tests until you found one that agreed with your hoped for outcome? I offer the following from John Tukey:The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-26-2015 05:24 PM

Thank you for you concern. I wanted to report both results. The Wilcoxon rank sum test results were significant so worry not, we were not on a fishing expedition.