BookmarkSubscribeRSS Feed
he2182
Calcite | Level 5

 Hi,

 

I'm working with a dataset that is a combination of two separate datasets. There are certain variables that are positively skewed. I have decided to winsorize to address this but I wasn't sure if I should winsorize the variables from the different datasets separately or all together.

 

For example (winsorize at 75th percentile):

 

Dataset  Freckles  Winsorize_together_75  Winsorize_groups_separately_75

1           10                     10                                    10

1           15                     15                                    15

1           20                     20                                     20 

1           99                     75                                     99

1           100                   75                                     99

1           10                     10                                     10

2            15                    15                                     15

2            20                    20                                     20   

2            20                    20                                     20

2            25                    25                                     25

2            75                    75                                     55

2            105                  75                                     55

2            35                    35                                     35

2            35                    35                                     35

 

Should I winsorize the positively skewed variable for the overall dataset or for the two datasets separately?

 

Thanks!

 

9 REPLIES 9
PaigeMiller
Diamond | Level 26

The decision on what analysis to do depends on the goal of the analysis. Can you please state the goal of your analysis? Thanks.

--
Paige Miller
he2182
Calcite | Level 5

Thanks for your reply. I'm trying to perform a t-test.

 

To give you more background, the individuals all come from one online community but were divided based on how they responded to a particular question. 

PaigeMiller
Diamond | Level 26

@he2182 wrote:

Thanks for your reply. I'm trying to perform a t-test.

 

To give you more background, the individuals all come from one online community but were divided based on how they responded to a particular question. 


This doesn't really tell us whether the distributions of the individuals are the same or different, based on how they responded to a particular question. So, I don't really have a recommendation based on this about how to perform the winsorizing.

 

But I do agree completely with @SteveDenham on this matter, in which case the issue of how to winsorize isn't relevant.

--
Paige Miller
Rick_SAS
SAS Super FREQ

To expand on PaigeMiller's suggestion, do you think the data are coming from a single population or from two different populations?

 

if you merge and then Winsorize, you are assuming that each sample is drawn from the same population. If you Winsorize separately, you are implicitly assuming that each sample comes from it's own population, which makes me wonder whether it is appropriate to merging them together.

PaigeMiller
Diamond | Level 26

@Rick_SAS wrote:

To expand on PaigeMiller's suggestion, do you think the data are coming from a single population or from two different populations?



Well, Rick, that's a good point, but it wasn't my point. My point is that if you are looking to compare means or medians, then that might lead to one decision, and if you are looking to compare standard deviations or variances, you might choose a different decision. We don't know what comparison or analysis the user wants to do.

--
Paige Miller
SteveDenham
Jade | Level 19

I don't particularly care for Winsorizing data.  Trading a long tail for a heavy tail accomplishes little insofar as having a normal distribution, and it grossly underestimates the true variance, no matter what the distribution.  If there isn't a particular process known to generate the data (waiting times, counts, etc.), then it's probably not a good idea to assume an underlying non-normal distribution, sucha as a gamma or Poisson.  Which means:

 

Why not use a nonparametric test?  The median is probably a better indicator of central tendency for these samples in any case, so a Wilcoxon rank sum test would be nearly ideal for what I think the OP is trying to do.

 

Steve Denham

he2182
Calcite | Level 5

Thank you for your suggestion.  

 

I have already performed the Wilcoxon rank sum test but also wanted to do a t-test.

SteveDenham
Jade | Level 19

Why do the t test?  You have already tested whether the two groups differ as far as location.  If you did do another test, did you plan on adjusting the p value for multiple testing?  Or, and I really hope this is not the case, were you going to keep doing tests until you found one that agreed with your hoped for outcome?  I offer the following from John Tukey:The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

Steve Denham

he2182
Calcite | Level 5

Thank you for you concern. I wanted to report both results. The Wilcoxon rank sum test results were significant so worry not, we were not on a fishing expedition. 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1847 views
  • 1 like
  • 4 in conversation