I'd like to winsorize my data from a self-paced reading study. In particular, (in the sample data below) I'd like to make it so that any value in "responsetime" that is more than 2.5 times the standard deviation away from the mean for a given "region", then that too-high value will be replaced by 2.5 times the standard deviation plus the mean. For example, region 4 (with values 304, 510, 331, 474, 1347, 379, 1072, 854, 542, 547, 340, 410) has a mean of 592.5 and a standard deviation of 314.6. Therefore, any value for region 3 that is more than 1378.7 should be replaced by 1378.7. In this sample, none of the values would be replaced. Is there a relatively simple way to command SAS to do this for my whole data set?
I'm also uncertain whether this should be done by conditions and regions, or just by regions. My thinking is that it's more conservative to do it by regions only.
data reading2;
input subject
lines subject$ item $ region $ responsetime verbtype$ animate$;
400 | 4 | 4 | 304 | Alternating | Animate |
400 | 4 | 5 | 498 | Alternating | Animate |
400 | 4 | 6 | 340 | Alternating | Animate |
400 | 4 | 7 | 485 | Alternating | Animate |
400 | 15 | 4 | 510 | NonAlternating | Inanimate |
400 | 15 | 5 | 304 | NonAlternating | Inanimate |
400 | 15 | 6 | 371 | NonAlternating | Inanimate |
400 | 15 | 7 | 483 | NonAlternating | Inanimate |
400 | 23 | 4 | 331 | NonAlternating | Inanimate |
400 | 23 | 5 | 332 | NonAlternating | Inanimate |
400 | 23 | 6 | 332 | NonAlternating | Inanimate |
400 | 23 | 7 | 357 | NonAlternating | Inanimate |
400 | 6 | 4 | 474 | Alternating | Animate |
400 | 6 | 5 | 309 | Alternating | Animate |
400 | 6 | 6 | 335 | Alternating | Animate |
400 | 6 | 7 | 337 | Alternating | Animate |
401 | 26 | 4 | 1347 | NonAlternating | Inanimate |
401 | 26 | 5 | 514 | NonAlternating | Inanimate |
401 | 26 | 6 | 484 | NonAlternating | Inanimate |
401 | 26 | 7 | 527 | NonAlternating | Inanimate |
401 | 5 | 4 | 379 | Alternating | Animate |
401 | 5 | 5 | 378 | Alternating | Animate |
401 | 5 | 6 | 1720 | Alternating | Animate |
401 | 5 | 7 | 399 | Alternating | Animate |
401 | 19 | 4 | 1072 | NonAlternating | Animate |
401 | 19 | 5 | 459 | NonAlternating | Animate |
401 | 19 | 6 | 679 | NonAlternating | Animate |
401 | 19 | 7 | 985 | NonAlternating | Animate |
401 | 18 | 4 | 854 | NonAlternating | Inanimate |
401 | 18 | 5 | 434 | NonAlternating | Inanimate |
401 | 18 | 6 | 959 | NonAlternating | Inanimate |
401 | 18 | 7 | 438 | NonAlternating | Inanimate |
401 | 17 | 4 | 542 | NonAlternating | Animate |
401 | 17 | 5 | 374 | NonAlternating | Animate |
401 | 17 | 6 | 398 | NonAlternating | Animate |
401 | 17 | 7 | 428 | NonAlternating | Animate |
401 | 23 | 4 | 547 | NonAlternating | Animate |
401 | 23 | 5 | 522 | NonAlternating | Animate |
401 | 23 | 6 | 579 | NonAlternating | Animate |
401 | 23 | 7 | 401 | NonAlternating | Animate |
401 | 2 | 4 | 340 | Alternating | Inanimate |
401 | 2 | 5 | 437 | Alternating | Inanimate |
401 | 2 | 6 | 361 | Alternating | Inanimate |
401 | 2 | 7 | 369 | Alternating | Inanimate |
401 | 13 | 4 | 410 | Alternating | Animate |
401 | 13 | 5 | 392 | Alternating | Animate |
401 | 13 | 6 | 702 | Alternating | Animate |
401 | 13 | 7 | 1280 | Alternating | Animate |
;
I think this may do what you are asking for.
proc sql; create table want as select a.lines, a.subject, a.item,min(a.responstime,b.limit) as responsetime , a.region, a.verbtype from have as a left join (select region,mean(responsetime)+2.5*std(responsetime) as limit from have group by region) as b on a.region=b.region ; quit;
I would be very careful about actually replacing values in tables until you are sure of the result to avoid destroying the existing data (incorrect code, losing variables or such).
You did not mention any lower limit check so none is attempted.
You can certainly carry out the scheme that you request, but be aware that "to Winsorize data" has a particular statistical meaning that is different from what you describe. For the statistical definition, see "How to Winsorize data in SAS." For a discussion of the pros and cons of trimming the extreme values, see "Winsorization: The good, the bad, and the ugly."
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.