Hello experts,
I want to winsorize all of my variables, I use the following code given by @Linlin. But the problem is, this code is not working for .1 and 99.9 percentile or .5 and 99.5 percentile. It works for 1st and 99th percentile. Also, I can not winsorize three variable by using this code, what will be the coding requires to winsorize multiple variables at a time. I am winsorizing 24000 observations.
data have;
input price size;
cards;
10 3
20 5
30 5
45 1
2 3
20 2
30 2
45 1
38 3
20 3
39 3
46 1
;
run;
proc univariate data=have noprint;
var price size;
output out=temp pctlpts = 1 99 pctlpre = price size pctlname = pct1 pct99;
run;
/* create 4 macro variables with the 4 interested values */
data _null_;
set temp;
call symputx('price1',pricepct1);
call symputx('price99',pricepct99);
call symputx('size1',sizepct1);
call symputx('size99',sizepct99);
run;
%put _user_;
data want;
set have;
where (&price1< price<&price99) and (&size1 <size< &size99);
run;
proc print;run;
First, the code you are using does not perform Winsorization. It is performing "trimming" because it is excluding the extreme values. For Winsorization, you need to REPLACE the extreme values with another less extreme value. Furthermore, the code you are using uses quantiles, when you should be using COUNTS. If there are tied values, these are not the same.
> Also, I can not winsorize three variable by using this code
I'm not sure why you say this. The code creates a new data set that has fewer observations by discarding the extreme percentiles. The code you are using discards the ENTIRE observation if one of the variables has an extreme value. Suppose for X1 you discard observations 1 and 2, and for X2 you discard observations 101 and 102. Then the new data set equals the old one except that it does not have obs 1,2,101, or 102. This generalizes.
Although you CAN generalize the program, you should ask yourself whether it is a good idea. Personally, I wouldn't throw out the entire observation just because one value is extreme. For more about Winsorization, see "Winsorization: The good, the bad, and the ugly."
As suggested in the "Winsorized mean" item in the list of Frequently Asked-for Statistics (FASTats) in the Important Links section of the Statistical Procedures Community page, use the WINSOR= option in PROC UNIVARIATE.
Did you try Pctlpts = .5 99.5 ?
Do you have a clue how to modify the macro code to match the names created for the pctlpts .5 and 99.5?
This has no problems creating the .5 and 99.5:
proc univariate data=have noprint; var price size; output out=temp pctlpts = .5 99.5 pctlpre = price size pctlname = pct_5 pct99_5; run;
Look at the names of the variables in the Temp set and it should be easy to modify that data step creating macro variables.
First, the code you are using does not perform Winsorization. It is performing "trimming" because it is excluding the extreme values. For Winsorization, you need to REPLACE the extreme values with another less extreme value. Furthermore, the code you are using uses quantiles, when you should be using COUNTS. If there are tied values, these are not the same.
> Also, I can not winsorize three variable by using this code
I'm not sure why you say this. The code creates a new data set that has fewer observations by discarding the extreme percentiles. The code you are using discards the ENTIRE observation if one of the variables has an extreme value. Suppose for X1 you discard observations 1 and 2, and for X2 you discard observations 101 and 102. Then the new data set equals the old one except that it does not have obs 1,2,101, or 102. This generalizes.
Although you CAN generalize the program, you should ask yourself whether it is a good idea. Personally, I wouldn't throw out the entire observation just because one value is extreme. For more about Winsorization, see "Winsorization: The good, the bad, and the ugly."
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.