Hello, I have several (hundreds of) variables that I need to “winsorize” at some level, let us say at the 95% and 5%. As you know, this means that all the observations with values > 95 percentile take the value of the 95% percentile , and all observations with value less than the 5% percentile take the value of the 5th percentile. I have been using the proc univariate, get the p5 and p95, and use “if” statements : For instance for a variable Raw1, I compute the 95 and 5 percentiles, I create a new variable Raw1_w and I "winsorize" it with if statements. Raw1_w = Raw1; if Raw1 < Raw1_p5 then Raw1_w = Raw1_p5; if Raw1 > Raw1_p95 then Raw1_w = Raw1_p95; if Raw1 =. then Raw1_w = .; The problem is that when I have several hundreds variables, this 4 lines of code per variable result in a very complex and very long program and I make mistakes too frequently .. , and I was wondering if you can think in a better solution to do this. Thank you !!!,
Use a macro.
Here's one that caps outliers using 25/75 but could easily be modified for 5/95 and caps them.
SAS Box-Plot/Tukey Method of Capping Outliers
If it's to calculate means/statistics proc univariate will also do that using winsorize option.
Use a macro.
Here's one that caps outliers using 25/75 but could easily be modified for 5/95 and caps them.
SAS Box-Plot/Tukey Method of Capping Outliers
If it's to calculate means/statistics proc univariate will also do that using winsorize option.
Thank you again Reeza, My first time using these macros. It works. Would you know by a chance how to modify the macro so the winsorized variables have different names than the original varialbes? For instance, creating new names for the winsorized variables adding _winsor to the original name? I tried several things but I was not successful Thank you again
Modify the macro slightly - I'm adding the win to the beginning of the variable because that will allow you to use naming shortcuts in SAS later on if required.
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
win_&var = &var;
if &var>&upper then win_&var=&upper;
if &var<&lower then win_&var=&lower;
run;
%mend;
Thank you again !!!!
Here is a simple IML code .
data have;
do i=1 to 100;
a=ceil(ranuni(1)*100);
b=ceil(ranuni(2)*100);
output;
end;
drop i;
run;
%let low=0.05 ;
%let high=0.95 ;
proc iml;
use have;
read all var _num_ into x[c=vname];
close have;
call qntl(q,x,{&low ,&high});
do i=1 to ncol(x);
x[loc(x[,i]<q[1,i]),i]=q[1,i];
x[loc(x[,i]>q[2,i]),i]=q[2,i];
end;
create want from x[c=vname];
append from x;
close want;
quit;
Xia Keshan
The previous solutions have two problems:
1) They do not correctly Winsorize the data
2) They do not handle missing values
Winsorization does not involve quantiles. To Winsorize, you specify an integer k. You then replace the lowest (nonmissing) k values by the (k+1)st ordered value, and replace the highest k values by the (n-k)th ordered value, where n is the number of nonmissing observations.
Notice that if there are repeated values in the tails, then Winsorizing is different than "capping" and quantiles are not the same as choosing k. For example, if k=2 and the data are
0 4 4 4 5 5 6 6 6 20
then the Winsorized data should be
4 4 4 4 5 5 6 6 6 6
Here is a SAS/IML program that Winsorizes data. To validate, take the (ordinary) mean of the Winsorized data and compare to the Winsorized means as computed by using PROC UNIVARIATE:
%let dsname = sashelp.heart;
/* SAS/IML program to Winsorize data */
/* Modified from the "trimmed mean" algorithm in
Wicklin (2010) "Rediscovering SAS/IML Software", p 2-3 */
proc iml;
start Winsorize(x, prop);
p = ncol(x); /* number of columns */
w = x; /* copy of x */
do i = 1 to p;
z = x[,i]; /* copy i_th column */
n = countn(z); /* count nonmissing values */
k = ceil(prop*n); /* number of observations to trim */
r = rank(z); /* rank data in i_th column */
/* find target values and obs with smaller/larger values */
lowIdx = loc(r<=k & r^=.);
lowVal = z[loc(r=k+1)];
highIdx = loc(r>=n-k+1);
highVal = z[loc(r=n-k)];
/* Winsorize k largest and k smallest values */
w[lowIdx,i] = lowVal;
w[highIdx,i] = highVal;
end;
return(w);
finish;
/* test it */
use &dsname;
read all var _NUM_ into X[colname=varNames];
close;
winX = Winsorize(X, 0.1);
/* Optional: compute Winsorized mean = mean of the Winsorized data */
winMean = mean(winX);
print winMean[c=varNames f=8.4];
/* write Winsorized data to data set, if necessary */
quit;
/* Validation: compute Winsorized means by using UNIVARIATE */
ods graphics off;
ods exclude all;
proc univariate data=&dsname winsorized=0.1;
ods output WinsorizedMeans=winMeans;
run;
ods exclude none;
proc print data=winMeans;
var VarName HalfP HalfN Mean;
run;
Thanks Rick, that is good to know.
Thank you Rick,
I see the problems . One question is that you mention you need to specify an integer K. Perhaps a problem is that this integer should be (for instance) the 20 percentile, and 80 percentile (to winsorize at the 20 % and 80%), so k is different for each variable. See for instance the desired output. The first column is an ID, columns 2 and 3 have the data for Va and VB, the other 2 columns have two variables winsorized. Va has several missing values.
My question is thus, how can this program be “adapted” to make “k” be the percentile desired to winsorize at the 20%-80%instead of a fix number?
Thank you again,
id Va Vb Va_win Vb_win
2 4 1001 4 998
3 . 1000 . 998
4 4 999 4 998
5 . 998 . 998
6 5 997 5 997
7 6 996 6 996
8 . 995 . 995
9 8 994 8 994
10 9 993 9 993
18 . 985 . 985
19 18 984 18 984
20 . 983 . 983
21 20 982 20 982
22 . 981 . 981
23 . 980 . 980
24 . 979 . 979
25 . 978 . 978
26 25 977 25 978
27 . 976 . 978
28 26 976 25 978
If you want to just cap it at the values of the 20th and 80th percentiles, see the answers by the others. In particular, @Ksharp has already shown how to use the QNTL function in IML to replace values less than (resp., greater than) the 20th (resp, 80th) percentile.
Just understand that what you are doing is slightly different than the Winsorization method that PROC UNIVARIATE uses. If there are no repeated values and no missing values, the two methods are equivalent, I think.
@Rick
There are two different way to define winsorize in PROC UNIVARIATE , one is like yours , another is as OP's said.
Xia Keshan (a.k.a Ksharp)
No, UNIVARIATE uses only the symmetric definition. however, you can specify the number of observations to Winsorize as an integer, k, or as a percentage (0.1=10%). In both cases the procedure Winsorizes the same number of observations in both tails.
Percentages are not the same as percentiles. As I show in my response, when there are repeated values this is not the same as changing all data less than the alpha_th percentile and all data greater than the (1-alpha)th percentile.
I don't see the difference in what I've coded and how Wikipedia and NumPy define Winsorize :smileyconfused:
Winsorising - Wikipedia, the free encyclopedia
scipy.stats.mstats.winsorize — SciPy v0.14.0 Reference Guide
Yes, the Wikipedia article uses a different definition than PROC UNIVARIATE does. The UNIVARIATE procedure uses the statistical definition of the symmetric Winsorization as studied by Tukey and colleagues in s series of papers. (Symmetric means change k values in the upper and lower tails.) By using the Tukey definition, you can get standard errors, confidence intervals, and other distributional information about the Winsorized mean.
There's nothing inherently wrong with replacing via quantiles, but I don't know whether the distribution of the resulting statistics is understood for that nonsymmetric case.
In case this thread is referenced in the future, I am including a link to the article "How to Winsorize data in SAS," in which I consolidate my thoughts on this issue.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.