Obsidian | Level 7

## How to winsorize

Hello, I have several (hundreds of) variables that I need to “winsorize” at some level, let us say at the 95% and 5%.  As you know, this  means that all the observations with values > 95 percentile take the value of the 95% percentile , and all observations with value less than the 5% percentile take the value of the 5th percentile. I have been using the proc univariate, get the p5 and p95, and use “if” statements : For instance for a variable Raw1, I compute the 95 and 5 percentiles, I create a new variable Raw1_w  and I "winsorize" it with if statements. Raw1_w =  Raw1; if  Raw1 < Raw1_p5  then Raw1_w =  Raw1_p5; if  Raw1 > Raw1_p95  then Raw1_w =  Raw1_p95; if  Raw1 =.  then Raw1_w = .; The problem is that when I have several hundreds variables, this 4 lines of code per variable result in a  very complex and very long program and I make mistakes too frequently .. , and I was wondering if you can think in a better solution to do this. Thank you  !!!,

1 ACCEPTED SOLUTION

Accepted Solutions
Super User

## Re: How to winsorize

Use a macro.

Here's one that caps outliers using 25/75 but could easily be modified for 5/95 and caps them.

SAS Box-Plot/Tukey Method of Capping Outliers

If it's to calculate means/statistics proc univariate will also do that using winsorize option.

14 REPLIES 14
Super User

## Re: How to winsorize

Use a macro.

Here's one that caps outliers using 25/75 but could easily be modified for 5/95 and caps them.

SAS Box-Plot/Tukey Method of Capping Outliers

If it's to calculate means/statistics proc univariate will also do that using winsorize option.

Obsidian | Level 7

## Re: How to winsorize

Thank you again Reeza, My first time using these macros.  It works. Would you know by a chance how to modify the macro so the winsorized variables have different names than the original varialbes?  For instance, creating  new names for the winsorized variables adding _winsor to the original name?  I tried several things but I was not successful Thank you again

Super User

## Re: How to winsorize

Modify the macro slightly - I'm adding the win to the beginning of the variable because that will allow you to use naming shortcuts in SAS later on if required.

%macro cap(dset=,var=, lower=, upper=);

data &dset;

set &dset;

win_&var = &var;

if &var>&upper then win_&var=&upper;

if &var<&lower then win_&var=&lower;

run;

%mend;

Obsidian | Level 7

## Re: How to winsorize

Thank you again !!!!

Super User

## Re: How to winsorize

Here is a simple IML code .

### Code: Program

`data have; do i=1 to 100;  a=ceil(ranuni(1)*100);  b=ceil(ranuni(2)*100);  output; end; drop i;run;%let low=0.05 ;%let high=0.95 ;proc iml;use have;read all var _num_ into x[c=vname];close have;call qntl(q,x,{&low ,&high});do i=1 to ncol(x); x[loc(x[,i]<q[1,i]),i]=q[1,i]; x[loc(x[,i]>q[2,i]),i]=q[2,i];end;create want from x[c=vname];append from x;close want;quit;`

Xia Keshan

SAS Super FREQ

## Re: How to winsorize

The previous solutions have two problems:

1) They do not correctly Winsorize the data

2) They do not handle missing values

Winsorization does not involve quantiles. To Winsorize, you specify an integer k. You then replace the lowest (nonmissing) k values by the (k+1)st ordered value, and replace the highest k values by the (n-k)th ordered value, where n is the number of nonmissing observations.

Notice that if there are repeated values in the tails, then Winsorizing is different than "capping" and  quantiles are not the same as choosing k. For example, if k=2 and the data are

0 4 4 4 5 5 6 6 6 20

then the Winsorized data should be

4 4 4 4 5 5 6 6 6 6

Here is a SAS/IML program that Winsorizes data. To validate, take the (ordinary) mean of the Winsorized data and compare to the Winsorized means as computed by using PROC UNIVARIATE:

%let dsname = sashelp.heart;

/* SAS/IML program to Winsorize data */
/* Modified from the "trimmed mean" algorithm in
Wicklin (2010) "Rediscovering SAS/IML Software", p 2-3 */
proc iml;

start Winsorize(x, prop);
p = ncol(x);      /* number of columns */
w = x;            /* copy of x */
do i = 1 to p;
z = x[,i];        /* copy i_th column */
n = countn(z);    /* count nonmissing values */
k = ceil(prop*n); /* number of observations to trim */
r = rank(z);      /* rank data in i_th column */
/* find target values and obs with smaller/larger values */
lowIdx = loc(r<=k & r^=.);
lowVal = z[loc(r=k+1)];
highIdx = loc(r>=n-k+1);
highVal = z[loc(r=n-k)];
/* Winsorize k largest and k smallest values */
w[lowIdx,i] = lowVal;
w[highIdx,i] = highVal;
end;
return(w);
finish;

/* test it */

use &dsname;
read all var _NUM_ into X[colname=varNames];
close;
winX = Winsorize(X, 0.1);

/* Optional: compute Winsorized mean = mean of the Winsorized data */
winMean = mean(winX);
print winMean[c=varNames f=8.4];

/* write Winsorized data to data set, if necessary */
quit;

/* Validation: compute Winsorized means by using UNIVARIATE */
ods graphics off;
ods exclude all;
proc univariate data=&dsname winsorized=0.1;
ods output WinsorizedMeans=winMeans;
run;
ods exclude none;
proc print data=winMeans;
var VarName HalfP HalfN Mean;
run;

Super User

## Re: How to winsorize

Thanks Rick, that is good to know.

Obsidian | Level 7

## Re: How to winsorize

Thank you Rick,

I see the problems .  One question is that you mention you need to specify an integer K.  Perhaps a problem is that this integer should be (for instance) the 20 percentile, and 80 percentile (to winsorize at the 20 % and 80%), so k is different for each variable. See for instance the desired output. The first column  is an ID, columns 2 and 3 have the data for Va and VB,  the other 2 columns have two variables winsorized.  Va has several missing values.

My question is thus, how can this program be “adapted” to make “k” be the percentile desired to winsorize at the 20%-80%instead of a fix number?

Thank you again,

id Va Vb   Va_win Vb_win

2 4 1001   4 998

3 . 1000   . 998

4 4 999   4 998

5 . 998   . 998

6 5 997   5 997

7 6 996   6 996

8 . 995   . 995

9 8 994   8 994

10 9 993   9 993

18 . 985   . 985

19 18 984   18 984

20 . 983   . 983

21 20 982   20 982

22 . 981   . 981

23 . 980   . 980

24 . 979   . 979

25 . 978   . 978

26 25 977   25 978

27 . 976   . 978

28 26 976   25 978

SAS Super FREQ

## Re: How to winsorize

If you want to just cap it at the values of the 20th and 80th percentiles, see the answers by the others. In particular, @Ksharp has already shown how to use the QNTL function in IML to replace values less than (resp., greater than) the 20th (resp, 80th)  percentile.

Just understand that what you are doing is slightly different than the Winsorization method that PROC UNIVARIATE uses.  If there are no repeated values and no missing values, the two methods are equivalent, I think.

Super User

## Re: How to winsorize

@Rick

There are two different way to define winsorize in PROC UNIVARIATE , one is like yours , another is as OP's said.

Xia Keshan  (a.k.a  Ksharp)

SAS Super FREQ

## Re: How to winsorize

No, UNIVARIATE uses only the symmetric definition.  however, you can specify the number of observations to Winsorize as an integer, k, or as a percentage (0.1=10%).   In both cases the procedure Winsorizes the same number of observations in both tails.

Percentages are not the same as percentiles. As I show in my response, when there are repeated values this is not the same as changing all data less than the alpha_th percentile and all data greater than the (1-alpha)th percentile.

Super User

## Re: How to winsorize

I don't see the difference in what I've coded and how Wikipedia and NumPy define Winsorize :smileyconfused:

Winsorising - Wikipedia, the free encyclopedia

SAS Super FREQ

## Re: How to winsorize

Yes, the Wikipedia article uses a different definition than PROC UNIVARIATE does. The UNIVARIATE procedure uses the statistical definition of the symmetric Winsorization as studied by Tukey and colleagues in s series of papers.  (Symmetric means change k values in the upper and lower tails.) By using the Tukey definition, you can get standard errors, confidence intervals, and other distributional information about the Winsorized mean.

There's nothing inherently wrong with replacing via quantiles, but I don't know whether the distribution of the resulting statistics is understood for that nonsymmetric case.

SAS Super FREQ

## Re: How to winsorize

In case this thread is referenced in the future, I am including a link to the article "How to Winsorize data in SAS," in which I consolidate my thoughts on this issue.

Discussion stats
• 14 replies
• 7431 views
• 8 likes
• 4 in conversation