Solved: How to winsorize

Thomas_mp · Posted 06-26-2015 01:45 PM

Hello, I have several (hundreds of) variables that I need to “winsorize” at some level, let us say at the 95% and 5%. As you know, this means that all the observations with values > 95 percentile take the value of the 95% percentile , and all observations with value less than the 5% percentile take the value of the 5th percentile. I have been using the proc univariate, get the p5 and p95, and use “if” statements : For instance for a variable Raw1, I compute the 95 and 5 percentiles, I create a new variable Raw1_w and I "winsorize" it with if statements. Raw1_w = Raw1; if Raw1 < Raw1_p5 then Raw1_w = Raw1_p5; if Raw1 > Raw1_p95 then Raw1_w = Raw1_p95; if Raw1 =. then Raw1_w = .; The problem is that when I have several hundreds variables, this 4 lines of code per variable result in a very complex and very long program and I make mistakes too frequently .. , and I was wondering if you can think in a better solution to do this. Thank you !!!,

Reeza · Posted 06-26-2015 01:49 PM

Use a macro.

Here's one that caps outliers using 25/75 but could easily be modified for 5/95 and caps them.

SAS Box-Plot/Tukey Method of Capping Outliers

If it's to calculate means/statistics proc univariate will also do that using winsorize option.

View solution in original post

Reeza · Posted 06-26-2015 01:49 PM

Use a macro.

Here's one that caps outliers using 25/75 but could easily be modified for 5/95 and caps them.

SAS Box-Plot/Tukey Method of Capping Outliers

If it's to calculate means/statistics proc univariate will also do that using winsorize option.

Thomas_mp · Posted 06-26-2015 04:53 PM

Thank you again Reeza, My first time using these macros. It works. Would you know by a chance how to modify the macro so the winsorized variables have different names than the original varialbes? For instance, creating new names for the winsorized variables adding _winsor to the original name? I tried several things but I was not successful Thank you again

Reeza · Posted 06-26-2015 04:58 PM

Modify the macro slightly - I'm adding the win to the beginning of the variable because that will allow you to use naming shortcuts in SAS later on if required.

%macro cap(dset=,var=, lower=, upper=);

data &dset;

set &dset;

win_&var = &var;

if &var>&upper then win_&var=&upper;

if &var<&lower then win_&var=&lower;

run;

%mend;

Thomas_mp · Posted 06-26-2015 06:14 PM

Thank you again !!!!

Ksharp · Posted 06-27-2015 03:12 AM

Here is a simple IML code .

Code: Program

data have;
 do i=1 to 100;
  a=ceil(ranuni(1)*100);
  b=ceil(ranuni(2)*100);
  output;
 end;
 drop i;
run;


%let low=0.05 ;
%let high=0.95 ;

proc iml;
use have;
read all var _num_ into x[c=vname];
close have;
call qntl(q,x,{&low ,&high});

do i=1 to ncol(x);
 x[loc(x[,i]<q[1,i]),i]=q[1,i];
 x[loc(x[,i]>q[2,i]),i]=q[2,i];
end;

create want from x[c=vname];
append from x;
close want;

quit;

Xia Keshan

Rick_SAS · Posted 06-29-2015 10:51 AM

The previous solutions have two problems:

1) They do not correctly Winsorize the data

2) They do not handle missing values

Winsorization does not involve quantiles. To Winsorize, you specify an integer k. You then replace the lowest (nonmissing) k values by the (k+1)st ordered value, and replace the highest k values by the (n-k)th ordered value, where n is the number of nonmissing observations.

Notice that if there are repeated values in the tails, then Winsorizing is different than "capping" and quantiles are not the same as choosing k. For example, if k=2 and the data are

0 4 4 4 5 5 6 6 6 20

then the Winsorized data should be

4 4 4 4 5 5 6 6 6 6

Here is a SAS/IML program that Winsorizes data. To validate, take the (ordinary) mean of the Winsorized data and compare to the Winsorized means as computed by using PROC UNIVARIATE:

%let dsname = sashelp.heart;

/* SAS/IML program to Winsorize data */
/* Modified from the "trimmed mean" algorithm in
Wicklin (2010) "Rediscovering SAS/IML Software", p 2-3 */
proc iml;

start Winsorize(x, prop);
   p = ncol(x);      /* number of columns */
   w = x;            /* copy of x */
   do i = 1 to p;
      z = x[,i];        /* copy i_th column */
      n = countn(z);    /* count nonmissing values */
      k = ceil(prop*n); /* number of observations to trim */
      r = rank(z);      /* rank data in i_th column */
      /* find target values and obs with smaller/larger values */
      lowIdx = loc(r<=k & r^=.);
      lowVal = z[loc(r=k+1)];
      highIdx = loc(r>=n-k+1);
      highVal = z[loc(r=n-k)];
      /* Winsorize k largest and k smallest values */
      w[lowIdx,i] = lowVal;
      w[highIdx,i] = highVal;
   end;
   return(w);
finish;

/* test it */

use &dsname;
read all var _NUM_ into X[colname=varNames];
close;
winX = Winsorize(X, 0.1);

/* Optional: compute Winsorized mean = mean of the Winsorized data */
winMean = mean(winX);
print winMean[c=varNames f=8.4];

/* write Winsorized data to data set, if necessary */
quit;

/* Validation: compute Winsorized means by using UNIVARIATE */
ods graphics off;
ods exclude all;
proc univariate data=&dsname winsorized=0.1;
ods output WinsorizedMeans=winMeans;
run;
ods exclude none;
proc print data=winMeans;
var VarName HalfP HalfN Mean;
run;

Reeza · Posted 06-29-2015 11:12 AM

Thanks Rick, that is good to know.

Thomas_mp · Posted 06-29-2015 12:24 PM

Thank you Rick,

I see the problems . One question is that you mention you need to specify an integer K. Perhaps a problem is that this integer should be (for instance) the 20 percentile, and 80 percentile (to winsorize at the 20 % and 80%), so k is different for each variable. See for instance the desired output. The first column is an ID, columns 2 and 3 have the data for Va and VB, the other 2 columns have two variables winsorized. Va has several missing values.

My question is thus, how can this program be “adapted” to make “k” be the percentile desired to winsorize at the 20%-80%instead of a fix number?

Thank you again,

id Va Vb Va_win Vb_win

2 4 1001 4 998

3 . 1000 . 998

4 4 999 4 998

5 . 998 . 998

6 5 997 5 997

7 6 996 6 996

8 . 995 . 995

9 8 994 8 994

10 9 993 9 993

18 . 985 . 985

19 18 984 18 984

20 . 983 . 983

21 20 982 20 982

22 . 981 . 981

23 . 980 . 980

24 . 979 . 979

25 . 978 . 978

26 25 977 25 978

27 . 976 . 978

28 26 976 25 978

Rick_SAS · Posted 06-29-2015 01:30 PM

If you want to just cap it at the values of the 20th and 80th percentiles, see the answers by the others. In particular, @Ksharp has already shown how to use the QNTL function in IML to replace values less than (resp., greater than) the 20th (resp, 80th) percentile.

Just understand that what you are doing is slightly different than the Winsorization method that PROC UNIVARIATE uses. If there are no repeated values and no missing values, the two methods are equivalent, I think.

Ksharp · Posted 06-30-2015 08:06 AM

@Rick

There are two different way to define winsorize in PROC UNIVARIATE , one is like yours , another is as OP's said.

Xia Keshan (a.k.a Ksharp)

Rick_SAS · Posted 07-06-2015 03:58 PM

No, UNIVARIATE uses only the symmetric definition. however, you can specify the number of observations to Winsorize as an integer, k, or as a percentage (0.1=10%). In both cases the procedure Winsorizes the same number of observations in both tails.

Percentages are not the same as percentiles. As I show in my response, when there are repeated values this is not the same as changing all data less than the alpha_th percentile and all data greater than the (1-alpha)th percentile.

Reeza · Posted 06-29-2015 01:37 PM

I don't see the difference in what I've coded and how Wikipedia and NumPy define Winsorize :smileyconfused:

Winsorising - Wikipedia, the free encyclopedia

scipy.stats.mstats.winsorize — SciPy v0.14.0 Reference Guide

Rick_SAS · Posted 06-29-2015 01:57 PM

Yes, the Wikipedia article uses a different definition than PROC UNIVARIATE does. The UNIVARIATE procedure uses the statistical definition of the symmetric Winsorization as studied by Tukey and colleagues in s series of papers. (Symmetric means change k values in the upper and lower tails.) By using the Tukey definition, you can get standard errors, confidence intervals, and other distributional information about the Winsorized mean.

There's nothing inherently wrong with replacing via quantiles, but I don't know whether the distribution of the resulting statistics is understood for that nonsymmetric case.

Rick_SAS · Posted 06-17-2016 08:43 AM

In case this thread is referenced in the future, I am including a link to the article "How to Winsorize data in SAS," in which I consolidate my thoughts on this issue.

How to winsorize

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Code: Program

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Re: How to winsorize

Code: Program

Registration is open

SAS Training: Just a Click Away