BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Thomas_mp
Obsidian | Level 7

Hello, I have several (hundreds of) variables that I need to “winsorize” at some level, let us say at the 95% and 5%.  As you know, this  means that all the observations with values > 95 percentile take the value of the 95% percentile , and all observations with value less than the 5% percentile take the value of the 5th percentile. I have been using the proc univariate, get the p5 and p95, and use “if” statements : For instance for a variable Raw1, I compute the 95 and 5 percentiles, I create a new variable Raw1_w  and I "winsorize" it with if statements. Raw1_w =  Raw1; if  Raw1 < Raw1_p5  then Raw1_w =  Raw1_p5; if  Raw1 > Raw1_p95  then Raw1_w =  Raw1_p95; if  Raw1 =.  then Raw1_w = .; The problem is that when I have several hundreds variables, this 4 lines of code per variable result in a  very complex and very long program and I make mistakes too frequently .. , and I was wondering if you can think in a better solution to do this. Thank you  !!!,

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Use a macro.

Here's one that caps outliers using 25/75 but could easily be modified for 5/95 and caps them.

SAS Box-Plot/Tukey Method of Capping Outliers

If it's to calculate means/statistics proc univariate will also do that using winsorize option.

View solution in original post

14 REPLIES 14
Reeza
Super User

Use a macro.

Here's one that caps outliers using 25/75 but could easily be modified for 5/95 and caps them.

SAS Box-Plot/Tukey Method of Capping Outliers

If it's to calculate means/statistics proc univariate will also do that using winsorize option.

Thomas_mp
Obsidian | Level 7

Thank you again Reeza, My first time using these macros.  It works. Would you know by a chance how to modify the macro so the winsorized variables have different names than the original varialbes?  For instance, creating  new names for the winsorized variables adding _winsor to the original name?  I tried several things but I was not successful Thank you again

Reeza
Super User

Modify the macro slightly - I'm adding the win to the beginning of the variable because that will allow you to use naming shortcuts in SAS later on if required.

%macro cap(dset=,var=, lower=, upper=);

data &dset;

set &dset;

win_&var = &var;

if &var>&upper then win_&var=&upper;

if &var<&lower then win_&var=&lower;

run;

%mend;

Thomas_mp
Obsidian | Level 7

Thank you again !!!!

Ksharp
Super User

Here is a simple IML code .

Code: Program

data have;
do i=1 to 100;
  a=ceil(ranuni(1)*100);
  b=ceil(ranuni(2)*100);
  output;
end;
drop i;
run;


%let low=0.05 ;
%let high=0.95 ;

proc iml;
use have;
read all var _num_ into x[c=vname];
close have;
call qntl(q,x,{&low ,&high});

do i=1 to ncol(x);
x[loc(x[,i]<q[1,i]),i]=q[1,i];
x[loc(x[,i]>q[2,i]),i]=q[2,i];
end;

create want from x[c=vname];
append from x;
close want;

quit;

Xia Keshan

Rick_SAS
SAS Super FREQ

The previous solutions have two problems:

1) They do not correctly Winsorize the data

2) They do not handle missing values

Winsorization does not involve quantiles. To Winsorize, you specify an integer k. You then replace the lowest (nonmissing) k values by the (k+1)st ordered value, and replace the highest k values by the (n-k)th ordered value, where n is the number of nonmissing observations.

Notice that if there are repeated values in the tails, then Winsorizing is different than "capping" and  quantiles are not the same as choosing k. For example, if k=2 and the data are

0 4 4 4 5 5 6 6 6 20

then the Winsorized data should be

4 4 4 4 5 5 6 6 6 6

Here is a SAS/IML program that Winsorizes data. To validate, take the (ordinary) mean of the Winsorized data and compare to the Winsorized means as computed by using PROC UNIVARIATE:


%let dsname = sashelp.heart;

/* SAS/IML program to Winsorize data */
/* Modified from the "trimmed mean" algorithm in
   Wicklin (2010) "Rediscovering SAS/IML Software", p 2-3 */
proc iml;

start Winsorize(x, prop);
   p = ncol(x);      /* number of columns */
   w = x;            /* copy of x */
   do i = 1 to p;
      z = x[,i];        /* copy i_th column */
      n = countn(z);    /* count nonmissing values */
      k = ceil(prop*n); /* number of observations to trim */
      r = rank(z);      /* rank data in i_th column */
      /* find target values and obs with smaller/larger values */
      lowIdx = loc(r<=k & r^=.);
      lowVal = z[loc(r=k+1)];
      highIdx = loc(r>=n-k+1);
      highVal = z[loc(r=n-k)];
      /* Winsorize k largest and k smallest values */
      w[lowIdx,i] = lowVal;
      w[highIdx,i] = highVal;
   end;
   return(w);
finish;

/* test it */

use &dsname;
read all var _NUM_ into X[colname=varNames];
close;
winX = Winsorize(X, 0.1);

/* Optional: compute Winsorized mean = mean of the Winsorized data */
winMean = mean(winX);
print winMean[c=varNames f=8.4];

/* write Winsorized data to data set, if necessary */
quit;

/* Validation: compute Winsorized means by using UNIVARIATE */
ods graphics off;
ods exclude all;
proc univariate data=&dsname winsorized=0.1;
   ods output WinsorizedMeans=winMeans;
run;
ods exclude none;
proc print data=winMeans;
   var VarName HalfP HalfN Mean;
run;

Reeza
Super User

Thanks Rick, that is good to know.

Thomas_mp
Obsidian | Level 7

Thank you Rick,

I see the problems .  One question is that you mention you need to specify an integer K.  Perhaps a problem is that this integer should be (for instance) the 20 percentile, and 80 percentile (to winsorize at the 20 % and 80%), so k is different for each variable. See for instance the desired output. The first column  is an ID, columns 2 and 3 have the data for Va and VB,  the other 2 columns have two variables winsorized.  Va has several missing values.

My question is thus, how can this program be “adapted” to make “k” be the percentile desired to winsorize at the 20%-80%instead of a fix number?

Thank you again,

id Va Vb   Va_win Vb_win

2 4 1001   4 998

3 . 1000   . 998

4 4 999   4 998

5 . 998   . 998

6 5 997   5 997

7 6 996   6 996

8 . 995   . 995

9 8 994   8 994

10 9 993   9 993

18 . 985   . 985

19 18 984   18 984

20 . 983   . 983

21 20 982   20 982

22 . 981   . 981

23 . 980   . 980

24 . 979   . 979

25 . 978   . 978

26 25 977   25 978

27 . 976   . 978

28 26 976   25 978

Rick_SAS
SAS Super FREQ

If you want to just cap it at the values of the 20th and 80th percentiles, see the answers by the others. In particular, @Ksharp has already shown how to use the QNTL function in IML to replace values less than (resp., greater than) the 20th (resp, 80th)  percentile.

Just understand that what you are doing is slightly different than the Winsorization method that PROC UNIVARIATE uses.  If there are no repeated values and no missing values, the two methods are equivalent, I think.

Ksharp
Super User

@Rick

There are two different way to define winsorize in PROC UNIVARIATE , one is like yours , another is as OP's said.

Xia Keshan  (a.k.a  Ksharp)

Rick_SAS
SAS Super FREQ

No, UNIVARIATE uses only the symmetric definition.  however, you can specify the number of observations to Winsorize as an integer, k, or as a percentage (0.1=10%).   In both cases the procedure Winsorizes the same number of observations in both tails.

Percentages are not the same as percentiles. As I show in my response, when there are repeated values this is not the same as changing all data less than the alpha_th percentile and all data greater than the (1-alpha)th percentile.

Reeza
Super User

I don't see the difference in what I've coded and how Wikipedia and NumPy define Winsorize :smileyconfused:

Winsorising - Wikipedia, the free encyclopedia

scipy.stats.mstats.winsorize &mdash; SciPy v0.14.0 Reference Guide

Rick_SAS
SAS Super FREQ

Yes, the Wikipedia article uses a different definition than PROC UNIVARIATE does. The UNIVARIATE procedure uses the statistical definition of the symmetric Winsorization as studied by Tukey and colleagues in s series of papers.  (Symmetric means change k values in the upper and lower tails.) By using the Tukey definition, you can get standard errors, confidence intervals, and other distributional information about the Winsorized mean.

There's nothing inherently wrong with replacing via quantiles, but I don't know whether the distribution of the resulting statistics is understood for that nonsymmetric case.

Rick_SAS
SAS Super FREQ

In case this thread is referenced in the future, I am including a link to the article "How to Winsorize data in SAS," in which I consolidate my thoughts on this issue.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 8145 views
  • 8 likes
  • 4 in conversation