Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- How to winsorize

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 06-26-2015 01:45 PM
(7430 views)

Hello, I have several (hundreds of) variables that I need to “winsorize” at some level, let us say at the 95% and 5%. As you know, this means that all the observations with values > 95 percentile take the value of the 95% percentile , and all observations with value less than the 5% percentile take the value of the 5th percentile. I have been using the proc univariate, get the p5 and p95, and use “if” statements : For instance for a variable Raw1, I compute the 95 and 5 percentiles, I create a new variable Raw1_w and I "winsorize" it with if statements. Raw1_w = Raw1; if Raw1 < Raw1_p5 then Raw1_w = Raw1_p5; if Raw1 > Raw1_p95 then Raw1_w = Raw1_p95; if Raw1 =. then Raw1_w = .; The problem is that when I have several hundreds variables, this 4 lines of code per variable result in a very complex and very long program and I make mistakes too frequently .. , and I was wondering if you can think in a better solution to do this. Thank you !!!,

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Use a macro.

Here's one that caps outliers using 25/75 but could easily be modified for 5/95 and caps them.

SAS Box-Plot/Tukey Method of Capping Outliers

If it's to calculate means/statistics proc univariate will also do that using winsorize option.

14 REPLIES 14

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Use a macro.

Here's one that caps outliers using 25/75 but could easily be modified for 5/95 and caps them.

SAS Box-Plot/Tukey Method of Capping Outliers

If it's to calculate means/statistics proc univariate will also do that using winsorize option.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Modify the macro slightly - I'm adding the win to the beginning of the variable because that will allow you to use naming shortcuts in SAS later on if required.

%macro cap(dset=,var=, lower=, upper=);

data &dset;

set &dset;

win_&var = &var;

if &var>&upper then win_&var=&upper;

if &var<&lower then win_&var=&lower;

run;

%mend;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you again !!!!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Here is a simple IML code .

data have;

do i=1 to 100;

a=ceil(ranuni(1)*100);

b=ceil(ranuni(2)*100);

output;

end;

drop i;

run;

%let low=0.05 ;

%let high=0.95 ;

proc iml;

use have;

read all var _num_ into x[c=vname];

close have;

call qntl(q,x,{&low ,&high});

do i=1 to ncol(x);

x[loc(x[,i]<q[1,i]),i]=q[1,i];

x[loc(x[,i]>q[2,i]),i]=q[2,i];

end;

create want from x[c=vname];

append from x;

close want;

quit;

Xia Keshan

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The previous solutions have two problems:

1) They do not correctly Winsorize the data

2) They do not handle missing values

Winsorization does not involve quantiles. To Winsorize, you specify an integer k. You then replace the lowest (nonmissing) k values by the (k+1)st ordered value, and replace the highest k values by the (n-k)th ordered value, where n is the number of nonmissing observations.

Notice that if there are repeated values in the tails, then Winsorizing is different than "capping" and quantiles are not the same as choosing k. For example, if k=2 and the data are

0 4 4 4 5 5 6 6 6 20

then the Winsorized data should be

4 4 4 4 5 5 6 6 6 6

Here is a SAS/IML program that Winsorizes data. To validate, take the (ordinary) mean of the Winsorized data and compare to the Winsorized means as computed by using PROC UNIVARIATE:

%let dsname = sashelp.heart;

/* SAS/IML program to Winsorize data */

/* Modified from the "trimmed mean" algorithm in

Wicklin (2010) "Rediscovering SAS/IML Software", p 2-3 */

proc iml;

start Winsorize(x, prop);

p = ncol(x); /* number of columns */

w = x; /* copy of x */

do i = 1 to p;

z = x[,i]; /* copy i_th column */

n = countn(z); /* count nonmissing values */

k = ceil(prop*n); /* number of observations to trim */

r = rank(z); /* rank data in i_th column */

/* find target values and obs with smaller/larger values */

lowIdx = loc(r<=k & r^=.);

lowVal = z[loc(r=k+1)];

highIdx = loc(r>=n-k+1);

highVal = z[loc(r=n-k)];

/* Winsorize k largest and k smallest values */

w[lowIdx,i] = lowVal;

w[highIdx,i] = highVal;

end;

return(w);

finish;

/* test it */

use &dsname;

read all var _NUM_ into X[colname=varNames];

close;

winX = Winsorize(X, 0.1);

/* Optional: compute Winsorized mean = mean of the Winsorized data */

winMean = mean(winX);

print winMean[c=varNames f=8.4];

/* write Winsorized data to data set, if necessary */

quit;

/* Validation: compute Winsorized means by using UNIVARIATE */

ods graphics off;

ods exclude all;

proc univariate data=&dsname winsorized=0.1;

ods output WinsorizedMeans=winMeans;

run;

ods exclude none;

proc print data=winMeans;

var VarName HalfP HalfN Mean;

run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks Rick, that is good to know.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you Rick,

I see the problems . One question is that you mention you need to specify an integer K. Perhaps a problem is that this integer should be (for instance) the 20 percentile, and 80 percentile (to winsorize at the 20 % and 80%), so k is different for each variable. See for instance the desired output. The first column is an ID, columns 2 and 3 have the data for Va and VB, the other 2 columns have two variables winsorized. Va has several missing values.

My question is thus, how can this program be “adapted” to make “k” be the percentile desired to winsorize at the 20%-80%instead of a fix number?

Thank you again,

id Va Vb Va_win Vb_win

2 4 1001 4 998

3 . 1000 . 998

4 4 999 4 998

5 . 998 . 998

6 5 997 5 997

7 6 996 6 996

8 . 995 . 995

9 8 994 8 994

10 9 993 9 993

18 . 985 . 985

19 18 984 18 984

20 . 983 . 983

21 20 982 20 982

22 . 981 . 981

23 . 980 . 980

24 . 979 . 979

25 . 978 . 978

26 25 977 25 978

27 . 976 . 978

28 26 976 25 978

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

If you want to just cap it at the values of the 20th and 80th percentiles, see the answers by the others. In particular, @Ksharp has already shown how to use the QNTL function in IML to replace values less than (resp., greater than) the 20th (resp, 80th) percentile.

Just understand that what you are doing is slightly different than the Winsorization method that PROC UNIVARIATE uses. If there are no repeated values and no missing values, the two methods are equivalent, I think.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Rick

There are two different way to define winsorize in PROC UNIVARIATE , one is like yours , another is as OP's said.

Xia Keshan (a.k.a Ksharp)

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

No, UNIVARIATE uses only the symmetric definition. however, you can specify the number of observations to Winsorize as an integer, k, or as a percentage (0.1=10%). In both cases the procedure Winsorizes the same number of observations in both tails.

Percentages are not the same as percentiles. As I show in my response, when there are repeated values this is not the same as changing all data less than the alpha_th percentile and all data greater than the (1-alpha)th percentile.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I don't see the difference in what I've coded and how Wikipedia and NumPy define Winsorize :smileyconfused:

Winsorising - Wikipedia, the free encyclopedia

scipy.stats.mstats.winsorize — SciPy v0.14.0 Reference Guide

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Yes, the Wikipedia article uses a different definition than PROC UNIVARIATE does. The UNIVARIATE procedure uses the statistical definition of the symmetric Winsorization as studied by Tukey and colleagues in s series of papers. (Symmetric means change k values in the upper and lower tails.) By using the Tukey definition, you can get standard errors, confidence intervals, and other distributional information about the Winsorized mean.

There's nothing inherently wrong with replacing via quantiles, but I don't know whether the distribution of the resulting statistics is understood for that nonsymmetric case.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.