turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- how to winsorize variable using SAS

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-15-2012 07:35 AM

Hi,

I did know how to winsorize in Stata, but how to do it in SAS.

For example, I have three variables: size, sales, total assets

I want winsorize at the 1% and 99%, i.e. values that are less than the value at 1% are replaced by the value at 1%, and values that are greater than the value at 99% are replaced by the value at 99%. I don't want obs. with outliers are deleted.

Thanks.

Accepted Solutions

Solution

08-15-2012
08:58 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to comeon2012

08-15-2012 08:58 PM - last edited on 01-23-2017 10:56 AM by ChrisHemedinger

That's because in this small sample the 99%ile and 1%ile are at the max and min of the variables, so the winsorizing values were already at the extremes. Add more observations or change the percentiles (but if you change the percentiles, modify the names of the variables in the WLO and WHI arrays correspondingly.

Also you mistakenly generated the 10%ile instead of the first pecentile in the univariate procedure but that had no effect because the data step was looking for __size1, (but the actual variable created was __size10).

I will change to the 10th and 90th percentiles which should produce results in your example.

*Editor's note: including original code solution in this message for convenience:*

Two steps:

1. find the winsor low and high percentiles (PROC UNIVARIATE)

2. truncate outliers to those values (DATA step)

```
%let L=10; %* 10th percentile *;
%let H=%eval(100 - &L); %* 90th percentile*;
proc univariate data=have noprint;
var size sales assets;
output out=_winsor pctlpts=&L &H
pctlpre=__size __sales __assets;
run;
data want (drop=__:);
set have;
if _n_=1 then set _winsor;
array wlo {*} __size&L __sales&L __assets&L;
array whi {*} __size&H __sales&H __assets&H;
array wval {*} wsize wsales wassets;
array val {*} size sales assets;
do _V=1 to dim(val);
wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});
end;
run;
```

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to comeon2012

08-15-2012 08:13 AM

lTwo steps:

1. find the winsor low and high percentiles

2. truncate outliers to those values

%let L=10; %* 10th percentile *;

%let H=%eval(100 - &L); %* 90th percentile*;

proc univariate data=have noprint;

var size sales assets;

output out=_winsor pctlpts=&L &H pctlpre=__size __sales __assets;

run;

data want (drop=_;

set have;

if _n_=1 then set _winsor;

array wlo {*} __size&L __sales&L __assets&L;

array whi {*} __size&H __sales&H __assets&H;

array wval {*} wsize wsales wassets;

array val {*} size sales assets;

do _V=1 to dim(val);

wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});

end;

run;

This can be easily macro-ized so that you can make parameters of the variable list, input/output data sets, and percentiles. I suspect you may be a user of WRDS, given you are dealing with company-based financial data. If so, you can find a more generalized winsorize macro via the RESEARCH tab on the WRDS web site.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to mkeintz

08-15-2012 08:49 AM

Hi **mkeintz**,

Problem again.

It seems that the winsor procedure does not make difference since values in want dataset(after winsorizing) is the same as have dataset(before winsorizing).

For example:

**data** have;

input size sales assets;

cards;

10 3 15

20 5 20

30 5 23.2

45 19 14.3

2 3 16.4

20 2 170

30 2 3

45 1 5

100 3 10

20 3 25

39 3 30

46 1 12

;

**run**;

**proc** **univariate** data=have noprint;

var size sales assets;

output out=_winsor pctlpts=**10** **99** pctlpre=__size __sales __assets;

**run**;

**data** want (drop=_;

set have;

if _n_=**1** then set _winsor;

array wlo {*} __size1 __sales1 __assets1;

array whi {*} __size99 __sales99 __assets99;

array wval {*} wsize wsales wassets;

array val {*} size sales assets;

do _V=**1** to dim(val);

wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});

end;

**run**;

Solution

08-15-2012
08:58 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to comeon2012

08-15-2012 08:58 PM - last edited on 01-23-2017 10:56 AM by ChrisHemedinger

That's because in this small sample the 99%ile and 1%ile are at the max and min of the variables, so the winsorizing values were already at the extremes. Add more observations or change the percentiles (but if you change the percentiles, modify the names of the variables in the WLO and WHI arrays correspondingly.

Also you mistakenly generated the 10%ile instead of the first pecentile in the univariate procedure but that had no effect because the data step was looking for __size1, (but the actual variable created was __size10).

I will change to the 10th and 90th percentiles which should produce results in your example.

*Editor's note: including original code solution in this message for convenience:*

Two steps:

1. find the winsor low and high percentiles (PROC UNIVARIATE)

2. truncate outliers to those values (DATA step)

```
%let L=10; %* 10th percentile *;
%let H=%eval(100 - &L); %* 90th percentile*;
proc univariate data=have noprint;
var size sales assets;
output out=_winsor pctlpts=&L &H
pctlpre=__size __sales __assets;
run;
data want (drop=__:);
set have;
if _n_=1 then set _winsor;
array wlo {*} __size&L __sales&L __assets&L;
array whi {*} __size&H __sales&H __assets&H;
array wval {*} wsize wsales wassets;
array val {*} size sales assets;
do _V=1 to dim(val);
wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});
end;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to mkeintz

08-15-2012 10:08 PM

It's my fault. Sorry for that.

Thanks anyway.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to mkeintz

12-01-2014 09:59 AM

Hello,

I found this answer very helpful, but I have a question for how to make it a little more specific to my needs.

I have data that I need to winsorize by quarter. So, I have a large dataset with lots of values for each quarter, and I want to winsorize the top and bottom percent within each time period.

How could I modify the above code to make that happen?

Thanks!

John

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to mahler_ji

01-15-2015 04:07 AM

Hello Folks!

I have also a similar problem here! I have one continuous variable and it was not normally distributed - So I tried to take the square root of the variable and then winsorize at 5% and 95% - It was not normal again! So I went to winsorize at %7 and %97 - Then the variable was normally distributed!

But, all the original significant associations from PROC GLM disappeared! can you guys please suggest me any solution for this?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to mantubiradar19

01-15-2015 01:50 PM

Your question appears as an ANSWERED question on the Forum, which it is not. Post it as a new question (Discussion) on the SAS Statistical Procedures Community to attract more attention. - PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

01-16-2015 02:13 AM

Sure! Thank you!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to comeon2012

08-15-2012 08:17 AM

A Google search brings me the following links, you may find them relevant:

http://www.wrds.us/index.php/repository/view/1

Haikuo

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to comeon2012

08-15-2012 10:03 PM

Are you try to calculate Winsorized mean ? It is an example from documentation.

title 'Robust Estimates for Blood Pressure Data'; ods select TrimmedMeans WinsorizedMeans RobustScale; proc univariate data=BPressure trimmed=1 .1 winsorized=.1 robustscale; var Systolic; run;

Ksharp