- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I did know how to winsorize in Stata, but how to do it in SAS.
For example, I have three variables: size, sales, total assets
I want winsorize at the 1% and 99%, i.e. values that are less than the value at 1% are replaced by the value at 1%, and values that are greater than the value at 99% are replaced by the value at 99%. I don't want obs. with outliers are deleted.
Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That's because in this small sample the 99%ile and 1%ile are at the max and min of the variables, so the winsorizing values were already at the extremes. Add more observations or change the percentiles (but if you change the percentiles, modify the names of the variables in the WLO and WHI arrays correspondingly.
Also you mistakenly generated the 10%ile instead of the first pecentile in the univariate procedure but that had no effect because the data step was looking for __size1, (but the actual variable created was __size10).
I will change to the 10th and 90th percentiles which should produce results in your example.
Editor's note: including original code solution in this message for convenience:
Two steps:
1. find the winsor low and high percentiles (PROC UNIVARIATE)
2. truncate outliers to those values (DATA step)
%let L=10; %* 10th percentile *;
%let H=%eval(100 - &L); %* 90th percentile*;
proc univariate data=have noprint;
var size sales assets;
output out=_winsor pctlpts=&L &H
pctlpre=__size __sales __assets;
run;
data want (drop=__:);
set have;
if _n_=1 then set _winsor;
array wlo {*} __size&L __sales&L __assets&L;
array whi {*} __size&H __sales&H __assets&H;
array wval {*} wsize wsales wassets;
array val {*} size sales assets;
do _V=1 to dim(val);
wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});
end;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
lTwo steps:
1. find the winsor low and high percentiles
2. truncate outliers to those values
%let L=10; %* 10th percentile *;
%let H=%eval(100 - &L); %* 90th percentile*;
proc univariate data=have noprint;
var size sales assets;
output out=_winsor pctlpts=&L &H pctlpre=__size __sales __assets;
run;
data want (drop=_:);
set have;
if _n_=1 then set _winsor;
array wlo {*} __size&L __sales&L __assets&L;
array whi {*} __size&H __sales&H __assets&H;
array wval {*} wsize wsales wassets;
array val {*} size sales assets;
do _V=1 to dim(val);
wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});
end;
run;
This can be easily macro-ized so that you can make parameters of the variable list, input/output data sets, and percentiles. I suspect you may be a user of WRDS, given you are dealing with company-based financial data. If so, you can find a more generalized winsorize macro via the RESEARCH tab on the WRDS web site.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi mkeintz,
Problem again.
It seems that the winsor procedure does not make difference since values in want dataset(after winsorizing) is the same as have dataset(before winsorizing).
For example:
data have;
input size sales assets;
cards;
10 3 15
20 5 20
30 5 23.2
45 19 14.3
2 3 16.4
20 2 170
30 2 3
45 1 5
100 3 10
20 3 25
39 3 30
46 1 12
;
run;
proc univariate data=have noprint;
var size sales assets;
output out=_winsor pctlpts=10 99 pctlpre=__size __sales __assets;
run;
data want (drop=_:);
set have;
if _n_=1 then set _winsor;
array wlo {*} __size1 __sales1 __assets1;
array whi {*} __size99 __sales99 __assets99;
array wval {*} wsize wsales wassets;
array val {*} size sales assets;
do _V=1 to dim(val);
wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That's because in this small sample the 99%ile and 1%ile are at the max and min of the variables, so the winsorizing values were already at the extremes. Add more observations or change the percentiles (but if you change the percentiles, modify the names of the variables in the WLO and WHI arrays correspondingly.
Also you mistakenly generated the 10%ile instead of the first pecentile in the univariate procedure but that had no effect because the data step was looking for __size1, (but the actual variable created was __size10).
I will change to the 10th and 90th percentiles which should produce results in your example.
Editor's note: including original code solution in this message for convenience:
Two steps:
1. find the winsor low and high percentiles (PROC UNIVARIATE)
2. truncate outliers to those values (DATA step)
%let L=10; %* 10th percentile *;
%let H=%eval(100 - &L); %* 90th percentile*;
proc univariate data=have noprint;
var size sales assets;
output out=_winsor pctlpts=&L &H
pctlpre=__size __sales __assets;
run;
data want (drop=__:);
set have;
if _n_=1 then set _winsor;
array wlo {*} __size&L __sales&L __assets&L;
array whi {*} __size&H __sales&H __assets&H;
array wval {*} wsize wsales wassets;
array val {*} size sales assets;
do _V=1 to dim(val);
wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});
end;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It's my fault. Sorry for that.
Thanks anyway.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I found this answer very helpful, but I have a question for how to make it a little more specific to my needs.
I have data that I need to winsorize by quarter. So, I have a large dataset with lots of values for each quarter, and I want to winsorize the top and bottom percent within each time period.
How could I modify the above code to make that happen?
Thanks!
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello Folks!
I have also a similar problem here! I have one continuous variable and it was not normally distributed - So I tried to take the square root of the variable and then winsorize at 5% and 95% - It was not normal again! So I went to winsorize at %7 and %97 - Then the variable was normally distributed!
But, all the original significant associations from PROC GLM disappeared! can you guys please suggest me any solution for this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your question appears as an ANSWERED question on the Forum, which it is not. Post it as a new question (Discussion) on the SAS Statistical Procedures Community to attract more attention. - PG
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sure! Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A Google search brings me the following links, you may find them relevant:
http://www.wrds.us/index.php/repository/view/1
Haikuo
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Are you try to calculate Winsorized mean ? It is an example from documentation.
title 'Robust Estimates for Blood Pressure Data'; ods select TrimmedMeans WinsorizedMeans RobustScale; proc univariate data=BPressure trimmed=1 .1 winsorized=.1 robustscale; var Systolic; run;
Ksharp