BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
comeon2012
Fluorite | Level 6

Hi,

I did know how to winsorize in Stata, but how to do it in SAS.

For example, I have three variables: size, sales, total assets

I want winsorize at the 1% and 99%, i.e. values that are less than the value at 1% are replaced by the value at 1%, and values that are greater than the value at 99% are replaced by the value at 99%. I don't want obs. with outliers are deleted.

Thanks.


1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

That's because in this small sample the 99%ile and 1%ile are at the max and min of the variables, so the winsorizing values were already at the extremes.  Add more observations or change the percentiles (but if you change the percentiles, modify the names of the variables in the WLO and WHI arrays correspondingly.

 

Also you mistakenly generated the 10%ile instead of the first pecentile in the univariate procedure but that had no effect because the data step was looking for __size1, (but the actual variable created was __size10).

 

I will change to the 10th and 90th percentiles which should produce results in your example.

 

Editor's note: including original code solution in this message for convenience:

 

Two steps:

 1. find the winsor low and high percentiles (PROC UNIVARIATE)

 2. truncate outliers to those values (DATA step)

 

%let L=10;    %* 10th percentile *;
%let H=%eval(100 - &L);   %* 90th percentile*;
proc univariate data=have noprint;
   var size sales assets;
   output out=_winsor   pctlpts=&L  &H    
   pctlpre=__size  __sales  __assets;
run;
data want (drop=__:);
  set have;
  if _n_=1 then set _winsor;
  array wlo  {*} __size&L  __sales&L   __assets&L;
  array whi  {*} __size&H __sales&H __assets&H;
  array wval {*} wsize wsales wassets;
  array val   {*} size sales assets;
  do _V=1 to dim(val);
     wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});
  end;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

10 REPLIES 10
mkeintz
PROC Star

lTwo steps:

   1. find the winsor low and high percentiles

   2. truncate outliers to those values

%let L=10;    %* 10th percentile *;

%let H=%eval(100 - &L);   %* 90th percentile*;

proc univariate data=have noprint;

   var size sales assets;

   output out=_winsor   pctlpts=&L  &H     pctlpre=__size  __sales  __assets;

run;

data want (drop=_:);

  set have;

  if _n_=1 then set _winsor;

  array wlo  {*} __size&L  __sales&L   __assets&L;

  array whi  {*} __size&H __sales&H __assets&H;

  array wval {*} wsize wsales wassets;

  array val   {*} size sales assets;

  do _V=1 to dim(val);

     wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});

  end;

run;

 

This can be easily macro-ized so that you can make parameters of the variable list, input/output data sets, and percentiles.  I suspect you may be a user of WRDS, given you are dealing with company-based financial data.  If so, you can find a more generalized winsorize macro via the RESEARCH tab on the WRDS web site.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
comeon2012
Fluorite | Level 6

Hi mkeintz,

Problem again.

It seems that the winsor procedure does not make difference since values in want dataset(after winsorizing) is the same as have dataset(before winsorizing).

For example:

data have;

input size sales assets;

cards;

10 3 15

20 5 20

30 5 23.2

45 19 14.3

2 3 16.4

20 2 170

30 2 3

45 1 5

100 3 10

20 3 25

39 3 30

46 1 12

;

run;

proc univariate data=have noprint;

   var size sales assets;

   output out=_winsor   pctlpts=10 99     pctlpre=__size  __sales  __assets;

run;

data want (drop=_:);

  set have;

  if _n_=1 then set _winsor;

  array wlo  {*} __size1 __sales1 __assets1;

  array whi  {*} __size99 __sales99 __assets99;

  array wval {*} wsize wsales wassets;

  array val   {*} size sales assets;

  do _V=1 to dim(val);

     wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});

  end;

run;

mkeintz
PROC Star

That's because in this small sample the 99%ile and 1%ile are at the max and min of the variables, so the winsorizing values were already at the extremes.  Add more observations or change the percentiles (but if you change the percentiles, modify the names of the variables in the WLO and WHI arrays correspondingly.

 

Also you mistakenly generated the 10%ile instead of the first pecentile in the univariate procedure but that had no effect because the data step was looking for __size1, (but the actual variable created was __size10).

 

I will change to the 10th and 90th percentiles which should produce results in your example.

 

Editor's note: including original code solution in this message for convenience:

 

Two steps:

 1. find the winsor low and high percentiles (PROC UNIVARIATE)

 2. truncate outliers to those values (DATA step)

 

%let L=10;    %* 10th percentile *;
%let H=%eval(100 - &L);   %* 90th percentile*;
proc univariate data=have noprint;
   var size sales assets;
   output out=_winsor   pctlpts=&L  &H    
   pctlpre=__size  __sales  __assets;
run;
data want (drop=__:);
  set have;
  if _n_=1 then set _winsor;
  array wlo  {*} __size&L  __sales&L   __assets&L;
  array whi  {*} __size&H __sales&H __assets&H;
  array wval {*} wsize wsales wassets;
  array val   {*} size sales assets;
  do _V=1 to dim(val);
     wval{_V}=min(max(val{_V},wlo{_V}),whi{_V});
  end;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
comeon2012
Fluorite | Level 6

It's my fault. Sorry for that.

Thanks anyway.

mahler_ji
Obsidian | Level 7

Hello,

I found this answer very helpful, but I have a question for how to make it a little more specific to my needs.

I have data that I need to winsorize by quarter.  So, I have a large dataset with lots of values for each quarter, and I want to winsorize the top and bottom percent within each time period.

How could I modify the above code to make that happen?

Thanks!

John

mantubiradar19
Quartz | Level 8

Hello Folks!

I have also a similar problem here! I have one continuous variable and it was not normally distributed - So I tried to take the square root of the variable and then winsorize at 5% and 95% - It was not normal again! So I went to winsorize at %7 and %97 - Then the variable was normally distributed!

But, all the original significant associations from PROC GLM disappeared! can you guys please suggest me any solution for this?

PGStats
Opal | Level 21

Your question appears as an ANSWERED question on the Forum, which it is not. Post it as a new question (Discussion) on the SAS Statistical Procedures Community to attract more attention. - PG

PG
mantubiradar19
Quartz | Level 8

Sure! Thank you!

Haikuo
Onyx | Level 15

A Google search brings me the following links, you may find them relevant:

Google Groups

http://www.wrds.us/index.php/repository/view/1

Haikuo

Ksharp
Super User

Are you try to calculate Winsorized mean ? It is an example from documentation.

title 'Robust Estimates for Blood Pressure Data';
ods select TrimmedMeans WinsorizedMeans RobustScale;
proc univariate data=BPressure trimmed=1 .1
winsorized=.1 robustscale;
var Systolic;
run;

Ksharp

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 30957 views
  • 8 likes
  • 7 in conversation