Solved: Re: winsoring outliers

lioradam · Posted 01-20-2019 02:38 PM

Hello,

I'm running the following regression (a Fama MecBeth regression):

proc sort data=WORK.QUERY out=FM_quadric_reg1;
by gvkey;
run;

PROC REG DATA=FM_quadric_reg1
outest=FM_quadric_reg011 noprint outseb;

MODEL y= x1 x2;
by gvkey ;
RUN;
QUIT;
proc means data=FM_quadric_reg011 n mean t probt;
where _TYPE_ = 'PARMS';
run;

Now I want to winsore the outliers: for both beta 1 , beta 2, and the intercept, I want to turn values of that exceed mean plus 3 std dev to the value of mean plus 3 std dev and to turn the values that below mean minus 3 std dev to the value of mean minus 3 std dev.

I want to do it for both beta 1 (the coefficient of x1) , beta 2 (the coefficient of x2) and the intercept.

could you please assist in writing the code for that?

Thank you,

Lior

PGStats · Posted 01-20-2019 06:41 PM

This isn't winsorizing. Winsorizing involves censoring the same number of values from the lower and upper tail of the distribution. What you describe can be done with:

proc sort data=WORK.QUERY out=FM_quadric_reg1;
 by gvkey;
run;
 
PROC REG DATA=FM_quadric_reg1
  outest=FM_quadric_reg011 noprint outseb;
BY gvkey ;
MODEL y= x1 x2;
RUN;

proc sql;
create table FM_quadric_reg011_Winz as
select 
    *,
    case when abs(intercept-mean(intercept)) > 3 * std(intercept) 
    then mean(intercept) + sign(intercept-mean(intercept)) * 3 * std(intercept) 
    else intercept end as winz_intercept,
    case when abs(x1-mean(x1)) > 3 * std(x1) 
    then mean(x1) + sign(x1-mean(x1)) * 3 * std(x1) 
    else x1 end as winz_x1,    
    case when abs(intercept-mean(intercept)) > 3 * std(x2) 
    then mean(x2) + sign(x2-mean(x2)) * 3 * std(x2) 
    else x2 end as winz_x2
from FM_quadric_reg011
where _type_ = "PARMS";
quit;

PG

View solution in original post

PaigeMiller · Posted 01-20-2019 05:30 PM

@lioradam wrote:

Hello,

I'm running the following regression (a Fama MecBeth regression):

proc sort data=WORK.QUERY out=FM_quadric_reg1;
by gvkey;
run;

PROC REG DATA=FM_quadric_reg1
outest=FM_quadric_reg011 noprint outseb;

MODEL y= x1 x2;
by gvkey ;
RUN;
QUIT;
proc means data=FM_quadric_reg011 n mean t probt;
where _TYPE_ = 'PARMS';
run;

Now I want to winsore the outliers: for both beta 1 , beta 2, and the intercept, I want to turn values of that exceed mean plus 3 std dev to the value of mean plus 3 std dev and to turn the values that below mean minus 3 std dev to the value of mean minus 3 std dev.

I want to do it for both beta 1 (the coefficient of x1) , beta 2 (the coefficient of x2) and the intercept.

could you please assist in writing the code for that?

Thank you,

Lior

So you get many different beta1 and beta2 and intercept because of the BY statement. It's not clear that they should be winsorized, or that the different beta1, beta2 and intercept across the BY groups would have the same distribution allowing winsorizing to make sense.

You'd have to be willing to produce reasons to your clients why Winsorizing makes sense here. I'm trying to think of an example, and the only example I can think of is a simulation experiment, where the BY is for different iterations in the simulation.

But you can take the output data set PROC MEANS and then merge that into the coefficients to determine if they are byond 3 sigma. Something like this:

proc means data=FM_quadric_reg011;
    where _TYPE_ = 'PARMS';
    output out=_stats_ mean= std=/autoname;
run;
 
Data fm_quadric_reg011a;
    if _n_=1 then set _stats_;
    set fm_quadric_reg011;
    /* Formula for winsorizing goes here */
run;

--
Paige Miller

lioradam · Posted 01-21-2019 01:46 AM

Hi!

The method of calculating many beta1 and beta2 and intercept is common practice in accounting research and it aimed to treat a possible collinearity between observations. then the mean value of beta1 and beta2 and intercept are calculated . so as any mean calculation its affected by outliers, and there for it needed to be addressed.

Any way thank you for your assistance!

Lior

Reeza · Posted 01-20-2019 06:05 PM

I agree with @PaigeMiller.

That being said, here's an example of how to cap results, it's not efficient by any means, but it works.

https://gist.github.com/statgeek/31316a678433a1db8136

lioradam · Posted 01-21-2019 02:00 AM

Hi Rezza,

I used PG code however the code you sent looks also suitable for me.

thank you very much!

Lior

PGStats · Posted 01-20-2019 06:41 PM

This isn't winsorizing. Winsorizing involves censoring the same number of values from the lower and upper tail of the distribution. What you describe can be done with:

proc sort data=WORK.QUERY out=FM_quadric_reg1;
 by gvkey;
run;
 
PROC REG DATA=FM_quadric_reg1
  outest=FM_quadric_reg011 noprint outseb;
BY gvkey ;
MODEL y= x1 x2;
RUN;

proc sql;
create table FM_quadric_reg011_Winz as
select 
    *,
    case when abs(intercept-mean(intercept)) > 3 * std(intercept) 
    then mean(intercept) + sign(intercept-mean(intercept)) * 3 * std(intercept) 
    else intercept end as winz_intercept,
    case when abs(x1-mean(x1)) > 3 * std(x1) 
    then mean(x1) + sign(x1-mean(x1)) * 3 * std(x1) 
    else x1 end as winz_x1,    
    case when abs(intercept-mean(intercept)) > 3 * std(x2) 
    then mean(x2) + sign(x2-mean(x2)) * 3 * std(x2) 
    else x2 end as winz_x2
from FM_quadric_reg011
where _type_ = "PARMS";
quit;

PG

lioradam · Posted 01-21-2019 02:03 AM

Hi PG,

I used it and it worked perfectly (I calculate it also manually to verify that this is what I attend to receive).

Thank you very much!

Lior

lioradam · Posted 01-23-2019 08:43 AM

Hi PG,

I want to try another type of winsoring, instead of winzoring coefficients above mean plus 3 STD (or below mean minus 3 STD), I want to do winsoring for coefficient values above 95% or below 5%.

Could you please assist how to adjust the code you wrote accordingly?

Kind Regards,

Lior

PGStats · Posted 01-23-2019 11:40 AM

SQL doesn't provide percentile summaries, other than the median. So you will have to use a different approach, such as:

proc summary data=FM_quadric_reg011;
by gvkey;
var intercept x1 x2;
output out=FM_quadric_reg011_summ p5= p95= / autoname;
run;

data FM_quadric_reg011_Winz;
merge FM_quadric_reg011 FM_quadric_reg011_summ; by gvkey;
intercept = min(max(intercept, intercept_p5), intercept_p95);
x1 = min(max(x1, x1_p5), x1_p95);
x2 = min(max(x2, x2_p5), x2_p95);
run;

PG

lioradam · Posted 01-24-2019 04:25 AM

Hi PG,

Thank you for your response.

I wrote the following code, based on your suggestion, but for some reason, I received different result than when winsoring manually (for verification).

Also, I don't see "output data" of "FM_quadric_regr11_Winz",

and when I remove the condition " where _TYPE_ = 'PARMS'" from the "proc summary" I still receive the same results (the same value of " winz_year_of_tenure2" for example) which don't make sense.

Could you please try to see what I wrote incorrectly?

The code I wrote:

proc sort data=WORK.QUERY_FOR_FINAL1_0001 out=FM_quadric_reg1;

by gvkey ;

run;

PROC REG DATA=FM_quadric_reg1

outest=FM_quadric_regr11 noprint outseb;

MODEL y = x1 x2;

by gvkey ;

RUN;

QUIT;

proc summary data=FM_quadric_regr11;

var intercept x1 x2;

output out=FM_quadric_regr11_summ p5= p95= / autoname;

where _TYPE_ = 'PARMS';

quit;

data FM_quadric_regr11_Winz;

merge FM_quadric_regr11 FM_quadric_regr11_summ;

intercept = min(max(intercept, intercept_p5), intercept_p95);

x1 = min(max(x1, x1_p5), x1_p95);

x2 = min(max(x2, x2_p5), x2_p95);

quit;

proc means data=FM_quadric_regr11_Winz n mean t probt;

where _TYPE_ = 'PARMS';

run;

Thank you,

Lior

PGStats · Posted 01-24-2019 12:34 PM

You are missing by gvkey in many steps.

PG

lioradam · Posted 01-24-2019 08:29 PM

I intentionally didn't include "by gvkey" in the "proc summary" because I want to calculate p5 and p95 of all the rows together. (the entire table).

there for the table FM_quadric_reg011_summ include only one row, while the table FM_quadric_reg011 including many rows (as the number of gvkey). that's why the merge procedure does not work.

Reeza · Posted 01-24-2019 09:17 PM

If you're trying to merge a data set with multiple rows, with on with a single row you do it this way:

data want;
set multipleRows;

if _n_ = 1 then set OneRow;

run;

@lioradam wrote:

I intentionally didn't include "by gvkey" in the "proc summary" because I want to calculate p5 and p95 of all the rows together. (the entire table).

there for the table FM_quadric_reg011_summ include only one row, while the table FM_quadric_reg011 including many rows (as the number of gvkey). that's why the merge procedure does not work.

lioradam · Posted 01-25-2019 07:46 PM

Thank you Reeza !

Classroom Training Available!