Hello,
I'm running the following regression (a Fama MecBeth regression):
Thank you,
This isn't winsorizing. Winsorizing involves censoring the same number of values from the lower and upper tail of the distribution. What you describe can be done with:
proc sort data=WORK.QUERY out=FM_quadric_reg1;
by gvkey;
run;
PROC REG DATA=FM_quadric_reg1
outest=FM_quadric_reg011 noprint outseb;
BY gvkey ;
MODEL y= x1 x2;
RUN;
proc sql;
create table FM_quadric_reg011_Winz as
select
*,
case when abs(intercept-mean(intercept)) > 3 * std(intercept)
then mean(intercept) + sign(intercept-mean(intercept)) * 3 * std(intercept)
else intercept end as winz_intercept,
case when abs(x1-mean(x1)) > 3 * std(x1)
then mean(x1) + sign(x1-mean(x1)) * 3 * std(x1)
else x1 end as winz_x1,
case when abs(intercept-mean(intercept)) > 3 * std(x2)
then mean(x2) + sign(x2-mean(x2)) * 3 * std(x2)
else x2 end as winz_x2
from FM_quadric_reg011
where _type_ = "PARMS";
quit;
@lioradam wrote:
Hello,
I'm running the following regression (a Fama MecBeth regression):
proc sort data=WORK.QUERY out=FM_quadric_reg1;
by gvkey;
run;PROC REG DATA=FM_quadric_reg1
outest=FM_quadric_reg011 noprint outseb;
MODEL y= x1 x2;
by gvkey ;
RUN;
QUIT;
proc means data=FM_quadric_reg011 n mean t probt;
where _TYPE_ = 'PARMS';
run;Now I want to winsore the outliers: for both beta 1 , beta 2, and the intercept, I want to turn values of that exceed mean plus 3 std dev to the value of mean plus 3 std dev and to turn the values that below mean minus 3 std dev to the value of mean minus 3 std dev.I want to do it for both beta 1 (the coefficient of x1) , beta 2 (the coefficient of x2) and the intercept.could you please assist in writing the code for that?Thank you,
Lior
So you get many different beta1 and beta2 and intercept because of the BY statement. It's not clear that they should be winsorized, or that the different beta1, beta2 and intercept across the BY groups would have the same distribution allowing winsorizing to make sense.
You'd have to be willing to produce reasons to your clients why Winsorizing makes sense here. I'm trying to think of an example, and the only example I can think of is a simulation experiment, where the BY is for different iterations in the simulation.
But you can take the output data set PROC MEANS and then merge that into the coefficients to determine if they are byond 3 sigma. Something like this:
proc means data=FM_quadric_reg011;
where _TYPE_ = 'PARMS';
output out=_stats_ mean= std=/autoname;
run;
Data fm_quadric_reg011a;
if _n_=1 then set _stats_;
set fm_quadric_reg011;
/* Formula for winsorizing goes here */
run;
Hi!
The method of calculating many beta1 and beta2 and intercept is common practice in accounting research and it aimed to treat a possible collinearity between observations. then the mean value of beta1 and beta2 and intercept are calculated . so as any mean calculation its affected by outliers, and there for it needed to be addressed.
Any way thank you for your assistance!
Lior
I agree with @PaigeMiller.
That being said, here's an example of how to cap results, it's not efficient by any means, but it works.
Hi Rezza,
I used PG code however the code you sent looks also suitable for me.
thank you very much!
Lior
This isn't winsorizing. Winsorizing involves censoring the same number of values from the lower and upper tail of the distribution. What you describe can be done with:
proc sort data=WORK.QUERY out=FM_quadric_reg1;
by gvkey;
run;
PROC REG DATA=FM_quadric_reg1
outest=FM_quadric_reg011 noprint outseb;
BY gvkey ;
MODEL y= x1 x2;
RUN;
proc sql;
create table FM_quadric_reg011_Winz as
select
*,
case when abs(intercept-mean(intercept)) > 3 * std(intercept)
then mean(intercept) + sign(intercept-mean(intercept)) * 3 * std(intercept)
else intercept end as winz_intercept,
case when abs(x1-mean(x1)) > 3 * std(x1)
then mean(x1) + sign(x1-mean(x1)) * 3 * std(x1)
else x1 end as winz_x1,
case when abs(intercept-mean(intercept)) > 3 * std(x2)
then mean(x2) + sign(x2-mean(x2)) * 3 * std(x2)
else x2 end as winz_x2
from FM_quadric_reg011
where _type_ = "PARMS";
quit;
Hi PG,
I used it and it worked perfectly (I calculate it also manually to verify that this is what I attend to receive).
Thank you very much!
Lior
Hi PG,
I want to try another type of winsoring, instead of winzoring coefficients above mean plus 3 STD (or below mean minus 3 STD), I want to do winsoring for coefficient values above 95% or below 5%.
Could you please assist how to adjust the code you wrote accordingly?
Kind Regards,
Lior
SQL doesn't provide percentile summaries, other than the median. So you will have to use a different approach, such as:
proc summary data=FM_quadric_reg011;
by gvkey;
var intercept x1 x2;
output out=FM_quadric_reg011_summ p5= p95= / autoname;
run;
data FM_quadric_reg011_Winz;
merge FM_quadric_reg011 FM_quadric_reg011_summ; by gvkey;
intercept = min(max(intercept, intercept_p5), intercept_p95);
x1 = min(max(x1, x1_p5), x1_p95);
x2 = min(max(x2, x2_p5), x2_p95);
run;
Hi PG,
Thank you for your response.
I wrote the following code, based on your suggestion, but for some reason, I received different result than when winsoring manually (for verification).
Also, I don't see "output data" of "FM_quadric_regr11_Winz",
and when I remove the condition " where _TYPE_ = 'PARMS'" from the "proc summary" I still receive the same results (the same value of " winz_year_of_tenure2" for example) which don't make sense.
Could you please try to see what I wrote incorrectly?
The code I wrote:
proc sort data=WORK.QUERY_FOR_FINAL1_0001 out=FM_quadric_reg1;
by gvkey ;
run;
PROC REG DATA=FM_quadric_reg1
outest=FM_quadric_regr11 noprint outseb;
MODEL y = x1 x2;
by gvkey ;
RUN;
QUIT;
proc summary data=FM_quadric_regr11;
var intercept x1 x2;
output out=FM_quadric_regr11_summ p5= p95= / autoname;
where _TYPE_ = 'PARMS';
quit;
data FM_quadric_regr11_Winz;
merge FM_quadric_regr11 FM_quadric_regr11_summ;
intercept = min(max(intercept, intercept_p5), intercept_p95);
x1 = min(max(x1, x1_p5), x1_p95);
x2 = min(max(x2, x2_p5), x2_p95);
quit;
proc means data=FM_quadric_regr11_Winz n mean t probt;
where _TYPE_ = 'PARMS';
run;
Thank you,
Lior
You are missing by gvkey in many steps.
I intentionally didn't include "by gvkey" in the "proc summary" because I want to calculate p5 and p95 of all the rows together. (the entire table).
there for the table FM_quadric_reg011_summ include only one row, while the table FM_quadric_reg011 including many rows (as the number of gvkey). that's why the merge procedure does not work.
If you're trying to merge a data set with multiple rows, with on with a single row you do it this way:
data want;
set multipleRows;
if _n_ = 1 then set OneRow;
run;
@lioradam wrote:
I intentionally didn't include "by gvkey" in the "proc summary" because I want to calculate p5 and p95 of all the rows together. (the entire table).
there for the table FM_quadric_reg011_summ include only one row, while the table FM_quadric_reg011 including many rows (as the number of gvkey). that's why the merge procedure does not work.
Thank you Reeza !
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.