BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lioradam
Obsidian | Level 7

Hello,

 

I'm running the following regression (a Fama MecBeth regression):

 


proc sort data=WORK.QUERY out=FM_quadric_reg1;
 by gvkey;
run;
 
PROC REG DATA=FM_quadric_reg1
  outest=FM_quadric_reg011 noprint outseb;

MODEL y= x1 x2;
   by gvkey ;
  RUN;
QUIT;
proc means data=FM_quadric_reg011 n mean t probt;
 where _TYPE_ = 'PARMS';
run;
 
Now I want to winsore the outliers: for both beta 1 , beta 2, and the intercept, I want to turn values of that exceed mean  plus 3 std dev to the value of mean plus 3 std dev and to turn the values that below mean minus 3 std dev to the value of mean minus 3 std dev.
I want to do it for both beta 1 (the coefficient of x1) , beta 2 (the coefficient of x2) and the intercept.
could you please assist in writing the code for that?

 

Thank you,
Lior
 
1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

This isn't winsorizing. Winsorizing involves censoring the same number of values from the lower and upper tail of the distribution. What you describe can be done with:

 

proc sort data=WORK.QUERY out=FM_quadric_reg1;
 by gvkey;
run;
 
PROC REG DATA=FM_quadric_reg1
  outest=FM_quadric_reg011 noprint outseb;
BY gvkey ;
MODEL y= x1 x2;
RUN;

proc sql;
create table FM_quadric_reg011_Winz as
select 
    *,
    case when abs(intercept-mean(intercept)) > 3 * std(intercept) 
    then mean(intercept) + sign(intercept-mean(intercept)) * 3 * std(intercept) 
    else intercept end as winz_intercept,
    case when abs(x1-mean(x1)) > 3 * std(x1) 
    then mean(x1) + sign(x1-mean(x1)) * 3 * std(x1) 
    else x1 end as winz_x1,    
    case when abs(intercept-mean(intercept)) > 3 * std(x2) 
    then mean(x2) + sign(x2-mean(x2)) * 3 * std(x2) 
    else x2 end as winz_x2
from FM_quadric_reg011
where _type_ = "PARMS";
quit;
PG

View solution in original post

13 REPLIES 13
PaigeMiller
Diamond | Level 26

@lioradam wrote:

Hello,

 

I'm running the following regression (a Fama MecBeth regression):

 


proc sort data=WORK.QUERY out=FM_quadric_reg1;
 by gvkey;
run;
 
PROC REG DATA=FM_quadric_reg1
  outest=FM_quadric_reg011 noprint outseb;

MODEL y= x1 x2;
   by gvkey ;
  RUN;
QUIT;
proc means data=FM_quadric_reg011 n mean t probt;
 where _TYPE_ = 'PARMS';
run;
 
Now I want to winsore the outliers: for both beta 1 , beta 2, and the intercept, I want to turn values of that exceed mean  plus 3 std dev to the value of mean plus 3 std dev and to turn the values that below mean minus 3 std dev to the value of mean minus 3 std dev.
I want to do it for both beta 1 (the coefficient of x1) , beta 2 (the coefficient of x2) and the intercept.
could you please assist in writing the code for that?

 

Thank you,
Lior
 

So you get many different beta1 and beta2 and intercept because of the BY statement. It's not clear that they should be winsorized, or that the different beta1, beta2 and intercept across the BY groups would have the same distribution allowing winsorizing to make sense.

 

You'd have to be willing to produce reasons to your clients why Winsorizing makes sense here. I'm trying to think of an example, and the only example I can think of is a simulation experiment, where the BY is for different iterations in the simulation.

 

But you can take the output data set PROC MEANS and then merge that into the coefficients to determine if they are byond 3 sigma. Something like this:

proc means data=FM_quadric_reg011;
    where _TYPE_ = 'PARMS';
    output out=_stats_ mean= std=/autoname;
run;
 
Data fm_quadric_reg011a;
    if _n_=1 then set _stats_;
    set fm_quadric_reg011;
    /* Formula for winsorizing goes here */
run;

 

--
Paige Miller
lioradam
Obsidian | Level 7

 

Hi!

 

The method of calculating many beta1 and beta2 and intercept is common practice in accounting research and it aimed to treat a possible collinearity between observations.  then the mean value of beta1 and beta2 and intercept  are calculated . so as any mean calculation its affected by outliers, and there for it needed to be addressed.

 

Any way thank you for your assistance!

Lior

 

Reeza
Super User

I agree with @PaigeMiller.

 

That being said, here's an example of how to cap results, it's not efficient by any means, but it works. 

 

https://gist.github.com/statgeek/31316a678433a1db8136

lioradam
Obsidian | Level 7

Hi Rezza,

 

I used PG code however the code you sent looks also suitable for me.

thank you very much!

Lior

 

PGStats
Opal | Level 21

This isn't winsorizing. Winsorizing involves censoring the same number of values from the lower and upper tail of the distribution. What you describe can be done with:

 

proc sort data=WORK.QUERY out=FM_quadric_reg1;
 by gvkey;
run;
 
PROC REG DATA=FM_quadric_reg1
  outest=FM_quadric_reg011 noprint outseb;
BY gvkey ;
MODEL y= x1 x2;
RUN;

proc sql;
create table FM_quadric_reg011_Winz as
select 
    *,
    case when abs(intercept-mean(intercept)) > 3 * std(intercept) 
    then mean(intercept) + sign(intercept-mean(intercept)) * 3 * std(intercept) 
    else intercept end as winz_intercept,
    case when abs(x1-mean(x1)) > 3 * std(x1) 
    then mean(x1) + sign(x1-mean(x1)) * 3 * std(x1) 
    else x1 end as winz_x1,    
    case when abs(intercept-mean(intercept)) > 3 * std(x2) 
    then mean(x2) + sign(x2-mean(x2)) * 3 * std(x2) 
    else x2 end as winz_x2
from FM_quadric_reg011
where _type_ = "PARMS";
quit;
PG
lioradam
Obsidian | Level 7

Hi PG,

I used it and it worked perfectly (I calculate it also manually to verify that this is what I attend to receive).

 

Thank you very much!

Lior

lioradam
Obsidian | Level 7

Hi PG,

 

I want to try another type of winsoring, instead of winzoring coefficients above mean plus 3 STD (or below mean minus 3 STD), I want to do winsoring for coefficient values above 95% or below 5%. 

Could you please assist how to adjust the code you wrote accordingly?

 

Kind Regards,

Lior

 

 

PGStats
Opal | Level 21

SQL doesn't provide percentile summaries, other than the median. So you will have to use a different approach, such as:

 

proc summary data=FM_quadric_reg011;
by gvkey;
var intercept x1 x2;
output out=FM_quadric_reg011_summ p5= p95= / autoname;
run;

data FM_quadric_reg011_Winz;
merge FM_quadric_reg011 FM_quadric_reg011_summ; by gvkey;
intercept = min(max(intercept, intercept_p5), intercept_p95);
x1 = min(max(x1, x1_p5), x1_p95);
x2 = min(max(x2, x2_p5), x2_p95);
run;
PG
lioradam
Obsidian | Level 7

Hi PG,

Thank you for your response.

I wrote the following code, based on your suggestion, but for some reason, I received different result than when winsoring manually (for verification).

Also, I don't see "output data" of "FM_quadric_regr11_Winz",

 and when I remove the condition " where _TYPE_ = 'PARMS'" from the "proc summary" I still receive the same results (the same value of " winz_year_of_tenure2" for  example) which don't make sense.

 

Could you please try to see what I wrote incorrectly?

 

The code I wrote:

 

proc sort data=WORK.QUERY_FOR_FINAL1_0001 out=FM_quadric_reg1;

 by gvkey ;

run;

PROC REG DATA=FM_quadric_reg1

  outest=FM_quadric_regr11 noprint outseb;

     

MODEL y = x1 x2;

   by gvkey ;

RUN;

QUIT;

 

proc summary data=FM_quadric_regr11;

var intercept x1 x2;

output out=FM_quadric_regr11_summ p5= p95= / autoname;

 where _TYPE_ = 'PARMS';

quit;

 

data FM_quadric_regr11_Winz;

merge FM_quadric_regr11 FM_quadric_regr11_summ;

intercept = min(max(intercept, intercept_p5), intercept_p95);

x1 = min(max(x1, x1_p5), x1_p95);

x2 = min(max(x2, x2_p5), x2_p95);

 

quit;

proc means data=FM_quadric_regr11_Winz n mean t probt;

 where _TYPE_ = 'PARMS';

run;

 

Thank you,

Lior

PGStats
Opal | Level 21

You are missing by gvkey in many steps.

PG
lioradam
Obsidian | Level 7

I intentionally didn't include "by gvkey" in the "proc summary" because I want to calculate p5 and p95 of all the rows together. (the entire table).

there for the table FM_quadric_reg011_summ include only one row, while the table FM_quadric_reg011 including many rows (as the number of gvkey). that's why the merge procedure does not work.

Reeza
Super User

If you're trying to merge a data set with multiple rows, with on with a single row you do it this way:

 

data want;
set multipleRows;

if _n_ = 1 then set OneRow;

run;

@lioradam wrote:

I intentionally didn't include "by gvkey" in the "proc summary" because I want to calculate p5 and p95 of all the rows together. (the entire table).

there for the table FM_quadric_reg011_summ include only one row, while the table FM_quadric_reg011 including many rows (as the number of gvkey). that's why the merge procedure does not work.


 

lioradam
Obsidian | Level 7

Thank you Reeza !

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1618 views
  • 2 likes
  • 4 in conversation