Re: Code for eliminating outliers

lioradam · Posted 07-13-2022 09:20 AM

Hello all,

My data refers to CEO tenures (average tenure length is 8 years). This means that my sample includes firm-year observations regarding several measures of net profit (accounting return), and those observations are sorted by CEO tenures.

I wrote a code to eliminate outliers of the variables "net profit" in the 1% and 99% (program 1). Then I wrote a second code that said that if there is an outlier during CEO tenure, all the other observations in the same tenure will be eliminated as well (program 2).

According to my understanding, those programs combined should reduce my sample by an average of 16%: 2% of the sample should be reduced in program 1, and as the average tenure length is 8 years, after program 2, the reduction of observations should be 16%.

However, the actual reduction is a third of the sample (twice what I expected).

Enclosed are the two codes (two programs).

Could you please advise if there are problems with the codes?

Thanks in advance,

Lior

PaigeMiller · Posted 07-13-2022 09:33 AM

Many of us will not download attachments. Code should be posted as text in the window that appears when you click on the "running man" icon.

I do not follow your logic about how there is a 2% drop because of program one and an average tenure is 8 years, so this leads to an expected 16% drop in number of records. Nevertheless, this should be easy for you to trace in your data sets to see if the correct records are being deleted. Since we don't have your data, we can't do that tracing.

--
Paige Miller

lioradam · Posted 07-13-2022 10:03 AM

Hi,

Since 2% of the observations (observations below 1% and above 99%) are reduced, if, in the worst-case scenario, all outliers belong to separate tenures (there are no two or more outliers in the same tenure) and since the average tenure length is 8 years, then the outliers should be around 16% (2% multiple by 8).

Enclose the code of the two programs:

Program 1:

options mprint;

data Final_without_outliers_YB;

set ADJUSTED_VARIABLES;

run;

/*title;*/

%macro change(var);

proc means data= ADJUSTED_VARIABLES n nmiss mean std median q1 q3 p1 p99 min max T ProbT/* noprint*/ ;

var &var;

output out=temp_YB p1=p1 p99=p99;

run;

data _NULL_;

set temp_YB;

call symputx("p1_YB",p1);

call symputx("p99_YB",p99);

run;

%put _user_;

data Final_without_outliers_YB;

set Final_without_outliers_YB;

if &var < &p1_YB then &var = .;

if &var > &p99_YB then &var = .;

run;

title 'Results after Omitting Outliers';

proc means data = Final_without_outliers_YB n nmiss mean std median q1 q3 p1 p99 min max T ProbT;

var &var;

run;

title;

%mend;

%change(Adj_m1_ebsi_to_assets_a)

%change(Adj_m1_ebwo_to_assets_a)

%change(ebsi_to_assets_a)

%change(ebwo_to_assets_a)

%change(ewem_a_conpustat)

%change(e_bxi_a)

%change(cash_operation_a)

%change(Adj_m1_ewem_a_compustat)

%change(Adj_m1_ewem_a_exe)

%change(Adj_m1_ewem_a_exe_5)

%change(Adj_m1_ewem_a_exe_10)

%change(Adj_m1_ewem_a_exe_15)

%change(Adj_m1_ewem_a_20_exe)

%change(Adj_m1_e_bxi_a)

%change(Adj_m1_cash_operation_a)

%change(tobin_q)

Program 2:

options mprint;

data FINAL_WITHOUT_OUTLIERS_YB01;

set FINAL_WITHOUT_OUTLIERS_YB;

%macro change1(var);

proc means data=FINAL_WITHOUT_OUTLIERS_YB01 noprint nway;

var &var;

class GVKEY01 Full_Name;

output out=work.e n=n;

run;

data e_&var;

set e;

if _FREQ_ > n;

diff_&var = _FREQ_ - n;

drop _type_ _FREQ_ n;

run;

proc sort data = FINAL_WITHOUT_OUTLIERS_YB01 out = finel_WO_temp;

by GVKEY01 Full_Name;

run;

proc sort data = e_&var out = e_temp_&var;

by GVKEY01 Full_Name;

run;

data FINAL_WITHOUT_OUTLIERS_YB01;

merge e_temp_&var finel_WO_temp;

by GVKEY01 Full_Name;

if diff_&var > 0 then

&var = .;

drop diff_&var; /* if chacking then omit line*/

run;

proc delete data = finel_WO_temp;

run;

proc delete data = e_temp_&var;

run;

proc delete data = e_&var;

run;

proc delete data = e;

run;

%mend;

%change1(Adj_m1_ebsi_to_assets_a)

%change1(Adj_m1_ebwo_to_assets_a)

%change1(ebsi_to_assets_a)

%change1(ebwo_to_assets_a)

%change1(ewem_a_conpustat)

%change1(e_bxi_a)

%change1(cash_operation_a)

%change1(Adj_m1_ewem_a_compustat)

%change1(Adj_m1_ewem_a_exe)

%change1(Adj_m1_ewem_a_exe_5)

%change1(Adj_m1_ewem_a_exe_10)

%change1(Adj_m1_ewem_a_exe_15)

%change1(Adj_m1_ewem_a_20_exe)

%change1(Adj_m1_e_bxi_a)

%change1(Adj_m1_cash_operation_a)

%change1(tobin_q)

PaigeMiller · Posted 07-13-2022 10:12 AM

Repeating: Code should be posted as text in the window that appears when you click on the "running man" icon.

A tenure of 8 years average doesn't seem to me to relate to the 2% of the observations that are deemed to be outliers. I don't see how multiplying them together makes any sense here.

Did you try my suggestion to see what records have been deleted to determine if these are correct, or not?

--
Paige Miller

lioradam · Posted 07-13-2022 02:18 PM

Hi,

Yes, I have checked what records have been deleted and the code is fine.

I analyzed the results in the wrong way.

Thank you,

Lior

Reeza · Posted 07-13-2022 11:47 AM

then the outliers should be around 16% (2% multiple by 8).

This is incorrect. You would expect somewhere between 2% and 8% as there would very likely be overlaps in the outliers.

lioradam · Posted 07-13-2022 02:26 PM

Hi Reeza,

You are right.

Anyway, I analyzed the data in the wrong way. Now I understand it and it makes sense.

Thank you,

Lior

mkeintz · Posted 07-13-2022 12:04 PM

@lioradam wrote:
... stuff deleted ...

According to my understanding, those programs combined should reduce my sample by an average of 16%: 2% of the sample should be reduced in program 1, and as the average tenure length is 8 years, after program 2, the reduction of observations should be 16%.

However, the actual reduction is a third of the sample (twice what I expected).

Enclosed are the two codes (two programs).

I haven't looked at your code. Please enter it in the popup box that will appear when you click on the "running man" glyph when posting a message to this forum.

But I wonder whether there is much justification for your expectation of 16% of observations to be removed by your outlier-year identification.

Did you generate accounting return cutoff points on a year-by-year basis, or globally (i.e. for all years)? If they are global outliers, you will likely find more upper outliers in globally profitable years (i.e. years with high rates of return). This means you are introducing a likely bias in your sample of outliers, since a non-outlier performer (i.e. an "ordinary" 95%ile) might exceed the global 99%ile in that high profit year. This could easily end up identifying more ceo-tenures above the 99%ile than you would it you used annually-adjusted distribution of returns to generate annual outlier cutoff values.

And the same argument might be applied to the lower cutoff point - the 1%iles.

And of course if your study covers a lot of years, then this bias mechanism could be exaggerated - the longer your study the more chance of an extremely good or bad year, i.e. more business cycles.

Also why do you assume that the average tenure length would be the same for ceo's at the extremes? Wouldn't high performers likely stay longer? Of course you could make the counter argument for low performers - they will have shorter-than-average tenures. But I don't think you should assume that these tendencies are symmetrical - which your rationale suggests.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

lioradam · Posted 07-13-2022 02:34 PM

Hi,

I checked what records have been deleted and the code works fine. It was my mistake.

You raise good comments. I calculate the outliers for all years, however, the return is adjusted to industry return in each year.

I dill with the other comments you raise, through different procedures in my research.

Thank you!

Lior

Registration is open

SAS Training: Just a Click Away