Hello,
I am modeling the effect of firm performance on CEO salary. SAS read 10,741 observations from my merged ceo_firm data set, but only used 2,066 of them. 8,675 observations had missing values. I am surprised by this, however, due to the substantive data cleaning I have done in preparation for running the regression. Do you have any idea why I could still be missing so many observations? Here is my code:
libname paper "~/425/425_Final_Paper"; *import compustat dataset; PROC IMPORT out=paper.CompuStat_Execucomp datafile="~/425/425_Final_Paper/CompuStat_Execucomp.csv" DBMS=CSV replace; getnames= yes; guessingrows=2000; run; *keep only CEOs; data paper.Compustat_ExecuComp2; set paper.Compustat_ExecuComp; if CEOANN = 'CEO'; run; *** extra DATA step for checking previous data step results by comparing number of records selected; data check_CEOANN; set paper.Compustat_ExecuComp; if CEOANN =: 'CEO'; run; proc sort data=paper.Compustat_ExecuComp2; by ticker year; run; *import csrp dataset; PROC IMPORT out=paper.CSRP_Monthly_Stock_char datafile="~/425/425_Final_Paper/CSRP_MonthlyStock_char.csv" DBMS=CSV replace; getnames= yes; guessingrows=max; run; *remove bad data; data paper.CSRP_Monthly_Stock_char2; set paper.CSRP_Monthly_Stock_char (rename=(ret=character_ret)); drop character_ret; if cusip = ' ' then delete; ret = input(character_ret, ??8.); if ret = . then delete; date = input(put(date, z8.), yymmdd8.); format date yymmdd10.; year = year(date); month = month(date); if cusip =: '?' then cusip = substr(cusip, 2); run; proc contents data=paper.CSRP_Monthly_Stock_char2; run; proc sort data=paper.CSRP_Monthly_Stock_char2; by ticker year; run; *Remove all bad years from both data sources; data paper.Compustat_ExecuComp3; merge paper.Compustat_ExecuComp2 paper.multiple_CEOs (keep=ticker year in=had_multiple_CEOs); by ticker year; if had_multiple_CEOs then delete; run; data paper.CSRP_Monthly_Stock_char3; merge paper.CSRP_Monthly_Stock_char2 paper.multiple_CEOs (keep=ticker year in=had_multiple_CEOs); by ticker year; if had_multiple_CEOs then delete; run; *find additional bad data: multiple return records for the same month/year; proc freq data=paper.CSRP_Monthly_Stock_char3; tables ticker * year * month / noprint out=paper.multiple_returns (where=(count > 1)); run; *Remove all matching year data for multiple returns; proc sort data=paper.multiple_returns out=multiple_returns (keep=ticker year) NODUPKEY; by ticker year; run; data paper.Compustat_ExecuComp4; merge paper.Compustat_ExecuComp3 multiple_returns (in=had_multiple_returns); by ticker year; if had_multiple_returns then delete; run; data paper.CSRP_Monthly_Stock_char4; merge paper.CSRP_Monthly_Stock_char3 multiple_returns (in=had_multiple_returns); by ticker year; if had_multiple_returns then delete; run; *create new variable annualized growth; data paper.CSRP_annual_returns; set paper.CSRP_Monthly_Stock_char4; by ticker year; retain annual_return 1; annual_return = annual_return * (1 + RET); if month = 12 or last.ticker; annual_return = (annual_return - 1) * 100; output; annual_return = 1; keep ticker year annual_return; run; *use proc contents to see if there is a type mismatch; proc contents data=paper.CSRP_annual_returns; run; proc contents data=paper.Compustat_ExecuComp4; run; *MERGE statement has more than one data set with repeats of BY values; proc sort data=paper.compustat_execucomp4; by ticker; run; proc sort data=paper.CSRP_annual_returns nodupkey; by ticker; run; data want; merge paper.compustat_execucomp4 (in=in1) paper.CSRP_annual_returns; by ticker; if in1; run; *Merge CEO data and firm data; DATA paper.ceo_firm ; length ticker $5; MERGE paper.CSRP_annual_returns paper.compustat_execucomp4; BY ticker; RUN; proc contents data=paper.ceo_firm; run; *remove missing return data; data paper.ceo_firm; set paper.ceo_firm; if annual_return = . then delete; run; proc means data=paper.CSRP_Monthly_Stock_char3 n nmiss; var ret; run; *Use OLS to estimate model; title ""; ods graphics on; proc reg data=paper.ceo_firm plots(maxpoints=none); model salary = annual_return /clb acov; run; ods graphics off;
Thanks in advance for the help!
Without seeing the data I don't think anyone can tell you why you are missing salary and/or annual return from so many records.
I think you have to manually review the files to see why there are so many missing values. Could be an error in how the data was imported.
Art, CEO, AnalystFinder.com
Which particular dataset do you think would be helpful in looking at this issue?
Post the results of the PROC MEANS call
proc means data=paper.CSRP_Monthly_Stock_char3 n nmiss; var salary annual_return; run;
. It contains the count of missing and nonmissing values for the variables in your model.
From
proc means data=paper.Compustat_ExecuComp4 n nmiss; var salary; run;
The MEANS Procedure
Analysis Variable : SALARYN N Miss
13346 | 0 |
and from:
proc means data=paper.CSRP_annual_returns n nmiss; var annual_return; run;
we get
The MEANS Procedure
Analysis Variable : annual_returnN N Miss
53176 | 0 |
Didn't you really want to run those proc means on paper.ceo_firm since that is the file you're using in proc reg?
proc reg data=paper.ceo_firm plots(maxpoints=none); model salary = annual_return /clb acov; run;
And you haven't (I don't think) addressed the question I raised in my previous post.
Art, CEO, AnalystFinder.com
I agree, in principle, with Rick, but I don't think RET is the variable you're using. Your code includes:
data paper.CSRP_annual_returns; set paper.CSRP_Monthly_Stock_char4; by ticker year; retain annual_return 1; annual_return = annual_return * (1 + RET); if month = 12 or last.ticker; annual_return = (annual_return - 1) * 100; output; annual_return = 1; keep ticker year annual_return; run;
In that code you're only calculating annual_return if month=12 or it's the last record for a particular ticker. Is that really what you want to be doing? And your proc means only analyzes RET, while your variable is annual_return.
I don't know which file contains SALARY
Art, CEO, AnalystFinder.com
Art, CEO,
Salary is in Compustat_ExecuComp4.
And with regards to annual_return, I will quote a SAS expert from another post:
Do you always begin with January and end with December?
Is DATE a SAS date, or merely a numeric value with YMD?
Is your data in sorted order by TICKER DATE?
Assuming "yes", "just a number", and "yes", you could program it in this way:
I will look into your other points now.
data temp;
set have;
by ticker date;
retain annual_return 1;
annual_return = annual_return * (1 + RET);
month = int( mod(date,10000) / 100);
if month = 12;
annual_return = (annual_return - 1) * 100;
output;
annual_return = 1;
run;
If DATE is actually a SAS date, the calculation of MONTH is easier:
month = month(date);
Note that the annualized return is calculated separately for each year. If you get 10% for one year, and 20% for the next year, the total value after two years would be 1.1 * 1.2 * original value ... all assuming that I got the formulas correct.
I don't know who provided that code or whether or not it's doing what you want. Most of us on this forum are users .. just like you. We don't get paid, may sometimes suggest wrong answers, and often aren't provided with enough details to make a suggestion that actually does what the requester wanted to do (but didn't convey the correct request).
As for myself, I wouldn't necessarily call myself an expert, but I do have a PhD and have been using SAS for the past 45 years.
Make up an example for two years of one ticker and, for each record, show what you expect the value of annual_return to be.
As for Salary, it has to be in the file you are analyzing with proc reg. Earlier you had asked which file to look at to identify why you had missing values .. I think it may be the comparison of those two files and, if they're different, figuring out why they're different.
Art, CEO, AnalystFinder.com
P.S. I just looked up the original post and, if your data are in fact monthly values, then the code that @Astounding suggested does indeed do the calculation correctly. However, I also noticed in that thread that you were trying obtain the geometric mean, but didn't see @mkeintz's suggestion incorporated in your code.
However, I also noticed in that thread, that you were getting missing values back then. Was that problem ever corrected?
A) I am going to forgo the geometric mean for now. I am just looking for percentage growth over the course of a year. Here’s an example of what I would expect, based off numbers from yahoo finance:
Ticker year annual_return
AAOI 2014 61.35% (my data actually says its -35.15776393)
ACFC 2013 115.42% (my data actually says its 125.9565%)
B) I used:
proc contents
data=paper.ceo_firm;
run;
to confirm that salary is in the data set I am analyzing with proc reg
Is this what you were wondering?
C) I will find that thread and investigate my missing values further.
@art297 any other suggestions?
From what I've read in previous posts on this and related topics I'd be most interested in seeing what data was missing and, if it the results didn't seem reasonable, I'd investigate further.
You might want to run something like:
proc format;
value sal
.='Missing'
low-4999='<5000'
5000-39999='< 40000'
40000-high='40000 or more'
;
value return
.='missing'
low-.299='< .3'
.3-.499='<.5'
.5-high='.5 or more'
;
run;
proc freq data=paper.ceo_firm;
tables year*(annual_return salary)/ missing;
format salary sal.;
format annual_return return.;
run;
Of course, you might want to select better ranges for the formats.
Art, CEO, AnalystFinder.com
Here is what the output looks like:
Table of year by annual_returnyear annual_returnmissing < .3 <.5 .5 or more Total20102011201220132014201520162017Total
|
Frequency Percent Row Pct Col Pct | Table of year by SALARYyear SALARYMissing <5000 < 40000 Total20102011201220132014201520162017Total
|
Do you see anything systematic going on here?
The tables posted in a garbled fashion where one can't tell what it represents. It would, in this case, be better to post a snapshot of the result screens and post them as pdf files.
Art, CEO, AnalystFinder.com
@art297, my bad. I will take note of that for next time. It's just that I have been told more than a few times that people do not prefer to have to open attached files on this forum is all.
People on the forum don't like pictures of datasets .. they'd rather have data posted in the form of an analyzable data step.
Similarly, they don't like opening file type that might contain viruses, like Word or Excel.
However, when showing one's output, a picture (i.e., jpg, pdf or the like) I'd think would be welcomed by most.
Art, CEO, AnalystFinder.com
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.