Re: Large number of observations with missing values in regression out...

sastuck · Posted 03-30-2018 02:32 PM

Hello,

I am modeling the effect of firm performance on CEO salary. SAS read 10,741 observations from my merged ceo_firm data set, but only used 2,066 of them. 8,675 observations had missing values. I am surprised by this, however, due to the substantive data cleaning I have done in preparation for running the regression. Do you have any idea why I could still be missing so many observations? Here is my code:

libname paper "~/425/425_Final_Paper";

*import compustat dataset;
PROC IMPORT out=paper.CompuStat_Execucomp
datafile="~/425/425_Final_Paper/CompuStat_Execucomp.csv"
DBMS=CSV replace;
getnames= yes;
guessingrows=2000;
run;

*keep only CEOs;
data paper.Compustat_ExecuComp2;
set paper.Compustat_ExecuComp;
if CEOANN = 'CEO';
run;

*** extra DATA step for checking previous data step results by comparing number of records selected;
data check_CEOANN;
set paper.Compustat_ExecuComp;
if CEOANN =: 'CEO';
run;

proc sort data=paper.Compustat_ExecuComp2;
by ticker year;
run;

*import csrp dataset;
PROC IMPORT out=paper.CSRP_Monthly_Stock_char
datafile="~/425/425_Final_Paper/CSRP_MonthlyStock_char.csv"
DBMS=CSV replace;
getnames= yes;
guessingrows=max;
run;

*remove bad data; 
data paper.CSRP_Monthly_Stock_char2;
set paper.CSRP_Monthly_Stock_char (rename=(ret=character_ret));
drop character_ret;
if cusip = ' ' then delete;
ret = input(character_ret, ??8.);
if ret = . then delete;
date = input(put(date, z8.), yymmdd8.);
format date yymmdd10.;
year = year(date);
month = month(date);
if cusip =: '?' then cusip = substr(cusip, 2);
run;

proc contents 
data=paper.CSRP_Monthly_Stock_char2;
run;

proc sort data=paper.CSRP_Monthly_Stock_char2;
by ticker year;
run;

*Remove all bad years from both data sources;
data paper.Compustat_ExecuComp3;
merge paper.Compustat_ExecuComp2
paper.multiple_CEOs (keep=ticker year in=had_multiple_CEOs);
by ticker year;
if had_multiple_CEOs then delete;
run;

data paper.CSRP_Monthly_Stock_char3;
merge paper.CSRP_Monthly_Stock_char2
paper.multiple_CEOs (keep=ticker year in=had_multiple_CEOs);
by ticker year;
if had_multiple_CEOs then delete;
run;

*find additional bad data: multiple return records for the same month/year;
proc freq data=paper.CSRP_Monthly_Stock_char3;
tables ticker * year * month / noprint out=paper.multiple_returns (where=(count > 1));
run;

*Remove all matching year data for multiple returns;
proc sort data=paper.multiple_returns out=multiple_returns (keep=ticker year) NODUPKEY;
by ticker year;
run;

data paper.Compustat_ExecuComp4;
merge paper.Compustat_ExecuComp3
multiple_returns (in=had_multiple_returns);
by ticker year;
if had_multiple_returns then delete;
run;

data paper.CSRP_Monthly_Stock_char4;
merge paper.CSRP_Monthly_Stock_char3
multiple_returns (in=had_multiple_returns);
by ticker year;
if had_multiple_returns then delete;
run;

*create new variable annualized growth;
data paper.CSRP_annual_returns;
set paper.CSRP_Monthly_Stock_char4;
by ticker year;
retain annual_return 1;
annual_return = annual_return * (1 + RET);
if month = 12 or last.ticker;
annual_return = (annual_return - 1) * 100;
output;
annual_return = 1;
keep ticker year annual_return;
run;

*use proc contents to see if there is a type mismatch;
proc contents 
data=paper.CSRP_annual_returns;
run;

proc contents 
data=paper.Compustat_ExecuComp4;
run;

*MERGE statement has more than one data set with repeats of BY values;
proc sort data=paper.compustat_execucomp4;
  by ticker;
run;
proc sort data=paper.CSRP_annual_returns nodupkey;
  by ticker;
run;
data want;
  merge paper.compustat_execucomp4 (in=in1) paper.CSRP_annual_returns;
  by ticker;
  if in1;
run;

*Merge CEO data and firm data; 
DATA paper.ceo_firm ; 
length ticker $5;
MERGE paper.CSRP_annual_returns 
paper.compustat_execucomp4; 
BY ticker; 
RUN;

proc contents 
data=paper.ceo_firm;
run;

*remove missing return data; 
data paper.ceo_firm;
set paper.ceo_firm;
if annual_return = . then delete;
run;

proc means data=paper.CSRP_Monthly_Stock_char3   n nmiss;
var ret;
run;

*Use OLS to estimate model;
title "";
ods graphics on;
proc reg data=paper.ceo_firm plots(maxpoints=none);
model salary = annual_return /clb acov;
run;  
ods graphics off;

Thanks in advance for the help!

art297 · Posted 03-30-2018 02:51 PM

Without seeing the data I don't think anyone can tell you why you are missing salary and/or annual return from so many records.

I think you have to manually review the files to see why there are so many missing values. Could be an error in how the data was imported.

Art, CEO, AnalystFinder.com

sastuck · Posted 03-30-2018 02:52 PM

Which particular dataset do you think would be helpful in looking at this issue?

Rick_SAS · Posted 03-30-2018 03:02 PM

Post the results of the PROC MEANS call

proc means data=paper.CSRP_Monthly_Stock_char3   n nmiss;
var salary annual_return;
run;

. It contains the count of missing and nonmissing values for the variables in your model.

sastuck · Posted 03-30-2018 03:58 PM

From

proc means data=paper.Compustat_ExecuComp4   n nmiss;
var salary;
run;

The MEANS Procedure

Analysis Variable : SALARYN N Miss

13346

0

and from:

proc means data=paper.CSRP_annual_returns   n nmiss;
var annual_return;
run;

we get

The MEANS Procedure

Analysis Variable : annual_returnN N Miss

53176

0

art297 · Posted 03-30-2018 04:16 PM

Didn't you really want to run those proc means on paper.ceo_firm since that is the file you're using in proc reg?

proc reg data=paper.ceo_firm plots(maxpoints=none);
model salary = annual_return /clb acov;
run;

And you haven't (I don't think) addressed the question I raised in my previous post.

Art, CEO, AnalystFinder.com

art297 · Posted 03-30-2018 03:10 PM

I agree, in principle, with Rick, but I don't think RET is the variable you're using. Your code includes:

data paper.CSRP_annual_returns;
  set paper.CSRP_Monthly_Stock_char4;
  by ticker year;
  retain annual_return 1;
  annual_return = annual_return * (1 + RET);
  if month = 12 or last.ticker;
  annual_return = (annual_return - 1) * 100;
  output;
  annual_return = 1;
  keep ticker year annual_return;
run;

In that code you're only calculating annual_return if month=12 or it's the last record for a particular ticker. Is that really what you want to be doing? And your proc means only analyzes RET, while your variable is annual_return.

I don't know which file contains SALARY

Art, CEO, AnalystFinder.com

Art, CEO,

sastuck · Posted 03-30-2018 04:19 PM

Salary is in Compustat_ExecuComp4.

And with regards to annual_return, I will quote a SAS expert from another post:

Do you always begin with January and end with December?

Is DATE a SAS date, or merely a numeric value with YMD?

Is your data in sorted order by TICKER DATE?

Assuming "yes", "just a number", and "yes", you could program it in this way:

I will look into your other points now.

data temp;

set have;

by ticker date;

retain annual_return 1;

annual_return = annual_return * (1 + RET);

month = int( mod(date,10000) / 100);

if month = 12;

annual_return = (annual_return - 1) * 100;

output;

annual_return = 1;

run;

If DATE is actually a SAS date, the calculation of MONTH is easier:

month = month(date);

Note that the annualized return is calculated separately for each year. If you get 10% for one year, and 20% for the next year, the total value after two years would be 1.1 * 1.2 * original value ... all assuming that I got the formulas correct.

art297 · Posted 03-30-2018 05:22 PM

I don't know who provided that code or whether or not it's doing what you want. Most of us on this forum are users .. just like you. We don't get paid, may sometimes suggest wrong answers, and often aren't provided with enough details to make a suggestion that actually does what the requester wanted to do (but didn't convey the correct request).

As for myself, I wouldn't necessarily call myself an expert, but I do have a PhD and have been using SAS for the past 45 years.

Make up an example for two years of one ticker and, for each record, show what you expect the value of annual_return to be.

As for Salary, it has to be in the file you are analyzing with proc reg. Earlier you had asked which file to look at to identify why you had missing values .. I think it may be the comparison of those two files and, if they're different, figuring out why they're different.

Art, CEO, AnalystFinder.com

P.S. I just looked up the original post and, if your data are in fact monthly values, then the code that @Astounding suggested does indeed do the calculation correctly. However, I also noticed in that thread that you were trying obtain the geometric mean, but didn't see @mkeintz's suggestion incorporated in your code.

However, I also noticed in that thread, that you were getting missing values back then. Was that problem ever corrected?

sastuck · Posted 03-30-2018 07:41 PM

A) I am going to forgo the geometric mean for now. I am just looking for percentage growth over the course of a year. Here’s an example of what I would expect, based off numbers from yahoo finance:

Ticker year annual_return

AAOI 2014 61.35% (my data actually says its -35.15776393)

ACFC 2013 115.42% (my data actually says its 125.9565%)

B) I used:

proc contents

data=paper.ceo_firm;

run;

to confirm that salary is in the data set I am analyzing with proc reg

Is this what you were wondering?

C) I will find that thread and investigate my missing values further.

@art297 any other suggestions?

art297 · Posted 04-01-2018 05:25 PM

From what I've read in previous posts on this and related topics I'd be most interested in seeing what data was missing and, if it the results didn't seem reasonable, I'd investigate further.

You might want to run something like:

proc format;
value sal
.='Missing'
low-4999='<5000'
5000-39999='< 40000'
40000-high='40000 or more'
;
value return
.='missing'
low-.299='< .3'
.3-.499='<.5'
.5-high='.5 or more'
;
run;

proc freq data=paper.ceo_firm;
tables year*(annual_return salary)/ missing;
format salary sal.;
format annual_return return.;
run;

Of course, you might want to select better ranges for the formats.

Art, CEO, AnalystFinder.com

sastuck · Posted 04-01-2018 05:33 PM

Here is what the output looks like:

Table of year by annual_returnyear annual_returnmissing < .3 <.5 .5 or more Total20102011201220132014201520162017Total

156 0.72 2.33 22.41	1843 8.45 27.53 29.12	26 0.12 0.39 32.91	4670 21.42 69.75 31.76	6695 30.70
137 0.63 5.03 19.68	892 4.09 32.78 14.09	8 0.04 0.29 10.13	1684 7.72 61.89 11.45	2721 12.48
114 0.52 4.84 16.38	594 2.72 25.24 9.38	8 0.04 0.34 10.13	1637 7.51 69.57 11.13	2353 10.79
97 0.44 4.00 13.94	608 2.79 25.06 9.61	10 0.05 0.41 12.66	1711 7.85 70.53 11.64	2426 11.12
86 0.39 3.44 12.36	781 3.58 31.27 12.34	8 0.04 0.32 10.13	1623 7.44 64.97 11.04	2498 11.46
64 0.29 2.67 9.20	803 3.68 33.44 12.69	10 0.05 0.42 12.66	1524 6.99 63.47 10.37	2401 11.01
42 0.19 1.97 6.03	564 2.59 26.42 8.91	9 0.04 0.42 11.39	1520 6.97 71.19 10.34	2135 9.79
0 0.00 0.00 0.00	245 1.12 42.39 3.87	0 0.00 0.00 0.00	333 1.53 57.61 2.26	578 2.65
696 3.19	6330 29.03	79 0.36	14702 67.42	21807 100.00

Frequency

Percent

Row Pct

Col Pct

Table of year by SALARYyear SALARYMissing <5000 < 40000 Total20102011201220132014201520162017Total

4665 21.39 69.68 55.14	2029 9.30 30.31 15.21	1 0.00 0.01 20.00	6695 30.70
726 3.33 26.68 8.58	1994 9.14 73.28 14.95	1 0.00 0.04 20.00	2721 12.48
425 1.95 18.06 5.02	1927 8.84 81.90 14.44	1 0.00 0.04 20.00	2353 10.79
502 2.30 20.69 5.93	1923 8.82 79.27 14.41	1 0.00 0.04 20.00	2426 11.12
622 2.85 24.90 7.35	1875 8.60 75.06 14.05	1 0.00 0.04 20.00	2498 11.46
596 2.73 24.82 7.04	1805 8.28 75.18 13.53	0 0.00 0.00 0.00	2401 11.01
421 1.93 19.72 4.98	1714 7.86 80.28 12.85	0 0.00 0.00 0.00	2135 9.79
504 2.31 87.20 5.96	74 0.34 12.80 0.55	0 0.00 0.00 0.00	578 2.65
8461 38.80	13341 61.18	5 0.02	21807 100.00

Do you see anything systematic going on here?

art297 · Posted 04-01-2018 05:39 PM

The tables posted in a garbled fashion where one can't tell what it represents. It would, in this case, be better to post a snapshot of the result screens and post them as pdf files.

Art, CEO, AnalystFinder.com

sastuck · Posted 04-01-2018 05:41 PM

@art297, my bad. I will take note of that for next time. It's just that I have been told more than a few times that people do not prefer to have to open attached files on this forum is all.

art297 · Posted 04-01-2018 05:45 PM

People on the forum don't like pictures of datasets .. they'd rather have data posted in the form of an analyzable data step.

Similarly, they don't like opening file type that might contain viruses, like Word or Excel.

However, when showing one's output, a picture (i.e., jpg, pdf or the like) I'd think would be welcomed by most.

Art, CEO, AnalystFinder.com

Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

Re: Large number of observations with missing values in regression output

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away