Hello,
I have a variable I would like to include in my regression model "bonus" that is character and needs to be numeric. While I am looking for help with the data step that could fix this, I am also wondering how this will fit into the context of the rest of my program, since I have been renaming the data sets as I've been going. So one question I have, for example, is if we change the name of the new data set (with bonus as a numeric variable), then do we need to change the data set going into the merge statement (which I use to run the regression) so that this change is put into use?
Here's what bonus looks like:
36656478161452797741211945451505255534647956761056894889091751038798281
ADDRESS | Char | 41 | $41. | $41. |
AGE | Num | 8 | BEST12. | BEST32. |
ALLOTHPD | Char | 1 | $1. | $1. |
ALLOTHTOT | Char | 1 | $1. | $1. |
BECAMECEO | Num | 8 | BEST12. | BEST32. |
BONUS | Char | 12 | $12. | $12. |
CEOANN | Char | 3 | $3. | $3. |
CFOANN | Char | 7 | $7. | $7. |
CHG_CTRL_PYMT | Num | 8 | BEST12. | BEST32. |
CITY | Char | 15 | $15. | $15. |
COMMENT | Char | 1 | $1. | $1. |
CONAME | Char | 34 | $34. | $34. |
CO_PER_ROL | Char | 5 | $5. | $5. |
CUSIP | Char | 8 | $8. | $8. |
DEFER_BALANCE_TOT | Num | 8 | BEST12. | BEST32. |
DEFER_CONTRIB_CO_TOT | Num | 8 | BEST12. | BEST32. |
DEFER_CONTRIB_EXEC_TOT | Num | 8 | BEST12. | BEST32. |
DEFER_EARNINGS_TOT | Num | 8 | BEST12. | BEST32. |
DEFER_RPT_AS_COMP_TOT | Num | 8 | BEST12. | BEST32. |
DEFER_WITHDR_TOT | Num | 8 | BEST12. | BEST32. |
EIP_UNEARN_NUM | Num | 8 | BEST12. | BEST32. |
EIP_UNEARN_VAL | Num | 8 | BEST12. | BEST32. |
EXCHANGE | Char | 3 | $3. | $3. |
EXECDIR | Char | 2 | $2. | $2. |
EXECID | Num | 8 | BEST12. | BEST32. |
EXECRANK | Char | 8 | $8. | $8. |
EXECRANKANN | Num | 8 | BEST12. | BEST32. |
EXEC_FNAME | Char | 12 | $12. | $12. |
EXEC_FULLNAME | Char | 614 | $614. | $614. |
EXEC_LNAME | Char | 22 | $22. | $22. |
EXEC_MNAME | Char | 9 | $9. | $9. |
GENDER | Char | 6 | $6. | $6. |
GVKEY | Num | 8 | BEST12. | BEST32. |
INDDESC | Char | 44 | $44. | $44. |
INTERLOCK | Char | 3 | $3. | $3. |
JOINED_CO | Num | 8 | BEST12. | BEST32. |
LEFTCO | Num | 8 | BEST12. | BEST32. |
LEFTOFC | Num |
and here is what my program looks like:
libname paper "~/425/425_Final_Paper"; *import compustat dataset; PROC IMPORT out=paper.CompuStat_Execucomp datafile="~/425/425_Final_Paper/CompuStat_Execucomp.csv" DBMS=CSV replace; getnames= yes; guessingrows=2000; run; *keep only CEOs; data paper.Compustat_ExecuComp2; set paper.Compustat_ExecuComp; if CEOANN = 'CEO'; run; *** extra DATA step for checking previous data step results by comparing number of records selected; data check_CEOANN; set paper.Compustat_ExecuComp; if CEOANN =: 'CEO'; run; proc sort data=paper.Compustat_ExecuComp2; by ticker year; run; *import csrp dataset; PROC IMPORT out=paper.CSRP_Monthly_Stock_char datafile="~/425/425_Final_Paper/CSRP_MonthlyStock_char.csv" DBMS=CSV replace; getnames= yes; guessingrows=max; run; *remove bad data; data paper.CSRP_Monthly_Stock_char2; set paper.CSRP_Monthly_Stock_char (rename=(ret=character_ret)); drop character_ret; if cusip = ' ' then delete; ret = input(character_ret, ??8.); if ret = . then delete; date = input(put(date, z8.), yymmdd8.); format date yymmdd10.; year = year(date); month = month(date); if cusip =: '?' then cusip = substr(cusip, 2); run; proc contents data=paper.CSRP_Monthly_Stock_char; run; proc contents data=paper.CSRP_Monthly_Stock_char2; run; proc contents data=paper.multiple_CEOs; run; proc sort data=paper.CSRP_Monthly_Stock_char2; by ticker year; run; proc sort data=paper.Compustat_ExecuComp2; by ticker year; run; *Remove all bad years from both data sources; data paper.Compustat_ExecuComp3; length ticker $5; merge paper.Compustat_ExecuComp2 paper.multiple_CEOs (keep=ticker year in=had_multiple_CEOs); by ticker year; if had_multiple_CEOs then delete; run; data paper.CSRP_Monthly_Stock_char3; merge paper.CSRP_Monthly_Stock_char2 paper.multiple_CEOs (keep=ticker year in=had_multiple_CEOs); by ticker year; if had_multiple_CEOs then delete; run; *find additional bad data: multiple return records for the same month/year; proc freq data=paper.CSRP_Monthly_Stock_char3; tables ticker * year * month / noprint out=paper.multiple_returns (where=(count > 1)); run; *Remove all matching year data for multiple returns; proc sort data=paper.multiple_returns out=multiple_returns (keep=ticker year) NODUPKEY; by ticker year; run; data paper.Compustat_ExecuComp4; merge paper.Compustat_ExecuComp3 multiple_returns (in=had_multiple_returns); by ticker year; if had_multiple_returns then delete; run; data paper.CSRP_Monthly_Stock_char4; length ticker $5; merge paper.CSRP_Monthly_Stock_char3 multiple_returns (in=had_multiple_returns); by ticker year; if had_multiple_returns then delete; run; proc contents data=paper.CSRP_Monthly_Stock_char3; run; proc contents data=paper.CSRP_Monthly_Stock_char4; run; *create new variable annualized growth; data paper.CSRP_annual_returns; set paper.CSRP_Monthly_Stock_char4; by ticker year; retain annual_return 1; annual_return = annual_return * (1 + RET); if month = 12 or last.ticker; annual_return = (annual_return - 1) * 100; output; annual_return = 1; keep ticker year annual_return; run; *use proc contents to see if there is a type mismatch; proc contents data=paper.CSRP_annual_returns; run; proc contents data=paper.Compustat_ExecuComp4; run; *MERGE statement has more than one data set with repeats of BY values; proc sort data=paper.CSRP_annual_returns out=paper.CSRP_annual_returns2 nodupkey; by ticker; run; *Merge CEO data and firm data; DATA paper.ceo_firm ; length ticker $5; MERGE paper.CSRP_annual_returns2 (in=in1) paper.compustat_execucomp4 (in=in2) ; BY ticker; if in1 and in2; RUN; proc contents data=paper.ceo_firm; run; *remove missing return data; data paper.ceo_firm; set paper.ceo_firm; if annual_return = . then delete; run; proc means data=paper.CSRP_Monthly_Stock_char3 n nmiss; var ret; run; proc means data=paper.Compustat_ExecuComp4 n nmiss; var salary; run; proc means data=paper.CSRP_annual_returns n nmiss; var annual_return; run; *create dummy variable that indicates whether the firm's market value increases in a given year; data paper.ceo_firm; set paper.ceo_firm; by ticker year; last_return = lag(annual_return); if not first.ticker then increase = (annual_return > last_return); else increase = .; run;
Thanks so much for the help!
Given that BONUS is originally part of executive compensation, here would be an easy place to revise the program:
*Remove all bad years from both data sources; data paper.Compustat_ExecuComp3; length ticker $5; merge paper.Compustat_ExecuComp2 paper.multiple_CEOs (keep=ticker year in=had_multiple_CEOs); by ticker year; if had_multiple_CEOs then delete; run;
One way to change it:
*Remove all bad years from both data sources; data paper.Compustat_ExecuComp3; length ticker $5; merge paper.Compustat_ExecuComp2 (rename=(bonus=bonus_char)) paper.multiple_CEOs (keep=ticker year in=had_multiple_CEOs); by ticker year; if had_multiple_CEOs then delete;
bonus = input(bonus_char, ??12.);
drop bonus_char; run;
You might want to examine the character values for BONUS. This program changes them to missing values. It could be appropriate to change them to 0 or to some other value. It all depends on what those character values contain, and what those values mean. If you want to isolate them and inspect them, it could be done in this way:
proc freq data=paper.CompuStat_ExecuComp2;
tables bonus;
where bonus > ' ' and input(bonus, 12.) = .;
run;
Given that BONUS is originally part of executive compensation, here would be an easy place to revise the program:
*Remove all bad years from both data sources; data paper.Compustat_ExecuComp3; length ticker $5; merge paper.Compustat_ExecuComp2 paper.multiple_CEOs (keep=ticker year in=had_multiple_CEOs); by ticker year; if had_multiple_CEOs then delete; run;
One way to change it:
*Remove all bad years from both data sources; data paper.Compustat_ExecuComp3; length ticker $5; merge paper.Compustat_ExecuComp2 (rename=(bonus=bonus_char)) paper.multiple_CEOs (keep=ticker year in=had_multiple_CEOs); by ticker year; if had_multiple_CEOs then delete;
bonus = input(bonus_char, ??12.);
drop bonus_char; run;
You might want to examine the character values for BONUS. This program changes them to missing values. It could be appropriate to change them to 0 or to some other value. It all depends on what those character values contain, and what those values mean. If you want to isolate them and inspect them, it could be done in this way:
proc freq data=paper.CompuStat_ExecuComp2;
tables bonus;
where bonus > ' ' and input(bonus, 12.) = .;
run;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.