Hello,
I am cleaning up my executive compensation data and need some help. I found the code that I need to use on my dataset online, but it is from a Stata tutorial. Could you help me translate this to SAS code? The variable names should be the exact same.
sort execid year cusip by execid year; gen count_abs = [_N] drop if count_obs > 1 drop count_obs sort cusip year execid by cusip year: gen count_obs = [_N] drop if count_ops > 1 drop count_ops
and for my firm performance data:
sort cusip fyear at by cusip fyear: gen count_obs =[_N] drop if count_obs > 1 drop count_obs gen roa = ni / at rename fyear year
Any help is appreciated!
The first one just looks like it's removing duplicates.
proc sort data=have;
by execid year cusip;
run;
proc sort data=have out=deduped uodupkey;
by exceed year;
run;
That should help you get started.
@sastuck wrote:
Hello,
I am cleaning up my executive compensation data and need some help. I found the code that I need to use on my dataset online, but it is from a Stata tutorial. Could you help me translate this to SAS code? The variable names should be the exact same.
sort execid year cusip by execid year; gen count_abs = [_N] drop if count_obs > 1 drop count_obs sort cusip year execid by cusip year: gen count_obs = [_N] drop if count_ops > 1 drop count_opsand for my firm performance data:
sort cusip fyear at by cusip fyear: gen count_obs =[_N] drop if count_obs > 1 drop count_obs gen roa = ni / at rename fyear yearAny help is appreciated!
This is where I am so far:
/*keep only CEOs*/ data CEO; set paper.Compustat_ExecuComp; if CEOANN = 'CEO' then output CEO; run; /*drop CEOs who work in more than one firm in a given year*/ proc sort data=paper.Compustat_ExecuComp; by execid year cusip; run; proc sort data=paper.Compustat_ExecuComp out=deduped nodupkey; by execid year; run; /*drop all observations generated by firms with more than one CEO in a given year*/ proc sort data=paper.Compustat_ExecuComp out=deduped nodupkey; by cusip year execid; run;
But the amount of observations has remained constant even with the new code. Since keeping only CEOs the dataset went from 74300 to 13678 observations but has remained the same even though I am expecting the number to get smaller. Why is this?
Its not removing duplicates entirely, it's removing the duplicate rows. so this comment is incorrect:
/*drop CEOs who work in more than one firm in a given year*/
It keeps the first record for CEOs, it doesn't drop them.
If you want no duplicates at all, look at the UNIQUEOUT option for PROC SORT instead.
but it doesn't seem to be dropping anything? and what do you mean it keeps the first record for CEOs, not dropping them?
Thanks!
Take a small subset of your data, say 5 different ceos with some having no duplicates and some having duplicates.
Track those through the data set examining the companies at each step.
@sastuck wrote:
but it doesn't seem to be dropping anything? and what do you mean it keeps the first record for CEOs, not dropping them?
Thanks!
The NODUPKEY option means to output only one observation per BY group. So if there are 5 observations with the same BY values then only 1 is written and the other 4 are deleted. But it sounds like you want to delete ALL observations when there are more than one observation in the BY group.
But the amount of observations has remained constant even with the new code. Since keeping only CEOs the dataset went from 74300 to 13678 observations but has remained the same even though I am expecting the number to get smaller. Why is this?
Why would you expect it to keep getting smaller?
Do you expect that one executive is the sole CEO of more than one company at a time?
Do you expect that one company has more than one CEO at a time?
Also don't keep going back to the same source dataset if you want to apply additional filtering to the subset.
proc sort data=INPUT out=step1 nodupkey ;
by x ;
run;
proc sort data=step1 out=step2 nodupkey ;
by y;
run;
What are you trying to do? Explain in words.
What does your source data look like?
Can you make a small example set of records that demonstrate what you are trying to do?
I find it much easier to understand if you use a data step to select which records you want.
So if you have a variable named BUSINESS for the company id and and a variable EXECUTIVE for the person id and variable named ROLE that has values that include 'CEO' then to find the businesses that have only one executive in the CEO role you would do something like this.
data sole_ceo ;
set have ;
by business ;
where role='CEO';
if first.business and last.business;
run;
So if there is only one observation for that business it will be both the first and the last observation in that by group.
@Tom I don't think the OP only wants companies with a single CEO over the entire study period, as your example produces.
I believe they are asking for single CEO's for any given cusip/year, which @Reeza's suggestion would produce:
proc sort data=paper.compustat_execucomp
nouniquekey out=_null_ uniqueout=want;
where ceoann='CEO';
by cusip year;
run;
@sastuck: The "nouniquekey" tells SAS that the normal OUT= sorted dataset gets only the keys that don't have a single record. But since that's not what the OP wants, use the "out=_null_" to send those records to a null data set. And then "uniqueout=want" sends all the rest to dataset WANT.
But I also don't see how a uniqueout option in proc sort would generate this eithe
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.