- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am cleaning up my executive compensation data and need some help. I found the code that I need to use on my dataset online, but it is from a Stata tutorial. Could you help me translate this to SAS code? The variable names should be the exact same.
sort execid year cusip by execid year; gen count_abs = [_N] drop if count_obs > 1 drop count_obs sort cusip year execid by cusip year: gen count_obs = [_N] drop if count_ops > 1 drop count_ops
and for my firm performance data:
sort cusip fyear at by cusip fyear: gen count_obs =[_N] drop if count_obs > 1 drop count_obs gen roa = ni / at rename fyear year
Any help is appreciated!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The first one just looks like it's removing duplicates.
proc sort data=have;
by execid year cusip;
run;
proc sort data=have out=deduped uodupkey;
by exceed year;
run;
That should help you get started.
@sastuck wrote:
Hello,
I am cleaning up my executive compensation data and need some help. I found the code that I need to use on my dataset online, but it is from a Stata tutorial. Could you help me translate this to SAS code? The variable names should be the exact same.
sort execid year cusip by execid year; gen count_abs = [_N] drop if count_obs > 1 drop count_obs sort cusip year execid by cusip year: gen count_obs = [_N] drop if count_ops > 1 drop count_opsand for my firm performance data:
sort cusip fyear at by cusip fyear: gen count_obs =[_N] drop if count_obs > 1 drop count_obs gen roa = ni / at rename fyear yearAny help is appreciated!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is where I am so far:
/*keep only CEOs*/ data CEO; set paper.Compustat_ExecuComp; if CEOANN = 'CEO' then output CEO; run; /*drop CEOs who work in more than one firm in a given year*/ proc sort data=paper.Compustat_ExecuComp; by execid year cusip; run; proc sort data=paper.Compustat_ExecuComp out=deduped nodupkey; by execid year; run; /*drop all observations generated by firms with more than one CEO in a given year*/ proc sort data=paper.Compustat_ExecuComp out=deduped nodupkey; by cusip year execid; run;
But the amount of observations has remained constant even with the new code. Since keeping only CEOs the dataset went from 74300 to 13678 observations but has remained the same even though I am expecting the number to get smaller. Why is this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Its not removing duplicates entirely, it's removing the duplicate rows. so this comment is incorrect:
/*drop CEOs who work in more than one firm in a given year*/
It keeps the first record for CEOs, it doesn't drop them.
If you want no duplicates at all, look at the UNIQUEOUT option for PROC SORT instead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
but it doesn't seem to be dropping anything? and what do you mean it keeps the first record for CEOs, not dropping them?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Take a small subset of your data, say 5 different ceos with some having no duplicates and some having duplicates.
Track those through the data set examining the companies at each step.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@sastuck wrote:
but it doesn't seem to be dropping anything? and what do you mean it keeps the first record for CEOs, not dropping them?
Thanks!
The NODUPKEY option means to output only one observation per BY group. So if there are 5 observations with the same BY values then only 1 is written and the other 4 are deleted. But it sounds like you want to delete ALL observations when there are more than one observation in the BY group.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
But the amount of observations has remained constant even with the new code. Since keeping only CEOs the dataset went from 74300 to 13678 observations but has remained the same even though I am expecting the number to get smaller. Why is this?
Why would you expect it to keep getting smaller?
Do you expect that one executive is the sole CEO of more than one company at a time?
Do you expect that one company has more than one CEO at a time?
Also don't keep going back to the same source dataset if you want to apply additional filtering to the subset.
proc sort data=INPUT out=step1 nodupkey ;
by x ;
run;
proc sort data=step1 out=step2 nodupkey ;
by y;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What are you trying to do? Explain in words.
What does your source data look like?
Can you make a small example set of records that demonstrate what you are trying to do?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I find it much easier to understand if you use a data step to select which records you want.
So if you have a variable named BUSINESS for the company id and and a variable EXECUTIVE for the person id and variable named ROLE that has values that include 'CEO' then to find the businesses that have only one executive in the CEO role you would do something like this.
data sole_ceo ;
set have ;
by business ;
where role='CEO';
if first.business and last.business;
run;
So if there is only one observation for that business it will be both the first and the last observation in that by group.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Tom I don't think the OP only wants companies with a single CEO over the entire study period, as your example produces.
I believe they are asking for single CEO's for any given cusip/year, which @Reeza's suggestion would produce:
proc sort data=paper.compustat_execucomp
nouniquekey out=_null_ uniqueout=want;
where ceoann='CEO';
by cusip year;
run;
@sastuck: The "nouniquekey" tells SAS that the normal OUT= sorted dataset gets only the keys that don't have a single record. But since that's not what the OP wants, use the "out=_null_" to send those records to a null data set. And then "uniqueout=want" sends all the rest to dataset WANT.
But I also don't see how a uniqueout option in proc sort would generate this eithe
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------