BookmarkSubscribeRSS Feed
sastuck
Pyrite | Level 9

Hello,

 

I am cleaning up my executive compensation data and need some help. I found the code that I need to use on my dataset online, but it is from a Stata tutorial. Could you help me translate this to SAS code? The variable names should be the exact same. 

sort execid year cusip
by execid year; gen count_abs = [_N] 
drop if count_obs > 1 
drop count_obs

sort cusip year execid 
by cusip year: gen count_obs = [_N]
drop if count_ops > 1
drop count_ops

and for my firm performance data:

 

sort cusip fyear at
by cusip fyear: gen count_obs =[_N]

drop if count_obs > 1
drop count_obs

gen roa = ni / at

rename fyear year 

Any help is appreciated! 

10 REPLIES 10
Reeza
Super User

The first one just looks like it's removing duplicates.

 

proc sort data=have;
by execid year cusip;
run;

proc sort data=have out=deduped uodupkey;
by exceed year;
run;

That should help you get started.

 


@sastuck wrote:

Hello,

 

I am cleaning up my executive compensation data and need some help. I found the code that I need to use on my dataset online, but it is from a Stata tutorial. Could you help me translate this to SAS code? The variable names should be the exact same. 

sort execid year cusip
by execid year; gen count_abs = [_N] 
drop if count_obs > 1 
drop count_obs

sort cusip year execid 
by cusip year: gen count_obs = [_N]
drop if count_ops > 1
drop count_ops

and for my firm performance data:

 

sort cusip fyear at
by cusip fyear: gen count_obs =[_N]

drop if count_obs > 1
drop count_obs

gen roa = ni / at

rename fyear year 

Any help is appreciated! 


 

 

sastuck
Pyrite | Level 9

This is where I am so far: 

 

/*keep only CEOs*/
data CEO;
set paper.Compustat_ExecuComp;
  if CEOANN = 'CEO' then output CEO;
run;

/*drop CEOs who work in more than one firm in a given year*/ 
proc sort data=paper.Compustat_ExecuComp;
by execid year cusip;
run;

proc sort data=paper.Compustat_ExecuComp out=deduped nodupkey;
by execid year;
run;

/*drop all observations generated by firms with more than one CEO in a given year*/
proc sort data=paper.Compustat_ExecuComp out=deduped nodupkey;
by cusip year execid;
run;

But the amount of observations has remained constant even with the new code. Since keeping only CEOs the dataset went from 74300 to 13678 observations but has remained the same even though I am expecting the number to get smaller. Why is this? 

Reeza
Super User

Its not removing duplicates entirely, it's removing the duplicate rows. so this comment is incorrect:

 

/*drop CEOs who work in more than one firm in a given year*/ 

 

It keeps the first record for CEOs, it doesn't drop them.

 

If you want no duplicates at all, look at the UNIQUEOUT option for PROC SORT instead. 

sastuck
Pyrite | Level 9

but it doesn't seem to be dropping anything? and what do you mean it keeps the first record for CEOs, not dropping them?

 

Thanks!

Reeza
Super User

Take a small subset of your data, say 5 different ceos with some having no duplicates and some having duplicates. 

Track those through the data set examining the companies at each step. 

 

 

Tom
Super User Tom
Super User

@sastuck wrote:

but it doesn't seem to be dropping anything? and what do you mean it keeps the first record for CEOs, not dropping them?

 

Thanks!


The NODUPKEY option means to output only one observation per BY group.  So if there are 5 observations with the same BY values then only 1 is written and the other 4 are deleted.  But it sounds like you want to delete ALL observations when there are more than one observation in the BY group.

Tom
Super User Tom
Super User

But the amount of observations has remained constant even with the new code. Since keeping only CEOs the dataset went from 74300 to 13678 observations but has remained the same even though I am expecting the number to get smaller. Why is this? 

 

Why would you expect it to keep getting smaller?

Do you expect that one executive is the sole CEO of more than one company at a time?

Do you expect that one company has more than one CEO at a time?

 

Also don't keep going back to the same source dataset if you want to apply additional filtering to the subset.

 

proc sort data=INPUT out=step1 nodupkey ;
  by x ;
run;

proc sort data=step1 out=step2 nodupkey ;
  by y;
run;

 

Tom
Super User Tom
Super User

What are you trying to do? Explain in words.

What does your source data look like?

Can you make a small example set of records that demonstrate what you are trying to do?

Tom
Super User Tom
Super User

I find it much easier to understand if you use a data step to select which records you want.

So if you have a variable named BUSINESS for the company id and and a variable EXECUTIVE for the person id and variable named ROLE that has values that include 'CEO' then to find the businesses that have only one executive in the CEO role you would do something like this.

 

data sole_ceo ;
  set have ;
  by business ;
  where role='CEO';
  if first.business and last.business;
run;

So if there is only one observation for that business it will be both the first and the last observation in that by group.

mkeintz
PROC Star

@Tom  I don't think the OP only wants companies with a single CEO over the entire study period, as your example produces.

 

I believe they are asking for single CEO's for any given cusip/year, which @Reeza's suggestion would produce:

 

proc sort data=paper.compustat_execucomp  
       nouniquekey out=_null_ uniqueout=want;
  where ceoann='CEO';
  by cusip year;
run;

 

@sastuck: The "nouniquekey" tells SAS that the normal OUT= sorted dataset gets only the keys that don't have a single record.  But since that's not what the OP wants, use the "out=_null_" to send those records to a null data set.  And then "uniqueout=want" sends all the rest to dataset WANT.

 

 

But I also don't see how a uniqueout option in proc sort would generate this eithe

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 4959 views
  • 3 likes
  • 4 in conversation