Re: Translate from Stata to SAS code

sastuck · Posted 03-04-2018 03:00 PM

Hello,

I am cleaning up my executive compensation data and need some help. I found the code that I need to use on my dataset online, but it is from a Stata tutorial. Could you help me translate this to SAS code? The variable names should be the exact same.

sort execid year cusip
by execid year; gen count_abs = [_N] 
drop if count_obs > 1 
drop count_obs

sort cusip year execid 
by cusip year: gen count_obs = [_N]
drop if count_ops > 1
drop count_ops

and for my firm performance data:

sort cusip fyear at
by cusip fyear: gen count_obs =[_N]

drop if count_obs > 1
drop count_obs

gen roa = ni / at

rename fyear year

Any help is appreciated!

Reeza · Posted 03-04-2018 03:09 PM

The first one just looks like it's removing duplicates.

proc sort data=have;
by execid year cusip;
run;

proc sort data=have out=deduped uodupkey;
by exceed year;
run;

That should help you get started.

@sastuck wrote:

Hello,

I am cleaning up my executive compensation data and need some help. I found the code that I need to use on my dataset online, but it is from a Stata tutorial. Could you help me translate this to SAS code? The variable names should be the exact same.
sort execid year cusip
by execid year; gen count_abs = [_N] 
drop if count_obs > 1 
drop count_obs

sort cusip year execid 
by cusip year: gen count_obs = [_N]
drop if count_ops > 1
drop count_ops
and for my firm performance data:
sort cusip fyear at
by cusip fyear: gen count_obs =[_N]

drop if count_obs > 1
drop count_obs

gen roa = ni / at

rename fyear year 
Any help is appreciated!

sastuck · Posted 03-04-2018 04:27 PM

This is where I am so far:

/*keep only CEOs*/
data CEO;
set paper.Compustat_ExecuComp;
  if CEOANN = 'CEO' then output CEO;
run;

/*drop CEOs who work in more than one firm in a given year*/ 
proc sort data=paper.Compustat_ExecuComp;
by execid year cusip;
run;

proc sort data=paper.Compustat_ExecuComp out=deduped nodupkey;
by execid year;
run;

/*drop all observations generated by firms with more than one CEO in a given year*/
proc sort data=paper.Compustat_ExecuComp out=deduped nodupkey;
by cusip year execid;
run;

But the amount of observations has remained constant even with the new code. Since keeping only CEOs the dataset went from 74300 to 13678 observations but has remained the same even though I am expecting the number to get smaller. Why is this?

Reeza · Posted 03-04-2018 05:02 PM

Its not removing duplicates entirely, it's removing the duplicate rows. so this comment is incorrect:

/*drop CEOs who work in more than one firm in a given year*/

It keeps the first record for CEOs, it doesn't drop them.

If you want no duplicates at all, look at the UNIQUEOUT option for PROC SORT instead.

sastuck · Posted 03-05-2018 05:50 PM

but it doesn't seem to be dropping anything? and what do you mean it keeps the first record for CEOs, not dropping them?

Thanks!

Reeza · Posted 03-05-2018 05:52 PM

Take a small subset of your data, say 5 different ceos with some having no duplicates and some having duplicates.

Track those through the data set examining the companies at each step.

Tom · Posted 03-05-2018 11:52 PM

@sastuck wrote:

but it doesn't seem to be dropping anything? and what do you mean it keeps the first record for CEOs, not dropping them?

Thanks!

The NODUPKEY option means to output only one observation per BY group. So if there are 5 observations with the same BY values then only 1 is written and the other 4 are deleted. But it sounds like you want to delete ALL observations when there are more than one observation in the BY group.

Tom · Posted 03-04-2018 06:02 PM

But the amount of observations has remained constant even with the new code. Since keeping only CEOs the dataset went from 74300 to 13678 observations but has remained the same even though I am expecting the number to get smaller. Why is this?

Why would you expect it to keep getting smaller?

Do you expect that one executive is the sole CEO of more than one company at a time?

Do you expect that one company has more than one CEO at a time?

Also don't keep going back to the same source dataset if you want to apply additional filtering to the subset.

proc sort data=INPUT out=step1 nodupkey ;
  by x ;
run;

proc sort data=step1 out=step2 nodupkey ;
  by y;
run;

Tom · Posted 03-04-2018 05:55 PM

What are you trying to do? Explain in words.

What does your source data look like?

Can you make a small example set of records that demonstrate what you are trying to do?

Tom · Posted 03-04-2018 06:07 PM

I find it much easier to understand if you use a data step to select which records you want.

So if you have a variable named BUSINESS for the company id and and a variable EXECUTIVE for the person id and variable named ROLE that has values that include 'CEO' then to find the businesses that have only one executive in the CEO role you would do something like this.

data sole_ceo ;
  set have ;
  by business ;
  where role='CEO';
  if first.business and last.business;
run;

So if there is only one observation for that business it will be both the first and the last observation in that by group.

mkeintz · Posted 03-06-2018 12:42 AM

@Tom I don't think the OP only wants companies with a single CEO over the entire study period, as your example produces.

I believe they are asking for single CEO's for any given cusip/year, which @Reeza's suggestion would produce:

proc sort data=paper.compustat_execucomp  
       nouniquekey out=_null_ uniqueout=want;
  where ceoann='CEO';
  by cusip year;
run;

@sastuck: The "nouniquekey" tells SAS that the normal OUT= sorted dataset gets only the keys that don't have a single record. But since that's not what the OP wants, use the "out=_null_" to send those records to a null data set. And then "uniqueout=want" sends all the rest to dataset WANT.

But I also don't see how a uniqueout option in proc sort would generate this eithe

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Training: Just a Click Away