Hello all!
You will notice from the log that my data set PAPER.COMPUSTAT_EXECUCOMP4 has 13355 observations.
NOTE: There were 13678 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP3.
NOTE: There were 4304 observations read from the data set WORK.MULTIPLE_RETURNS.
NOTE: The data set PAPER.COMPUSTAT_EXECUCOMP4 has 13355 observations and 107 variables.
NOTE: DATA statement used (Total process time): real time 0.14 seconds user cpu time 0.03 seconds system cpu time 0.04 seconds memory 4073.15k OS Memory 42680.00k Timestamp 03/30/2018 02:06:59 PM Step Count 206 Switch Count 2 Page Faults 0 Page Reclaims 575 Page Swaps 0 Voluntary Context Switches 1143 Involuntary Context Switches 2 Block Input Operations 55840 Block Output Operations 54536 And this is the dataset that we merged for the final CEO_FIRM dataset.
So how could it be possible, as the log shows, that CEO_FIRM, the merged database has 55084 observations?
NOTE: MERGE statement has more than one data set with repeats of BY values. NOTE: There were 53961 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.
NOTE: There were 13355 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP4.
NOTE: The data set PAPER.CEO_FIRM has 55084 observations and 108 variables. NOTE: DATA statement used (Total process time): real time 0.39 seconds user cpu time 0.06 seconds system cpu time 0.10 seconds memory 5286.21k OS Memory 44216.00k Timestamp 03/30/2018 02:07:03 PM Step Count 211 Switch Count 6 Page Faults 0 Page Reclaims 869 Page Swaps 0 Voluntary Context Switches 3404 Involuntary Context Switches 20 Block Input Operations 56576 Block Output Operations 224008
Shouldn't it only have as much data as the smaller dataset has? If its a 1-1 match for each CEO/annual_return? I highlighted the relevant section of the log.
Let me know if you have any ideas!
-SAStuck
If you merge two files by some key variable and the resulting dataset has more observations that the largest input file then it means that there are key values that are present in only one of the files.
So if you merge A and B by ID where A has 10 obs and B has 8 obs and the result has more than 10 obs then it there must be some values of ID that appear on only one of the two inputs.
If you want to figure out which input dataset is contributing data to the merge you can use the IN= dataset options to set flag variable.
DATA paper.ceo_firm ;
length ticker $5 source $10;
MERGE
paper.CSRP_annual_returns2 (in=in1)
paper.compustat_execucomp4 (in=in2)
;
BY ticker;
if in1 and in2 then source='BOTH';
else if in1 then source='ANNUAL';
else source='EXECCOMP';
RUN;
proc freq ;
tables source;
run;
Thanks for the reply. Here is the code:
*Merge CEO data and firm data; DATA paper.ceo_firm ; length ticker $5; MERGE paper.CSRP_annual_returns paper.compustat_execucomp4; BY ticker; RUN;
I found this code, which may have worked:
*MERGE statement has more than one data set with repeats of BY values;
proc sort data=paper.compustat_execucomp4;
by ticker;
run;
proc sort data=paper.CSRP_annual_returns nodupkey;
by ticker;
run;
data want;
merge paper.compustat_execucomp4 (in=in1) paper.CSRP_annual_returns;
by ticker;
if in1;
run;
The new ceo_firm dataset has 10,741 rows, which seems is more realistic.
What do you think?
Yes, that would eliminate the "more than one record" warning. The only question is whether you dropped the correct records or, if not, whether you can retrieve the original file to find out. You used:
proc sort data=paper.CSRP_annual_returns nodupkey; by ticker; run;
which will definitely make it so that there is only one record per ticker, but you did it replacing the original (?) file.
Art, CEO, AnalystFinder.com
right . . . recommendations?
Hopefully, you still have the original file.
If so, I would create a file that only contains the duplicates and check each duplicate to see if they are different and, if they are, which one to keep.
Only getting the duplicate records should be easy. e.g.:
proc sort data=paper.CSRP_annual_returns; by ticker; run; data test; set paper.CSRP_annual_returns;
by ticker; if not (first.ticker and last.ticker); run;
Art, CEO, AnalystFinder.com
I am happy to try this. Would you just mind explaining this portion so that I understand?
data test; set paper.CSRP_annual_returns; if not (first.ticker and last.ticker); run;
Additionally, how do I make sure not to overwrite the original dataset again?
1. How not to replace a file during a sort. Simply use the out=option when running the sort. i.e., instead of just sorting the file itself, use something like:
proc sort data=paper.CSRP_annual_returns out=paper.CSRP_annual_returns2 nodupkey; by ticker; run;
2. My bad! The code should have included a by statement. e.g.
data test; set paper.CSRP_annual_returns; by ticker; if not (first.ticker and last.ticker); run;
That way, first.ticker and last.ticker would only each be equal to 1 if there weren't any duplicates for a particular ticker. Thus you only want to look at/review those that don't meet that condition.
Art, CEO, AnalystFinder.com
I ran the following code:
*MERGE statement has more than one data set with repeats of BY values; proc sort data=paper.CSRP_annual_returns out=paper.CSRP_annual_returns2; by ticker; run; proc sort data=paper.CSRP_annual_returns out=paper.CSRP_annual_returns2 nodupkey; by ticker; run; *Merge CEO data and firm data; DATA paper.ceo_firm ; length ticker $5; MERGE paper.CSRP_annual_returns paper.compustat_execucomp4; BY ticker; RUN;
And now, as you will see in the log, the by variable issue is persisting. Any ideas? Maybe I didn't implement the code properly?
I don't know why you sorted the file twice but, regardless, you used the wrong file in your datastep merge.
Didn't you mean to use?:
MERGE paper.CSRP_annual_returns2 paper.compustat_execucomp4;
Art, CEO, AnalystFinder.com
Thanks, Art.
Here's the update:
*MERGE statement has more than one data set with repeats of BY values; proc sort data=paper.CSRP_annual_returns out=paper.CSRP_annual_returns2 nodupkey; by ticker; run; *Merge CEO data and firm data; DATA paper.ceo_firm ; length ticker $5; MERGE paper.CSRP_annual_returns2 paper.compustat_execucomp4; BY ticker; RUN;
And the log:
I don't know what you have in execomp4 (I probably have the file name wrong .. the file that you DIDN'T run the nodupkey on), but was your annual file the one that contained yearly averages? Unless you're just analyzing one year, I don't think you want to unduplicate that file.
I think your duplicate problem is in your execomp4 file.
Art, CEO, AnalystFinder.com
If you merge two files by some key variable and the resulting dataset has more observations that the largest input file then it means that there are key values that are present in only one of the files.
So if you merge A and B by ID where A has 10 obs and B has 8 obs and the result has more than 10 obs then it there must be some values of ID that appear on only one of the two inputs.
If you want to figure out which input dataset is contributing data to the merge you can use the IN= dataset options to set flag variable.
DATA paper.ceo_firm ;
length ticker $5 source $10;
MERGE
paper.CSRP_annual_returns2 (in=in1)
paper.compustat_execucomp4 (in=in2)
;
BY ticker;
if in1 and in2 then source='BOTH';
else if in1 then source='ANNUAL';
else source='EXECCOMP';
RUN;
proc freq ;
tables source;
run;
Thanks @Tom for the very clear and useful response. You made it very easy to understand what is going on here.
Here is the output from the proc freq:
source Frequency Percent CumulativeFrequency CumulativePercentANNUALBOTHEXECCOMP
8461 | 38.80 | 8461 | 38.80 |
12650 | 58.01 | 21111 | 96.81 |
696 | 3.19 | 21807 | 100.00 |
8461+696=9157, so 12650-9157=3493 are coming from one of these datasets--how do we know which one? Also, there are 21,807 observations in the ceo_firm dataset. How does this tie in to the numbers above?
Here's the log from the merge statement:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.