Solved: Re: MERGE statement has more than one data set with repeats of BY valu...

sastuck · Posted 03-30-2018 01:29 PM

Hello all!

You will notice from the log that my data set PAPER.COMPUSTAT_EXECUCOMP4 has 13355 observations.

NOTE: There were 13678 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP3.

NOTE: There were 4304 observations read from the data set WORK.MULTIPLE_RETURNS.

NOTE: The data set PAPER.COMPUSTAT_EXECUCOMP4 has 13355 observations and 107 variables.

NOTE: DATA statement used (Total process time): real time 0.14 seconds user cpu time 0.03 seconds system cpu time 0.04 seconds memory 4073.15k OS Memory 42680.00k Timestamp 03/30/2018 02:06:59 PM Step Count 206 Switch Count 2 Page Faults 0 Page Reclaims 575 Page Swaps 0 Voluntary Context Switches 1143 Involuntary Context Switches 2 Block Input Operations 55840 Block Output Operations 54536 And this is the dataset that we merged for the final CEO_FIRM dataset.

So how could it be possible, as the log shows, that CEO_FIRM, the merged database has 55084 observations?

NOTE: MERGE statement has more than one data set with repeats of BY values. NOTE: There were 53961 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.

NOTE: There were 13355 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP4.

NOTE: The data set PAPER.CEO_FIRM has 55084 observations and 108 variables. NOTE: DATA statement used (Total process time): real time 0.39 seconds user cpu time 0.06 seconds system cpu time 0.10 seconds memory 5286.21k OS Memory 44216.00k Timestamp 03/30/2018 02:07:03 PM Step Count 211 Switch Count 6 Page Faults 0 Page Reclaims 869 Page Swaps 0 Voluntary Context Switches 3404 Involuntary Context Switches 20 Block Input Operations 56576 Block Output Operations 224008

Shouldn't it only have as much data as the smaller dataset has? If its a 1-1 match for each CEO/annual_return? I highlighted the relevant section of the log.

Let me know if you have any ideas!

-SAStuck

Tom · Posted 03-30-2018 11:31 PM

If you merge two files by some key variable and the resulting dataset has more observations that the largest input file then it means that there are key values that are present in only one of the files.

So if you merge A and B by ID where A has 10 obs and B has 8 obs and the result has more than 10 obs then it there must be some values of ID that appear on only one of the two inputs.

If you want to figure out which input dataset is contributing data to the merge you can use the IN= dataset options to set flag variable.

DATA paper.ceo_firm ;
   length ticker $5 source $10;
  MERGE
    paper.CSRP_annual_returns2 (in=in1)
    paper.compustat_execucomp4 (in=in2)
  ;
  BY ticker;
  if in1 and in2 then source='BOTH';
  else if in1 then source='ANNUAL';
  else source='EXECCOMP';
RUN;

proc freq ;
 tables source;
run;

View solution in original post

Quentin · Posted 03-30-2018 01:35 PM

Can you show your code (merge statement)? Depending on how you code it, it could be keeping only the records that match or keeping all records. Also the note about MERGE statement with more than one data set with repeats of BY values is definitely a concern/problem. So you will want to dig into that.

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

sastuck · Posted 03-30-2018 01:37 PM

Thanks for the reply. Here is the code:

*Merge CEO data and firm data; 
DATA paper.ceo_firm ; 
length ticker $5;
MERGE paper.CSRP_annual_returns 
paper.compustat_execucomp4; 
BY ticker; 
RUN;

sastuck · Posted 03-30-2018 02:24 PM

I found this code, which may have worked:

*MERGE statement has more than one data set with repeats of BY values;
proc sort data=paper.compustat_execucomp4;
 by ticker;
run;
proc sort data=paper.CSRP_annual_returns nodupkey;
 by ticker;
run;
data want;
 merge paper.compustat_execucomp4 (in=in1) paper.CSRP_annual_returns;
 by ticker;
 if in1;
run;

The new ceo_firm dataset has 10,741 rows, which seems is more realistic.

What do you think?

art297 · Posted 03-30-2018 02:31 PM

Yes, that would eliminate the "more than one record" warning. The only question is whether you dropped the correct records or, if not, whether you can retrieve the original file to find out. You used:

proc sort data=paper.CSRP_annual_returns nodupkey;
 by ticker;
run;

which will definitely make it so that there is only one record per ticker, but you did it replacing the original (?) file.

Art, CEO, AnalystFinder.com

sastuck · Posted 03-30-2018 02:34 PM

right . . . recommendations?

art297 · Posted 03-30-2018 02:41 PM

Hopefully, you still have the original file.

If so, I would create a file that only contains the duplicates and check each duplicate to see if they are different and, if they are, which one to keep.

Only getting the duplicate records should be easy. e.g.:

proc sort data=paper.CSRP_annual_returns;
 by ticker;
run;

data test;
  set paper.CSRP_annual_returns;
  by ticker;
  if not (first.ticker and last.ticker);
run;

Art, CEO, AnalystFinder.com

sastuck · Posted 03-30-2018 02:51 PM

I am happy to try this. Would you just mind explaining this portion so that I understand?

data test;
  set paper.CSRP_annual_returns;
  if not (first.ticker and last.ticker);
run;

Additionally, how do I make sure not to overwrite the original dataset again?

art297 · Posted 03-30-2018 02:59 PM

1. How not to replace a file during a sort. Simply use the out=option when running the sort. i.e., instead of just sorting the file itself, use something like:

proc sort data=paper.CSRP_annual_returns
               out=paper.CSRP_annual_returns2
               nodupkey;
  by ticker;
run;

2. My bad! The code should have included a by statement. e.g.

data test;
  set paper.CSRP_annual_returns;
  by ticker;
  if not (first.ticker and last.ticker);
run;

That way, first.ticker and last.ticker would only each be equal to 1 if there weren't any duplicates for a particular ticker. Thus you only want to look at/review those that don't meet that condition.

Art, CEO, AnalystFinder.com

sastuck · Posted 03-30-2018 08:10 PM

I ran the following code:

*MERGE statement has more than one data set with repeats of BY values;
proc sort data=paper.CSRP_annual_returns
 out=paper.CSRP_annual_returns2;
 by ticker;
run;

proc sort data=paper.CSRP_annual_returns
               out=paper.CSRP_annual_returns2
               nodupkey;
  by ticker;
run;

*Merge CEO data and firm data; 
DATA paper.ceo_firm ; 
length ticker $5;
MERGE paper.CSRP_annual_returns 
paper.compustat_execucomp4; 
BY ticker; 
RUN;

And now, as you will see in the log, the by variable issue is persisting. Any ideas? Maybe I didn't implement the code properly?

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

70

71

72 *MERGE statement has more than one data set with repeats of BY values;

73 proc sort data=paper.CSRP_annual_returns

74 out=paper.CSRP_annual_returns2;

75 by ticker;

76 run;

NOTE: There were 53176 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.

NOTE: The data set PAPER.CSRP_ANNUAL_RETURNS2 has 53176 observations and 3 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.02 seconds

user cpu time 0.02 seconds

system cpu time 0.00 seconds

memory 6129.37k

OS Memory 36708.00k

Timestamp 03/31/2018 12:03:36 AM

Step Count 70 Switch Count 2

Page Faults 0

Page Reclaims 1003

Page Swaps 0

Voluntary Context Switches 56

Involuntary Context Switches 0

Block Input Operations 0

Block Output Operations 2576

77

78 proc sort data=paper.CSRP_annual_returns

79 out=paper.CSRP_annual_returns2

80 nodupkey;

81 by ticker;

82 run;

NOTE: There were 53176 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.

NOTE: 42661 observations with duplicate key values were deleted.

NOTE: The data set PAPER.CSRP_ANNUAL_RETURNS2 has 10515 observations and 3 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.02 seconds

user cpu time 0.01 seconds

system cpu time 0.00 seconds

memory 6129.37k

OS Memory 36708.00k

Timestamp 03/31/2018 12:03:36 AM

Step Count 71 Switch Count 2

Page Faults 0

Page Reclaims 842

Page Swaps 0

Voluntary Context Switches 48

Involuntary Context Switches 0

Block Input Operations 32

Block Output Operations 528

83

84

85 *Merge CEO data and firm data;

86 DATA paper.ceo_firm ;

87 length ticker $5;

88 MERGE paper.CSRP_annual_returns

89 paper.compustat_execucomp4;

90 BY ticker;

91 RUN;

NOTE: MERGE statement has more than one data set with repeats of BY values.

NOTE: There were 53176 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.

NOTE: There were 13346 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP4.

NOTE: The data set PAPER.CEO_FIRM has 54361 observations and 108 variables.

NOTE: DATA statement used (Total process time):

real time 0.36 seconds

user cpu time 0.03 seconds

system cpu time 0.12 seconds

memory 5276.21k

OS Memory 37040.00k

Timestamp 03/31/2018 12:03:36 AM

Step Count 72 Switch Count 2

Page Faults 0

Page Reclaims 1005

Page Swaps 0

Voluntary Context Switches 3349

Involuntary Context Switches 3

Block Input Operations 0

Block Output Operations 220936

92

93 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

105

Thanks!

SAStuck

art297 · Posted 03-30-2018 08:31 PM

I don't know why you sorted the file twice but, regardless, you used the wrong file in your datastep merge.

Didn't you mean to use?:

MERGE paper.CSRP_annual_returns2 
              paper.compustat_execucomp4;

Art, CEO, AnalystFinder.com

sastuck · Posted 03-30-2018 08:44 PM

Thanks, Art.

Here's the update:

*MERGE statement has more than one data set with repeats of BY values;
proc sort data=paper.CSRP_annual_returns
               out=paper.CSRP_annual_returns2
               nodupkey;
  by ticker;
run;

*Merge CEO data and firm data; 
DATA paper.ceo_firm ; 
length ticker $5;
MERGE paper.CSRP_annual_returns2 
paper.compustat_execucomp4; 
BY ticker; 
RUN;

And the log:

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

70

71 *MERGE statement has more than one data set with repeats of BY values;

72 proc sort data=paper.CSRP_annual_returns

73 out=paper.CSRP_annual_returns2

74 nodupkey;

75 by ticker;

76 run;

NOTE: There were 53176 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.

NOTE: 42661 observations with duplicate key values were deleted.

NOTE: The data set PAPER.CSRP_ANNUAL_RETURNS2 has 10515 observations and 3 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.02 seconds

user cpu time 0.02 seconds

system cpu time 0.00 seconds

memory 6135.43k

OS Memory 36196.00k

Timestamp 03/31/2018 12:37:11 AM

Step Count 77 Switch Count 2

Page Faults 0

Page Reclaims 871

Page Swaps 0

Voluntary Context Switches 43

Involuntary Context Switches 0

Block Input Operations 0

Block Output Operations 528

77

78 *Merge CEO data and firm data;

79 DATA paper.ceo_firm ;

80 length ticker $5;

81 MERGE paper.CSRP_annual_returns2

82 paper.compustat_execucomp4;

83 BY ticker;

84 RUN;

NOTE: There were 10515 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS2.

NOTE: There were 13346 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP4.

NOTE: The data set PAPER.CEO_FIRM has 21807 observations and 108 variables.

NOTE: DATA statement used (Total process time):

real time 0.29 seconds

user cpu time 0.01 seconds

system cpu time 0.06 seconds

memory 4219.71k

OS Memory 35248.00k

Timestamp 03/31/2018 12:37:11 AM

Step Count 78 Switch Count 2

Page Faults 0

Page Reclaims 740

Page Swaps 0

Voluntary Context Switches 1273

Involuntary Context Switches 2

Block Input Operations 576

Block Output Operations 88840

85

86 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

98

Now, the by value issue appears to be gone, but the number of rows in the merged data set is still large than that of the input datasets. Suggestions?

art297 · Posted 03-30-2018 09:24 PM

I don't know what you have in execomp4 (I probably have the file name wrong .. the file that you DIDN'T run the nodupkey on), but was your annual file the one that contained yearly averages? Unless you're just analyzing one year, I don't think you want to unduplicate that file.

I think your duplicate problem is in your execomp4 file.

Art, CEO, AnalystFinder.com

Tom · Posted 03-30-2018 11:31 PM

If you merge two files by some key variable and the resulting dataset has more observations that the largest input file then it means that there are key values that are present in only one of the files.

So if you merge A and B by ID where A has 10 obs and B has 8 obs and the result has more than 10 obs then it there must be some values of ID that appear on only one of the two inputs.

If you want to figure out which input dataset is contributing data to the merge you can use the IN= dataset options to set flag variable.

DATA paper.ceo_firm ;
   length ticker $5 source $10;
  MERGE
    paper.CSRP_annual_returns2 (in=in1)
    paper.compustat_execucomp4 (in=in2)
  ;
  BY ticker;
  if in1 and in2 then source='BOTH';
  else if in1 then source='ANNUAL';
  else source='EXECCOMP';
RUN;

proc freq ;
 tables source;
run;

sastuck · Posted 03-30-2018 11:54 PM

Thanks @Tom for the very clear and useful response. You made it very easy to understand what is going on here.

Here is the output from the proc freq:

source Frequency Percent CumulativeFrequency CumulativePercentANNUALBOTHEXECCOMP

8461	38.80	8461	38.80
12650	58.01	21111	96.81
696	3.19	21807	100.00

8461+696=9157, so 12650-9157=3493 are coming from one of these datasets--how do we know which one? Also, there are 21,807 observations in the ceo_firm dataset. How does this tie in to the numbers above?

Here's the log from the merge statement:

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

70

71 *Merge CEO data and firm data;

72 DATA paper.ceo_firm ;

73 length ticker $5 source $10;

74 MERGE

75 paper.CSRP_annual_returns2 (in=in1)

76 paper.compustat_execucomp4 (in=in2)

77 ;

78 BY ticker;

79 if in1 and in2 then source='BOTH';

80 else if in1 then source='ANNUAL';

81 else source='EXECCOMP';

82 RUN;

NOTE: There were 10515 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS2.

NOTE: There were 13346 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP4.

NOTE: The data set PAPER.CEO_FIRM has 21807 observations and 109 variables.

NOTE: DATA statement used (Total process time):

real time 0.16 seconds

user cpu time 0.01 seconds

system cpu time 0.04 seconds

memory 4217.15k

OS Memory 33456.00k

Timestamp 03/31/2018 03:53:50 AM

Step Count 35 Switch Count 2

Page Faults 0

Page Reclaims 768

Page Swaps 0

Voluntary Context Switches 638

Involuntary Context Switches 0

Block Input Operations 54880

Block Output Operations 89104

83

84 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

96

User: apmorabito0

Messages: 4

Thanks again!

MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Re: MERGE statement has more than one data set with repeats of BY values

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away