BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sastuck
Pyrite | Level 9

Hello all!

 

You will notice from the log that my data set PAPER.COMPUSTAT_EXECUCOMP4 has 13355 observations.

 

NOTE: There were 13678 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP3.

NOTE: There were 4304 observations read from the data set WORK.MULTIPLE_RETURNS.

NOTE: The data set PAPER.COMPUSTAT_EXECUCOMP4 has 13355 observations and 107 variables.

NOTE: DATA statement used (Total process time): real time 0.14 seconds user cpu time 0.03 seconds system cpu time 0.04 seconds     memory 4073.15k OS Memory 42680.00k Timestamp 03/30/2018 02:06:59 PM Step Count 206 Switch Count 2 Page Faults 0 Page Reclaims 575 Page Swaps 0 Voluntary Context Switches 1143 Involuntary Context Switches 2 Block Input Operations 55840 Block Output Operations 54536 And this is the dataset that we merged for the final CEO_FIRM dataset.

 

So how could it be possible, as the log shows, that CEO_FIRM, the merged database has 55084 observations?

 

NOTE: MERGE statement has more than one data set with repeats of BY values. NOTE: There were 53961 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.

NOTE: There were 13355 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP4.

NOTE: The data set PAPER.CEO_FIRM has 55084 observations and 108 variables. NOTE: DATA statement used (Total process time): real time 0.39 seconds user cpu time 0.06 seconds system cpu time 0.10 seconds memory 5286.21k OS Memory 44216.00k Timestamp 03/30/2018 02:07:03 PM Step Count 211 Switch Count 6 Page Faults 0 Page Reclaims 869 Page Swaps 0 Voluntary Context Switches 3404 Involuntary Context Switches 20 Block Input Operations 56576 Block Output Operations 224008

 

Shouldn't it only have as much data as the smaller dataset has? If its a 1-1 match for each CEO/annual_return? I highlighted the relevant section of the log. 

 

Let me know if you have any ideas!

 

-SAStuck

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

If you merge two files by some key variable and the resulting dataset has more observations that the largest input file then it means that there are key values that are present in only one of the files.  

 

So if you merge A and B by ID where A has 10 obs and B has 8 obs and the result has more than 10 obs then it there must be some values of ID that appear on only one of the two inputs. 

 

If you want to figure out which input dataset is contributing data to the merge you can use the IN= dataset options to set flag variable.

DATA paper.ceo_firm ;
   length ticker $5 source $10;
  MERGE
    paper.CSRP_annual_returns2 (in=in1)
    paper.compustat_execucomp4 (in=in2)
  ;
  BY ticker;
  if in1 and in2 then source='BOTH';
  else if in1 then source='ANNUAL';
  else source='EXECCOMP';
RUN;

proc freq ;
 tables source;
run;
 

View solution in original post

18 REPLIES 18
Quentin
Super User
Can you show your code (merge statement)? Depending on how you code it, it could be keeping only the records that match or keeping all records. Also the note about MERGE statement with more than one data set with repeats of BY values is definitely a concern/problem. So you will want to dig into that.
BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
sastuck
Pyrite | Level 9

Thanks for the reply. Here is the code:

 

*Merge CEO data and firm data; 
DATA paper.ceo_firm ; 
length ticker $5;
MERGE paper.CSRP_annual_returns 
paper.compustat_execucomp4; 
BY ticker; 
RUN;
sastuck
Pyrite | Level 9

I found this code, which may have worked:

 

*MERGE statement has more than one data set with repeats of BY values;
proc sort data=paper.compustat_execucomp4;
by ticker;
run;
proc sort data=paper.CSRP_annual_returns nodupkey;
by ticker;
run;
data want;
merge paper.compustat_execucomp4 (in=in1) paper.CSRP_annual_returns;
by ticker;
if in1;
run;

The new ceo_firm dataset has 10,741 rows, which seems is more realistic. 

 

What do you think?

art297
Opal | Level 21

Yes, that would eliminate the "more than one record" warning. The only question is whether you dropped the correct records or, if not, whether you can retrieve the original file to find out. You used:

proc sort data=paper.CSRP_annual_returns nodupkey;
 by ticker;
run;

which will definitely make it so that there is only one record per ticker, but you did it replacing the original (?) file.

 

Art, CEO, AnalystFinder.com

 

 

sastuck
Pyrite | Level 9

right . . . recommendations?  

art297
Opal | Level 21

Hopefully, you still have the original file.

 

If so, I would create a file that only contains the duplicates and check each duplicate to see if they are different and, if they are, which one to keep.

 

Only getting the duplicate records should be easy. e.g.:

proc sort data=paper.CSRP_annual_returns;
 by ticker;
run;

data test;
  set paper.CSRP_annual_returns;
by ticker; if not (first.ticker and last.ticker); run;

Art, CEO, AnalystFinder.com

 

 

sastuck
Pyrite | Level 9

I am happy to try this. Would you just mind explaining this portion so that I understand?

 

data test;
  set paper.CSRP_annual_returns;
  if not (first.ticker and last.ticker);
run;

Additionally, how do I make sure not to overwrite the original dataset again? 

art297
Opal | Level 21

1. How not to replace a file during a sort. Simply use the out=option when running the sort. i.e., instead of just sorting the file itself, use something like:

proc sort data=paper.CSRP_annual_returns
               out=paper.CSRP_annual_returns2
               nodupkey;
  by ticker;
run;

2. My bad! The code should have included a by statement. e.g.

data test;
  set paper.CSRP_annual_returns;
  by ticker;
  if not (first.ticker and last.ticker);
run;

That way, first.ticker and last.ticker would only each be equal to 1 if there weren't any duplicates for a particular ticker. Thus you only want to look at/review those that don't meet that condition.

 

Art, CEO, AnalystFinder.com

 

 

 

 

sastuck
Pyrite | Level 9

I ran the following code:

 

*MERGE statement has more than one data set with repeats of BY values;
proc sort data=paper.CSRP_annual_returns
 out=paper.CSRP_annual_returns2;
 by ticker;
run;

proc sort data=paper.CSRP_annual_returns
               out=paper.CSRP_annual_returns2
               nodupkey;
  by ticker;
run;

*Merge CEO data and firm data; 
DATA paper.ceo_firm ; 
length ticker $5;
MERGE paper.CSRP_annual_returns 
paper.compustat_execucomp4; 
BY ticker; 
RUN;

And now, as you will see in the log, the by variable issue is persisting. Any ideas? Maybe I didn't implement the code properly?

 

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
70
71
72 *MERGE statement has more than one data set with repeats of BY values;
73 proc sort data=paper.CSRP_annual_returns
74 out=paper.CSRP_annual_returns2;
75 by ticker;
76 run;
 
NOTE: There were 53176 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.
NOTE: The data set PAPER.CSRP_ANNUAL_RETURNS2 has 53176 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.02 seconds
user cpu time 0.02 seconds
system cpu time 0.00 seconds
memory 6129.37k
OS Memory 36708.00k
Timestamp 03/31/2018 12:03:36 AM
Step Count 70 Switch Count 2
Page Faults 0
Page Reclaims 1003
Page Swaps 0
Voluntary Context Switches 56
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 2576
 
 
77
78 proc sort data=paper.CSRP_annual_returns
79 out=paper.CSRP_annual_returns2
80 nodupkey;
81 by ticker;
82 run;
 
NOTE: There were 53176 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.
NOTE: 42661 observations with duplicate key values were deleted.
NOTE: The data set PAPER.CSRP_ANNUAL_RETURNS2 has 10515 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.02 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds
memory 6129.37k
OS Memory 36708.00k
Timestamp 03/31/2018 12:03:36 AM
Step Count 71 Switch Count 2
Page Faults 0
Page Reclaims 842
Page Swaps 0
Voluntary Context Switches 48
Involuntary Context Switches 0
Block Input Operations 32
Block Output Operations 528
 
 
83
84
85 *Merge CEO data and firm data;
86 DATA paper.ceo_firm ;
87 length ticker $5;
88 MERGE paper.CSRP_annual_returns
89 paper.compustat_execucomp4;
90 BY ticker;
91 RUN;
 
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 53176 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.
NOTE: There were 13346 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP4.
NOTE: The data set PAPER.CEO_FIRM has 54361 observations and 108 variables.
NOTE: DATA statement used (Total process time):
real time 0.36 seconds
user cpu time 0.03 seconds
system cpu time 0.12 seconds
memory 5276.21k
OS Memory 37040.00k
Timestamp 03/31/2018 12:03:36 AM
Step Count 72 Switch Count 2
Page Faults 0
Page Reclaims 1005
Page Swaps 0
Voluntary Context Switches 3349
Involuntary Context Switches 3
Block Input Operations 0
Block Output Operations 220936
 
 
92
93 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
105
 
 
Thanks!
 
SAStuck
art297
Opal | Level 21

I don't know why you sorted the file twice but, regardless, you used the wrong file in your datastep merge.

 

Didn't you mean to use?:

MERGE paper.CSRP_annual_returns2 
              paper.compustat_execucomp4; 

Art, CEO, AnalystFinder.com

 

 

sastuck
Pyrite | Level 9

Thanks, Art.

 

Here's the update:

 

*MERGE statement has more than one data set with repeats of BY values;
proc sort data=paper.CSRP_annual_returns
               out=paper.CSRP_annual_returns2
               nodupkey;
  by ticker;
run;

*Merge CEO data and firm data; 
DATA paper.ceo_firm ; 
length ticker $5;
MERGE paper.CSRP_annual_returns2 
paper.compustat_execucomp4; 
BY ticker; 
RUN;

And the log:

 

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
70
71 *MERGE statement has more than one data set with repeats of BY values;
72 proc sort data=paper.CSRP_annual_returns
73 out=paper.CSRP_annual_returns2
74 nodupkey;
75 by ticker;
76 run;
 
NOTE: There were 53176 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS.
NOTE: 42661 observations with duplicate key values were deleted.
NOTE: The data set PAPER.CSRP_ANNUAL_RETURNS2 has 10515 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.02 seconds
user cpu time 0.02 seconds
system cpu time 0.00 seconds
memory 6135.43k
OS Memory 36196.00k
Timestamp 03/31/2018 12:37:11 AM
Step Count 77 Switch Count 2
Page Faults 0
Page Reclaims 871
Page Swaps 0
Voluntary Context Switches 43
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 528
 
 
77
78 *Merge CEO data and firm data;
79 DATA paper.ceo_firm ;
80 length ticker $5;
81 MERGE paper.CSRP_annual_returns2
82 paper.compustat_execucomp4;
83 BY ticker;
84 RUN;
 
NOTE: There were 10515 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS2.
NOTE: There were 13346 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP4.
NOTE: The data set PAPER.CEO_FIRM has 21807 observations and 108 variables.
NOTE: DATA statement used (Total process time):
real time 0.29 seconds
user cpu time 0.01 seconds
system cpu time 0.06 seconds
memory 4219.71k
OS Memory 35248.00k
Timestamp 03/31/2018 12:37:11 AM
Step Count 78 Switch Count 2
Page Faults 0
Page Reclaims 740
Page Swaps 0
Voluntary Context Switches 1273
Involuntary Context Switches 2
Block Input Operations 576
Block Output Operations 88840
 
 
85
86 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
98
 
 
 
Now, the by value issue appears to be gone, but the number of rows in the merged data set is still large than that of the input datasets. Suggestions?
art297
Opal | Level 21

I don't know what you have in execomp4 (I probably have the file name wrong .. the file that you DIDN'T run the nodupkey on), but was your annual file the one that contained yearly averages? Unless you're just analyzing one year, I don't think you want to unduplicate that file.

 

I think your duplicate problem is in your execomp4 file.

 

Art, CEO, AnalystFinder.com

 

Tom
Super User Tom
Super User

If you merge two files by some key variable and the resulting dataset has more observations that the largest input file then it means that there are key values that are present in only one of the files.  

 

So if you merge A and B by ID where A has 10 obs and B has 8 obs and the result has more than 10 obs then it there must be some values of ID that appear on only one of the two inputs. 

 

If you want to figure out which input dataset is contributing data to the merge you can use the IN= dataset options to set flag variable.

DATA paper.ceo_firm ;
   length ticker $5 source $10;
  MERGE
    paper.CSRP_annual_returns2 (in=in1)
    paper.compustat_execucomp4 (in=in2)
  ;
  BY ticker;
  if in1 and in2 then source='BOTH';
  else if in1 then source='ANNUAL';
  else source='EXECCOMP';
RUN;

proc freq ;
 tables source;
run;
 
sastuck
Pyrite | Level 9

Thanks @Tom for the very clear and useful response. You made it very easy to understand what is going on here.

 

Here is the output from the proc freq:

 

 

source Frequency Percent CumulativeFrequency CumulativePercentANNUALBOTHEXECCOMP

846138.80846138.80
1265058.012111196.81
6963.1921807100.00

 

 

8461+696=9157, so 12650-9157=3493 are coming from one of these datasets--how do we know which one? Also, there are 21,807 observations in the ceo_firm dataset. How does this tie in to the numbers above? 

 

Here's the log from the merge statement:

 

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
70
71 *Merge CEO data and firm data;
72 DATA paper.ceo_firm ;
73 length ticker $5 source $10;
74 MERGE
75 paper.CSRP_annual_returns2 (in=in1)
76 paper.compustat_execucomp4 (in=in2)
77 ;
78 BY ticker;
79 if in1 and in2 then source='BOTH';
80 else if in1 then source='ANNUAL';
81 else source='EXECCOMP';
82 RUN;
 
NOTE: There were 10515 observations read from the data set PAPER.CSRP_ANNUAL_RETURNS2.
NOTE: There were 13346 observations read from the data set PAPER.COMPUSTAT_EXECUCOMP4.
NOTE: The data set PAPER.CEO_FIRM has 21807 observations and 109 variables.
NOTE: DATA statement used (Total process time):
real time 0.16 seconds
user cpu time 0.01 seconds
system cpu time 0.04 seconds
memory 4217.15k
OS Memory 33456.00k
Timestamp 03/31/2018 03:53:50 AM
Step Count 35 Switch Count 2
Page Faults 0
Page Reclaims 768
Page Swaps 0
Voluntary Context Switches 638
Involuntary Context Switches 0
Block Input Operations 54880
Block Output Operations 89104
 
 
83
84 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
96
 
User: apmorabito0
Messages: 4
 
 
 
 
 Thanks again!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 37951 views
  • 4 likes
  • 4 in conversation