BookmarkSubscribeRSS Feed
ak2011
Fluorite | Level 6
Hi, sorry my first message posted on this topic contained an error: I did proc sort data=d3 twice(I forgot to do proc sort data=d4). I have done the corrections. 
 
 I would appreciate if someone could help me with the SAS code to resolve this problem.
I merged 4 datasets by id (common) to all but received a SAS message " Merge statement has more than one datasets with repeats of by values". Any help with the correct merge code to avoid this message( merge statement.......repeats of by values)? I have a huge datasets; these are just sub datasets.
My ultimate aim is to count the number of ca case, pop cont and ca cont for each status (S and NS).
 
Please find below the 4 datasets. Output attached. Thanks in advance for your help.
 
/*Pollutants*/
data d1;
input id$ 1-5
job 7 id_job$ 9-15 hcl_exp 17 amo_exp 19 bio_exp 21 cla_exp 23;
datalines;
OSa03 4 OSa03_4 1 0 0 0
OSa06 3 OSa06_3 0 1 0 0
OSa13 1 OSa13_1 0 1 1 0
OSa13 3 OSa13_3 0 1 1 1
OSa29 2 OSa29_2 0 0 0 1
OSa29 4 OSa29_4 0 1 1 0
OSa30 4 OSa30_4 0 0 1 0
OSa30 1 OSa30_1 1 0 0 0
OSa30 2 OSa30_2 0 1 1 1
OSa54 3 OSa54_3 0 1 0 0
OSa64 3 OSa64_3 0 1 0 0
OSa73 3 OSa73_3 0 0 0 1
OSa74 3 OSa74_3 1 0 0 0
OSa78 3 OSa78_3 0 1 0 0
;
proc sort data=d1; by id; run;

/* Cancer subjects*/
data d2;
input id$ 1-5 lung$ 7-15;
datalines;
OSa01 Pop cont
OSa06 Ca cont
OSa11 Pop cont
OSa13 Ca case
OSa29 Ca cont
OSa30 Ca case
OSa31 Ca cont
OSa54 Pop cont
OSa73 Pop cont
;
proc sort data=d2; by id; run;
/* Exposure level*/
data d3;
input id$ 1-5 job 7 idchem 9 level 16;
datalines;
OSa03 4 211701 3
OSa06 3 210701 3
OSa13 1 210701 3
OSa13 1 990021 3
OSa13 3 210701 3
OSa13 3 990005 3
OSa13 3 990021 2
OSa29 2 990005 3
OSa29 4 210701 3
OSa30 1 990021 3
OSa30 2 211701 3
OSa30 3 210701 3
OSa30 3 990005 3
OSa30 3 990021 3
OSa54 3 990005 3
OSa64 3 210701 2
OSa74 1 211701 3
OSa78 4 210701 3
OSa78 4 990005 3
OSa78 4 990021 3
;
proc sort data=d3; by id; run;

/* Exposure Duration*/
data d4;
input id$ 1-5 idchem 7 status$ 14-15 duration 16-18;
datalines;
OSa03 211701 S 6
OSa06 210701 S 9
OSa13 210701 S 37
OSa13 990005 S 5
OSa13 990021 S 37
OSa29 210701 NS 12
OSa29 990005 S 2
OSa30 210701 S 8
OSa30 211701 NS 8
OSa30 990005 S 8
OSa30 990021 S 15
OSa54 210701 NS 14
OSa64 210701 S 15
OSa74 211701 NS 21
OSa78 210701 NS 20
OSa78 990005 S 20
OSa78 990021 S 20
OSa86 990005 S 14
OSa93 210701 S 4
OSa93 990005 S 13
;

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 /*Pollutants*/
74 data d1;
75 input id$ 1-5 job 7 id_job$ 9-15 hcl_exp 17 amo_exp 19 bio_exp 21 cla_exp 23;
76 datalines;
 
NOTE: The data set WORK.D1 has 14 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
 
 
91 ;
92 proc sort data=d1; by id; run;
 
NOTE: There were 14 observations read from the data set WORK.D1.
NOTE: The data set WORK.D1 has 14 observations and 7 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
 
 
93
94 /* Cancer subjects*/
95 data d2;
96 input id$ 1-5 lung$ 7-15;
97 datalines;
 
NOTE: The data set WORK.D2 has 9 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
 
 
107 ;
108 proc sort data=d2; by id; run;
 
NOTE: There were 9 observations read from the data set WORK.D2.
NOTE: The data set WORK.D2 has 9 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
 
 
109 /* Exposure level*/
110 data d3;
111 input id$ 1-5 job 7 idchem 9 level 16;
112 datalines;
 
NOTE: The data set WORK.D3 has 20 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
 
 
133 ;
134 proc sort data=d3; by id; run;
 
NOTE: There were 20 observations read from the data set WORK.D3.
NOTE: The data set WORK.D3 has 20 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
 
 
135
136 /* Exposure Duration*/
137 data d4;
138 input id$ 1-5 idchem 7 status$ 14-15 duration 16-18;
139 datalines;
 
NOTE: The data set WORK.D4 has 20 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
 
 
160 ;
161
162 proc sort data=d4; by id; run;
 
NOTE: There were 20 observations read from the data set WORK.D4.
NOTE: The data set WORK.D4 has 20 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
 
 
163
164 /* Merging d1,d2,d3 and d4*/
165 data mg4;
166 merge d1 d2 d3 d4; by id;
167 run;
 
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 14 observations read from the data set WORK.D1.
NOTE: There were 9 observations read from the data set WORK.D2.
NOTE: There were 20 observations read from the data set WORK.D3.
NOTE: There were 20 observations read from the data set WORK.D4.
NOTE: The data set WORK.MG4 has 27 observations and 12 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.02 seconds
 
 
168
169 proc print data=mg4;
170 title "Table 1 corrected. Merged datasets(d1,d2,d3,d4)"; run;
NOTE: There were 27 observations read from the data set WORK.MG4.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.54 seconds
cpu time 0.55 seconds
 
 
171
172
173
174 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
186

 
4 REPLIES 4
Duggins
Obsidian | Level 7

This occurs anytime your variables in the BY statement do not uniquely identify records in 2 or more data sets. For example, in D1 you have OSa13 appear twice and in D3 it appears five times. This NOTE in your log is letting you know that these repeats exist based on your BY statement only including ID as the variable.

When combining the data, you'll get the first OSa13 record from D1 with the first record from D3 and the second record from D1 with the second record from D3. However, SAS doesn't have any more OSa13 records in D1, so it will keep using the same record from D1 until no more OSa13 records are found. 

Based on the fact that you have other variables, like Job, that seem to distinguish records, my advice is to include more BY variables so that you can uniquely determine which records you want joined across your D# data sets. The documentation for the MERGE statement includes a note about how this is handled: https://documentation.sas.com/?docsetId=lestmtsref&docsetTarget=n1i8w2bwu1fn5kn1gpxj18xttbb0.htm&doc...

Duggins
Obsidian | Level 7

No problem! If you have other issues with this, let us know - otherwise you can mark that post as a solution.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 864 views
  • 0 likes
  • 2 in conversation