Merging problem: Merge statement.....Repeats of by values

ak2011 · Posted 02-18-2020 05:00 AM

I would appreciate if someone could help me with the SAS code to resolve this problem.

I merged 4 datasets by id (common) to all but received a SAS message " Merge statement has more than one datasets with repeats of by values". Any help with the correct merge code to avoid this message( merge statement.......repeats of by values)? I have a huge datasets; these are just sub datasets.

My ultimate aim is to count the number of ca case, pop cont and ca cont for each status (S and NS).

Please find below the 4 datasets. Output attached. Thanks in advance for your help.

/*Pollutants*/
data d1;
input id$ 1-5

 job 7 id_job$ 9-15 hcl_exp 17 amo_exp 19 bio_exp 21 cla_exp 23;
datalines;
OSa03 4 OSa03_4 1 0 0 0
OSa06 3	OSa06_3 0 1 0 0 
OSa13 1	OSa13_1 0 1 1 0
OSa13 3	OSa13_3 0 1 1 1
OSa29 2	OSa29_2 0 0 0 1
OSa29 4	OSa29_4 0 1 1 0
OSa30 4	OSa30_4 0 0 1 0
OSa30 1	OSa30_1 1 0 0 0
OSa30 2	OSa30_2 0 1 1 1
OSa54 3	OSa54_3 0 1 0 0
OSa64 3	OSa64_3 0 1 0 0
OSa73 3	OSa73_3 0 0 0 1
OSa74 3	OSa74_3 1 0 0 0
OSa78 3	OSa78_3 0 1 0 0
;
proc sort data=d1; by id; run;

/*  Cancer subjects*/
data d2;
input id$ 1-5 lung$ 7-15;
datalines;
OSa01 Pop cont
OSa06 Ca cont
OSa11 Pop cont
OSa13 Ca case
OSa29 Ca cont
OSa30 Ca case
OSa31 Ca cont
OSa54 Pop cont
OSa73 Pop cont
;
proc sort data=d2; by id; run;
/* Exposure level*/
data d3;
input id$ 1-5 job 7 idchem 9 level 16;
datalines;
OSa03 4 211701 3
OSa06 3	210701 3
OSa13 1	210701 3
OSa13 1	990021 3
OSa13 3	210701 3
OSa13 3	990005 3
OSa13 3	990021 2
OSa29 2	990005 3
OSa29 4	210701 3
OSa30 1 990021 3
OSa30 2	211701 3
OSa30 3	210701 3
OSa30 3	990005 3
OSa30 3	990021 3
OSa54 3	990005 3
OSa64 3	210701 2
OSa74 1 211701 3
OSa78 4	210701 3
OSa78 4	990005 3
OSa78 4	990021 3
;
proc sort data=d3; by id; run;

/* Exposure Duration*/
data d4;
input id$ 1-5 idchem 7 status$ 14-15 duration 16-18;
datalines;
OSa03 211701 S 6
OSa06 210701 S 9
OSa13 210701 S 37
OSa13 990005 S 5
OSa13 990021 S 37
OSa29 210701 NS 12
OSa29 990005 S 2
OSa30 210701 S 8
OSa30 211701 NS 8
OSa30 990005 S  8
OSa30 990021 S 15
OSa54 210701 NS 14
OSa64 210701 S 15
OSa74 211701 NS 21
OSa78 210701 NS 20
OSa78 990005 S 20
OSa78 990021 S 20
OSa86 990005 S 14
OSa93 210701 S 4
OSa93 990005 S 13
;

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

72

73

74 /*Pollutants*/

75 data d1;

76 input id$ 1-5 job 7 id_job$ 9-15 hcl_exp 17 amo_exp 19 bio_exp 21 cla_exp 23;

77 datalines;

NOTE: The data set WORK.D1 has 14 observations and 7 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

92 ;

93 proc sort data=d1; by id; run;

NOTE: There were 14 observations read from the data set WORK.D1.

NOTE: The data set WORK.D1 has 14 observations and 7 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.01 seconds

cpu time 0.00 seconds

94

95 /* Cancer subjects*/

96 data d2;

97 input id$ 1-5 lung$ 7-15;

98 datalines;

NOTE: The data set WORK.D2 has 9 observations and 2 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.00 seconds

108 ;

109 proc sort data=d2; by id; run;

NOTE: There were 9 observations read from the data set WORK.D2.

NOTE: The data set WORK.D2 has 9 observations and 2 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

110 /* Exposure level*/

111 data d3;

112 input id$ 1-5 job 7 idchem 9 level 16;

113 datalines;

NOTE: The data set WORK.D3 has 20 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.00 seconds

cpu time 0.01 seconds

134 ;

135 proc sort data=d3; by id; run;

NOTE: There were 20 observations read from the data set WORK.D3.

NOTE: The data set WORK.D3 has 20 observations and 4 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

136

137 /* Exposure Duration*/

138 data d4;

139 input id$ 1-5 idchem 7 status$ 14-15 duration 16-18;

140 datalines;

NOTE: The data set WORK.D4 has 20 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.00 seconds

161 ;

162

163 proc sort data=d3; by id; run;

NOTE: Input data set is already sorted, no sorting done.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

164

165 /* Merging d1,d2,d3 and d4*/

166 data mg4;

167 merge d1 d2 d3 d4; by id;

168 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.

NOTE: There were 14 observations read from the data set WORK.D1.

NOTE: There were 9 observations read from the data set WORK.D2.

NOTE: There were 20 observations read from the data set WORK.D3.

NOTE: There were 20 observations read from the data set WORK.D4.

NOTE: The data set WORK.MG4 has 27 observations and 12 variables.

NOTE: DATA statement used (Total process time):

real time 0.02 seconds

cpu time 0.02 seconds

169

170 proc print data=mg4;

171 title "Table 1. Merged datasets(d1,d2,d3,d4)"; run;

NOTE: There were 27 observations read from the data set WORK.MG4.

NOTE: PROCEDURE PRINT used (Total process time):

real time 0.34 seconds

cpu time 0.34 seconds

172

173 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

185

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

72

73

74 /*Pollutants*/

75 data d1;

76 input id$ 1-5 job 7 id_job$ 9-15 hcl_exp 17 amo_exp 19 bio_exp 21 cla_exp 23;

77 datalines;

NOTE: The data set WORK.D1 has 14 observations and 7 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

92 ;

93 proc sort data=d1; by id; run;

NOTE: There were 14 observations read from the data set WORK.D1.

NOTE: The data set WORK.D1 has 14 observations and 7 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.01 seconds

cpu time 0.00 seconds

94

95 /* Cancer subjects*/

96 data d2;

97 input id$ 1-5 lung$ 7-15;

98 datalines;

NOTE: The data set WORK.D2 has 9 observations and 2 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.00 seconds

108 ;

109 proc sort data=d2; by id; run;

NOTE: There were 9 observations read from the data set WORK.D2.

NOTE: The data set WORK.D2 has 9 observations and 2 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

110 /* Exposure level*/

111 data d3;

112 input id$ 1-5 job 7 idchem 9 level 16;

113 datalines;

NOTE: The data set WORK.D3 has 20 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.00 seconds

cpu time 0.01 seconds

134 ;

135 proc sort data=d3; by id; run;

NOTE: There were 20 observations read from the data set WORK.D3.

NOTE: The data set WORK.D3 has 20 observations and 4 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

136

137 /* Exposure Duration*/

138 data d4;

139 input id$ 1-5 idchem 7 status$ 14-15 duration 16-18;

140 datalines;

NOTE: The data set WORK.D4 has 20 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.00 seconds

161 ;

162

163 proc sort data=d3; by id; run;

NOTE: Input data set is already sorted, no sorting done.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

164

165 /* Merging d1,d2,d3 and d4*/

166 data mg4;

167 merge d1 d2 d3 d4; by id;

168 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.

NOTE: There were 14 observations read from the data set WORK.D1.

NOTE: There were 9 observations read from the data set WORK.D2.

NOTE: There were 20 observations read from the data set WORK.D3.

NOTE: There were 20 observations read from the data set WORK.D4.

NOTE: The data set WORK.MG4 has 27 observations and 12 variables.

NOTE: DATA statement used (Total process time):

real time 0.02 seconds

cpu time 0.02 seconds

169

170 proc print data=mg4;

171 title "Table 1. Merged datasets(d1,d2,d3,d4)"; run;

NOTE: There were 27 observations read from the data set WORK.MG4.

NOTE: PROCEDURE PRINT used (Total process time):

real time 0.34 seconds

cpu time 0.34 seconds

172

173 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

185

Shmuel · Posted 02-19-2020 04:00 PM

The IDs are not unique in your data sets. This is the reason you get the message

MERGE statement has more than one data set with repeats of BY values

In such case there is different approach of SAS base and SQL.

Are there another variables beyond ID to make it unique?

otherwise use SQL full join to merge the data sets.

Merging problem: Merge statement.....Repeats of by values

Re: Merging problem: Merge statement.....Repeats of by values

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!