merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

ak2011 · Posted 02-18-2020 06:09 AM


Hi,
There were errors in my previous 2 posts. Ignore them and consider this one instead.
Could you please help me to resolve this issue? I merged 4 datasets by id(a common variable) 
and had a message "Merge statement has.....repeats of by values". Any with the SAS code to avoid this
statement? I have huge datasets but these ones are just subsets of the larger ones.
My ultimate aim is to count the number of ca case ca cont pop cont in  the status(S and NS) 
variables in the merged dataset Table 1 final(attached pdf).
Thanks in advance for your expert assistance.
ak.

/*Pollutants*/
data d1;
input id$ 1-5 job 7 id_job$ 9-15 hcl_exp 17 amo_exp 19 bio_exp 21 cla_exp 23;
datalines;
OSa03 4 OSa03_4 1 0 0 0
OSa06 3	OSa06_3 0 1 0 0 
OSa13 1	OSa13_1 0 1 1 0
OSa13 3	OSa13_3 0 1 1 1
OSa29 2	OSa29_2 0 0 0 1
OSa29 4	OSa29_4 0 1 1 0
OSa30 4	OSa30_4 0 0 1 0
OSa30 1	OSa30_1 1 0 0 0
OSa30 2	OSa30_2 0 1 1 1
OSa54 3	OSa54_3 0 1 0 0
OSa64 3	OSa64_3 0 1 0 0
OSa73 3	OSa73_3 0 0 0 1
OSa74 3	OSa74_3 1 0 0 0
OSa78 3	OSa78_3 0 1 0 0
;
proc sort data=d1; by id; run;

/*  Cancer subjects*/
data d2;
input id$ 1-5 lung$ 7-15;
datalines;
OSa01 Pop cont
OSa06 Ca cont
OSa11 Pop cont
OSa13 Ca case
OSa29 Ca cont
OSa30 Ca case
OSa31 Ca cont
OSa54 Pop cont
OSa73 Pop cont
;
proc sort data=d2; by id; run;
/* Exposure level*/
data d3;
input id$ 1-5 job 7 idchem 9-15 level 16;
datalines;
OSa03 4 211701 3
OSa06 3	210701 3
OSa13 1	210701 3
OSa13 1	990021 3
OSa13 3	210701 3
OSa13 3	990005 3
OSa13 3	990021 2
OSa29 2	990005 3
OSa29 4	210701 3
OSa30 1 990021 3
OSa30 2	211701 3
OSa30 3	210701 3
OSa30 3	990005 3
OSa30 3	990021 3
OSa54 3	990005 3
OSa64 3	210701 2
OSa74 1 211701 3
OSa78 4	210701 3
OSa78 4	990005 3
OSa78 4	990021 3
;
proc sort data=d3; by id; run;

/* Exposure Duration*/
data d4;
input id$ 1-5 idchem 7-12 status$ 14-15 duration 16-18;
datalines;
OSa03 211701 S 6
OSa06 210701 S 9
OSa13 210701 S 37
OSa13 990005 S 5
OSa13 990021 S 37
OSa29 210701 NS 12
OSa29 990005 S 2
OSa30 210701 S 8
OSa30 211701 NS 8
OSa30 990005 S  8
OSa30 990021 S 15
OSa54 210701 NS 14
OSa64 210701 S 15
OSa74 211701 NS 21
OSa78 210701 NS 20
OSa78 990005 S 20
OSa78 990021 S 20
OSa86 990005 S 14
OSa93 210701 S 4
OSa93 990005 S 13
;

proc sort data=d4; by id; run;

/* Merging d1,d2,d3 and d4*/
data mg4; 
merge d1 d2 d3 d4; by id;
run;

proc print data=mg4;
title "Table 1 final. Merged datasets(d1,d2,d3,d4)"; run;

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

72

73 /*Pollutants*/

74 data d1;

75 input id$ 1-5 job 7 id_job$ 9-15 hcl_exp 17 amo_exp 19 bio_exp 21 cla_exp 23;

76 datalines;

NOTE: The data set WORK.D1 has 14 observations and 7 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

91 ;

92 proc sort data=d1; by id; run;

NOTE: There were 14 observations read from the data set WORK.D1.

NOTE: The data set WORK.D1 has 14 observations and 7 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.01 seconds

cpu time 0.02 seconds

93

94 /* Cancer subjects*/

95 data d2;

96 input id$ 1-5 lung$ 7-15;

97 datalines;

NOTE: The data set WORK.D2 has 9 observations and 2 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

107 ;

108 proc sort data=d2; by id; run;

NOTE: There were 9 observations read from the data set WORK.D2.

NOTE: The data set WORK.D2 has 9 observations and 2 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

109 /* Exposure level*/

110 data d3;

111 input id$ 1-5 job 7 idchem 9-15 level 16;

112 datalines;

NOTE: The data set WORK.D3 has 20 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

133 ;

134 proc sort data=d3; by id; run;

NOTE: There were 20 observations read from the data set WORK.D3.

NOTE: The data set WORK.D3 has 20 observations and 4 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

135

136 /* Exposure Duration*/

137 data d4;

138 input id$ 1-5 idchem 7-12 status$ 14-15 duration 16-18;

139 datalines;

NOTE: The data set WORK.D4 has 20 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

160 ;

161

162 proc sort data=d4; by id; run;

NOTE: There were 20 observations read from the data set WORK.D4.

NOTE: The data set WORK.D4 has 20 observations and 4 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

163

164 /* Merging d1,d2,d3 and d4*/

165 data mg4;

166 merge d1 d2 d3 d4; by id;

167 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.

NOTE: There were 14 observations read from the data set WORK.D1.

NOTE: There were 9 observations read from the data set WORK.D2.

NOTE: There were 20 observations read from the data set WORK.D3.

NOTE: There were 20 observations read from the data set WORK.D4.

NOTE: The data set WORK.MG4 has 27 observations and 12 variables.

NOTE: DATA statement used (Total process time):

real time 0.04 seconds

cpu time 0.03 seconds

168

169 proc print data=mg4;

170 title "Table 1 final. Merged datasets(d1,d2,d3,d4)"; run;

NOTE: There were 27 observations read from the data set WORK.MG4.

NOTE: PROCEDURE PRINT used (Total process time):

real time 0.51 seconds

cpu time 0.51 seconds

171

172 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

184

mkeintz · Posted 02-18-2020 11:33 AM

You are merging 2 or more data sets via the MERGE statement in combination with the BY statement. Consider the case when merging only 2 datasets A and B, by variable ID. If, for ID=1, dataset A has 2 obs and dataset B has 4 obs, the result will have 4 obs. Both datasets have repeats of the BY variable. Or it could be, for ID=1 A has 1 obs, B has 2, and for ID=2 it's the opposite (A has 2 obs and B has 1). These cases will generate the messages you report. There is nothing wrong in your program. It's just thet many time a user expects that only one dataset (or maybe no dataset) has repeats of ID.

BUT ... whichever dataset has the shorter BY-group will have its final obs duplicated for all the subsequent matches with the dataset having the longer BY-group. So the question is: what do you want to do about these situations? Do you want to propagate observations from the shorter group, or not? There is a simple way to eliminate that behavior if needed. I.e. if for a given ID N(a)=2 and N(b)=4 you could have 4 obs, but the 3rd and 4th obs could have all missing values for the variables from A, which might be your preference. The issue is probably how you want to count status S and NS.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 02-18-2020 11:40 AM

Your post is a little to long to follow clearly.

The note you are referencing means that in more than one of the datasets you are merging the BY variables (ID in your case) do NOT uniquely identify the observations.

So how is your data structured? Did you expect all four of those datasets to have only one observation for each ID? Did you expect three of them have that condition? Only in those cases does merging by ID makes sense.

If not then what do you want to do when there are 2 observations for ID=1 in one of the datasets and 3 observations for ID=1 in another?

Is it possible you want to first merge some of the datasets by some combination of variables that will uniquely identify the observations?

ak2011 · Posted 02-18-2020 09:12 PM

Thank you. Yes, I want all four datasets to have only one observation for each id, that is why I merged by id. I thought maybe there might be another approach to handle the situation to avoid the "merge statement.....repeats of by values:. If you have a clue, please let me known.
ak

mkeintz · Posted 02-18-2020 09:46 PM

@ak2011 wrote:
Thank you. Yes, I want all four datasets to have only one observation for each id, that is why I merged by id. I thought maybe there might be another approach to handle the situation to avoid the "merge statement.....repeats of by values:. If you have a clue, please let me known.
ak

If you want only one obs per id in each source dataset, then you have to decide on a rule to determine which observation to keep available for the subsequent merge. If dataset D1 has two records with ID=1, how do you know which record you want? Unless those records are complete duplicates of each other, your choice may depend on the values of the non ID variables. Or it may be that the variables of interest to your task may not change within any repeated ID - in which case you can just keep, say, the first record for each ID. Regardless, once you've implemented such a rule, you will then be able do the merge without getting the warning message.

BTW, your log note doesn't tell you how many datasets have repeats of BY values. Do you know which datasets have this condition?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

Re: merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

Re: merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

Re: merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

Re: merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

Re: merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

Re: merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

Re: merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

Re: merging problem: ..Repeats of by values(please ignore 1st & 2nd posts)

Registration is open

SAS Training: Just a Click Away