Hi,
There were errors in my previous 2 posts. Ignore them and consider this one instead.
Could you please help me to resolve this issue? I merged 4 datasets by id(a common variable)
and had a message "Merge statement has.....repeats of by values". Any with the SAS code to avoid this
statement? I have huge datasets but these ones are just subsets of the larger ones.
My ultimate aim is to count the number of ca case ca cont pop cont in the status(S and NS)
variables in the merged dataset Table 1 final(attached pdf).
Thanks in advance for your expert assistance.
ak.
/*Pollutants*/
data d1;
input id$ 1-5 job 7 id_job$ 9-15 hcl_exp 17 amo_exp 19 bio_exp 21 cla_exp 23;
datalines;
OSa03 4 OSa03_4 1 0 0 0
OSa06 3 OSa06_3 0 1 0 0
OSa13 1 OSa13_1 0 1 1 0
OSa13 3 OSa13_3 0 1 1 1
OSa29 2 OSa29_2 0 0 0 1
OSa29 4 OSa29_4 0 1 1 0
OSa30 4 OSa30_4 0 0 1 0
OSa30 1 OSa30_1 1 0 0 0
OSa30 2 OSa30_2 0 1 1 1
OSa54 3 OSa54_3 0 1 0 0
OSa64 3 OSa64_3 0 1 0 0
OSa73 3 OSa73_3 0 0 0 1
OSa74 3 OSa74_3 1 0 0 0
OSa78 3 OSa78_3 0 1 0 0
;
proc sort data=d1; by id; run;
/* Cancer subjects*/
data d2;
input id$ 1-5 lung$ 7-15;
datalines;
OSa01 Pop cont
OSa06 Ca cont
OSa11 Pop cont
OSa13 Ca case
OSa29 Ca cont
OSa30 Ca case
OSa31 Ca cont
OSa54 Pop cont
OSa73 Pop cont
;
proc sort data=d2; by id; run;
/* Exposure level*/
data d3;
input id$ 1-5 job 7 idchem 9-15 level 16;
datalines;
OSa03 4 211701 3
OSa06 3 210701 3
OSa13 1 210701 3
OSa13 1 990021 3
OSa13 3 210701 3
OSa13 3 990005 3
OSa13 3 990021 2
OSa29 2 990005 3
OSa29 4 210701 3
OSa30 1 990021 3
OSa30 2 211701 3
OSa30 3 210701 3
OSa30 3 990005 3
OSa30 3 990021 3
OSa54 3 990005 3
OSa64 3 210701 2
OSa74 1 211701 3
OSa78 4 210701 3
OSa78 4 990005 3
OSa78 4 990021 3
;
proc sort data=d3; by id; run;
/* Exposure Duration*/
data d4;
input id$ 1-5 idchem 7-12 status$ 14-15 duration 16-18;
datalines;
OSa03 211701 S 6
OSa06 210701 S 9
OSa13 210701 S 37
OSa13 990005 S 5
OSa13 990021 S 37
OSa29 210701 NS 12
OSa29 990005 S 2
OSa30 210701 S 8
OSa30 211701 NS 8
OSa30 990005 S 8
OSa30 990021 S 15
OSa54 210701 NS 14
OSa64 210701 S 15
OSa74 211701 NS 21
OSa78 210701 NS 20
OSa78 990005 S 20
OSa78 990021 S 20
OSa86 990005 S 14
OSa93 210701 S 4
OSa93 990005 S 13
;
proc sort data=d4; by id; run;
/* Merging d1,d2,d3 and d4*/
data mg4;
merge d1 d2 d3 d4; by id;
run;
proc print data=mg4;
title "Table 1 final. Merged datasets(d1,d2,d3,d4)"; run;
You are merging 2 or more data sets via the MERGE statement in combination with the BY statement. Consider the case when merging only 2 datasets A and B, by variable ID. If, for ID=1, dataset A has 2 obs and dataset B has 4 obs, the result will have 4 obs. Both datasets have repeats of the BY variable. Or it could be, for ID=1 A has 1 obs, B has 2, and for ID=2 it's the opposite (A has 2 obs and B has 1). These cases will generate the messages you report. There is nothing wrong in your program. It's just thet many time a user expects that only one dataset (or maybe no dataset) has repeats of ID.
BUT ... whichever dataset has the shorter BY-group will have its final obs duplicated for all the subsequent matches with the dataset having the longer BY-group. So the question is: what do you want to do about these situations? Do you want to propagate observations from the shorter group, or not? There is a simple way to eliminate that behavior if needed. I.e. if for a given ID N(a)=2 and N(b)=4 you could have 4 obs, but the 3rd and 4th obs could have all missing values for the variables from A, which might be your preference. The issue is probably how you want to count status S and NS.
Your post is a little to long to follow clearly.
The note you are referencing means that in more than one of the datasets you are merging the BY variables (ID in your case) do NOT uniquely identify the observations.
So how is your data structured? Did you expect all four of those datasets to have only one observation for each ID? Did you expect three of them have that condition? Only in those cases does merging by ID makes sense.
If not then what do you want to do when there are 2 observations for ID=1 in one of the datasets and 3 observations for ID=1 in another?
Is it possible you want to first merge some of the datasets by some combination of variables that will uniquely identify the observations?
@ak2011 wrote:
Thank you. Yes, I want all four datasets to have only one observation for each id, that is why I merged by id. I thought maybe there might be another approach to handle the situation to avoid the "merge statement.....repeats of by values:. If you have a clue, please let me known.
ak
If you want only one obs per id in each source dataset, then you have to decide on a rule to determine which observation to keep available for the subsequent merge. If dataset D1 has two records with ID=1, how do you know which record you want? Unless those records are complete duplicates of each other, your choice may depend on the values of the non ID variables. Or it may be that the variables of interest to your task may not change within any repeated ID - in which case you can just keep, say, the first record for each ID. Regardless, once you've implemented such a rule, you will then be able do the merge without getting the warning message.
BTW, your log note doesn't tell you how many datasets have repeats of BY values. Do you know which datasets have this condition?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.