Hi All,
I am looking to merge two datasets and I get the "NOTE: MERGE statement has more than one data set with repeats of BY values" in my log, Also, the number of observations in my final merged dataset are
increased. I need the exact number of observations in the 1st input dataset.
I have typed the example of the input datasets 1 and 2 and my final dataset below. The merging has to be done using the zipcode variable. In the 1st dataset more then one uniqueID has the same zip code. The 2nd dataset, more than one zip code has same R_value and there more number of observations in input dataset 2. The zipcode field is unique in the input dataset 2. I need to have the R_value for the zip code for each unique ID in the 1st dataset. I used the merge statement and in=a etc. syntax and it gives me more number of observations and the note specified above. Please advice an appropriate merge statement and/or SQL join.
Thanks.
Input dataset 1:
UniqueID zipcode
001 30313
002 30313
003 30002
004 30002
005 30003
006 30004
007 30004
Input dataset 2:
zipcode R_value
30313 1
30002 1
30003 2
30004 2
56006 3
57000 4
67012 5
65789 6
56890 6
76890 7
Finaldataset:
UniqueID zipcode R-value
001 30313 1
002 30313 1
003 30002 1
004 30002 1
005 30003 2
006 30004 2
007 30004 2
I agree with , but the problem you confronted was listed in the note shown in your log:
dataset2 has at least two records for one of your zip codes. To achieve what you want, whether using sql or a datastep merge, you need to get rid of (or at least ignore) the duplicates in dataset2.
e.g.:
data dataset1;
input UniqueID $ zipcode;
cards;
001 30313
002 30313
003 30002
004 30002
005 30003
006 30004
007 30004
;
data dataset2;
input zipcode R_value;
cards;
30313 1
30002 1
30002 1
30003 2
30004 2
56006 3
57000 4
67012 5
65789 6
56890 6
76890 7
;
proc sort data=dataset1;
by zipcode;
run;
proc sort data=dataset2 nodupkey;
by zipcode;
run;
data want;
merge dataset1 (in=in1) dataset2;
by zipcode;
if in1;
run;
It looks like a table lookup.
A left join would work joined on UniqueID or a proc format.
I'm a fan of proc format
-See the examples in this paper:
Thanks all for your replies. Reeza I am going to try proc format to create a look up table and see if that works. Arthur , the zip codes are unique in the second dataset ,you accidentally included 30002 twice in your code.
dr2014: that was not an accident and I don't think zip codes are unique in your 2nd dataset. Having just one instance is all it would take to case the note you received in your log.
Run a proc sort on your 2nd dataset with the nodupkey option, using zipcode as the by variable, and see how many records are eliminated.
If none are eliminated, I don't think you'll get the note in your log when you merge the two files.
@Arthur Tabachneck. you were right. I did have duplicate zip codes in my 2nd dataset. I didn't understand what you were referring to then, but as I understood the data better I realized it. Just wanted to let you know.
You have to have duplicate zip codes in dataset 2 for you to get that message.
You even state that
"The 2nd dataset, more than one zip code has same R_value"
You should check your data.
Use a SQL left join or clean up dataset 2 and use a regular merge.
I agree with , but the problem you confronted was listed in the note shown in your log:
dataset2 has at least two records for one of your zip codes. To achieve what you want, whether using sql or a datastep merge, you need to get rid of (or at least ignore) the duplicates in dataset2.
e.g.:
data dataset1;
input UniqueID $ zipcode;
cards;
001 30313
002 30313
003 30002
004 30002
005 30003
006 30004
007 30004
;
data dataset2;
input zipcode R_value;
cards;
30313 1
30002 1
30002 1
30003 2
30004 2
56006 3
57000 4
67012 5
65789 6
56890 6
76890 7
;
proc sort data=dataset1;
by zipcode;
run;
proc sort data=dataset2 nodupkey;
by zipcode;
run;
data want;
merge dataset1 (in=in1) dataset2;
by zipcode;
if in1;
run;
From the message you get, zipcode is not unique in DS 2. Try to apply a unique index/primary key constraint/sort nudupkey to verify that.
For the coice of lookup technique: id you don't have any performance constraints, use the technique you are most comfortable with. For me, SQL is quite simple and many people have that competence.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.