Solved: NOTE: MERGE statement has more than one data set with repeats of by va...

dr2014 · Posted 02-01-2015 03:12 PM

Hi All,

I am looking to merge two datasets and I get the "NOTE: MERGE statement has more than one data set with repeats of BY values" in my log, Also, the number of observations in my final merged dataset are
increased. I need the exact number of observations in the 1st input dataset.

I have typed the example of the input datasets 1 and 2 and my final dataset below. The merging has to be done using the zipcode variable. In the 1st dataset more then one uniqueID has the same zip code. The 2nd dataset, more than one zip code has same R_value and there more number of observations in input dataset 2. The zipcode field is unique in the input dataset 2. I need to have the R_value for the zip code for each unique ID in the 1st dataset. I used the merge statement and in=a etc. syntax and it gives me more number of observations and the note specified above. Please advice an appropriate merge statement and/or SQL join.

Thanks.

Input dataset 1:

UniqueID zipcode

001 30313

002 30313

003 30002

004 30002

005 30003

006 30004

007 30004

Input dataset 2:

zipcode R_value

30313 1

30002 1

30003 2

30004 2

56006 3

57000 4

67012 5

65789 6

56890 6

76890 7

Finaldataset:

UniqueID zipcode R-value

001 30313 1

002 30313 1

003 30002 1

004 30002 1

005 30003 2

006 30004 2

007 30004 2

art297 · Posted 02-01-2015 04:48 PM

I agree with , but the problem you confronted was listed in the note shown in your log:

dataset2 has at least two records for one of your zip codes. To achieve what you want, whether using sql or a datastep merge, you need to get rid of (or at least ignore) the duplicates in dataset2.

e.g.:

data dataset1;

input UniqueID $ zipcode;

cards;

001 30313

002 30313

003 30002

004 30002

005 30003

006 30004

007 30004

;

data dataset2;

input zipcode R_value;

cards;

30313 1

30002 1

30003 2

30004 2

56006 3

57000 4

67012 5

65789 6

56890 6

76890 7

;

proc sort data=dataset1;

by zipcode;

run;

proc sort data=dataset2 nodupkey;

by zipcode;

run;

data want;

merge dataset1 (in=in1) dataset2;

by zipcode;

if in1;

run;

View solution in original post

Reeza · Posted 02-01-2015 04:06 PM

It looks like a table lookup.

A left join would work joined on UniqueID or a proc format.

I'm a fan of proc format

-See the examples in this paper:

http://www2.sas.com/proceedings/sugi30/001-30.pdf

dr2014 · Posted 02-02-2015 01:02 PM

Thanks all for your replies. Reeza I am going to try proc format to create a look up table and see if that works. Arthur , the zip codes are unique in the second dataset ,you accidentally included 30002 twice in your code.

art297 · Posted 02-02-2015 01:39 PM

dr2014: that was not an accident and I don't think zip codes are unique in your 2nd dataset. Having just one instance is all it would take to case the note you received in your log.

Run a proc sort on your 2nd dataset with the nodupkey option, using zipcode as the by variable, and see how many records are eliminated.

If none are eliminated, I don't think you'll get the note in your log when you merge the two files.

dr2014 · Posted 03-24-2015 12:36 PM

@Arthur Tabachneck. you were right. I did have duplicate zip codes in my 2nd dataset. I didn't understand what you were referring to then, but as I understood the data better I realized it. Just wanted to let you know.

BurntDirt · Posted 02-02-2015 03:30 PM

You have to have duplicate zip codes in dataset 2 for you to get that message.

You even state that

"The 2nd dataset, more than one zip code has same R_value"

You should check your data.

Use a SQL left join or clean up dataset 2 and use a regular merge.

art297 · Posted 02-01-2015 04:48 PM

I agree with , but the problem you confronted was listed in the note shown in your log:

dataset2 has at least two records for one of your zip codes. To achieve what you want, whether using sql or a datastep merge, you need to get rid of (or at least ignore) the duplicates in dataset2.

e.g.:

data dataset1;

input UniqueID $ zipcode;

cards;

001 30313

002 30313

003 30002

004 30002

005 30003

006 30004

007 30004

;

data dataset2;

input zipcode R_value;

cards;

30313 1

30002 1

30003 2

30004 2

56006 3

57000 4

67012 5

65789 6

56890 6

76890 7

;

proc sort data=dataset1;

by zipcode;

run;

proc sort data=dataset2 nodupkey;

by zipcode;

run;

data want;

merge dataset1 (in=in1) dataset2;

by zipcode;

if in1;

run;

LinusH · Posted 02-01-2015 04:52 PM

From the message you get, zipcode is not unique in DS 2. Try to apply a unique index/primary key constraint/sort nudupkey to verify that.

For the coice of lookup technique: id you don't have any performance constraints, use the technique you are most comfortable with. For me, SQL is quite simple and many people have that competence.

Data never sleeps

NOTE: MERGE statement has more than one data set with repeats of by values

Re: NOTE: MERGE statement has more than one data set with repeats of by values

Re: NOTE: MERGE statement has more than one data set with repeats of by values

Re: NOTE: MERGE statement has more than one data set with repeats of by values

Re: NOTE: MERGE statement has more than one data set with repeats of by values

Re: NOTE: MERGE statement has more than one data set with repeats of by values

Re: NOTE: MERGE statement has more than one data set with repeats of by values

Re: NOTE: MERGE statement has more than one data set with repeats of by values

Re: NOTE: MERGE statement has more than one data set with repeats of by values

Ready to join fellow brilliant minds for the SAS Hackathon?