SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

NOTE: MERGE statement has more than one data set with repeats of by values

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 124
Accepted Solution

NOTE: MERGE statement has more than one data set with repeats of by values

Hi All,

I am looking to merge two datasets and I get the "NOTE: MERGE statement has more than one data set with repeats of  BY values" in my log, Also, the number of observations in my final merged dataset are
increased. I need the exact number of observations in the 1st input dataset.

I have typed the example of the input datasets 1 and 2 and my final dataset below. The merging has to be done using the zipcode variable. In the 1st dataset more then one uniqueID has the same zip code. The 2nd dataset, more than one zip code has same R_value and there more number of observations in input dataset 2. The zipcode field is unique in the input dataset 2. I need to have the R_value for the zip code for each unique ID in the 1st dataset. I used the merge statement and  in=a etc. syntax and it gives me more number of observations and the note specified above. Please advice an appropriate merge statement and/or SQL join.

Thanks.

Input dataset 1:

UniqueID  zipcode

001           30313

002           30313

003           30002

004           30002

005           30003

006           30004

007           30004

Input dataset 2:

zipcode R_value

30313    1

30002    1

30003    2

30004    2

56006    3

57000    4

67012     5

65789     6

56890     6

76890     7

Finaldataset:

UniqueID  zipcode  R-value

001           30313      1

002           30313      1

003           30002      1

004           30002      1

005           30003      2

006           30004      2

007           30004      2


Accepted Solutions
Solution
‎02-01-2015 04:48 PM
PROC Star
Posts: 7,471

Re: NOTE: MERGE statement has more than one data set with repeats of by values

I agree with , but the problem you confronted was listed in the note shown in your log:

dataset2 has at least two records for one of your zip codes. To achieve what you want, whether using sql or a datastep merge, you need to get rid of (or at least ignore) the duplicates in dataset2.

e.g.:

data dataset1;

  input UniqueID $  zipcode;

  cards;

001           30313

002           30313

003           30002

004           30002

005           30003

006           30004

007           30004

;

data dataset2;

  input zipcode R_value;

  cards;

30313    1

30002    1

30002    1

30003    2

30004    2

56006    3

57000    4

67012     5

65789     6

56890     6

76890     7

;

proc sort data=dataset1;

  by zipcode;

run;

proc sort data=dataset2 nodupkey;

  by zipcode;

run;

data want;

  merge dataset1 (in=in1) dataset2;

  by zipcode;

  if in1;

run;

View solution in original post


All Replies
Super User
Posts: 19,789

Re: NOTE: MERGE statement has more than one data set with repeats of by values

It looks like a table lookup.

A left join would work joined on UniqueID or a proc format.

I'm a fan of proc format

-See the examples in this paper:

http://www2.sas.com/proceedings/sugi30/001-30.pdf

Frequent Contributor
Posts: 124

Re: NOTE: MERGE statement has more than one data set with repeats of by values

Thanks all for your replies. Reeza I am going to try proc format to create a look up table and see if that works. Arthur , the zip codes are unique in the second dataset ,you accidentally included 30002 twice in your code.

PROC Star
Posts: 7,471

Re: NOTE: MERGE statement has more than one data set with repeats of by values

dr2014: that was not an accident and I don't think zip codes are unique in your 2nd dataset. Having just one instance is all it would take to case the note you received in your log.

Run a proc sort on your 2nd dataset with the nodupkey option, using zipcode as the by variable, and see how many records are eliminated.

If none are eliminated, I don't think you'll get the note in your log when you merge the two files.

Frequent Contributor
Posts: 124

Re: NOTE: MERGE statement has more than one data set with repeats of by values

@Arthur Tabachneck. you were right. I did have duplicate zip codes in my 2nd dataset. I didn't understand what you were referring to then, but as I understood the data better I realized it. Just wanted to let you know.

Occasional Contributor
Posts: 15

Re: NOTE: MERGE statement has more than one data set with repeats of by values

You have to have duplicate zip codes in dataset 2 for you to get that message.

You even state that

"The 2nd dataset, more than one zip code has same R_value"

You should check your data.

Use a SQL left join or clean up dataset 2 and use a regular merge.

Solution
‎02-01-2015 04:48 PM
PROC Star
Posts: 7,471

Re: NOTE: MERGE statement has more than one data set with repeats of by values

I agree with , but the problem you confronted was listed in the note shown in your log:

dataset2 has at least two records for one of your zip codes. To achieve what you want, whether using sql or a datastep merge, you need to get rid of (or at least ignore) the duplicates in dataset2.

e.g.:

data dataset1;

  input UniqueID $  zipcode;

  cards;

001           30313

002           30313

003           30002

004           30002

005           30003

006           30004

007           30004

;

data dataset2;

  input zipcode R_value;

  cards;

30313    1

30002    1

30002    1

30003    2

30004    2

56006    3

57000    4

67012     5

65789     6

56890     6

76890     7

;

proc sort data=dataset1;

  by zipcode;

run;

proc sort data=dataset2 nodupkey;

  by zipcode;

run;

data want;

  merge dataset1 (in=in1) dataset2;

  by zipcode;

  if in1;

run;

Super User
Posts: 5,428

Re: NOTE: MERGE statement has more than one data set with repeats of by values

From the message you get, zipcode is not unique in DS 2. Try to apply a unique index/primary key constraint/sort nudupkey to verify that.

For the coice of lookup technique: id you don't have any performance constraints, use the technique you are most comfortable with. For me, SQL is quite simple and many people have that competence.

Data never sleeps
🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 10350 views
  • 3 likes
  • 5 in conversation