BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dr2014
Quartz | Level 8

Hi All,

I am looking to merge two datasets and I get the "NOTE: MERGE statement has more than one data set with repeats of  BY values" in my log, Also, the number of observations in my final merged dataset are
increased. I need the exact number of observations in the 1st input dataset.

I have typed the example of the input datasets 1 and 2 and my final dataset below. The merging has to be done using the zipcode variable. In the 1st dataset more then one uniqueID has the same zip code. The 2nd dataset, more than one zip code has same R_value and there more number of observations in input dataset 2. The zipcode field is unique in the input dataset 2. I need to have the R_value for the zip code for each unique ID in the 1st dataset. I used the merge statement and  in=a etc. syntax and it gives me more number of observations and the note specified above. Please advice an appropriate merge statement and/or SQL join.

Thanks.

Input dataset 1:

UniqueID  zipcode

001           30313

002           30313

003           30002

004           30002

005           30003

006           30004

007           30004

Input dataset 2:

zipcode R_value

30313    1

30002    1

30003    2

30004    2

56006    3

57000    4

67012     5

65789     6

56890     6

76890     7

Finaldataset:

UniqueID  zipcode  R-value

001           30313      1

002           30313      1

003           30002      1

004           30002      1

005           30003      2

006           30004      2

007           30004      2

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

I agree with , but the problem you confronted was listed in the note shown in your log:

dataset2 has at least two records for one of your zip codes. To achieve what you want, whether using sql or a datastep merge, you need to get rid of (or at least ignore) the duplicates in dataset2.

e.g.:

data dataset1;

  input UniqueID $  zipcode;

  cards;

001           30313

002           30313

003           30002

004           30002

005           30003

006           30004

007           30004

;

data dataset2;

  input zipcode R_value;

  cards;

30313    1

30002    1

30002    1

30003    2

30004    2

56006    3

57000    4

67012     5

65789     6

56890     6

76890     7

;

proc sort data=dataset1;

  by zipcode;

run;

proc sort data=dataset2 nodupkey;

  by zipcode;

run;

data want;

  merge dataset1 (in=in1) dataset2;

  by zipcode;

  if in1;

run;

View solution in original post

7 REPLIES 7
Reeza
Super User

It looks like a table lookup.

A left join would work joined on UniqueID or a proc format.

I'm a fan of proc format

-See the examples in this paper:

http://www2.sas.com/proceedings/sugi30/001-30.pdf

dr2014
Quartz | Level 8

Thanks all for your replies. Reeza I am going to try proc format to create a look up table and see if that works. Arthur , the zip codes are unique in the second dataset ,you accidentally included 30002 twice in your code.

art297
Opal | Level 21

dr2014: that was not an accident and I don't think zip codes are unique in your 2nd dataset. Having just one instance is all it would take to case the note you received in your log.

Run a proc sort on your 2nd dataset with the nodupkey option, using zipcode as the by variable, and see how many records are eliminated.

If none are eliminated, I don't think you'll get the note in your log when you merge the two files.

dr2014
Quartz | Level 8

@Arthur Tabachneck. you were right. I did have duplicate zip codes in my 2nd dataset. I didn't understand what you were referring to then, but as I understood the data better I realized it. Just wanted to let you know.

BurntDirt
Calcite | Level 5

You have to have duplicate zip codes in dataset 2 for you to get that message.

You even state that

"The 2nd dataset, more than one zip code has same R_value"

You should check your data.

Use a SQL left join or clean up dataset 2 and use a regular merge.

art297
Opal | Level 21

I agree with , but the problem you confronted was listed in the note shown in your log:

dataset2 has at least two records for one of your zip codes. To achieve what you want, whether using sql or a datastep merge, you need to get rid of (or at least ignore) the duplicates in dataset2.

e.g.:

data dataset1;

  input UniqueID $  zipcode;

  cards;

001           30313

002           30313

003           30002

004           30002

005           30003

006           30004

007           30004

;

data dataset2;

  input zipcode R_value;

  cards;

30313    1

30002    1

30002    1

30003    2

30004    2

56006    3

57000    4

67012     5

65789     6

56890     6

76890     7

;

proc sort data=dataset1;

  by zipcode;

run;

proc sort data=dataset2 nodupkey;

  by zipcode;

run;

data want;

  merge dataset1 (in=in1) dataset2;

  by zipcode;

  if in1;

run;

LinusH
Tourmaline | Level 20

From the message you get, zipcode is not unique in DS 2. Try to apply a unique index/primary key constraint/sort nudupkey to verify that.

For the coice of lookup technique: id you don't have any performance constraints, use the technique you are most comfortable with. For me, SQL is quite simple and many people have that competence.

Data never sleeps

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 31197 views
  • 3 likes
  • 5 in conversation