BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Paul_NYS
Obsidian | Level 7

Hi Everyone

I am merging two data sets using the below data step. I am seeing some instances of duplicate records being created and I don't understand why it is happening. Is there something I forgot in this or is there a reason why this is happening? I am also including an example of the dup records. One other item, is there a way to create a de-duped dataset? Using proc sort with nodupkey and dupout would require using all the variables in the dataset, I assume.

Paul

Code:

data permev1;

merge perm_ev (in=a) nyccust (keep=entity_id DispOutcome1 DispoDate);

if a;

by entity_id;

run;

Dup records:

21560655    222238    22    1    14JUL2011    12SEP2011    .    .    12SEP2011    .    .    .    .    .    14JUL2011    .    25APR2012    25APR2012    XRF    Permanency Tracking Completed   XUK   PERM ONLY    XUK    30JAN2014    Order of Custody Continued

21560655    222238    22    1    14JUL2011    12SEP2011    .    .    12SEP2011    .    .    .    .    .    14JUL2011    .    25APR2012    25APR2012    XRF    Permanency Tracking Completed  

XUK   PERM ONLY    XUK    30JAN2014    Order of Custody Continued
1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

So if you just want 2 that match and you don't care which 2, you could add a single statement to the end of your DATA step:

a=0;

Good luck.

View solution in original post

15 REPLIES 15
ballardw
Super User

Are you expecting your output data set to only have one record for each in the perm_ev set?

If so, look for duplicates of the by variable in either of the datasets.

Paul_NYS
Obsidian | Level 7

No, not necessarily. There are duplicates in the original perm_ev set, which is fine. The problem is that the perm_ev data set has 216,526 records and the merged data set permev1 has 216,720 records. Records like the one I already included were created in the merge. I just want to add the above two columns (DispOutcome1 DispoDate) from nyccust to the records in the perm_ev data set and that's it.

Paul

Astounding
PROC Star

So suppose your first data set contains 2 observations for an entity_id, and your second data set contains 4 observations for the same entity_id.  What would you like the outcome to be?  (Your answer actually determines the proper programming tools to use.)

Paul_NYS
Obsidian | Level 7

2 obs. to match the original data set. I really just need to add two columns (DispOutcome1 DispoDate) from the second data set (nyccust) to the records in the first (perm_ev) as they exist in the first.

Paul

Astounding
PROC Star

So if you just want 2 that match and you don't care which 2, you could add a single statement to the end of your DATA step:

a=0;

Good luck.

art297
Opal | Level 21

Paul, the situation is more complex if you received the following note in your log: "NOTE: MERGE statement has more than one data set with repeats of BY values." The key question is whether you have duplicates in one of your files, or in both of your files.

If you don't want duplicate records in either file, then I would just sort each file, using the nodupkey option, only including your one by variable in the by statement (i.e., by entity_id; ), and using an out= option so that you don't destroy the original data files. No need to include a dupout option, as you appear to only want a de-duped merged file.

Paul_NYS
Obsidian | Level 7

Dups in the original perm_ev data set are fine, so de-duping the original data set would not help. I just want to add the two new columns based on the entity_id 'by' variable.

Paul

art297
Opal | Level 21

Paul: I disagree!

I think that the following represents your problem and solution:

data have1;

  input id x;

  cards;

1 1

2 2

3 3

4 4

4 4.5

5 5

;

data have2;

  input id y;

  cards;

1 11

2 21

3 31

4 41

4 41.5

;

data dontwant;

  merge have1 (in=ina) have2;

  by id;

  if ina;

run;

proc sort data=have2 out=need2 nodupkey;

  by id;

run;

data want;

  merge have1 (in=ina) need2;

  by id;

  if ina;

run;

Paul_NYS
Obsidian | Level 7

Neither one of those worked actually. a=0 gives me fewer records than I started with and Art the above gives me the same record set that I am getting (216,720).

I want to keep the original record set in the first data set as is (the same ones and no new records) and just add the two columns from the second record set onto the end of the records in the first data set.

Paul

Astounding
PROC Star

Paul,

I'm actually shocked that a=0 didn't work.  When you say "fewer records" do you mean fewer than 216,526?  I would love to see the log (both the programming statements and the notes that SAS generates).

Paul_NYS
Obsidian | Level 7

Hi Astounding

I re-ran it with the above to generate the log actually and when I checked the second time, it did produce the correct number of records. So thanks a lot for that! The first time I was looking at the wrong data set apparently. EG sometimes displays different data sets when finishing.

I want to verify though that in using this method the records from the first record set (perm_ev) are the ones that remain?

I ran two types of merges to check (between perm_ev and the merged data set) and both seem to indicate that the 216,526 that in the merged file are the ones that are in the first perm_ev data set.

Paul

Astounding
PROC Star

Paul,

Yes, the records from the first data set are the ones that remain.  (Unless there is something truly bizarre happening, such as the variable "a" being part of the second data set and its value overwriting the value produced by in=a.)

In the unusual case where an entity_id appears twice in the first data set and four times in the second data set, only the first two matching records from the second data set would be used. The third and fourth would just be deleted by "if a=0;"  When there are multiple matching records in the second data set, as long as you don't care which matching record is used, this solution would work.

art297
Opal | Level 21

I agree! And 's suggested solution would be a lot quicker to run than the one I had suggested. Anything more than 's suggested code would only be needed if DispOutcome1 and DispoDate have to be assigned differently when there are duplicate records in either file.

Paul_NYS
Obsidian | Level 7

I will keep that in mind. Thanks Art.

Paul

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 4343 views
  • 3 likes
  • 4 in conversation