DATA Step, Macro, Functions and more

Merging two data sets and dups

Accepted Solution Solved
Reply
Regular Contributor
Posts: 216
Accepted Solution

Merging two data sets and dups

Hi Everyone

I am merging two data sets using the below data step. I am seeing some instances of duplicate records being created and I don't understand why it is happening. Is there something I forgot in this or is there a reason why this is happening? I am also including an example of the dup records. One other item, is there a way to create a de-duped dataset? Using proc sort with nodupkey and dupout would require using all the variables in the dataset, I assume.

Paul

Code:

data permev1;

merge perm_ev (in=a) nyccust (keep=entity_id DispOutcome1 DispoDate);

if a;

by entity_id;

run;

Dup records:

21560655    222238    22    1    14JUL2011    12SEP2011    .    .    12SEP2011    .    .    .    .    .    14JUL2011    .    25APR2012    25APR2012    XRF    Permanency Tracking Completed   XUK   PERM ONLY    XUK    30JAN2014    Order of Custody Continued

21560655    222238    22    1    14JUL2011    12SEP2011    .    .    12SEP2011    .    .    .    .    .    14JUL2011    .    25APR2012    25APR2012    XRF    Permanency Tracking Completed  

XUK   PERM ONLY    XUK    30JAN2014    Order of Custody Continued

Accepted Solutions
Solution
‎12-10-2014 01:14 PM
Super User
Posts: 5,075

Re: Merging two data sets and dups

So if you just want 2 that match and you don't care which 2, you could add a single statement to the end of your DATA step:

a=0;

Good luck.

View solution in original post


All Replies
Super User
Posts: 10,471

Re: Merging two data sets and dups

Are you expecting your output data set to only have one record for each in the perm_ev set?

If so, look for duplicates of the by variable in either of the datasets.

Regular Contributor
Posts: 216

Re: Merging two data sets and dups

No, not necessarily. There are duplicates in the original perm_ev set, which is fine. The problem is that the perm_ev data set has 216,526 records and the merged data set permev1 has 216,720 records. Records like the one I already included were created in the merge. I just want to add the above two columns (DispOutcome1 DispoDate) from nyccust to the records in the perm_ev data set and that's it.

Paul

Super User
Posts: 5,075

Re: Merging two data sets and dups

So suppose your first data set contains 2 observations for an entity_id, and your second data set contains 4 observations for the same entity_id.  What would you like the outcome to be?  (Your answer actually determines the proper programming tools to use.)

Regular Contributor
Posts: 216

Re: Merging two data sets and dups

2 obs. to match the original data set. I really just need to add two columns (DispOutcome1 DispoDate) from the second data set (nyccust) to the records in the first (perm_ev) as they exist in the first.

Paul

Solution
‎12-10-2014 01:14 PM
Super User
Posts: 5,075

Re: Merging two data sets and dups

So if you just want 2 that match and you don't care which 2, you could add a single statement to the end of your DATA step:

a=0;

Good luck.

PROC Star
Posts: 7,356

Re: Merging two data sets and dups

Paul, the situation is more complex if you received the following note in your log: "NOTE: MERGE statement has more than one data set with repeats of BY values." The key question is whether you have duplicates in one of your files, or in both of your files.

If you don't want duplicate records in either file, then I would just sort each file, using the nodupkey option, only including your one by variable in the by statement (i.e., by entity_id; ), and using an out= option so that you don't destroy the original data files. No need to include a dupout option, as you appear to only want a de-duped merged file.

Regular Contributor
Posts: 216

Re: Merging two data sets and dups

Dups in the original perm_ev data set are fine, so de-duping the original data set would not help. I just want to add the two new columns based on the entity_id 'by' variable.

Paul

PROC Star
Posts: 7,356

Re: Merging two data sets and dups

Paul: I disagree!

I think that the following represents your problem and solution:

data have1;

  input id x;

  cards;

1 1

2 2

3 3

4 4

4 4.5

5 5

;

data have2;

  input id y;

  cards;

1 11

2 21

3 31

4 41

4 41.5

;

data dontwant;

  merge have1 (in=ina) have2;

  by id;

  if ina;

run;

proc sort data=have2 out=need2 nodupkey;

  by id;

run;

data want;

  merge have1 (in=ina) need2;

  by id;

  if ina;

run;

Regular Contributor
Posts: 216

Re: Merging two data sets and dups

Neither one of those worked actually. a=0 gives me fewer records than I started with and Art the above gives me the same record set that I am getting (216,720).

I want to keep the original record set in the first data set as is (the same ones and no new records) and just add the two columns from the second record set onto the end of the records in the first data set.

Paul

Super User
Posts: 5,075

Re: Merging two data sets and dups

Paul,

I'm actually shocked that a=0 didn't work.  When you say "fewer records" do you mean fewer than 216,526?  I would love to see the log (both the programming statements and the notes that SAS generates).

Regular Contributor
Posts: 216

Re: Merging two data sets and dups

Hi Astounding

I re-ran it with the above to generate the log actually and when I checked the second time, it did produce the correct number of records. So thanks a lot for that! The first time I was looking at the wrong data set apparently. EG sometimes displays different data sets when finishing.

I want to verify though that in using this method the records from the first record set (perm_ev) are the ones that remain?

I ran two types of merges to check (between perm_ev and the merged data set) and both seem to indicate that the 216,526 that in the merged file are the ones that are in the first perm_ev data set.

Paul

Super User
Posts: 5,075

Re: Merging two data sets and dups

Paul,

Yes, the records from the first data set are the ones that remain.  (Unless there is something truly bizarre happening, such as the variable "a" being part of the second data set and its value overwriting the value produced by in=a.)

In the unusual case where an entity_id appears twice in the first data set and four times in the second data set, only the first two matching records from the second data set would be used. The third and fourth would just be deleted by "if a=0;"  When there are multiple matching records in the second data set, as long as you don't care which matching record is used, this solution would work.

PROC Star
Posts: 7,356

Re: Merging two data sets and dups

I agree! And 's suggested solution would be a lot quicker to run than the one I had suggested. Anything more than 's suggested code would only be needed if DispOutcome1 and DispoDate have to be assigned differently when there are duplicate records in either file.

Regular Contributor
Posts: 216

Re: Merging two data sets and dups

I will keep that in mind. Thanks Art.

Paul

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 410 views
  • 3 likes
  • 4 in conversation