Solved: Re: Merging two data sets and dups

Paul_NYS · Posted 12-10-2014 11:16 AM

Hi Everyone

I am merging two data sets using the below data step. I am seeing some instances of duplicate records being created and I don't understand why it is happening. Is there something I forgot in this or is there a reason why this is happening? I am also including an example of the dup records. One other item, is there a way to create a de-duped dataset? Using proc sort with nodupkey and dupout would require using all the variables in the dataset, I assume.

Paul

Code:

data permev1;

merge perm_ev (in=a) nyccust (keep=entity_id DispOutcome1 DispoDate);

if a;

by entity_id;

run;

Dup records:

21560655 222238 22 1 14JUL2011 12SEP2011 . . 12SEP2011 . . . . . 14JUL2011 . 25APR2012 25APR2012 XRF Permanency Tracking Completed

XUK

PERM ONLY XUK 30JAN2014 Order of Custody Continued

21560655 222238 22 1 14JUL2011 12SEP2011 . . 12SEP2011 . . . . . 14JUL2011 . 25APR2012 25APR2012 XRF Permanency Tracking Completed

XUK

PERM ONLY XUK 30JAN2014 Order of Custody Continued

Astounding · Posted 12-10-2014 01:14 PM

So if you just want 2 that match and you don't care which 2, you could add a single statement to the end of your DATA step:

a=0;

Good luck.

View solution in original post

ballardw · Posted 12-10-2014 11:44 AM

Are you expecting your output data set to only have one record for each in the perm_ev set?

If so, look for duplicates of the by variable in either of the datasets.

Paul_NYS · Posted 12-10-2014 12:04 PM

No, not necessarily. There are duplicates in the original perm_ev set, which is fine. The problem is that the perm_ev data set has 216,526 records and the merged data set permev1 has 216,720 records. Records like the one I already included were created in the merge. I just want to add the above two columns (DispOutcome1 DispoDate) from nyccust to the records in the perm_ev data set and that's it.

Paul

Astounding · Posted 12-10-2014 12:35 PM

So suppose your first data set contains 2 observations for an entity_id, and your second data set contains 4 observations for the same entity_id. What would you like the outcome to be? (Your answer actually determines the proper programming tools to use.)

Paul_NYS · Posted 12-10-2014 01:03 PM

2 obs. to match the original data set. I really just need to add two columns (DispOutcome1 DispoDate) from the second data set (nyccust) to the records in the first (perm_ev) as they exist in the first.

Paul

Astounding · Posted 12-10-2014 01:14 PM

So if you just want 2 that match and you don't care which 2, you could add a single statement to the end of your DATA step:

a=0;

Good luck.

art297 · Posted 12-10-2014 11:49 AM

Paul, the situation is more complex if you received the following note in your log: "NOTE: MERGE statement has more than one data set with repeats of BY values." The key question is whether you have duplicates in one of your files, or in both of your files.

If you don't want duplicate records in either file, then I would just sort each file, using the nodupkey option, only including your one by variable in the by statement (i.e., by entity_id; ), and using an out= option so that you don't destroy the original data files. No need to include a dupout option, as you appear to only want a de-duped merged file.

Paul_NYS · Posted 12-10-2014 12:06 PM

Dups in the original perm_ev data set are fine, so de-duping the original data set would not help. I just want to add the two new columns based on the entity_id 'by' variable.

Paul

art297 · Posted 12-10-2014 01:20 PM

Paul: I disagree!

I think that the following represents your problem and solution:

data have1;

input id x;

cards;

1 1

2 2

3 3

4 4

4 4.5

5 5

;

data have2;

input id y;

cards;

1 11

2 21

3 31

4 41

4 41.5

;

data dontwant;

merge have1 (in=ina) have2;

by id;

if ina;

run;

proc sort data=have2 out=need2 nodupkey;

by id;

run;

data want;

merge have1 (in=ina) need2;

by id;

if ina;

run;

Paul_NYS · Posted 12-10-2014 02:25 PM

Neither one of those worked actually. a=0 gives me fewer records than I started with and Art the above gives me the same record set that I am getting (216,720).

I want to keep the original record set in the first data set as is (the same ones and no new records) and just add the two columns from the second record set onto the end of the records in the first data set.

Paul

Astounding · Posted 12-10-2014 03:25 PM

Paul,

I'm actually shocked that a=0 didn't work. When you say "fewer records" do you mean fewer than 216,526? I would love to see the log (both the programming statements and the notes that SAS generates).

Paul_NYS · Posted 12-10-2014 03:46 PM

Hi Astounding

I re-ran it with the above to generate the log actually and when I checked the second time, it did produce the correct number of records. So thanks a lot for that! The first time I was looking at the wrong data set apparently. EG sometimes displays different data sets when finishing.

I want to verify though that in using this method the records from the first record set (perm_ev) are the ones that remain?

I ran two types of merges to check (between perm_ev and the merged data set) and both seem to indicate that the 216,526 that in the merged file are the ones that are in the first perm_ev data set.

Paul

Astounding · Posted 12-10-2014 03:56 PM

Paul,

Yes, the records from the first data set are the ones that remain. (Unless there is something truly bizarre happening, such as the variable "a" being part of the second data set and its value overwriting the value produced by in=a.)

In the unusual case where an entity_id appears twice in the first data set and four times in the second data set, only the first two matching records from the second data set would be used. The third and fourth would just be deleted by "if a=0;" When there are multiple matching records in the second data set, as long as you don't care which matching record is used, this solution would work.

art297 · Posted 12-10-2014 04:08 PM

I agree! And 's suggested solution would be a lot quicker to run than the one I had suggested. Anything more than 's suggested code would only be needed if DispOutcome1 and DispoDate have to be assigned differently when there are duplicate records in either file.

Paul_NYS · Posted 12-10-2014 04:14 PM

I will keep that in mind. Thanks Art.

Paul

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away