12-10-2014 11:16 AM
I am merging two data sets using the below data step. I am seeing some instances of duplicate records being created and I don't understand why it is happening. Is there something I forgot in this or is there a reason why this is happening? I am also including an example of the dup records. One other item, is there a way to create a de-duped dataset? Using proc sort with nodupkey and dupout would require using all the variables in the dataset, I assume.
merge perm_ev (in=a) nyccust (keep=entity_id DispOutcome1 DispoDate);
|21560655 222238 22 1 14JUL2011 12SEP2011 . . 12SEP2011 . . . . . 14JUL2011 . 25APR2012 25APR2012 XRF Permanency Tracking Completed||XUK||PERM ONLY XUK 30JAN2014 Order of Custody Continued|
21560655 222238 22 1 14JUL2011 12SEP2011 . . 12SEP2011 . . . . . 14JUL2011 . 25APR2012 25APR2012 XRF Permanency Tracking Completed
|XUK||PERM ONLY XUK 30JAN2014 Order of Custody Continued|
12-10-2014 12:04 PM
No, not necessarily. There are duplicates in the original perm_ev set, which is fine. The problem is that the perm_ev data set has 216,526 records and the merged data set permev1 has 216,720 records. Records like the one I already included were created in the merge. I just want to add the above two columns (DispOutcome1 DispoDate) from nyccust to the records in the perm_ev data set and that's it.
12-10-2014 12:35 PM
So suppose your first data set contains 2 observations for an entity_id, and your second data set contains 4 observations for the same entity_id. What would you like the outcome to be? (Your answer actually determines the proper programming tools to use.)
12-10-2014 01:03 PM
2 obs. to match the original data set. I really just need to add two columns (DispOutcome1 DispoDate) from the second data set (nyccust) to the records in the first (perm_ev) as they exist in the first.
12-10-2014 11:49 AM
Paul, the situation is more complex if you received the following note in your log: "NOTE: MERGE statement has more than one data set with repeats of BY values." The key question is whether you have duplicates in one of your files, or in both of your files.
If you don't want duplicate records in either file, then I would just sort each file, using the nodupkey option, only including your one by variable in the by statement (i.e., by entity_id; ), and using an out= option so that you don't destroy the original data files. No need to include a dupout option, as you appear to only want a de-duped merged file.
12-10-2014 12:06 PM
Dups in the original perm_ev data set are fine, so de-duping the original data set would not help. I just want to add the two new columns based on the entity_id 'by' variable.
12-10-2014 01:20 PM
Paul: I disagree!
I think that the following represents your problem and solution:
input id x;
input id y;
merge have1 (in=ina) have2;
proc sort data=have2 out=need2 nodupkey;
merge have1 (in=ina) need2;
12-10-2014 02:25 PM
Neither one of those worked actually. a=0 gives me fewer records than I started with and Art the above gives me the same record set that I am getting (216,720).
I want to keep the original record set in the first data set as is (the same ones and no new records) and just add the two columns from the second record set onto the end of the records in the first data set.
12-10-2014 03:25 PM
I'm actually shocked that a=0 didn't work. When you say "fewer records" do you mean fewer than 216,526? I would love to see the log (both the programming statements and the notes that SAS generates).
12-10-2014 03:46 PM
I re-ran it with the above to generate the log actually and when I checked the second time, it did produce the correct number of records. So thanks a lot for that! The first time I was looking at the wrong data set apparently. EG sometimes displays different data sets when finishing.
I want to verify though that in using this method the records from the first record set (perm_ev) are the ones that remain?
I ran two types of merges to check (between perm_ev and the merged data set) and both seem to indicate that the 216,526 that in the merged file are the ones that are in the first perm_ev data set.
12-10-2014 03:56 PM
Yes, the records from the first data set are the ones that remain. (Unless there is something truly bizarre happening, such as the variable "a" being part of the second data set and its value overwriting the value produced by in=a.)
In the unusual case where an entity_id appears twice in the first data set and four times in the second data set, only the first two matching records from the second data set would be used. The third and fourth would just be deleted by "if a=0;" When there are multiple matching records in the second data set, as long as you don't care which matching record is used, this solution would work.
12-10-2014 04:08 PM
Need further help from the community? Please ask a new question.