Hi Everyone
I am merging two data sets using the below data step. I am seeing some instances of duplicate records being created and I don't understand why it is happening. Is there something I forgot in this or is there a reason why this is happening? I am also including an example of the dup records. One other item, is there a way to create a de-duped dataset? Using proc sort with nodupkey and dupout would require using all the variables in the dataset, I assume.
Paul
Code:
data permev1;
merge perm_ev (in=a) nyccust (keep=entity_id DispOutcome1 DispoDate);
if a;
by entity_id;
run;
Dup records:
21560655 222238 22 1 14JUL2011 12SEP2011 . . 12SEP2011 . . . . . 14JUL2011 . 25APR2012 25APR2012 XRF Permanency Tracking Completed | XUK | PERM ONLY XUK 30JAN2014 Order of Custody Continued |
21560655 222238 22 1 14JUL2011 12SEP2011 . . 12SEP2011 . . . . . 14JUL2011 . 25APR2012 25APR2012 XRF Permanency Tracking Completed | XUK | PERM ONLY XUK 30JAN2014 Order of Custody Continued |
So if you just want 2 that match and you don't care which 2, you could add a single statement to the end of your DATA step:
a=0;
Good luck.
Are you expecting your output data set to only have one record for each in the perm_ev set?
If so, look for duplicates of the by variable in either of the datasets.
No, not necessarily. There are duplicates in the original perm_ev set, which is fine. The problem is that the perm_ev data set has 216,526 records and the merged data set permev1 has 216,720 records. Records like the one I already included were created in the merge. I just want to add the above two columns (DispOutcome1 DispoDate) from nyccust to the records in the perm_ev data set and that's it.
Paul
So suppose your first data set contains 2 observations for an entity_id, and your second data set contains 4 observations for the same entity_id. What would you like the outcome to be? (Your answer actually determines the proper programming tools to use.)
2 obs. to match the original data set. I really just need to add two columns (DispOutcome1 DispoDate) from the second data set (nyccust) to the records in the first (perm_ev) as they exist in the first.
Paul
So if you just want 2 that match and you don't care which 2, you could add a single statement to the end of your DATA step:
a=0;
Good luck.
Paul, the situation is more complex if you received the following note in your log: "NOTE: MERGE statement has more than one data set with repeats of BY values." The key question is whether you have duplicates in one of your files, or in both of your files.
If you don't want duplicate records in either file, then I would just sort each file, using the nodupkey option, only including your one by variable in the by statement (i.e., by entity_id; ), and using an out= option so that you don't destroy the original data files. No need to include a dupout option, as you appear to only want a de-duped merged file.
Dups in the original perm_ev data set are fine, so de-duping the original data set would not help. I just want to add the two new columns based on the entity_id 'by' variable.
Paul
Paul: I disagree!
I think that the following represents your problem and solution:
data have1;
input id x;
cards;
1 1
2 2
3 3
4 4
4 4.5
5 5
;
data have2;
input id y;
cards;
1 11
2 21
3 31
4 41
4 41.5
;
data dontwant;
merge have1 (in=ina) have2;
by id;
if ina;
run;
proc sort data=have2 out=need2 nodupkey;
by id;
run;
data want;
merge have1 (in=ina) need2;
by id;
if ina;
run;
Neither one of those worked actually. a=0 gives me fewer records than I started with and Art the above gives me the same record set that I am getting (216,720).
I want to keep the original record set in the first data set as is (the same ones and no new records) and just add the two columns from the second record set onto the end of the records in the first data set.
Paul
Paul,
I'm actually shocked that a=0 didn't work. When you say "fewer records" do you mean fewer than 216,526? I would love to see the log (both the programming statements and the notes that SAS generates).
Hi Astounding
I re-ran it with the above to generate the log actually and when I checked the second time, it did produce the correct number of records. So thanks a lot for that! The first time I was looking at the wrong data set apparently. EG sometimes displays different data sets when finishing.
I want to verify though that in using this method the records from the first record set (perm_ev) are the ones that remain?
I ran two types of merges to check (between perm_ev and the merged data set) and both seem to indicate that the 216,526 that in the merged file are the ones that are in the first perm_ev data set.
Paul
Paul,
Yes, the records from the first data set are the ones that remain. (Unless there is something truly bizarre happening, such as the variable "a" being part of the second data set and its value overwriting the value produced by in=a.)
In the unusual case where an entity_id appears twice in the first data set and four times in the second data set, only the first two matching records from the second data set would be used. The third and fourth would just be deleted by "if a=0;" When there are multiple matching records in the second data set, as long as you don't care which matching record is used, this solution would work.
I will keep that in mind. Thanks Art.
Paul
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.