I need to flag duplicates in a dataset based on the following criteria: 1. cases with the same first name, last name and zipcode 2. cases with the same first name, last name and email address. Cases should be flagged by assigning a set of duplicates a newUserId with the same id as the most recent enrollment date in the set. For example see following table: userid firstName lastName zip emailAddress enrollmentDate 1 John Smith 88787 jsmith@gmail.com 2011-01-01 2 John Smith 99888 jsmith@gmail.com 2013-01-01 3 John Smith 99888 johnsmith@gmail.com 2014-01-01 4 John Smith 11111 smith12@aol.com 2016-01-01 The first two rows should be flagged as duplicates, because they share firstName, lastName and zip. The 2nd and 3rd rows are also duplicates, because they share firstName, lastName and emailAddress. This means that all of the first three rows are a single set of duplicates. These three rows should each get a newUserId that should match the most recent enrollmentDate in the set, so 3. So, the output I want is: userid firstName lastName zip emailAddress enrollmentDate newUserId 1 John Smith 88787 jsmith@gmail.com 2011-01-01 3 2 John Smith 99888 jsmith@gmail.com 2013-01-01 3 3 John Smith 99888 johnsmith@gmail.com 2014-01-01 3 4 John Smith 11111 smith12@aol.com 2016-01-01 4 I know how to use first / last processing to flag duplicates on these two criteria separately, but can't figure out how to flag an entire 'duplicate cluster' with a single value.
... View more