Solved: Using Enterprise Guide Remove Duplicates based on 2 variables and keep...

FLCrime · Posted 05-27-2020 12:40 PM

I have the following Three variables:

Person Registration Registration_Date

I would like to remove any complete duplicates and any rows that contain the same Person and Registration, keeping the latest registration. Example Below

Person Registration Registration_Date

Pete A 2019

Marco A 1993

Sam B 2002

Sam B 2003

Sam C 1960

David A 2002

This should result in:

Person Registration Registration_Date

Pete A 2019

Marco A 1993

Sam B 2003

Sam C 1960

David A 2002

jebjur · Posted 05-27-2020 05:32 PM

If you want to use a task in Eguide, then the Sort Data task can be used, but you may have to use the task twice. The first time you would sort by all 3 variables, but make sure the sort order for Registration_Date is set to 'Descending', so the most recent date is the first observation for each Person and Registration group.

Then in the 2nd Sort Data task (used on the previously sorted output data set from the 1st Sort Data task), you would only sort by Person and Registration, and in the Options section under 'Duplicate Records', select "Keep only the first record for each 'Sort by' group" This will remove any duplicate observations for Person and Registration.

View solution in original post

ed_sas_member · Posted 05-27-2020 12:50 PM

Hi @FLCrime

Please try this:

data have;
	input Person $  Registration $  Registration_Date;
	datalines;
Pete             A                 2019
Marco          A                 1993
Sam             B                  2002
Sam             B                  2003
Sam             C                 1960
David            A                  2002
David            A                  2002
 ;
 run;
 
proc sort data=have out=have_sorted;
	by Person Registration Registration_Date;
run;
data want;
	set have_sorted;
	by Person Registration Registration_Date;
	if first.Registration then output;
run;

Best,

FLCrime · Posted 05-27-2020 01:00 PM

Thank you for the response. Is there a way to do this through the point and click? Or query builder perhaps? I am not very well versed in code and the data set has about 8 million rows.

jebjur · Posted 05-27-2020 05:32 PM

If you want to use a task in Eguide, then the Sort Data task can be used, but you may have to use the task twice. The first time you would sort by all 3 variables, but make sure the sort order for Registration_Date is set to 'Descending', so the most recent date is the first observation for each Person and Registration group.

Then in the 2nd Sort Data task (used on the previously sorted output data set from the 1st Sort Data task), you would only sort by Person and Registration, and in the Options section under 'Duplicate Records', select "Keep only the first record for each 'Sort by' group" This will remove any duplicate observations for Person and Registration.

Using Enterprise Guide Remove Duplicates based on 2 variables and keeping the latest

Re: Using Enterprise Guide Remove Duplicates based on 2 variables aind keeping the latest

Re: Using Enterprise Guide Remove Duplicates based on 2 variables aind keeping the latest

Re: Using Enterprise Guide Remove Duplicates based on 2 variables aind keeping the latest

Re: Using Enterprise Guide Remove Duplicates based on 2 variables aind keeping the latest

Catch up on SAS Innovate 2026

Using Enterprise Guide Remove Duplicates based on 2 variables and keeping the latest

Re: Using Enterprise Guide Remove Duplicates based on 2 variables aind keeping the latest

Re: Using Enterprise Guide Remove Duplicates based on 2 variables aind keeping the latest

Re: Using Enterprise Guide Remove Duplicates based on 2 variables aind keeping the latest

Re: Using Enterprise Guide Remove Duplicates based on 2 variables aind keeping the latest

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away