Solved: Deleting duplicates based on multiple criteria

BenBrady · Posted 06-20-2017 09:41 PM

I want to delete and keep observations based on multiple conditions. For example, in the following dataset I want to do the following:

If there is duplicates of a person i want to keep the one where score1=score2. If this still leave duplicates of that person then I want the person with the lower year (i.e 2015 not 2016). If there are duplicates for a person but score1 is not equal to score2 for any of them then I want to keep the one with the lower year. I have illustrated how i want the original data set to look below:

Person	Score1	Score2	Year
A	15	15	2015
A	15	15	2016
A	15	12	2016
B	8	7	2015
C	10	10	2016
D	11	12	2015
D	11	13	2016


Person	Score1	Score2	Year
A	15	15	2015
B	8	7	2015
C	10	10	2016
D	11	12	2015

Note that if there is no duplicates for a person then I want to keep them regardless of whether score1=score2.

Jagadishkatam · Posted 06-20-2017 11:15 PM

I hope the dataset has the same variables liek person, score1,score2 and year. if that is the case then replace the dataset name with your dataset, like replace have in the below code with your dataset name. this will create the dataset want.

proc sql;
create table want as select * from (select *, count(*) as count, 
case when count(*)>1 and score1=score2 then 1
when count(*)>1 and score1^=score2 then 2 else count(*) end as flag from have 
group by person,year ) group by person having min(year)=year and min(flag)=flag;
quit;

Thanks,
Jag

View solution in original post

Jagadishkatam · Posted 06-20-2017 10:30 PM

Pease try the sql

data have;
input Person$ Score1	Score2	Year;
cards;
A	15	15	2015
A	15	15	2016
A	15	12	2016
B	8	7	2015
C	10	10	2016
D	11	12	2015
D	11	13	2016
;

proc sql;
create table want as select * from (select *, count(*) as count, 
case when count(*)>1 and score1=score2 then 1
when count(*)>1 and score1^=score2 then 2 else count(*) end as flag from have 
group by person,year ) group by person having min(year)=year and min(flag)=flag;
quit;

Thanks,
Jag

BenBrady · Posted 06-20-2017 10:54 PM

Thanks! How would I change the SAS code to make it applicable to an exisiting data set with the same variables that have thousands of observations?

Jagadishkatam · Posted 06-20-2017 11:15 PM

I hope the dataset has the same variables liek person, score1,score2 and year. if that is the case then replace the dataset name with your dataset, like replace have in the below code with your dataset name. this will create the dataset want.

proc sql;
create table want as select * from (select *, count(*) as count, 
case when count(*)>1 and score1=score2 then 1
when count(*)>1 and score1^=score2 then 2 else count(*) end as flag from have 
group by person,year ) group by person having min(year)=year and min(flag)=flag;
quit;

Thanks,
Jag

Deleting duplicates based on multiple criteria

Re: Deleting duplicates based on multiple criteria

Re: Deleting duplicates based on multiple criteria

Re: Deleting duplicates based on multiple criteria

Re: Deleting duplicates based on multiple criteria

Deleting duplicates based on multiple criteria

Re: Deleting duplicates based on multiple criteria

Re: Deleting duplicates based on multiple criteria

Re: Deleting duplicates based on multiple criteria

Re: Deleting duplicates based on multiple criteria

Click image to register for webinar

Classroom Training Available!