Solved: Re: Delete duplicate rows if two variables match

Costasg · Posted 07-30-2011 01:28 PM

Hello, I need some help regarding deleting duplicate rows; my dataset has 3 columns and looks like this:

code date volume

3 jun1996 100

3 jul1996 110

3 jul1996 120

3 aug1996 130

4 jun1996 105

4 jul1996 110

4 jul1996 110

What I want to do is delete the rows that have the same code and date (volume can be different, I want to keep the one with the highest number; if these are the same just keep one)

Any ideas on how to do that?

So it would be like that:

code date volume

3 jun1996 100

3 jul1996 120

3 aug1996 130

4 jun1996 105

4 jul1996 110

Many thanks,

Costas

Tom · Posted 07-30-2011 02:47 PM

In general you want to sort and keep the last one per group.

proc sort data=have out=want ;

by code date volume;

run;

data want;

set want;

by code date ;

if last.date;

run;

View solution in original post

Tom · Posted 07-30-2011 02:47 PM

In general you want to sort and keep the last one per group.

proc sort data=have out=want ;

by code date volume;

run;

data want;

set want;

by code date ;

if last.date;

run;

Costasg · Posted 07-30-2011 03:39 PM

It worked!

Many thanks tom

data_null__ · Posted 07-30-2011 03:23 PM

For this type of summary I would use PROC SUMMARY. You could use MAX= parameter but IDGROUP works with both character and numeric variables.

data have;

inptut code date :monyy. volume;

format date monyy7.;

cards;

3 jun1996 100

3 jul1996 110

3 jul1996 120

3 aug1996 130

4 jun1996 105

4 jul1996 110

;;;;

run;

proc summary data=have nway;

class code date;

output out=new idgroup(max(volume) out(volume)=);

run;

proc print;

run;

art297 · Posted 07-31-2011 06:14 PM

While you already have two excellent answers, there are typically many ways of accomplishing the same thing in SAS. Thus, a third way of accomplishing what you want is with two sorts. E.g.,

proc sort data=have out=want ;

by code date descending volume;

run;

proc sort data=want nodupkey;

by code date;

run;

lostprophet · Posted 06-05-2017 07:13 AM

Thanks Art297, your 'third way' saved my day.

Ksharp · Posted 08-04-2011 04:26 AM

The fourth way after art.T

data have;
   input code date :monyy. volume;
   format date monyy7.;
   cards;
3 jun1996 100
3 jul1996 110
3 jul1996 120
3 aug1996 130
4 jun1996 105
4 jul1996 110
4 jul1996 110
;;;;
   run;
proc sql noprint;
 create table want as
  select distinct *
   from have
    group by code,date
     having volume eq max(volume);
quit;

Ksharp

PsycResearcher · Posted 08-05-2011 10:59 AM

I have a related question. How happen if I want to keep the duplicates but delete the non-duplicates. I have posted the following dataset in this forum before. It looks like this:

participantID	Treatment_Start_Date	Assessment_Date	Scores
1	13JAN2001	13JAN2001	5
1	13JAN2001	24MAR2001	6
1	13JAN2001	07MAY2001	8
1	15DEC2001	15DEC2001	9
2	01FEB2008	01FEB2008	5
2	01FEB2008	15MAY2008	2
2	01FEB2008	06JAN2009	1
2	15DEC2009	15DEC2009	3
2	15DEC2009	15JAN2010	5
2	26MAY2010	26MAY2010	4

OR

data have;

informat Treatment_Start_Date Assessment_Date date9.;

format Treatment_Start_Date Assessment_Date date9.;

input participantID Treatment_Start_Date Assessment_Date Scores;

cards;

1 13JAN2001 13JAN2001 5

1 13JAN2001 24MAR2001 6

1 13JAN2001 07MAY2001 8

1 15DEC2001 15DEC2001 9

2 01FEB2008 01FEB2008 5

2 01FEB2008 15MAY2008 2

2 01FEB2008 06JAN2009 1

2 15DEC2009 15DEC2009 3

2 15DEC2009 15JAN2010 5

2 26MAY2010 26MAY2010 4

;

run;

For participant 1 for example, I wan to keep only his three lines of data for 13JAN2001. For participant 2, I want to keep 01FEB2008 (three lines) and 15DEC2009 (two lines). I tried to find the "opposite" of NODUPKEY (something like KEEPDUPKEY") but have not succeed. Is there anyway to do this?

Thanks!

Chester

art297 · Posted 08-05-2011 11:59 AM

Very similar to the code Tom proposed earlier:

proc sort data=have out=want ;

by participantID Treatment_Start_Date;

run;

data want;

set want;

by code date ;

if not(first.Treatment_Start_Date and last.Treatment_Start_Date);

run;

Ksharp · Posted 08-09-2011 06:44 AM

How about:

data have;
   input code date1 : date9. date2 : date9. volume;
   format date1 date2 date9.;
   cards;
1 13JAN2001 13JAN2001 5
1 13JAN2001 24MAR2001 6
1 13JAN2001 07MAY2001 8
1 15DEC2001 15DEC2001 9
2 01FEB2008 01FEB2008 5
2 01FEB2008 15MAY2008 2
2 01FEB2008 06JAN2009 1
2 15DEC2009 15DEC2009 3
2 15DEC2009 15JAN2010 5
2 26MAY2010 26MAY2010 4
;
   run;
proc sql noprint;
 create table want as
  select  *
   from have
    group by code,date1
     having count(date1) gt 1;
quit;

Ksharp

PsycResearcher · Posted 08-11-2011 05:51 PM

thanks for being helpful, Ksharp. I guess that proc sql is a very useful command that I should learn.

sbhat · Posted 05-17-2017 12:15 PM

I used the following code to create a test data set:

data test0;
input ID1 ID2 date score;

datalines;
1 1.1 2004 8
1 1.1 2004 7
1 1.1 2004 1
2 1.2 2005 1
2 1.2 2006 1
2 1.2 2007 1
2 2.2 2005 8
2 2.2 2006 8
2 2.2 2007 8
3 3.1 2005 5
3 3.2 2005 6
3 3.3 2005 5
3 3.1 2006 5
3 3.2 2006 6
3 3.3 2006 5
3 3.1 2007 5
3 3.2 2007 6
3 3.3 2007 5
4 4.1 2005 8
4 4.1 2006 8
4 4.1 2007 8
5 5.1 2005 5
5 5.2 2006 6
5 5.3 2007 5
;

I want to test the presence of duplicate observations in the data sets.

The rule is ID1 (primary indicator) should be present only once for each date. For Example, ID1 = 2 and ID1 = 3 have duplicate observations as they have repeatations of the same value for ID1 for one particular value of date. However, ID1 =5 does not have a duplicate observation, although it's secondary indicator (ID2) changes it's value across dates.

Any help on this matter is highly appreciated.

I want to create an indicator variable that will take a value 1 for if a particular observation is a duplicate observation.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away