BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Costasg
Calcite | Level 5

Hello, I need some help regarding deleting duplicate rows; my dataset has 3 columns and looks like this:

code     date          volume

3          jun1996     100

3          jul1996      110

3          jul1996      120

3          aug1996    130

4          jun1996     105

4          jul1996      110

4          jul1996      110

What I want to do is delete the rows that have the same code and date (volume can be different, I want to keep the one  with the highest number; if these are the same just keep one)

Any ideas on how to do that?

So it would be like that:

code     date          volume

3          jun1996     100

3          jul1996      120

3          aug1996    130

4          jun1996     105

4          jul1996      110

Many thanks,

Costas

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

In general you want to sort and keep the last one per group.

proc sort data=have out=want ;

   by code date volume;

run;

data want;

   set want;

   by code date ;

  if last.date;

run;

View solution in original post

11 REPLIES 11
Tom
Super User Tom
Super User

In general you want to sort and keep the last one per group.

proc sort data=have out=want ;

   by code date volume;

run;

data want;

   set want;

   by code date ;

  if last.date;

run;

Costasg
Calcite | Level 5

It worked!

Many thanks tom

data_null__
Jade | Level 19

For this type of summary I would use PROC SUMMARY.  You could use MAX= parameter  but IDGROUP works with both character and numeric variables.

data have;

   inptut code date :monyy. volume;

   format date monyy7.;

   cards;

3 jun1996 100

3 jul1996 110

3 jul1996 120

3 aug1996 130

4 jun1996 105

4 jul1996 110

4 jul1996 110

;;;;

   run;

proc summary data=have nway;

   class code date;

   output out=new idgroup(max(volume) out(volume)=);

   run;

proc print;

   run;

art297
Opal | Level 21

While you already have two excellent answers, there are typically many ways of accomplishing the same thing in SAS.  Thus, a third way of accomplishing what you want is with two sorts.  E.g.,

proc sort data=have out=want ;

   by code date descending volume;

run;

proc sort data=want nodupkey;

   by code date;

run;

lostprophet
Calcite | Level 5

Thanks Art297, your 'third way' saved my day.

Ksharp
Super User

The fourth way after art.T

data have;
   input code date :monyy. volume;
   format date monyy7.;
   cards;
3 jun1996 100
3 jul1996 110
3 jul1996 120
3 aug1996 130
4 jun1996 105
4 jul1996 110
4 jul1996 110
;;;;
   run;
proc sql noprint;
 create table want as
  select distinct *
   from have
    group by code,date
     having volume eq max(volume);
quit;

Ksharp

PsycResearcher
Calcite | Level 5

I have a related question. How happen if I want to keep the duplicates but delete the non-duplicates. I have posted the following dataset in this forum before. It looks like this:

participantID

Treatment_Start_Date

Assessment_Date

Scores

1

13JAN2001

13JAN2001

5

1

13JAN2001

24MAR2001

6

1

13JAN2001

07MAY2001

8

1

15DEC2001

15DEC2001

9

2

01FEB2008

01FEB2008

5

2

01FEB2008

15MAY2008

2

2

01FEB2008

06JAN2009

1

2

15DEC2009

15DEC2009

3

2

15DEC2009

15JAN2010

5

2

26MAY2010

26MAY2010

4

OR

data have;

  informat Treatment_Start_Date Assessment_Date date9.;

  format Treatment_Start_Date Assessment_Date date9.;

  input participantID Treatment_Start_Date Assessment_Date Scores;

  cards;

1 13JAN2001 13JAN2001 5

1 13JAN2001 24MAR2001 6

1 13JAN2001 07MAY2001 8

1 15DEC2001 15DEC2001 9

2 01FEB2008 01FEB2008 5

2 01FEB2008 15MAY2008 2

2 01FEB2008 06JAN2009 1

2 15DEC2009 15DEC2009 3

2 15DEC2009 15JAN2010 5

2 26MAY2010 26MAY2010 4

;

run;


For participant 1 for example, I wan to keep only his three lines of data for 13JAN2001. For participant 2, I want to keep 01FEB2008 (three lines) and 15DEC2009 (two lines). I tried to find the "opposite" of NODUPKEY (something like KEEPDUPKEY") but have not succeed. Is there anyway to do this?

Thanks!

Chester

art297
Opal | Level 21

Very similar to the code Tom proposed earlier:

proc sort data=have out=want ;

   by participantID Treatment_Start_Date;

run;

data want;

   set want;

   by code date ;

  if not(first.Treatment_Start_Date and last.Treatment_Start_Date);

run;

Ksharp
Super User

How about:

data have;
   input code date1 : date9. date2 : date9. volume;
   format date1 date2 date9.;
   cards;
1 13JAN2001 13JAN2001 5
1 13JAN2001 24MAR2001 6
1 13JAN2001 07MAY2001 8
1 15DEC2001 15DEC2001 9
2 01FEB2008 01FEB2008 5
2 01FEB2008 15MAY2008 2
2 01FEB2008 06JAN2009 1
2 15DEC2009 15DEC2009 3
2 15DEC2009 15JAN2010 5
2 26MAY2010 26MAY2010 4
;
   run;
proc sql noprint;
 create table want as
  select  *
   from have
    group by code,date1
     having count(date1) gt 1;
quit;

Ksharp

PsycResearcher
Calcite | Level 5

thanks for being helpful, Ksharp. I guess that proc sql is a very useful command that I should learn.

sbhat
Calcite | Level 5

I used the following code to create a test data set:

 

data test0;
input ID1 ID2 date score;

datalines;
1 1.1 2004 8
1 1.1 2004 7
1 1.1 2004 1
2 1.2 2005 1
2 1.2 2006 1
2 1.2 2007 1
2 2.2 2005 8
2 2.2 2006 8
2 2.2 2007 8
3 3.1 2005 5
3 3.2 2005 6
3 3.3 2005 5
3 3.1 2006 5
3 3.2 2006 6
3 3.3 2006 5
3 3.1 2007 5
3 3.2 2007 6
3 3.3 2007 5
4 4.1 2005 8
4 4.1 2006 8
4 4.1 2007 8
5 5.1 2005 5
5 5.2 2006 6
5 5.3 2007 5
;

 

I want to test the presence of duplicate observations in the data sets.

 

The rule is ID1 (primary indicator) should be present only once for each date. For Example, ID1 = 2 and ID1 = 3 have duplicate observations as they have repeatations of the same value for ID1 for one particular value of date. However, ID1 =5 does not have a duplicate observation, although it's secondary indicator (ID2) changes it's value across dates.

 

Any help on this matter is highly appreciated. 

 

I want to create an indicator variable that will take a value 1 for if a particular observation is a duplicate observation. 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 58940 views
  • 15 likes
  • 8 in conversation