BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Costasg
Calcite | Level 5

Hello, I need some help regarding deleting duplicate rows; my dataset has 3 columns and looks like this:

code     date          volume

3          jun1996     100

3          jul1996      110

3          jul1996      120

3          aug1996    130

4          jun1996     105

4          jul1996      110

4          jul1996      110

What I want to do is delete the rows that have the same code and date (volume can be different, I want to keep the one  with the highest number; if these are the same just keep one)

Any ideas on how to do that?

So it would be like that:

code     date          volume

3          jun1996     100

3          jul1996      120

3          aug1996    130

4          jun1996     105

4          jul1996      110

Many thanks,

Costas

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

In general you want to sort and keep the last one per group.

proc sort data=have out=want ;

   by code date volume;

run;

data want;

   set want;

   by code date ;

  if last.date;

run;

View solution in original post

11 REPLIES 11
Tom
Super User Tom
Super User

In general you want to sort and keep the last one per group.

proc sort data=have out=want ;

   by code date volume;

run;

data want;

   set want;

   by code date ;

  if last.date;

run;

Costasg
Calcite | Level 5

It worked!

Many thanks tom

data_null__
Jade | Level 19

For this type of summary I would use PROC SUMMARY.  You could use MAX= parameter  but IDGROUP works with both character and numeric variables.

data have;

   inptut code date :monyy. volume;

   format date monyy7.;

   cards;

3 jun1996 100

3 jul1996 110

3 jul1996 120

3 aug1996 130

4 jun1996 105

4 jul1996 110

4 jul1996 110

;;;;

   run;

proc summary data=have nway;

   class code date;

   output out=new idgroup(max(volume) out(volume)=);

   run;

proc print;

   run;

art297
Opal | Level 21

While you already have two excellent answers, there are typically many ways of accomplishing the same thing in SAS.  Thus, a third way of accomplishing what you want is with two sorts.  E.g.,

proc sort data=have out=want ;

   by code date descending volume;

run;

proc sort data=want nodupkey;

   by code date;

run;

lostprophet
Calcite | Level 5

Thanks Art297, your 'third way' saved my day.

Ksharp
Super User

The fourth way after art.T

data have;
   input code date :monyy. volume;
   format date monyy7.;
   cards;
3 jun1996 100
3 jul1996 110
3 jul1996 120
3 aug1996 130
4 jun1996 105
4 jul1996 110
4 jul1996 110
;;;;
   run;
proc sql noprint;
 create table want as
  select distinct *
   from have
    group by code,date
     having volume eq max(volume);
quit;

Ksharp

PsycResearcher
Calcite | Level 5

I have a related question. How happen if I want to keep the duplicates but delete the non-duplicates. I have posted the following dataset in this forum before. It looks like this:

participantID

Treatment_Start_Date

Assessment_Date

Scores

1

13JAN2001

13JAN2001

5

1

13JAN2001

24MAR2001

6

1

13JAN2001

07MAY2001

8

1

15DEC2001

15DEC2001

9

2

01FEB2008

01FEB2008

5

2

01FEB2008

15MAY2008

2

2

01FEB2008

06JAN2009

1

2

15DEC2009

15DEC2009

3

2

15DEC2009

15JAN2010

5

2

26MAY2010

26MAY2010

4

OR

data have;

  informat Treatment_Start_Date Assessment_Date date9.;

  format Treatment_Start_Date Assessment_Date date9.;

  input participantID Treatment_Start_Date Assessment_Date Scores;

  cards;

1 13JAN2001 13JAN2001 5

1 13JAN2001 24MAR2001 6

1 13JAN2001 07MAY2001 8

1 15DEC2001 15DEC2001 9

2 01FEB2008 01FEB2008 5

2 01FEB2008 15MAY2008 2

2 01FEB2008 06JAN2009 1

2 15DEC2009 15DEC2009 3

2 15DEC2009 15JAN2010 5

2 26MAY2010 26MAY2010 4

;

run;


For participant 1 for example, I wan to keep only his three lines of data for 13JAN2001. For participant 2, I want to keep 01FEB2008 (three lines) and 15DEC2009 (two lines). I tried to find the "opposite" of NODUPKEY (something like KEEPDUPKEY") but have not succeed. Is there anyway to do this?

Thanks!

Chester

art297
Opal | Level 21

Very similar to the code Tom proposed earlier:

proc sort data=have out=want ;

   by participantID Treatment_Start_Date;

run;

data want;

   set want;

   by code date ;

  if not(first.Treatment_Start_Date and last.Treatment_Start_Date);

run;

Ksharp
Super User

How about:

data have;
   input code date1 : date9. date2 : date9. volume;
   format date1 date2 date9.;
   cards;
1 13JAN2001 13JAN2001 5
1 13JAN2001 24MAR2001 6
1 13JAN2001 07MAY2001 8
1 15DEC2001 15DEC2001 9
2 01FEB2008 01FEB2008 5
2 01FEB2008 15MAY2008 2
2 01FEB2008 06JAN2009 1
2 15DEC2009 15DEC2009 3
2 15DEC2009 15JAN2010 5
2 26MAY2010 26MAY2010 4
;
   run;
proc sql noprint;
 create table want as
  select  *
   from have
    group by code,date1
     having count(date1) gt 1;
quit;

Ksharp

PsycResearcher
Calcite | Level 5

thanks for being helpful, Ksharp. I guess that proc sql is a very useful command that I should learn.

sbhat
Calcite | Level 5

I used the following code to create a test data set:

 

data test0;
input ID1 ID2 date score;

datalines;
1 1.1 2004 8
1 1.1 2004 7
1 1.1 2004 1
2 1.2 2005 1
2 1.2 2006 1
2 1.2 2007 1
2 2.2 2005 8
2 2.2 2006 8
2 2.2 2007 8
3 3.1 2005 5
3 3.2 2005 6
3 3.3 2005 5
3 3.1 2006 5
3 3.2 2006 6
3 3.3 2006 5
3 3.1 2007 5
3 3.2 2007 6
3 3.3 2007 5
4 4.1 2005 8
4 4.1 2006 8
4 4.1 2007 8
5 5.1 2005 5
5 5.2 2006 6
5 5.3 2007 5
;

 

I want to test the presence of duplicate observations in the data sets.

 

The rule is ID1 (primary indicator) should be present only once for each date. For Example, ID1 = 2 and ID1 = 3 have duplicate observations as they have repeatations of the same value for ID1 for one particular value of date. However, ID1 =5 does not have a duplicate observation, although it's secondary indicator (ID2) changes it's value across dates.

 

Any help on this matter is highly appreciated. 

 

I want to create an indicator variable that will take a value 1 for if a particular observation is a duplicate observation. 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 57355 views
  • 15 likes
  • 8 in conversation