Help using Base SAS procedures

Delete duplicate rows if two variables match

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 75
Accepted Solution

Delete duplicate rows if two variables match

Hello, I need some help regarding deleting duplicate rows; my dataset has 3 columns and looks like this:

code     date          volume

3          jun1996     100

3          jul1996      110

3          jul1996      120

3          aug1996    130

4          jun1996     105

4          jul1996      110

4          jul1996      110

What I want to do is delete the rows that have the same code and date (volume can be different, I want to keep the one  with the highest number; if these are the same just keep one)

Any ideas on how to do that?

So it would be like that:

code     date          volume

3          jun1996     100

3          jul1996      120

3          aug1996    130

4          jun1996     105

4          jul1996      110

Many thanks,

Costas


Accepted Solutions
Solution
‎07-30-2011 02:47 PM
Super User
Super User
Posts: 6,500

Re: Delete duplicate rows if two variables match

In general you want to sort and keep the last one per group.

proc sort data=have out=want ;

   by code date volume;

run;

data want;

   set want;

   by code date ;

  if last.date;

run;

View solution in original post


All Replies
Solution
‎07-30-2011 02:47 PM
Super User
Super User
Posts: 6,500

Re: Delete duplicate rows if two variables match

In general you want to sort and keep the last one per group.

proc sort data=have out=want ;

   by code date volume;

run;

data want;

   set want;

   by code date ;

  if last.date;

run;

Frequent Contributor
Posts: 75

Delete duplicate rows if two variables match

It worked!

Many thanks tom

Respected Advisor
Posts: 3,777

Delete duplicate rows if two variables match

For this type of summary I would use PROC SUMMARY.  You could use MAX= parameter  but IDGROUP works with both character and numeric variables.

data have;

   inptut code date :monyy. volume;

   format date monyy7.;

   cards;

3 jun1996 100

3 jul1996 110

3 jul1996 120

3 aug1996 130

4 jun1996 105

4 jul1996 110

4 jul1996 110

;;;;

   run;

proc summary data=have nway;

   class code date;

   output out=new idgroup(max(volume) out(volume)=);

   run;

proc print;

   run;

PROC Star
Posts: 7,363

Delete duplicate rows if two variables match

While you already have two excellent answers, there are typically many ways of accomplishing the same thing in SAS.  Thus, a third way of accomplishing what you want is with two sorts.  E.g.,

proc sort data=have out=want ;

   by code date descending volume;

run;

proc sort data=want nodupkey;

   by code date;

run;

Occasional Learner
Posts: 1

Re: Delete duplicate rows if two variables match

Thanks Art297, your 'third way' saved my day.

Super User
Posts: 9,681

Re: Delete duplicate rows if two variables match

The fourth way after art.T

data have;
   input code date :monyy. volume;
   format date monyy7.;
   cards;
3 jun1996 100
3 jul1996 110
3 jul1996 120
3 aug1996 130
4 jun1996 105
4 jul1996 110
4 jul1996 110
;;;;
   run;
proc sql noprint;
 create table want as
  select distinct *
   from have
    group by code,date
     having volume eq max(volume);
quit;

Ksharp

Occasional Contributor
Posts: 14

Re: Delete duplicate rows if two variables match

I have a related question. How happen if I want to keep the duplicates but delete the non-duplicates. I have posted the following dataset in this forum before. It looks like this:

participantID

Treatment_Start_Date

Assessment_Date

Scores

1

13JAN2001

13JAN2001

5

1

13JAN2001

24MAR2001

6

1

13JAN2001

07MAY2001

8

1

15DEC2001

15DEC2001

9

2

01FEB2008

01FEB2008

5

2

01FEB2008

15MAY2008

2

2

01FEB2008

06JAN2009

1

2

15DEC2009

15DEC2009

3

2

15DEC2009

15JAN2010

5

2

26MAY2010

26MAY2010

4

OR

data have;

  informat Treatment_Start_Date Assessment_Date date9.;

  format Treatment_Start_Date Assessment_Date date9.;

  input participantID Treatment_Start_Date Assessment_Date Scores;

  cards;

1 13JAN2001 13JAN2001 5

1 13JAN2001 24MAR2001 6

1 13JAN2001 07MAY2001 8

1 15DEC2001 15DEC2001 9

2 01FEB2008 01FEB2008 5

2 01FEB2008 15MAY2008 2

2 01FEB2008 06JAN2009 1

2 15DEC2009 15DEC2009 3

2 15DEC2009 15JAN2010 5

2 26MAY2010 26MAY2010 4

;

run;


For participant 1 for example, I wan to keep only his three lines of data for 13JAN2001. For participant 2, I want to keep 01FEB2008 (three lines) and 15DEC2009 (two lines). I tried to find the "opposite" of NODUPKEY (something like KEEPDUPKEY") but have not succeed. Is there anyway to do this?

Thanks!

Chester

PROC Star
Posts: 7,363

Re: Delete duplicate rows if two variables match

Very similar to the code Tom proposed earlier:

proc sort data=have out=want ;

   by participantID Treatment_Start_Date;

run;

data want;

   set want;

   by code date ;

  if not(first.Treatment_Start_Date and last.Treatment_Start_Date);

run;

Super User
Posts: 9,681

Re: Delete duplicate rows if two variables match

How about:

data have;
   input code date1 : date9. date2 : date9. volume;
   format date1 date2 date9.;
   cards;
1 13JAN2001 13JAN2001 5
1 13JAN2001 24MAR2001 6
1 13JAN2001 07MAY2001 8
1 15DEC2001 15DEC2001 9
2 01FEB2008 01FEB2008 5
2 01FEB2008 15MAY2008 2
2 01FEB2008 06JAN2009 1
2 15DEC2009 15DEC2009 3
2 15DEC2009 15JAN2010 5
2 26MAY2010 26MAY2010 4
;
   run;
proc sql noprint;
 create table want as
  select  *
   from have
    group by code,date1
     having count(date1) gt 1;
quit;

Ksharp

Occasional Contributor
Posts: 14

Re: Delete duplicate rows if two variables match

thanks for being helpful, Ksharp. I guess that proc sql is a very useful command that I should learn.

Learner
Posts: 1

Re: Delete duplicate rows if two variables match

I used the following code to create a test data set:

 

data test0;
input ID1 ID2 date score;

datalines;
1 1.1 2004 8
1 1.1 2004 7
1 1.1 2004 1
2 1.2 2005 1
2 1.2 2006 1
2 1.2 2007 1
2 2.2 2005 8
2 2.2 2006 8
2 2.2 2007 8
3 3.1 2005 5
3 3.2 2005 6
3 3.3 2005 5
3 3.1 2006 5
3 3.2 2006 6
3 3.3 2006 5
3 3.1 2007 5
3 3.2 2007 6
3 3.3 2007 5
4 4.1 2005 8
4 4.1 2006 8
4 4.1 2007 8
5 5.1 2005 5
5 5.2 2006 6
5 5.3 2007 5
;

 

I want to test the presence of duplicate observations in the data sets.

 

The rule is ID1 (primary indicator) should be present only once for each date. For Example, ID1 = 2 and ID1 = 3 have duplicate observations as they have repeatations of the same value for ID1 for one particular value of date. However, ID1 =5 does not have a duplicate observation, although it's secondary indicator (ID2) changes it's value across dates.

 

Any help on this matter is highly appreciated. 

 

I want to create an indicator variable that will take a value 1 for if a particular observation is a duplicate observation. 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 34876 views
  • 15 likes
  • 8 in conversation