Hello, I need some help regarding deleting duplicate rows; my dataset has 3 columns and looks like this:
code date volume
3 jun1996 100
3 jul1996 110
3 jul1996 120
3 aug1996 130
4 jun1996 105
4 jul1996 110
4 jul1996 110
What I want to do is delete the rows that have the same code and date (volume can be different, I want to keep the one with the highest number; if these are the same just keep one)
Any ideas on how to do that?
So it would be like that:
code date volume
3 jun1996 100
3 jul1996 120
3 aug1996 130
4 jun1996 105
4 jul1996 110
Many thanks,
Costas
In general you want to sort and keep the last one per group.
proc sort data=have out=want ;
by code date volume;
run;
data want;
set want;
by code date ;
if last.date;
run;
In general you want to sort and keep the last one per group.
proc sort data=have out=want ;
by code date volume;
run;
data want;
set want;
by code date ;
if last.date;
run;
It worked!
Many thanks tom
For this type of summary I would use PROC SUMMARY. You could use MAX= parameter but IDGROUP works with both character and numeric variables.
data have;
inptut code date :monyy. volume;
format date monyy7.;
cards;
3 jun1996 100
3 jul1996 110
3 jul1996 120
3 aug1996 130
4 jun1996 105
4 jul1996 110
4 jul1996 110
;;;;
run;
proc summary data=have nway;
class code date;
output out=new idgroup(max(volume) out(volume)=);
run;
proc print;
run;
While you already have two excellent answers, there are typically many ways of accomplishing the same thing in SAS. Thus, a third way of accomplishing what you want is with two sorts. E.g.,
proc sort data=have out=want ;
by code date descending volume;
run;
proc sort data=want nodupkey;
by code date;
run;
Thanks Art297, your 'third way' saved my day.
The fourth way after art.T
data have; input code date :monyy. volume; format date monyy7.; cards; 3 jun1996 100 3 jul1996 110 3 jul1996 120 3 aug1996 130 4 jun1996 105 4 jul1996 110 4 jul1996 110 ;;;; run; proc sql noprint; create table want as select distinct * from have group by code,date having volume eq max(volume); quit;
Ksharp
I have a related question. How happen if I want to keep the duplicates but delete the non-duplicates. I have posted the following dataset in this forum before. It looks like this:
participantID | Treatment_Start_Date | Assessment_Date | Scores |
1 | 13JAN2001 | 13JAN2001 | 5 |
1 | 13JAN2001 | 24MAR2001 | 6 |
1 | 13JAN2001 | 07MAY2001 | 8 |
1 | 15DEC2001 | 15DEC2001 | 9 |
2 | 01FEB2008 | 01FEB2008 | 5 |
2 | 01FEB2008 | 15MAY2008 | 2 |
2 | 01FEB2008 | 06JAN2009 | 1 |
2 | 15DEC2009 | 15DEC2009 | 3 |
2 | 15DEC2009 | 15JAN2010 | 5 |
2 | 26MAY2010 | 26MAY2010 | 4 |
OR
data have;
informat Treatment_Start_Date Assessment_Date date9.;
format Treatment_Start_Date Assessment_Date date9.;
input participantID Treatment_Start_Date Assessment_Date Scores;
cards;
1 13JAN2001 13JAN2001 5
1 13JAN2001 24MAR2001 6
1 13JAN2001 07MAY2001 8
1 15DEC2001 15DEC2001 9
2 01FEB2008 01FEB2008 5
2 01FEB2008 15MAY2008 2
2 01FEB2008 06JAN2009 1
2 15DEC2009 15DEC2009 3
2 15DEC2009 15JAN2010 5
2 26MAY2010 26MAY2010 4
;
run;
For participant 1 for example, I wan to keep only his three lines of data for 13JAN2001. For participant 2, I want to keep 01FEB2008 (three lines) and 15DEC2009 (two lines). I tried to find the "opposite" of NODUPKEY (something like KEEPDUPKEY") but have not succeed. Is there anyway to do this?
Thanks!
Chester
Very similar to the code Tom proposed earlier:
proc sort data=have out=want ;
by participantID Treatment_Start_Date;
run;
data want;
set want;
by code date ;
if not(first.Treatment_Start_Date and last.Treatment_Start_Date);
run;
How about:
data have; input code date1 : date9. date2 : date9. volume; format date1 date2 date9.; cards; 1 13JAN2001 13JAN2001 5 1 13JAN2001 24MAR2001 6 1 13JAN2001 07MAY2001 8 1 15DEC2001 15DEC2001 9 2 01FEB2008 01FEB2008 5 2 01FEB2008 15MAY2008 2 2 01FEB2008 06JAN2009 1 2 15DEC2009 15DEC2009 3 2 15DEC2009 15JAN2010 5 2 26MAY2010 26MAY2010 4 ; run; proc sql noprint; create table want as select * from have group by code,date1 having count(date1) gt 1; quit;
Ksharp
thanks for being helpful, Ksharp. I guess that proc sql is a very useful command that I should learn.
I used the following code to create a test data set:
data test0;
input ID1 ID2 date score;
datalines;
1 1.1 2004 8
1 1.1 2004 7
1 1.1 2004 1
2 1.2 2005 1
2 1.2 2006 1
2 1.2 2007 1
2 2.2 2005 8
2 2.2 2006 8
2 2.2 2007 8
3 3.1 2005 5
3 3.2 2005 6
3 3.3 2005 5
3 3.1 2006 5
3 3.2 2006 6
3 3.3 2006 5
3 3.1 2007 5
3 3.2 2007 6
3 3.3 2007 5
4 4.1 2005 8
4 4.1 2006 8
4 4.1 2007 8
5 5.1 2005 5
5 5.2 2006 6
5 5.3 2007 5
;
I want to test the presence of duplicate observations in the data sets.
The rule is ID1 (primary indicator) should be present only once for each date. For Example, ID1 = 2 and ID1 = 3 have duplicate observations as they have repeatations of the same value for ID1 for one particular value of date. However, ID1 =5 does not have a duplicate observation, although it's secondary indicator (ID2) changes it's value across dates.
Any help on this matter is highly appreciated.
I want to create an indicator variable that will take a value 1 for if a particular observation is a duplicate observation.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.