Hello, I need some help regarding deleting duplicate rows; my dataset has 3 columns and looks like this:
code date volume
3 jun1996 100
3 jul1996 110
3 jul1996 120
3 aug1996 130
4 jun1996 105
4 jul1996 110
4 jul1996 110
What I want to do is delete the rows that have the same code and date (volume can be different, I want to keep the one with the highest number; if these are the same just keep one)
Any ideas on how to do that?
So it would be like that:
code date volume
3 jun1996 100
3 jul1996 120
3 aug1996 130
4 jun1996 105
4 jul1996 110
Many thanks,
Costas
In general you want to sort and keep the last one per group.
proc sort data=have out=want ;
by code date volume;
run;
data want;
set want;
by code date ;
if last.date;
run;
In general you want to sort and keep the last one per group.
proc sort data=have out=want ;
by code date volume;
run;
data want;
set want;
by code date ;
if last.date;
run;
It worked!
Many thanks tom
For this type of summary I would use PROC SUMMARY. You could use MAX= parameter but IDGROUP works with both character and numeric variables.
data have;
inptut code date :monyy. volume;
format date monyy7.;
cards;
3 jun1996 100
3 jul1996 110
3 jul1996 120
3 aug1996 130
4 jun1996 105
4 jul1996 110
4 jul1996 110
;;;;
run;
proc summary data=have nway;
class code date;
output out=new idgroup(max(volume) out(volume)=);
run;
proc print;
run;
While you already have two excellent answers, there are typically many ways of accomplishing the same thing in SAS. Thus, a third way of accomplishing what you want is with two sorts. E.g.,
proc sort data=have out=want ;
by code date descending volume;
run;
proc sort data=want nodupkey;
by code date;
run;
Thanks Art297, your 'third way' saved my day.
The fourth way after art.T
data have; input code date :monyy. volume; format date monyy7.; cards; 3 jun1996 100 3 jul1996 110 3 jul1996 120 3 aug1996 130 4 jun1996 105 4 jul1996 110 4 jul1996 110 ;;;; run; proc sql noprint; create table want as select distinct * from have group by code,date having volume eq max(volume); quit;
Ksharp
I have a related question. How happen if I want to keep the duplicates but delete the non-duplicates. I have posted the following dataset in this forum before. It looks like this:
participantID | Treatment_Start_Date | Assessment_Date | Scores |
1 | 13JAN2001 | 13JAN2001 | 5 |
1 | 13JAN2001 | 24MAR2001 | 6 |
1 | 13JAN2001 | 07MAY2001 | 8 |
1 | 15DEC2001 | 15DEC2001 | 9 |
2 | 01FEB2008 | 01FEB2008 | 5 |
2 | 01FEB2008 | 15MAY2008 | 2 |
2 | 01FEB2008 | 06JAN2009 | 1 |
2 | 15DEC2009 | 15DEC2009 | 3 |
2 | 15DEC2009 | 15JAN2010 | 5 |
2 | 26MAY2010 | 26MAY2010 | 4 |
OR
data have;
informat Treatment_Start_Date Assessment_Date date9.;
format Treatment_Start_Date Assessment_Date date9.;
input participantID Treatment_Start_Date Assessment_Date Scores;
cards;
1 13JAN2001 13JAN2001 5
1 13JAN2001 24MAR2001 6
1 13JAN2001 07MAY2001 8
1 15DEC2001 15DEC2001 9
2 01FEB2008 01FEB2008 5
2 01FEB2008 15MAY2008 2
2 01FEB2008 06JAN2009 1
2 15DEC2009 15DEC2009 3
2 15DEC2009 15JAN2010 5
2 26MAY2010 26MAY2010 4
;
run;
For participant 1 for example, I wan to keep only his three lines of data for 13JAN2001. For participant 2, I want to keep 01FEB2008 (three lines) and 15DEC2009 (two lines). I tried to find the "opposite" of NODUPKEY (something like KEEPDUPKEY") but have not succeed. Is there anyway to do this?
Thanks!
Chester
Very similar to the code Tom proposed earlier:
proc sort data=have out=want ;
by participantID Treatment_Start_Date;
run;
data want;
set want;
by code date ;
if not(first.Treatment_Start_Date and last.Treatment_Start_Date);
run;
How about:
data have; input code date1 : date9. date2 : date9. volume; format date1 date2 date9.; cards; 1 13JAN2001 13JAN2001 5 1 13JAN2001 24MAR2001 6 1 13JAN2001 07MAY2001 8 1 15DEC2001 15DEC2001 9 2 01FEB2008 01FEB2008 5 2 01FEB2008 15MAY2008 2 2 01FEB2008 06JAN2009 1 2 15DEC2009 15DEC2009 3 2 15DEC2009 15JAN2010 5 2 26MAY2010 26MAY2010 4 ; run; proc sql noprint; create table want as select * from have group by code,date1 having count(date1) gt 1; quit;
Ksharp
thanks for being helpful, Ksharp. I guess that proc sql is a very useful command that I should learn.
I used the following code to create a test data set:
data test0;
input ID1 ID2 date score;
datalines;
1 1.1 2004 8
1 1.1 2004 7
1 1.1 2004 1
2 1.2 2005 1
2 1.2 2006 1
2 1.2 2007 1
2 2.2 2005 8
2 2.2 2006 8
2 2.2 2007 8
3 3.1 2005 5
3 3.2 2005 6
3 3.3 2005 5
3 3.1 2006 5
3 3.2 2006 6
3 3.3 2006 5
3 3.1 2007 5
3 3.2 2007 6
3 3.3 2007 5
4 4.1 2005 8
4 4.1 2006 8
4 4.1 2007 8
5 5.1 2005 5
5 5.2 2006 6
5 5.3 2007 5
;
I want to test the presence of duplicate observations in the data sets.
The rule is ID1 (primary indicator) should be present only once for each date. For Example, ID1 = 2 and ID1 = 3 have duplicate observations as they have repeatations of the same value for ID1 for one particular value of date. However, ID1 =5 does not have a duplicate observation, although it's secondary indicator (ID2) changes it's value across dates.
Any help on this matter is highly appreciated.
I want to create an indicator variable that will take a value 1 for if a particular observation is a duplicate observation.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.