- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Dear SAS Community:
I have certain data. Basically it is panel data with company_id, year and quarter (qtr). I would like to remove duplicate records.
I know I can use the following command:
proc sort data=test nodupkey out=test 1;
by year qtr;
run;
However this randomly removes duplicates. I have certain variables (var1, var2, var3, and var4 that are percentages and I would like to keep the maximum values for each year-quarter for var3 and var4. So for example in 1999 I have two different rows- one for which var3 and var4 are equal to 0.5 and another one for which var3 and var4 are equal to 1. How can I keep the one with the higher value?
data work.TEST;
infile datalines dsd truncover;
input company_id:BEST12. year:32. qtr:32. var1:32. var2:32. var3:32. var4:32.;
format company_id BEST12.;
datalines4;
11334,1992,2,0,0,0,0
11334,1992,3,0,0,0,0
11334,1992,3,0,0,0,0
11334,1992,3,0,0,0,0
11334,1992,4,0,0,0,0
11334,1994,2,0,0,0,0
11334,1994,3,0,0,0,0
11334,1994,3,0,0,0,0
11334,1994,3,0,0,0,0
11334,1994,4,0,0,0,0
11334,1995,2,0,0,0,0
11334,1995,3,0,0,0,0
11334,1995,3,0,0,0,0
11334,1995,3,0,0,0,0
11334,1995,4,0,0,0,0
11334,1996,2,0,0,0,0
11334,1996,3,0,0,0,0
11334,1996,3,0,0,0,0
11334,1996,3,0,0,0,0
11334,1996,4,0,0,0,0
11334,1997,2,0,0,0,0
11334,1997,3,0,0,0,0
11334,1997,3,0,0,0,0
11334,1997,3,0,0,0,0
11334,1997,4,0,0,0,0
11334,1998,2,0,0,0,0
11334,1998,3,0,0,0,0
11334,1998,3,0,0,0,0
11334,1998,3,0,0,0,0
11334,1998,4,0,0,0,0
11334,1999,2,0.5,0.5,0.5,0.5
11334,1999,2,1,1,1,1
11334,1999,3,0.5,0.5,0.5,0.5
11334,1999,3,1,1,1,1
11334,1999,3,0.5,0.5,0.5,0.5
11334,1999,3,1,1,1,1
11334,1999,3,0.5,0.5,0.5,0.5
11334,1999,3,1,1,1,1
11334,1999,4,0.5,0.5,0.5,0.5
11334,1999,4,1,1,1,1
11334,2000,2,0,0,0,0
11334,2000,2,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,4,0,0,0,0
11334,2000,4,0,0,0,0
11334,2001,2,0,0,0,0
11334,2001,2,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,4,0,0,0,0
11334,2001,4,0,0,0,0
11334,2002,2,0,0,0,0
11334,2002,2,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,4,0,0,0,0
11334,2002,4,0,0,0,0
11334,2003,2,1,1,1,1
11334,2003,3,1,1,1,1
11334,2003,3,1,1,1,1
11334,2003,3,1,1,1,1
11334,2003,4,1,1,1,1
11334,2004,2,0,0,0,0
11334,2004,3,0,0,0,0
11334,2004,3,0,0,0,0
11334,2004,3,0,0,0,0
11334,2004,4,0,0,0,0
11334,2005,2,0,0,0,0
11334,2005,3,0,0,0,0
11334,2005,3,0,0,0,0
11334,2005,3,0,0,0,0
11334,2005,4,0,0,0,0
11334,2006,2,0,0,0,0
11334,2006,3,0,0,0,0
11334,2006,3,0,0,0,0
11334,2006,3,0,0,0,0
11334,2006,4,0,0,0,0
11334,2007,2,0,0,0,0
11334,2007,3,0,0,0,0
11334,2007,3,0,0,0,0
11334,2007,3,0,0,0,0
11334,2007,4,0,0,0,0
11334,2008,2,0,0,0,0
11334,2008,3,0,0,0,0
11334,2008,3,0,0,0,0
11334,2008,3,0,0,0,0
11334,2008,4,0,0,0,0
11334,2009,2,0,0,0,0
11334,2009,3,0,0,0,0
11334,2009,3,0,0,0,0
11334,2009,3,0,0,0,0
11334,2009,4,0,0,0,0
11334,2011,2,0,0.5,0.5,0.5
11334,2011,3,0,0.5,0.5,0.5
11334,2011,3,0,0.5,0.5,0.5
11334,2011,3,0,0.5,0.5,0.5
11334,2011,4,0,0.5,0.5,0.5
11334,2013,2,0.5,0.5,0.5,0.5
11334,2013,3,0.5,0.5,0.5,0.5
11334,2013,3,0.5,0.5,0.5,0.5
11334,2013,3,0.5,0.5,0.5,0.5
11334,2013,4,0.5,0.5,0.5,0.5
11334,2014,2,0,0,0,0
11334,2014,3,0,0,0,0
11334,2014,3,0,0,0,0
11334,2014,3,0,0,0,0
11334,2014,4,0,0,0,0
11334,2016,2,0,0,0,0
11334,2016,3,0,0,0,0
11334,2016,3,0,0,0,0
11334,2016,3,0,0,0,0
11334,2016,4,0,0,0,0
;;;;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is a hash approach that does not require sorting.
data _null_;
if _N_ = 1 then do;
declare hash h(dataset:'test(obs=0)', ordered:'A');
h.defineKey('company_id', 'year', 'qtr');
h.defineData(all:'Y');
h.defineDone();
end;
set test end=eof;
if h.check() ne 0 then h.add();
else do;
_var1=var1;_var2=var2;_var3=var3;_var4=var4;
rc=h.find();
if var3 <= _var3 then do;
var1=_var1;var2=_var2;var3=_var3;var4=_var4;
h.replace();
end;
end;
if eof then h.output(dataset:'want');
run;
@Kurt_Bremsers and my solution generates the exact same data set though 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sort by company_id, year, quarter and the selective variable, then use last. to extract the higher value:
proc sort data=have;
by company_id year quarter var3;
run;
data want;
set have;
by company_id year quarter;
if last.quarter;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you sir!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is a hash approach that does not require sorting.
data _null_;
if _N_ = 1 then do;
declare hash h(dataset:'test(obs=0)', ordered:'A');
h.defineKey('company_id', 'year', 'qtr');
h.defineData(all:'Y');
h.defineDone();
end;
set test end=eof;
if h.check() ne 0 then h.add();
else do;
_var1=var1;_var2=var2;_var3=var3;_var4=var4;
rc=h.find();
if var3 <= _var3 then do;
var1=_var1;var2=_var2;var3=_var3;var4=_var4;
h.replace();
end;
end;
if eof then h.output(dataset:'want');
run;
@Kurt_Bremsers and my solution generates the exact same data set though 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
SAS gives multiple approaches to the same issue.
This is mine
proc sql;
create table work.test2 as
select year,qtr , Max(var3) as var3, Max(var4) as var4 from work.test
group by year, qtr
order by year, qtr;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc means data=have noprint Nway;
class company year qtr;
output out=want max = / autoname autolabel;
run;