Dear SAS Community:
I have certain data. Basically it is panel data with company_id, year and quarter (qtr). I would like to remove duplicate records.
I know I can use the following command:
proc sort data=test nodupkey out=test 1;
by year qtr;
run;
However this randomly removes duplicates. I have certain variables (var1, var2, var3, and var4 that are percentages and I would like to keep the maximum values for each year-quarter for var3 and var4. So for example in 1999 I have two different rows- one for which var3 and var4 are equal to 0.5 and another one for which var3 and var4 are equal to 1. How can I keep the one with the higher value?
data work.TEST;
infile datalines dsd truncover;
input company_id:BEST12. year:32. qtr:32. var1:32. var2:32. var3:32. var4:32.;
format company_id BEST12.;
datalines4;
11334,1992,2,0,0,0,0
11334,1992,3,0,0,0,0
11334,1992,3,0,0,0,0
11334,1992,3,0,0,0,0
11334,1992,4,0,0,0,0
11334,1994,2,0,0,0,0
11334,1994,3,0,0,0,0
11334,1994,3,0,0,0,0
11334,1994,3,0,0,0,0
11334,1994,4,0,0,0,0
11334,1995,2,0,0,0,0
11334,1995,3,0,0,0,0
11334,1995,3,0,0,0,0
11334,1995,3,0,0,0,0
11334,1995,4,0,0,0,0
11334,1996,2,0,0,0,0
11334,1996,3,0,0,0,0
11334,1996,3,0,0,0,0
11334,1996,3,0,0,0,0
11334,1996,4,0,0,0,0
11334,1997,2,0,0,0,0
11334,1997,3,0,0,0,0
11334,1997,3,0,0,0,0
11334,1997,3,0,0,0,0
11334,1997,4,0,0,0,0
11334,1998,2,0,0,0,0
11334,1998,3,0,0,0,0
11334,1998,3,0,0,0,0
11334,1998,3,0,0,0,0
11334,1998,4,0,0,0,0
11334,1999,2,0.5,0.5,0.5,0.5
11334,1999,2,1,1,1,1
11334,1999,3,0.5,0.5,0.5,0.5
11334,1999,3,1,1,1,1
11334,1999,3,0.5,0.5,0.5,0.5
11334,1999,3,1,1,1,1
11334,1999,3,0.5,0.5,0.5,0.5
11334,1999,3,1,1,1,1
11334,1999,4,0.5,0.5,0.5,0.5
11334,1999,4,1,1,1,1
11334,2000,2,0,0,0,0
11334,2000,2,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,4,0,0,0,0
11334,2000,4,0,0,0,0
11334,2001,2,0,0,0,0
11334,2001,2,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,4,0,0,0,0
11334,2001,4,0,0,0,0
11334,2002,2,0,0,0,0
11334,2002,2,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,4,0,0,0,0
11334,2002,4,0,0,0,0
11334,2003,2,1,1,1,1
11334,2003,3,1,1,1,1
11334,2003,3,1,1,1,1
11334,2003,3,1,1,1,1
11334,2003,4,1,1,1,1
11334,2004,2,0,0,0,0
11334,2004,3,0,0,0,0
11334,2004,3,0,0,0,0
11334,2004,3,0,0,0,0
11334,2004,4,0,0,0,0
11334,2005,2,0,0,0,0
11334,2005,3,0,0,0,0
11334,2005,3,0,0,0,0
11334,2005,3,0,0,0,0
11334,2005,4,0,0,0,0
11334,2006,2,0,0,0,0
11334,2006,3,0,0,0,0
11334,2006,3,0,0,0,0
11334,2006,3,0,0,0,0
11334,2006,4,0,0,0,0
11334,2007,2,0,0,0,0
11334,2007,3,0,0,0,0
11334,2007,3,0,0,0,0
11334,2007,3,0,0,0,0
11334,2007,4,0,0,0,0
11334,2008,2,0,0,0,0
11334,2008,3,0,0,0,0
11334,2008,3,0,0,0,0
11334,2008,3,0,0,0,0
11334,2008,4,0,0,0,0
11334,2009,2,0,0,0,0
11334,2009,3,0,0,0,0
11334,2009,3,0,0,0,0
11334,2009,3,0,0,0,0
11334,2009,4,0,0,0,0
11334,2011,2,0,0.5,0.5,0.5
11334,2011,3,0,0.5,0.5,0.5
11334,2011,3,0,0.5,0.5,0.5
11334,2011,3,0,0.5,0.5,0.5
11334,2011,4,0,0.5,0.5,0.5
11334,2013,2,0.5,0.5,0.5,0.5
11334,2013,3,0.5,0.5,0.5,0.5
11334,2013,3,0.5,0.5,0.5,0.5
11334,2013,3,0.5,0.5,0.5,0.5
11334,2013,4,0.5,0.5,0.5,0.5
11334,2014,2,0,0,0,0
11334,2014,3,0,0,0,0
11334,2014,3,0,0,0,0
11334,2014,3,0,0,0,0
11334,2014,4,0,0,0,0
11334,2016,2,0,0,0,0
11334,2016,3,0,0,0,0
11334,2016,3,0,0,0,0
11334,2016,3,0,0,0,0
11334,2016,4,0,0,0,0
;;;;
Here is a hash approach that does not require sorting.
data _null_;
if _N_ = 1 then do;
declare hash h(dataset:'test(obs=0)', ordered:'A');
h.defineKey('company_id', 'year', 'qtr');
h.defineData(all:'Y');
h.defineDone();
end;
set test end=eof;
if h.check() ne 0 then h.add();
else do;
_var1=var1;_var2=var2;_var3=var3;_var4=var4;
rc=h.find();
if var3 <= _var3 then do;
var1=_var1;var2=_var2;var3=_var3;var4=_var4;
h.replace();
end;
end;
if eof then h.output(dataset:'want');
run;
@Kurt_Bremsers and my solution generates the exact same data set though 🙂
Sort by company_id, year, quarter and the selective variable, then use last. to extract the higher value:
proc sort data=have;
by company_id year quarter var3;
run;
data want;
set have;
by company_id year quarter;
if last.quarter;
run;
Thank you sir!
Here is a hash approach that does not require sorting.
data _null_;
if _N_ = 1 then do;
declare hash h(dataset:'test(obs=0)', ordered:'A');
h.defineKey('company_id', 'year', 'qtr');
h.defineData(all:'Y');
h.defineDone();
end;
set test end=eof;
if h.check() ne 0 then h.add();
else do;
_var1=var1;_var2=var2;_var3=var3;_var4=var4;
rc=h.find();
if var3 <= _var3 then do;
var1=_var1;var2=_var2;var3=_var3;var4=_var4;
h.replace();
end;
end;
if eof then h.output(dataset:'want');
run;
@Kurt_Bremsers and my solution generates the exact same data set though 🙂
SAS gives multiple approaches to the same issue.
This is mine
proc sql;
create table work.test2 as
select year,qtr , Max(var3) as var3, Max(var4) as var4 from work.test
group by year, qtr
order by year, qtr;
quit;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.