Solved: Removing duplicate rows with additional conditions

Agent1592 · Posted 01-06-2019 05:13 AM

Dear SAS Community:

I have certain data. Basically it is panel data with company_id, year and quarter (qtr). I would like to remove duplicate records.

I know I can use the following command:

proc sort data=test nodupkey out=test 1;
by year qtr;
run;

However this randomly removes duplicates. I have certain variables (var1, var2, var3, and var4 that are percentages and I would like to keep the maximum values for each year-quarter for var3 and var4. So for example in 1999 I have two different rows- one for which var3 and var4 are equal to 0.5 and another one for which var3 and var4 are equal to 1. How can I keep the one with the higher value?

data work.TEST;
  infile datalines dsd truncover;
  input company_id:BEST12. year:32. qtr:32. var1:32. var2:32. var3:32. var4:32.;
  format company_id BEST12.;
datalines4;
11334,1992,2,0,0,0,0
11334,1992,3,0,0,0,0
11334,1992,3,0,0,0,0
11334,1992,3,0,0,0,0
11334,1992,4,0,0,0,0
11334,1994,2,0,0,0,0
11334,1994,3,0,0,0,0
11334,1994,3,0,0,0,0
11334,1994,3,0,0,0,0
11334,1994,4,0,0,0,0
11334,1995,2,0,0,0,0
11334,1995,3,0,0,0,0
11334,1995,3,0,0,0,0
11334,1995,3,0,0,0,0
11334,1995,4,0,0,0,0
11334,1996,2,0,0,0,0
11334,1996,3,0,0,0,0
11334,1996,3,0,0,0,0
11334,1996,3,0,0,0,0
11334,1996,4,0,0,0,0
11334,1997,2,0,0,0,0
11334,1997,3,0,0,0,0
11334,1997,3,0,0,0,0
11334,1997,3,0,0,0,0
11334,1997,4,0,0,0,0
11334,1998,2,0,0,0,0
11334,1998,3,0,0,0,0
11334,1998,3,0,0,0,0
11334,1998,3,0,0,0,0
11334,1998,4,0,0,0,0
11334,1999,2,0.5,0.5,0.5,0.5
11334,1999,2,1,1,1,1
11334,1999,3,0.5,0.5,0.5,0.5
11334,1999,3,1,1,1,1
11334,1999,3,0.5,0.5,0.5,0.5
11334,1999,3,1,1,1,1
11334,1999,3,0.5,0.5,0.5,0.5
11334,1999,3,1,1,1,1
11334,1999,4,0.5,0.5,0.5,0.5
11334,1999,4,1,1,1,1
11334,2000,2,0,0,0,0
11334,2000,2,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,3,0,0,0,0
11334,2000,4,0,0,0,0
11334,2000,4,0,0,0,0
11334,2001,2,0,0,0,0
11334,2001,2,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,3,0,0,0,0
11334,2001,4,0,0,0,0
11334,2001,4,0,0,0,0
11334,2002,2,0,0,0,0
11334,2002,2,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,3,0,0,0,0
11334,2002,4,0,0,0,0
11334,2002,4,0,0,0,0
11334,2003,2,1,1,1,1
11334,2003,3,1,1,1,1
11334,2003,3,1,1,1,1
11334,2003,3,1,1,1,1
11334,2003,4,1,1,1,1
11334,2004,2,0,0,0,0
11334,2004,3,0,0,0,0
11334,2004,3,0,0,0,0
11334,2004,3,0,0,0,0
11334,2004,4,0,0,0,0
11334,2005,2,0,0,0,0
11334,2005,3,0,0,0,0
11334,2005,3,0,0,0,0
11334,2005,3,0,0,0,0
11334,2005,4,0,0,0,0
11334,2006,2,0,0,0,0
11334,2006,3,0,0,0,0
11334,2006,3,0,0,0,0
11334,2006,3,0,0,0,0
11334,2006,4,0,0,0,0
11334,2007,2,0,0,0,0
11334,2007,3,0,0,0,0
11334,2007,3,0,0,0,0
11334,2007,3,0,0,0,0
11334,2007,4,0,0,0,0
11334,2008,2,0,0,0,0
11334,2008,3,0,0,0,0
11334,2008,3,0,0,0,0
11334,2008,3,0,0,0,0
11334,2008,4,0,0,0,0
11334,2009,2,0,0,0,0
11334,2009,3,0,0,0,0
11334,2009,3,0,0,0,0
11334,2009,3,0,0,0,0
11334,2009,4,0,0,0,0
11334,2011,2,0,0.5,0.5,0.5
11334,2011,3,0,0.5,0.5,0.5
11334,2011,3,0,0.5,0.5,0.5
11334,2011,3,0,0.5,0.5,0.5
11334,2011,4,0,0.5,0.5,0.5
11334,2013,2,0.5,0.5,0.5,0.5
11334,2013,3,0.5,0.5,0.5,0.5
11334,2013,3,0.5,0.5,0.5,0.5
11334,2013,3,0.5,0.5,0.5,0.5
11334,2013,4,0.5,0.5,0.5,0.5
11334,2014,2,0,0,0,0
11334,2014,3,0,0,0,0
11334,2014,3,0,0,0,0
11334,2014,3,0,0,0,0
11334,2014,4,0,0,0,0
11334,2016,2,0,0,0,0
11334,2016,3,0,0,0,0
11334,2016,3,0,0,0,0
11334,2016,3,0,0,0,0
11334,2016,4,0,0,0,0
;;;;

PeterClemmensen · Posted 01-06-2019 12:07 PM

Here is a hash approach that does not require sorting.

data _null_;
   if _N_ = 1 then do;
      declare hash h(dataset:'test(obs=0)', ordered:'A');
      h.defineKey('company_id', 'year', 'qtr');
      h.defineData(all:'Y');
      h.defineDone();
   end;

   set test end=eof;

   if h.check() ne 0 then h.add();
   else do;
      _var1=var1;_var2=var2;_var3=var3;_var4=var4;
      rc=h.find();
      if var3 <= _var3 then do;
         var1=_var1;var2=_var2;var3=_var3;var4=_var4;
         h.replace();
      end;
   end;

   if eof then h.output(dataset:'want');
run;

@Kurt_Bremsers and my solution generates the exact same data set though 🙂

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

Kurt_Bremser · Posted 01-06-2019 05:56 AM

Sort by company_id, year, quarter and the selective variable, then use last. to extract the higher value:

proc sort data=have;
by company_id year quarter var3;
run;

data want;
set have;
by company_id year quarter;
if last.quarter;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Agent1592 · Posted 01-06-2019 05:01 PM

Thank you sir!

PeterClemmensen · Posted 01-06-2019 12:07 PM

Here is a hash approach that does not require sorting.

data _null_;
   if _N_ = 1 then do;
      declare hash h(dataset:'test(obs=0)', ordered:'A');
      h.defineKey('company_id', 'year', 'qtr');
      h.defineData(all:'Y');
      h.defineDone();
   end;

   set test end=eof;

   if h.check() ne 0 then h.add();
   else do;
      _var1=var1;_var2=var2;_var3=var3;_var4=var4;
      rc=h.find();
      if var3 <= _var3 then do;
         var1=_var1;var2=_var2;var3=_var3;var4=_var4;
         h.replace();
      end;
   end;

   if eof then h.output(dataset:'want');
run;

@Kurt_Bremsers and my solution generates the exact same data set though 🙂

The DATA to DATA Step Macro
Blog: SASnrd

Agent1592 · Posted 01-06-2019 05:01 PM

Thank you!

Sajid01 · Posted 01-06-2019 12:56 PM

SAS gives multiple approaches to the same issue.

This is mine

proc sql;
create table work.test2 as
select year,qtr , Max(var3) as var3, Max(var4) as var4 from work.test
group by year, qtr
order by year, qtr;
quit;

Reeza · Posted 01-06-2019 02:23 PM

If you want the MAX for each value for a particular grouping, PROC MEANS/SUMMARY is the most efficient method.

proc means data=have noprint Nway;
class company year qtr;
output out=want max = / autoname autolabel;
run;

Removing duplicate rows with additional conditions

Re: Removing duplicate rows with additional conditions

Re: Removing duplicate rows with additional conditions

Re: Removing duplicate rows with additional conditions

Re: Removing duplicate rows with additional conditions

Re: Removing duplicate rows with additional conditions

Re: Removing duplicate rows with additional conditions

Re: Removing duplicate rows with additional conditions

Registration is open

Call for Content EXTENDED

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away