Solved: Duplicate data with different dates - keep the newest

SasStatistics · Posted 06-14-2021 03:32 PM

Assume I have data of the sort:

Customer ID	Account	Create Date
1	123	01-jan-20
1	345	02-jan-20
2	300	10-jan-20
1	*200*	*05-jan-20*

Notice how I have three duplicates in terms of Customer_ID, I would only like to keep the last observation among these three duplicates since it is the newest observation. For example, 05-jan-20 is "younger" than (01-jan-20 and 02-jan-20) and therefore I want to keep the observation from 05-jan-20.

Any advide on how I could do this?

Thanks.

PaigeMiller · Posted 06-14-2021 03:44 PM

proc sort data=have;
    by customerID createdate;
run;
data want;
    set have;
    by customerID;
    if last.customerID;
run;

--
Paige Miller

View solution in original post

PaigeMiller · Posted 06-14-2021 03:44 PM

proc sort data=have;
    by customerID createdate;
run;
data want;
    set have;
    by customerID;
    if last.customerID;
run;

--
Paige Miller

tarheel13 · Posted 06-14-2021 03:53 PM

proc sql;
create table want as select * 
    from yourtable
     group by customerID
     having createdate=max(createdate);
quit;

Astounding · Posted 06-14-2021 04:49 PM

In real life, I would expect that your data set contains additional variables. If you want to keep the most recent value for all variables, it is possible that the most recent observation for a variable contains a missing value while an earlier observation contains a nonmissing value. To keep the most recent non-missing value (possibly piecing together the final observation from several original observations), use:

proc sort data=have;
    by customerID createdate;
run;
data want;
    update have (obs=0) have;
    by customerID;
run;

mkeintz · Posted 06-14-2021 08:22 PM

If sorting is expensive, here's an alternative:

proc summary data=have nway;
  class customer_id;
  var create_date;
  output out=want (drop=_:) 
    max(create_date)=create_date 
    maxid(create_date(_all_))=;
run;

Not only is pre-sorting unnecessary, but you could also ask for 1-way, 2-way, n-way classifications. This would be handy if you need something like the most recent date not only for each customer, but also for (say) each product (which might otherwise require multiple pre-sorts).

The "maxid(create_date(_all_))=" expression tells proc summary to output the entire record (all the variables) containing the corresponding most-recent-date.

Edited comment: The program above will also generate these messages:

WARNING: Variable customer_id already exists on file WORK.WANT.
WARNING: Variable create_date already exists on file WORK.WANT.
WARNING: The duplicate variables will not be included in the output data set of the output statement number 1.

This is due to the fact that the CLASS and VAR statements name variables (customer_id and create_date, respectively) that will be in the output dataset. However, they are also implied in the "(_all_)" subparameter of the maxid parameter of the OUTPUT statement. Now you could eliminate the warning for the create_date variable via:

proc summary data=have nway;
  class customer_id;
  var create_date;
  output out=want (drop=_: ) 
    max(create_date)=_max_create_date 
    maxid(create_date(_all_))=;
run;

Since the record for maxid(create_date) has the actual maximum value for create_date, renaming the generated maximum value statistic to "_max_create_date" (subsequently dropped) simply eliminates a duplicate value as well as the superfluous warning.

There may be a way to get rid of the remaining warning for the class variable, while using the "_all_" parameter, but it eludes me at the moment.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

andreas_lds · Posted 06-15-2021 02:21 AM

To add two more ways to solve the problem:

/* 1) double sort */
proc sort data=have;
   by CustomerID descending CreateDate;
run;

proc sort data=have out=want nodupkey;
   by CustomerID;
run;

If proc sort seems to be to simple, you could use a hash-object:

data _null_;
   if 0 then set have;
   
   if _n_ = 1 then do;
      declare hash h();
      h.defineKey('CustomerID');
      h.defineData('CustomerID', 'Account', 'CreateDate');
      h.defineDone();
   end;
   
   set have(rename=(Account = acc CreateDate = cd)) end=jobDone;
   
   if h.find() ^= 0 then do;
      Account = acc;
      CreateDate = cd;
      h.add();
   end;
   else do;
      if CreateDate < cd then do;
         Account = acc;
         CreateDate = cd;
         h.update();
      end;
   end;
   
   if jobDone then do;
      h.output(dataset: 'work.want2');
   end;
run;

Patrick · Posted 06-15-2021 02:48 AM

@SasStatistics

You've got already solutions how to keep the row with the most recent date per customer id.

Assuming your sample data somehow represents your actual data: Isn't it possible that a customer can have more than one account? If so then wouldn't you need to keep the most recent row per customer_id AND account_id? ...and potentially also include some other column in your logic with an expiration date or account status?

Duplicate data with different dates - keep the newest

Re: Duplicate data with different dates - keep the newest

Re: Duplicate data with different dates - keep the newest

Re: Duplicate data with different dates - keep the newest

Re: Duplicate data with different dates - keep the newest

Re: Duplicate data with different dates - keep the newest

Re: Duplicate data with different dates - keep the newest

Re: Duplicate data with different dates - keep the newest

SAS Innovate 2025: Call for Content

Classroom Training Available!