Assume I have data of the sort:
Customer ID | Account | Create Date |
1 | 123 | 01-jan-20 |
1 | 345 | 02-jan-20 |
2 | 300 | 10-jan-20 |
1 | 200 | 05-jan-20 |
Notice how I have three duplicates in terms of Customer_ID, I would only like to keep the last observation among these three duplicates since it is the newest observation. For example, 05-jan-20 is "younger" than (01-jan-20 and 02-jan-20) and therefore I want to keep the observation from 05-jan-20.
Any advide on how I could do this?
Thanks.
proc sort data=have;
by customerID createdate;
run;
data want;
set have;
by customerID;
if last.customerID;
run;
proc sort data=have;
by customerID createdate;
run;
data want;
set have;
by customerID;
if last.customerID;
run;
proc sql;
create table want as select *
from yourtable
group by customerID
having createdate=max(createdate);
quit;
In real life, I would expect that your data set contains additional variables. If you want to keep the most recent value for all variables, it is possible that the most recent observation for a variable contains a missing value while an earlier observation contains a nonmissing value. To keep the most recent non-missing value (possibly piecing together the final observation from several original observations), use:
proc sort data=have;
by customerID createdate;
run;
data want;
update have (obs=0) have;
by customerID;
run;
If sorting is expensive, here's an alternative:
proc summary data=have nway;
class customer_id;
var create_date;
output out=want (drop=_:)
max(create_date)=create_date
maxid(create_date(_all_))=;
run;
Not only is pre-sorting unnecessary, but you could also ask for 1-way, 2-way, n-way classifications. This would be handy if you need something like the most recent date not only for each customer, but also for (say) each product (which might otherwise require multiple pre-sorts).
The "maxid(create_date(_all_))=" expression tells proc summary to output the entire record (all the variables) containing the corresponding most-recent-date.
Edited comment: The program above will also generate these messages:
WARNING: Variable customer_id already exists on file WORK.WANT.
WARNING: Variable create_date already exists on file WORK.WANT.
WARNING: The duplicate variables will not be included in the output data set of the output statement number 1.
This is due to the fact that the CLASS and VAR statements name variables (customer_id and create_date, respectively) that will be in the output dataset. However, they are also implied in the "(_all_)" subparameter of the maxid parameter of the OUTPUT statement. Now you could eliminate the warning for the create_date variable via:
proc summary data=have nway;
class customer_id;
var create_date;
output out=want (drop=_: )
max(create_date)=_max_create_date
maxid(create_date(_all_))=;
run;
Since the record for maxid(create_date) has the actual maximum value for create_date, renaming the generated maximum value statistic to "_max_create_date" (subsequently dropped) simply eliminates a duplicate value as well as the superfluous warning.
There may be a way to get rid of the remaining warning for the class variable, while using the "_all_" parameter, but it eludes me at the moment.
To add two more ways to solve the problem:
/* 1) double sort */
proc sort data=have;
by CustomerID descending CreateDate;
run;
proc sort data=have out=want nodupkey;
by CustomerID;
run;
If proc sort seems to be to simple, you could use a hash-object:
data _null_;
if 0 then set have;
if _n_ = 1 then do;
declare hash h();
h.defineKey('CustomerID');
h.defineData('CustomerID', 'Account', 'CreateDate');
h.defineDone();
end;
set have(rename=(Account = acc CreateDate = cd)) end=jobDone;
if h.find() ^= 0 then do;
Account = acc;
CreateDate = cd;
h.add();
end;
else do;
if CreateDate < cd then do;
Account = acc;
CreateDate = cd;
h.update();
end;
end;
if jobDone then do;
h.output(dataset: 'work.want2');
end;
run;
You've got already solutions how to keep the row with the most recent date per customer id.
Assuming your sample data somehow represents your actual data: Isn't it possible that a customer can have more than one account? If so then wouldn't you need to keep the most recent row per customer_id AND account_id? ...and potentially also include some other column in your logic with an expiration date or account status?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.