BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SasStatistics
Pyrite | Level 9

Assume I have data of the sort: 

Customer ID  Account Create Date
1 123 01-jan-20
1 345 02-jan-20
2 300 10-jan-20
1 200 05-jan-20

 

Notice how I have three duplicates in terms of Customer_ID, I would only like to keep the last observation among these three duplicates since it is the newest observation. For example, 05-jan-20 is "younger" than (01-jan-20 and 02-jan-20) and therefore I want to keep the observation from 05-jan-20.

Any advide on how I could do this? 

Thanks. 

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26
proc sort data=have;
    by customerID createdate;
run;
data want;
    set have;
    by customerID;
    if last.customerID;
run;
--
Paige Miller

View solution in original post

6 REPLIES 6
PaigeMiller
Diamond | Level 26
proc sort data=have;
    by customerID createdate;
run;
data want;
    set have;
    by customerID;
    if last.customerID;
run;
--
Paige Miller
tarheel13
Rhodochrosite | Level 12
proc sql;
create table want as select * 
    from yourtable
     group by customerID
     having createdate=max(createdate);
quit;
Astounding
PROC Star

In real life, I would expect that your data set contains additional variables.  If you want to keep the most recent value for all variables, it is possible that the most recent observation for a variable contains a missing value while an earlier observation contains a nonmissing value.  To keep the most recent non-missing value (possibly piecing together the final observation from several original observations), use:

proc sort data=have;
    by customerID createdate;
run;
data want;
    update have (obs=0) have;
    by customerID;
run;
mkeintz
PROC Star

If sorting is expensive, here's an alternative: 

 

proc summary data=have nway;
  class customer_id;
  var create_date;
  output out=want (drop=_:) 
    max(create_date)=create_date 
    maxid(create_date(_all_))=;
run;

Not only is pre-sorting unnecessary, but you could also ask for 1-way, 2-way, n-way classifications.  This would be handy if you need something like the most recent date not only for each customer, but also for (say) each product (which might otherwise require multiple pre-sorts).

 

The "maxid(create_date(_all_))=" expression tells proc summary to output the entire record (all the variables) containing the corresponding most-recent-date.

 

Edited comment:   The program above will also generate these messages:

 

WARNING: Variable customer_id already exists on file WORK.WANT.
WARNING: Variable create_date already exists on file WORK.WANT.
WARNING: The duplicate variables will not be included in the output data set of the output statement number 1.

 

This is due to the fact that the CLASS and VAR statements name variables (customer_id and create_date, respectively) that will be in the output dataset.  However, they are also implied in the "(_all_)" subparameter of the maxid parameter of the OUTPUT statement.  Now you could eliminate the warning for the create_date variable via:

 

 

proc summary data=have nway;
  class customer_id;
  var create_date;
  output out=want (drop=_: ) 
    max(create_date)=_max_create_date 
    maxid(create_date(_all_))=;
run;

 

 

Since the record for maxid(create_date) has the actual maximum value for create_date, renaming the generated maximum value statistic to "_max_create_date" (subsequently dropped) simply eliminates a duplicate value as well as the superfluous warning.

 

There may be a way to get rid of the remaining warning for the class variable, while using the "_all_" parameter, but it eludes me at the moment.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
andreas_lds
Jade | Level 19

To add two more ways to solve the problem:

/* 1) double sort */
proc sort data=have;
   by CustomerID descending CreateDate;
run;

proc sort data=have out=want nodupkey;
   by CustomerID;
run;

If proc sort seems to be to simple, you could use a hash-object:

data _null_;
   if 0 then set have;
   
   if _n_ = 1 then do;
      declare hash h();
      h.defineKey('CustomerID');
      h.defineData('CustomerID', 'Account', 'CreateDate');
      h.defineDone();
   end;
   
   set have(rename=(Account = acc CreateDate = cd)) end=jobDone;
   
   if h.find() ^= 0 then do;
      Account = acc;
      CreateDate = cd;
      h.add();
   end;
   else do;
      if CreateDate < cd then do;
         Account = acc;
         CreateDate = cd;
         h.update();
      end;
   end;
   
   if jobDone then do;
      h.output(dataset: 'work.want2');
   end;
run;
Patrick
Opal | Level 21

@SasStatistics 

You've got already solutions how to keep the row with the most recent date per customer id. 

Assuming your sample data somehow represents your actual data: Isn't it possible that a customer can have more than one account? If so then wouldn't you need to keep the most recent row per customer_id AND account_id? ...and potentially also include some other column in your logic with an expiration date or account status?

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 4397 views
  • 5 likes
  • 7 in conversation