Hello,
I have a data set full of orders.
The data set has 150+ columns and when I update it every day some duplicates will be introduced.
I need to remove those duplicates (based on order_id) and keep the record with the most recent update_date.
The following code works most of the time but I just noticed there are duplicates that the code for some reason is not removing and it does not make any sense.
Here is my code:
proc sort data=Orders out=Orders;
by order_id update_date;
run;
data Orders;
set Orders;
by order_id update_date;
if last.update_date;
run;
What am I missing here and why is this working for some orders but not the others.
I've been staring at one particular order for the past 2hrs and it's the same exact order_id and the update_date is clearly different yet the code returns both records!?
Your subsetting statement is incorrect:
if last.update_date;
It should be:
if last.order_id;
Also notice, this is dangerous:
data orders;
set orders;
...........
If your logic is wrong, you can destroy your data set ORDERS. Safer:
data orders2;
set orders;
..............
Hmm..OK...I will try that.
But why?
If I have 2 records:
Order_Id Update_date
100 1/1/2019
100 2/1/2019
So I want to keep the one from 2/1/2019 why would I say if last.order_id and not last.update_date??
Just trying to understand.
Second part makes perfect sense..I will change that.
Thank you for your answer @Astounding !!
I forgot to mention...especially since the code I had was removing duplicates as I was hoping it would...just not all of them
proc sort data=Orders;
by order_id descending update_date;
run;
proc sort data=Orders nodupkey;
by order_id;
run;
@SasDewd wrote:
Hello,
I have a data set full of orders.
The data set has 150+ columns and when I update it every day some duplicates will be introduced.
I need to remove those duplicates (based on order_id) and keep the record with the most recent update_date.
The following code works most of the time but I just noticed there are duplicates that the code for some reason is not removing and it does not make any sense.
Here is my code:
proc sort data=Orders out=Orders;
by order_id update_date;
run;data Orders;
set Orders;
by order_id update_date;
if last.update_date;
run;
What am I missing here and why is this working for some orders but not the others.
I've been staring at one particular order for the past 2hrs and it's the same exact order_id and the update_date is clearly different yet the code returns both records!?
You may need to consider how you are "updating" you data. I sounds like you might be APPENDING data. Either with Proc Append or using a SET statement with two data sets, the master and the "new" data.
If you want a single record for ORDER_ID perhaps you need to use a data step UPDATE statement.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.