Hi,
My data looks like this,
Customer_Id | Transaction_No | Transaction_Date | Amount | Sales_Credit |
4699 | 80902288 | 22-Sep-15 | 55.3 | 2 |
4699 | 80686478 | 12-Sep-15 | 62.3 | 1 |
4699 | 80687203 | 12-Sep-15 | 55.3 | 1 |
4919 | 81735223 | 12-Oct-15 | 15.99 | 2 |
4919 | 80917293 | 22-Sep-15 | 15.99 | 1 |
4919 | 80917294 | 22-Sep-15 | 15.99 | 1 |
4919 | 80938352 | 22-Sep-15 | 23.99 | 1 |
6320 | 80926085 | 23-Sep-15 | 51.75 | 2 |
6320 | 80517713 | 09-Sep-15 | 51.75 | 1 |
6320 | 80512633 | 09-Sep-15 | 51.75 | 1 |
I need to find only the data with actual sales. Like in customer_id "4699" only the second observation remains. However with customer_id= "4919" I need to randomly delete either of the two observations having amount 15.99. Could you help me how to proceed with this? If statements with first.variable concept or proc sql?
Here could be solution using Hash:
data have; infile cards dlm='09'x truncover; input (Customer_Id Transaction_No) (:$20.) Transaction_Date :anydtdte20. Amount Sales_Credit ; format Transaction_Date date9.; cards; 4699 80902288 22-Sep-15 55.3 2 4699 80686478 12-Sep-15 62.3 1 4699 80687203 12-Sep-15 55.3 1 4919 81735223 12-Oct-15 15.99 2 4919 80917293 22-Sep-15 15.99 1 4919 80917294 22-Sep-15 15.99 1 4919 80938352 22-Sep-15 23.99 1 6320 80926085 23-Sep-15 51.75 2 6320 80517713 09-Sep-15 51.75 1 6320 80512633 09-Sep-15 51.75 1 ; data want;
if _n_=1 then do;
dcl hash h(dataset:'have(drop=Transaction_No rename=(Transaction_Date=_date Sales_Credit=_credit) where=(_Credit=2))', multidata:'y');
h.definekey('customer_id', 'amount');
h.definedata(all: 'y');
h.definedone();
call missing (_credit, _date);
end;
set have(where=(Sales_Credit=1));
if h.find()=0 and _date >= Transaction_Date then do;
h.removedup();
delete;
end;
drop _:;
run;
Can you describe why you wish to perform those actions?
And define "actual sale", it isn't self explanatory.
Reducing records be randomly deletes, AND keep details (in your case transaction number) - why? It's "dangerous, since the details it's not a fact any more.
I would simply combine a filter with select distinct.
I am sorry to miss that information. When sales_credit= 2, it refers to a "Product return". Therefore by removing that observation and the observation with similar amount within the customer_id I am trying to achieve only those observations that show actual sales.
And yes, I agree the transaction_id is not required once I randomly select data.
Can you guarantee that there will always be an exact match when it's a 2?
And do you need sale dates?
Yes, I can. Although there can be multiple 2's for one customer_id with its respective purchase transaction.
Like this:
16759 | 80902332 | 22-Sep-15 | 19.99 | 2 |
16759 | 80902327 | 22-Sep-15 | 19.99 | 2 |
16759 | 80902328 | 22-Sep-15 | 19.99 | 2 |
16759 | 80902326 | 22-Sep-15 | 19.99 | 2 |
16759 | 80902335 | 22-Sep-15 | 19.99 | 2 |
16759 | 80159162 | 05-Sep-15 | 19.99 | 1 |
16759 | 80161144 | 05-Sep-15 | 19.99 | 1 |
16759 | 80160655 | 05-Sep-15 | 19.99 | 1 |
16759 | 80191135 | 06-Sep-15 | 19.99 | 1 |
16759 | 80191307 | 06-Sep-15 | 19.99 | 1 |
Would it be easier if I skip the the transaction_date?
Here could be solution using Hash:
data have; infile cards dlm='09'x truncover; input (Customer_Id Transaction_No) (:$20.) Transaction_Date :anydtdte20. Amount Sales_Credit ; format Transaction_Date date9.; cards; 4699 80902288 22-Sep-15 55.3 2 4699 80686478 12-Sep-15 62.3 1 4699 80687203 12-Sep-15 55.3 1 4919 81735223 12-Oct-15 15.99 2 4919 80917293 22-Sep-15 15.99 1 4919 80917294 22-Sep-15 15.99 1 4919 80938352 22-Sep-15 23.99 1 6320 80926085 23-Sep-15 51.75 2 6320 80517713 09-Sep-15 51.75 1 6320 80512633 09-Sep-15 51.75 1 ; data want;
if _n_=1 then do;
dcl hash h(dataset:'have(drop=Transaction_No rename=(Transaction_Date=_date Sales_Credit=_credit) where=(_Credit=2))', multidata:'y');
h.definekey('customer_id', 'amount');
h.definedata(all: 'y');
h.definedone();
call missing (_credit, _date);
end;
set have(where=(Sales_Credit=1));
if h.find()=0 and _date >= Transaction_Date then do;
h.removedup();
delete;
end;
drop _:;
run;
This works great!! Thank you so much.
The two things I would change in your code:
- A keep statement for the data set loaded into the hash so that you're sure not to load variables you don't want to load
- Loop over the "duplicates" in the hash and chose the oldest date where "_date >= Transaction_Date" is true. You can't be sure that the first match is the correct one, and it could also be possible that the condition is true for more than one return.
@Patrick wrote:The two things I would change in your code:
- A keep statement for the data set loaded into the hash so that you're sure not to load variables you don't want to load
- Loop over the "duplicates" in the hash and chose the oldest date where "_date >= Transaction_Date" is true. You can't be sure that the first match is the correct one, and it could also be possible that the condition is true for more than one return.
1. Absolutely Agreed.
2. Not so sure about that. Customer behaves randomly, your suggestion is based on the assumption that 'first in - first out', which is not always necessarily the case. Without more information to identify the paired transactions, this is as far as I would go. But I do believe you have raised a valid point to OP.
Haikuo
I believe if you don't select the earliest return date which satifies condition "_date >= Transaction_Date" then you are at risk to miss some returns. Consider the following:
Rec_No | Customer_Id | Transaction_Date | Amount | Sales_Credit |
1 | 1 | 1-Nov-15 | 5 | 1 |
2 | 1 | 2-Nov-15 | 5 | 2 |
3 | 1 | 3-Nov-15 | 5 | 1 |
4 | 1 | 4-Nov-15 | 5 | 2 |
If you use record 4 to remove record 1 then the pair record 2/record 3 won't satisfy condition "_date >= Transaction_Date" and you'll miss the removal of record 3.
@Patrick wrote:I believe if you don't select the earliest return date which satifies condition "_date >= Transaction_Date" then you are at risk to miss some returns. Consider the following:
Rec_No Customer_Id Transaction_Date Amount Sales_Credit 1 1 1-Nov-15 5 1 2 1 2-Nov-15 5 2 3 1 3-Nov-15 5 1 4 1 4-Nov-15 5 2
If you use record 4 to remove record 1 then the pair record 2/record 3 won't satisfy condition "_date >= Transaction_Date" and you'll miss the removal of record 3.
Thank you for your following up and your point is well taken. However, if OP unfortunately has data like that, the problem is beyond a simple code fix. Surely you will have your 'balance sheet', but you will never know if paired transations are the actual pairs without more information. If the purpose is just to get the net 'balance', then the condition of '_date >= Transaction_Date' can be completely ignored. Given the data as is, my approach is to get as far as the 'known' can stand, and stop right at the point where 'unknown' starts.
Just my 2cents,
Haikuo
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.