In SAS, I have a transaction data and I need to dedup it in such a way that if a customer has attempted multiple transactions of the same amount with the same merchant within 2 hours then I should get the latest instance based on transaction time in my final dataset, but if the transaction time is greater than 2 hours then I should get all the records. Below is my sample data.
cust_nbr | txn_id | merchant | txn_amt | acct_num | txn_date | txn_time |
11111111111111111111 | A1234 | XYZ | 1000 | 0123456789 | 17Jan2021 | 23:59:57 |
11111111111111111111 | B9021 | XYZ | 1000 | 0123456789 | 18Jan2021 | 0:00:03 |
11111111111111111111 | V7619 | XYZ | 2000 | 0123456789 | 18Jan2021 | 0:00:08 |
11111111111111111111 | W8911 | XYZ | 2000 | 0123456789 | 18Jan2021 | 0:00:20 |
22222222222222222 | J0569 | ABC | 3000 | 9876543210 | 17Jan2021 | 8:00:02 |
22222222222222222 | T2801 | ABC | 3000 | 9876543210 | 18Jan2021 | 9:00:07 |
And below is my desired output.
cust_nbr | txn_id | merchant | txn_amt | acct_num | txn_date | txn_time |
11111111111111111111 | B9021 | XYZ | 1000 | 0123456789 | 18Jan2021 | 0:00:03 |
11111111111111111111 | W8911 | XYZ | 2000 | 0123456789 | 18Jan2021 | 0:00:20 |
22222222222222222 | J0569 | ABC | 3000 | 9876543210 | 17Jan2021 | 8:00:02 |
22222222222222222 | T2801 | ABC | 3000 | 9876543210 | 18Jan2021 | 9:00:07 |
I am guessing I need to do some kind of conditional deduping based on transaction date and time.
Any help is appreciated !
If de-dupable transactions are consecutive (i.e. never separated by some other transaction), then:
Editted correction to include transaction date as well as transaction time:
data want (drop=nxt_time);
set have end=end_of_have;
by cust_nbr merchant txn_amt notsorted;
if end_of_have=0 then set have (firstobs=2 keep=txn_date txn_time rename=(txn_date=nxt_date txn_time=nxt_time));
if last.txn_amt=0 and dhms(nxt_date,0,0,nxt_time)-dhms(txn_date,0,0,txn_time)<='02:00:00't then delete;
run;
The NOTSORTED option allows you to treat the data as physically grouped, but not necessarily in ascending or descending order.
The
IF END_OF_HAVE then set have (firstobs=2 ...)
statement reads ahead oneobservation to retrieve the upcoming txn_time, renamed to nxt_time. But when the first SET statement has read the last obs of the data set (END_OF_HAVE=1), the second (conditional) SET does not read (otherwise it would pre-maturely end the data step).
The other thing to note here, is the DELETE statement, which filters out the unwanted "duplicates" (i.e. within 2 hours of the next matching transaction).
Is the variable txn_time the start and end time of the transaction instead of the transaction time?
Is there a key to identify a single transaction? Or do you always have a set of start and end records?
@japelin Variable txn_time is the exact time at which the transaction took place and to identify a single transaction I have txn_id column in my dataset. I have edited my post and added txn_id in my sample data and desired output table.
If de-dupable transactions are consecutive (i.e. never separated by some other transaction), then:
Editted correction to include transaction date as well as transaction time:
data want (drop=nxt_time);
set have end=end_of_have;
by cust_nbr merchant txn_amt notsorted;
if end_of_have=0 then set have (firstobs=2 keep=txn_date txn_time rename=(txn_date=nxt_date txn_time=nxt_time));
if last.txn_amt=0 and dhms(nxt_date,0,0,nxt_time)-dhms(txn_date,0,0,txn_time)<='02:00:00't then delete;
run;
The NOTSORTED option allows you to treat the data as physically grouped, but not necessarily in ascending or descending order.
The
IF END_OF_HAVE then set have (firstobs=2 ...)
statement reads ahead oneobservation to retrieve the upcoming txn_time, renamed to nxt_time. But when the first SET statement has read the last obs of the data set (END_OF_HAVE=1), the second (conditional) SET does not read (otherwise it would pre-maturely end the data step).
The other thing to note here, is the DELETE statement, which filters out the unwanted "duplicates" (i.e. within 2 hours of the next matching transaction).
@mkeintz Thanks for taking a look at my query. Actually the data is at transaction level in which the dedupable records might not be consecutive always.
Are the data sorted in any way? Possibly by customer?
No, the real time data is not sorted as I mentioned that it is on a transaction level.
And I tried sorting the data on customer and run the code provided above but getting below error in the log.
if last.txn_amt=0 and dhms(nxt_date,,nxt_time) - dhms(txn_date,,txn_time)<='02:00:00't then
ERROR 159-185: Null parameters for DHMS are invalid.
I've corrected the DHMS function arguments.
Thanks @mkeintz , but is there a possible solution to make this code work even if the data is not sorted ? And will this code work if I can get my data sorted, say on customer number alone or with any other variable in combination ?
data have;
infile cards expandtabs truncover;
input cust_nbr :$80. txn_id :$40. merchant :$40. txn_amt acct_num :$40. txn_date :date9. txn_time :time8.;
format txn_date date9. txn_time time8.;
cards;
11111111111111111111 A1234 XYZ 1000 0123456789 17Jan2021 23:59:57
11111111111111111111 B9021 XYZ 1000 0123456789 18Jan2021 0:00:03
11111111111111111111 V7619 XYZ 2000 0123456789 18Jan2021 0:00:08
11111111111111111111 W8911 XYZ 2000 0123456789 18Jan2021 0:00:20
22222222222222222 J0569 ABC 3000 9876543210 17Jan2021 8:00:02
22222222222222222 T2801 ABC 3000 9876543210 18Jan2021 9:00:07
;
data have;
set have;
datetime=dhms(txn_date,0,0,txn_time);
run;
data temp;
set have;
by cust_nbr merchant txn_amt;
dif=dif(datetime);
if first.txn_amt then call missing(dif);
run;
data temp;
set temp;
by cust_nbr merchant txn_amt;
if first.txn_amt or dif>60*60*2 then group+1;
run;
data want;
set temp;
by group;
if last.group;
drop dif group datetime;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.