BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pk1206
Fluorite | Level 6

In SAS, I have a transaction data and I need to dedup it in such a way that if a customer has attempted multiple transactions of the same amount with the same merchant within 2 hours then I should get the latest instance based on transaction time in my final dataset, but if the transaction time is greater than 2 hours then I should get all the records. Below is my sample data.

 

cust_nbrtxn_idmerchanttxn_amtacct_numtxn_datetxn_time
11111111111111111111A1234XYZ1000012345678917Jan202123:59:57
11111111111111111111B9021XYZ1000012345678918Jan20210:00:03
11111111111111111111V7619XYZ2000012345678918Jan20210:00:08
11111111111111111111W8911XYZ2000012345678918Jan20210:00:20
22222222222222222J0569ABC3000987654321017Jan20218:00:02
22222222222222222T2801ABC3000987654321018Jan20219:00:07

 

And below is my desired output.

cust_nbrtxn_idmerchanttxn_amtacct_numtxn_datetxn_time
11111111111111111111B9021XYZ1000012345678918Jan20210:00:03
11111111111111111111W8911XYZ2000012345678918Jan20210:00:20
22222222222222222J0569ABC3000987654321017Jan20218:00:02
22222222222222222T2801ABC3000987654321018Jan20219:00:07

 

I am guessing I need to do some kind of conditional deduping based on transaction date and time.

 

Any help is appreciated !

 

@Reeza @Ksharp @Kurt_Bremser @Tom @MichelleHomes

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

If de-dupable transactions are consecutive (i.e. never separated by some other transaction), then:

 

Editted correction to include transaction date as well as transaction time:  

 

data want (drop=nxt_time);
  set have end=end_of_have;
  by cust_nbr merchant txn_amt notsorted;
  if end_of_have=0 then set have (firstobs=2 keep=txn_date txn_time rename=(txn_date=nxt_date txn_time=nxt_time));

  if last.txn_amt=0 and dhms(nxt_date,0,0,nxt_time)-dhms(txn_date,0,0,txn_time)<='02:00:00't then delete;
run;

 

 

The NOTSORTED option allows you to treat the data as physically grouped, but not necessarily in ascending or descending order.

 

The

    IF END_OF_HAVE then set have (firstobs=2 ...)

statement reads ahead oneobservation to retrieve the upcoming txn_time, renamed to nxt_time.   But when the first SET statement has read the last obs of the data set (END_OF_HAVE=1), the second (conditional) SET does not read (otherwise it would pre-maturely end the data step).

 

The other thing to note here, is the DELETE statement, which filters out the unwanted "duplicates"  (i.e. within 2 hours of the next matching transaction).  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

9 REPLIES 9
japelin
Rhodochrosite | Level 12

Is the variable txn_time the start and end time of the transaction instead of the transaction time?
Is there a key to identify a single transaction? Or do you always have a set of start and end records?

pk1206
Fluorite | Level 6

@japelin Variable txn_time is the exact time at which the transaction took place and to identify a single transaction I have txn_id column in my dataset. I have edited my post and added txn_id in my sample data and desired output table.

mkeintz
PROC Star

If de-dupable transactions are consecutive (i.e. never separated by some other transaction), then:

 

Editted correction to include transaction date as well as transaction time:  

 

data want (drop=nxt_time);
  set have end=end_of_have;
  by cust_nbr merchant txn_amt notsorted;
  if end_of_have=0 then set have (firstobs=2 keep=txn_date txn_time rename=(txn_date=nxt_date txn_time=nxt_time));

  if last.txn_amt=0 and dhms(nxt_date,0,0,nxt_time)-dhms(txn_date,0,0,txn_time)<='02:00:00't then delete;
run;

 

 

The NOTSORTED option allows you to treat the data as physically grouped, but not necessarily in ascending or descending order.

 

The

    IF END_OF_HAVE then set have (firstobs=2 ...)

statement reads ahead oneobservation to retrieve the upcoming txn_time, renamed to nxt_time.   But when the first SET statement has read the last obs of the data set (END_OF_HAVE=1), the second (conditional) SET does not read (otherwise it would pre-maturely end the data step).

 

The other thing to note here, is the DELETE statement, which filters out the unwanted "duplicates"  (i.e. within 2 hours of the next matching transaction).  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
pk1206
Fluorite | Level 6

@mkeintz Thanks for taking a look at my query. Actually the data is at transaction level in which the dedupable records might not be consecutive always.

mkeintz
PROC Star

Are the data sorted in any way?  Possibly by customer?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
pk1206
Fluorite | Level 6

No, the real time data is not sorted as I mentioned that it is on a transaction level.

 

And I tried sorting the data on customer and run the code provided above but getting below error in the log.

 

if last.txn_amt=0 and dhms(nxt_date,,nxt_time) - dhms(txn_date,,txn_time)<='02:00:00't then

ERROR 159-185: Null parameters for DHMS are invalid.

mkeintz
PROC Star

I've corrected the DHMS function arguments.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
pk1206
Fluorite | Level 6

Thanks @mkeintz , but is there a possible solution to make this code work even if the data is not sorted ? And will this code work if I can get my data sorted, say on customer number alone or with any other variable in combination ?

Ksharp
Super User
data have;
infile cards expandtabs truncover;
input cust_nbr :$80.	txn_id :$40.	merchant :$40.	txn_amt	acct_num :$40.	txn_date :date9.	txn_time :time8.;
format txn_date date9.	txn_time time8.;
cards;
11111111111111111111	A1234	XYZ	1000	0123456789	17Jan2021	23:59:57
11111111111111111111	B9021	XYZ	1000	0123456789	18Jan2021	0:00:03
11111111111111111111	V7619	XYZ	2000	0123456789	18Jan2021	0:00:08
11111111111111111111	W8911	XYZ	2000	0123456789	18Jan2021	0:00:20
22222222222222222	J0569	ABC	3000	9876543210	17Jan2021	8:00:02
22222222222222222	T2801	ABC	3000	9876543210	18Jan2021	9:00:07
;

data have;
 set have;
 datetime=dhms(txn_date,0,0,txn_time);
run;

data temp;
 set have;
 by cust_nbr merchant txn_amt;
 dif=dif(datetime);
 if first.txn_amt then call missing(dif);
run;
data temp;
 set temp;
 by cust_nbr merchant txn_amt;
 if first.txn_amt or dif>60*60*2 then group+1;
run;
data want;
 set temp;
 by group;
 if last.group;
drop dif group datetime;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 786 views
  • 6 likes
  • 4 in conversation