Find a 24 hour moving average of transactions by account - Attempting ...

Array_Mover · Posted 09-11-2020 01:57 PM

I will try and describe my problem in pseudo-code. I am a self-taught programmer, so I apologize in advance for any gaps in my vocabulary and knowledge.

I have transaction data across accounts down to the second. The problem with every moving average answer I have looked up is, the data is always very structured and designed in a way to make the calculation relatively easy.

The transaction data is difficult, because it can vary wildly depending on the account. An account might have 10 transactions in a 24 hour period, or 5 transactions over 10 days. I need to be able to check and see if the previous transactions were within 24 hours, and if not, do not count them.

I have created a fake dataset with dates that are commensurate with the problems I am facing.

data sample;
    input customerid : 7.
	transactionid: $13.
        date : datetime26.
        value : 3.
    ;
    format date datetime26.;

datalines;
2133780 2133780_00355 15JUN2020:12:47:17.000000   53
2133780 2133780_00687 15JUN2020:12:48:35.000000   72
2133780 2133780_00181 15JUN2020:12:59:44.000000   88
2133780 2133780_00273 15JUN2020:16:43:46.000000   96
2133780 2133780_00803 16JUN2020:22:30:15.000000   25
2133780 2133780_00411 17JUN2020:07:22:18.000000   75
2133780 2133780_00191 17JUN2020:23:16:39.000000   59
2133780 2133780_00461 14JUL2020:09:35:32.000000   22
2133780 2133780_00532 21JUL2020:10:40:54.000000   46
2133780 2133780_00609 28JUL2020:09:58:30.000000   36
2133780 2133780_00504 28JUL2020:10:12:55.000000   87
2133780 2133780_00488 31JUL2020:15:30:44.000000   36
2689999 2689999_00284 02JUN2020:09:03:55.000000   44
2689999 2689999_00734 06JUN2020:11:20:23.000000   95
2689999 2689999_00061 06JUN2020:13:40:22.000000   29
2689999 2689999_00540 07JUN2020:14:00:19.000000   14
2689999 2689999_00616 08JUN2020:13:36:24.000000   14
2689999 2689999_00002 08JUN2020:13:59:32.000000   82
2689999 2689999_00966 04JUL2020:11:29:49.000000   21
2689999 2689999_00859 10JUL2020:14:46:37.000000   34
2689999 2689999_00816 29JUL2020:13:56:45.000000   95
2689999 2689999_00045 29JUL2020:15:12:19.000000   42
2689999 2689999_00481 02AUG2020:11:02:21.000000   38
2689999 2689999_00227 02AUG2020:11:03:41.000000   42
2689999 2689999_00354 02AUG2020:12:26:34.000000   42
2689999 2689999_00794 03AUG2020:09:10:43.000000   73
2689999 2689999_00857 20AUG2020:15:03:19.000000   16

;
run;

So, for each account, I need to:

1)Loop through every transaction

2)On each transaction, review all previous transactions and compare the transaction date to the previous date and calculate how long ago it occurred

3)Count those transactions that qualify

4)Sum those transaction amounts that qualify

5)Divide to find average

6)Repeat

7)Reset when you get to a new account

I believe the Hash Object will be the best route, because there could be tens of millions of rows. I have begun reading 'Data Management Solutions Using SAS Hash Table Operations' but the concept is so foreign to me, I am having trouble grasping it.

Here is what I have tried:

data _null_;

dcl hash transactions(dataset:"sample",multidata:"Y",ordered:"A");
 transactions.defineKey("customerid");
 transactions.defineData("Date","value");
 transactions.defineDone();
dcl hash accounts(dataset:"sample(rename=(date=base_date))",multidata:"Y",ordered:"A");
 accounts.defineKey("customerid");
 accounts.defineData("customerid","base_date");
 accounts.defineDone();
/* define the result hash object tables */
 dcl hash h_pointer;
 dcl hash byAccount(ordered:"A");
 byAccount.defineKey("customerid");
 byAccount.defineData("customerid","Date","value","_average");
 byAccount.defineDone();

if 0 then set sample;
format base_date datetime26.;

 lr = 0;
 do until(lr);
    set sample end = lr;
	call missing(customerid,date,base_date,value);
	accounts.find();
	transactions_rc = transactions.find();
	do while(transactions_rc=0);
		time_diff = intck('second',base_date,date);
		put time_diff;
		if (-86400 le intck('second',base_date,date) le 86400) then leave;
		transactions_rc = transactions.find_next();
	end;
	h_pointer = byAccount;
	link calculator;
end;

stop;
calculator:
	call missing(date,value,_average);
	rc = h_pointer.find();
	counts 			+1;
	h_pointer.replace();
return;
run;

Right now, it is just trying to count. Any help would be appreciated!

@hashman @DonH

DonH · Posted 09-11-2020 11:23 PM

I believe the has object is a good fit. You might want the check out the section on stacks - a separate stack for each day - output (as appropriate) and clearing as you complete each day.

I do have a couple of questions about your requirements:

3)Count those transactions that qualify

4)Sum those transaction amounts that qualify

5)Divide to find average

I am not sure I know what you mean by "transactions that qualify" as well as what to do with the results.

Array_Mover · Posted 09-12-2020 01:12 PM

3)Count those transactions that qualify
4)Sum those transaction amounts that qualify

What I mean, if you look at the transactions in the sample data, is that as time passes, older transactions fall out of the 24 hour window. That is what I meant when I said they would no longer "qualify". That they wouldn't be considered for the current line's 24 hour calculation.

I have done the calculations by hand to see what I want the result table to look like. I hope it helps:

Row	customerid	transactionid	date	value	count	running_ave	Notes
1	2133780	2133780_00355	15Jun2020 12:47:17	53	1	53.0
2	2133780	2133780_00687	15Jun2020 12:48:35	72	2	62.5
3	2133780	2133780_00181	15Jun2020 12:59:44	88	3	71.0
4	2133780	2133780_00273	15Jun2020 16:43:46	96	4	77.3
5	2133780	2133780_00803	16Jun2020 22:30:15	25	1	25.0	More than 24 hours passed since the last transaction, a full reset
6	2133780	2133780_00411	17Jun2020 7:22:18	75	2	50.0	Even though a "day" has passed, it is June 17th now, the previous transaction is still within a 24 hour window
7	2133780	2133780_00191	17Jun2020 23:16:39	59	2	67.0	The transaction on row 5 now falls away, but the previous transaction on row 6 is within a 24 hour window
8	2133780	2133780_00461	14Jul2020 9:35:32	22	1	22.0
9	2133780	2133780_00532	21Jul2020 10:40:54	46	1	46.0
10	2133780	2133780_00609	28Jul2020 9:58:30	36	1	36.0
11	2133780	2133780_00504	28Jul2020 10:12:55	87	2	61.5
12	2133780	2133780_00488	31Jul2020 15:30:44	36	1	36.0
13	2689999	2689999_00284	02Jun2020 9:03:55	44	1	44.0	A new CustomerId
14	2689999	2689999_00734	06Jun2020 11:20:23	95	1	95.0
15	2689999	2689999_00061	06Jun2020 13:40:22	29	2	62.0
16	2689999	2689999_00540	07Jun2020 14:00:19	14	1	14.0	This is just outside of the 24 hour window of the previous transaction
17	2689999	2689999_00616	08Jun2020 13:36:24	14	2	14.0	This is just inside of the 24 hour window of the previous transaction
18	2689999	2689999_00002	08Jun2020 13:59:32	82	3	36.7	This is around 30 seconds from being outside the 24 hour window of the previous 2 transactions. This is how granular I want it to get
19	2689999	2689999_00966	04Jul2020 11:29:49	21	1	21.0
20	2689999	2689999_00859	10Jul2020 14:46:37	34	1	34.0
21	2689999	2689999_00816	29Jul2020 13:56:45	95	1	95.0
22	2689999	2689999_00045	29Jul2020 15:12:19	42	2	68.5
23	2689999	2689999_00481	02Aug2020 11:02:21	38	1	38.0
24	2689999	2689999_00227	02Aug2020 11:03:41	42	2	40.0
25	2689999	2689999_00354	02Aug2020 12:26:34	42	3	40.7
26	2689999	2689999_00794	03Aug2020 9:10:43	73	4	48.8	A new day, but still within 24 hours of the previous 3 transactions
27	2689999	2689999_00857	20Aug2020 15:03:19	16	1	16.0

Array_Mover · Posted 09-12-2020 02:46 PM

Also, someone else in the comments below mentioned how they would use hash of hashes if the date were sorted by timestamp/customerid. I wanted to say, in the actual transaction data I am using, the data is sorted in this manner. I chose to present the sample data sorted by customerid/timestamp for the purpose of illustrating my problem.

hashman · Posted 09-13-2020 12:09 AM

@DonH, @mkeintz, @ChrisNZ, @Ksharp:

I greatly appreciate all of you chiming in. Would love to myself ... but haven't time.

Kind regards

Paul D.

mkeintz · Posted 09-13-2020 02:25 AM

Here's a hash-of-hashes solution:

data sample;
  input customerid : 7.  transactionid: $13.
        date : datetime26.  value : 3. ;
    format date datetime26.;
datalines;
2133780 2133780_00355 15JUN2020:12:47:17.000000   53
2133780 2133780_00687 15JUN2020:12:48:35.000000   72
2133780 2133780_00181 15JUN2020:12:59:44.000000   88
2133780 2133780_00273 15JUN2020:16:43:46.000000   96
2133780 2133780_00803 16JUN2020:22:30:15.000000   25
2133780 2133780_00411 17JUN2020:07:22:18.000000   75
2133780 2133780_00191 17JUN2020:23:16:39.000000   59
2133780 2133780_00461 14JUL2020:09:35:32.000000   22
2133780 2133780_00532 21JUL2020:10:40:54.000000   46
2133780 2133780_00609 28JUL2020:09:58:30.000000   36
2133780 2133780_00504 28JUL2020:10:12:55.000000   87
2133780 2133780_00488 31JUL2020:15:30:44.000000   36
2689999 2689999_00284 02JUN2020:09:03:55.000000   44
2689999 2689999_00734 06JUN2020:11:20:23.000000   95
2689999 2689999_00061 06JUN2020:13:40:22.000000   29
2689999 2689999_00540 07JUN2020:14:00:19.000000   14
2689999 2689999_00616 08JUN2020:13:36:24.000000   14
2689999 2689999_00002 08JUN2020:13:59:32.000000   82
2689999 2689999_00966 04JUL2020:11:29:49.000000   21
2689999 2689999_00859 10JUL2020:14:46:37.000000   34
2689999 2689999_00816 29JUL2020:13:56:45.000000   95
2689999 2689999_00045 29JUL2020:15:12:19.000000   42
2689999 2689999_00481 02AUG2020:11:02:21.000000   38
2689999 2689999_00227 02AUG2020:11:03:41.000000   42
2689999 2689999_00354 02AUG2020:12:26:34.000000   42
2689999 2689999_00794 03AUG2020:09:10:43.000000   73
2689999 2689999_00857 20AUG2020:15:03:19.000000   16
run;

proc sort;
  by date customerid;
run;


data want (drop=_:);
  set sample;

  if _n_=1 then do;
    declare hash hoh ();
      hoh.definekey('customerid');
      hoh.definedata('customerid','h','ih','_value_sum','nvalues');
      hoh.definedone();
    declare hash h;
    declare hiter ih;
  end;

  /* New customerid?  Initialize a new hash & iterator. */
  if hoh.find()^=0 then do;
    h=_new_ hash(ordered:'a',multidata:'Y');  /*Added the MULTIDATA option*/
      h.definekey('date');
      h.definedata('date','value');
      h.definedone();
    ih=_new_ hiter('h');
    _value_sum=0;
    nvalues=0;
  end;
  h.add();

  _value_sum = _value_sum + value;
  nvalues=nvalues+1;

  _stale_timestamp = date - '24:00:00't;
  format _stale_timestamp datetime20.;

  array _drop_dates {0:100} ;
 
  /* Adjust _value_sum and nvalues for all stale timestamps */
  /* Also capture their dates to guide stale dataitems deletions*/
  do _rc=ih.first() by 0 while(date<_stale_timestamp);
    _value_sum = _value_sum-value;
    nvalues = nvalues -1;
    _ndd=n(of _drop_dates{*});
	if _drop_dates{_ndd}^=date then _drop_dates{_ndd+1}=date;
	_rc=ih.next();
  end;

  /* Replace updated _value_sum and nvalues in hoh */
  hoh.replace();

  /* Now remove any stale timestamps from h */
  _ndd=n(of _drop_dates{*});
  if _ndd>0 then do _d=1 to _ndd;
    h.remove(key:_drop_dates{_d});
    _drop_dates{_d}=.;
  end;

  average=_value_sum/nvalues;
  ih.last();  /*Retrieve date and value of most recent record*/
run;

Using the multidata:"Y" option alleviates the need to use transactionid as a hash key for storage purposes.

If there are duplicate dates they all appear in the hash object, but each date only appears once in the _dropdates array. That's not a problem because the remove method will remove all duplicates for the specific key.

Also this assumes that there are no more than 100 records in any 24 hour period for a given customerid. If there can be more then modifiy the upper bound of _dropdates.

EDITTED NOTE: I had originally forgotten to insert the multidata:'Y' option in the "h=_new_ hash ..." statement - even though I explicitly mentioned it in my comments.

It's there now, which will eliminate the ERROR messages for "duplicate keys" and (more importantly) properly account for duplicate date's in the running 24-hr average calculations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hashman · Posted 09-13-2020 03:33 AM

@mkeintz:

What a beautiful piece of hash code, Mark. Thanks 1E+6!

Kind regards

Paul D.

Array_Mover · Posted 09-13-2020 05:03 PM

Thank you for taking the time to help my understanding of the hash object. I will attempt to apply this solution to my data and check back in.

Array_Mover · Posted 09-14-2020 12:14 PM

I get an error that repeats ad nauseum in the log that says "ERROR: Duplicate key."

Despite that error, on a small data set of 45,000 transactions it seems like it still calculated correctly.

However, on a large data set of 20 million transactions, it failed saying it ran out of memory.

hashman · Posted 09-14-2020 01:28 PM

@Array_Mover:

The errors are generated by the unassigned H.ADD method call at every attempt to add a hash item with the key already in the table. The key is not inserted, and and error is written to the log. Since the discarding of the duplicate key items have no effect on the correctness of your calculation, it's okay, so all you want is to make the pesky log errors go away. You can do it by making the call assigned (_iorc_ is a nice receptacle for it because it's (a) available and (b) auto-dropped):

_iorc_ = h.add() ;

As for the memory overload, turn the FULLSTIMER system option on to see the DATA step memory footprint at which it terminates and compare it to your available system resources. If your MEMSIZE option is set too low compared to the RAM available on your machine, you can remedy the situation by bumping it up when you invoke SAS or in the configuration file.

Kind regards

Paul D.

ChrisNZ · Posted 09-11-2020 11:26 PM

SQL would be a lot more suitable.

proc sql;
  create table WANT as
  select a.*, mean(b.VALUE)
  from HAVE a left join HAVE b
  on a.CUSTOMERID=b.CUSTOMERID and b.DATE between a.DATE and a.DATE-3600*24
  group by a.CUSTOMERID, a.DATE, a.VALUE, a.TRANSACTIONID
  order by a.CUSTOMERID, a.DATE, a.VALUE;

High-Performance SAS Coding - Third Edition

Array_Mover · Posted 09-12-2020 01:17 PM

I agree for the small sample data set, but across 100 million rows of transaction data, it is just not reasonable. While researching, I discovered hash tables and thought that would be the perfect solution for lighting fast lookups and calculations at such a scale.

ChrisNZ · Posted 09-12-2020 10:13 PM

@Array_Mover wrote:

I agree for the small sample data set, but across 100 million rows of transaction data, it is just not reasonable. While researching, I discovered hash tables and thought that would be the perfect solution for lighting fast lookups and calculations at such a scale.

Not necessarily. You don't need all the rows at once for this logic to work. The minimum you need is all the rows for a given account at once.

So you could process the table say one hundredth of the table (say all accounts between 1 and 1000000, 1000001 and 2000000, etc) at a time and still get all the results you need.

If you have 20 transactions per account on average, that's a 20-million row table, which the SQL parser might well decide to load into a hash table. Or not.

In any case, not complex programming, no interpreted vs compiled inefficiencies, and since it seems you are going to load partial hash tables, you might be better off loading partial tables into SQL instead.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 09-14-2020 02:47 AM

Here I start with 2 million accounts and 200 million observations.

Splitting the table into 20 takes 30 seconds, and each of the 20 SORT+SQL steps takes 1.2 minutes, so the whole thing takes about 25 minutes on my machine (140 SAS sessions open atm).

data W.HAVE;
  VALUE=1;
  do DATE='01jan2020:0:0'dt to '31aug2020:23:59:59'dt by '06:00:00't;
    do ACCT =1 to 2e6;
      if ranuni(1) > .9 then output;
    end;
  end;
run;
       
data _null_;
  call execute('data ');
  do I=1 to 20;
   call execute (cat('W.TMP',I));
  end; 
  do I=1 to 20;
    call execute(ifc(I=1,';set W.HAVE;','else '));
    call execute (cat('if ',I-1,'e5 <= ACCT < ',I,'e5 then output W.TMP',I,';')); 
  end; 
run;    
 
data _null_;
  do I=1 to 20;
   call execute (cat('proc sort data=W.TMP',I,'; by ACCT DATE; run;                '));
   call execute ('proc sql; create table W.OUT as select a.*, mean(b.VALUE) as AVG ');
   call execute (cat('from W.TMP', I, ' a left join W.TMP', I, ' b                 '));
   call execute ('on a.ACCT=b.ACCT and b.DATE between a.DATE and a.DATE-3600*24    ');
   call execute ('group by a.ACCT, a.DATE, a.VALUE; quit;                          ');
   call execute ('proc append base=W.WANT data=W.OUT; run;                         ');
 end;
run;

Note that library W is an SPDE library with parameters compress=binary partsize=1T, in order to reduce IO. There is never a good reason not to use SPDE to process large tables.

High-Performance SAS Coding - Third Edition

DonH · Posted 09-14-2020 10:36 AM

Regardless of whether you adopt ChrisNZs approach using SQL or hash tables, permit me suggest you use a macro loop to partition the data.

A simple technique to partition the data and is easily updated to change the size of the subset is to pick a digit from the account number to filter on. Typically that can result in each subset being 10%. And if that subset is too large, use two digits - giving you approximately a 1% subset; and three digits a .1% subset; and so on.

For example:

%macro LoopThruSubsets(digits=1,stopAt=);

 %local i;
 proc delete data=cumulativeResults;
 /* clear out before starting */
 run;

 %do i = 0 %to %sysfunc(coalescec(&stopAt,10**&digits-1));
     %put loop for last &digits digits = %sysfunc(putn(&i,z&digits.));
     data subset;
      set have;
      where mod(acct,10**&digits)=1;
     run;
     proc append base=cumulativeResults data=subset;
     run;
 %end;
%mend;

Note that I used a simple data ~~set~~ step (edited to further clarify that the intent was not to create a local subset for processing) that does nothing but subset the data to highlight the approach.

You can then call the macro to get a 10% sample like this:

%LoopThruSubsets(digits=1)

And at 1% sample:

%LoopThruSubsets(digits=2)

And if you want to test using just the first 1% sample, call the macro this way:

%LoopThruSubsets(digits=2,stopAt=0)

Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Re: Find a 24 hour moving average of transactions by account - Attempting with Hash Tables

Registration is open

SAS Training: Just a Click Away