- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I will try and describe my problem in pseudo-code. I am a self-taught programmer, so I apologize in advance for any gaps in my vocabulary and knowledge.
I have transaction data across accounts down to the second. The problem with every moving average answer I have looked up is, the data is always very structured and designed in a way to make the calculation relatively easy.
The transaction data is difficult, because it can vary wildly depending on the account. An account might have 10 transactions in a 24 hour period, or 5 transactions over 10 days. I need to be able to check and see if the previous transactions were within 24 hours, and if not, do not count them.
I have created a fake dataset with dates that are commensurate with the problems I am facing.
data sample;
input customerid : 7.
transactionid: $13.
date : datetime26.
value : 3.
;
format date datetime26.;
datalines;
2133780 2133780_00355 15JUN2020:12:47:17.000000 53
2133780 2133780_00687 15JUN2020:12:48:35.000000 72
2133780 2133780_00181 15JUN2020:12:59:44.000000 88
2133780 2133780_00273 15JUN2020:16:43:46.000000 96
2133780 2133780_00803 16JUN2020:22:30:15.000000 25
2133780 2133780_00411 17JUN2020:07:22:18.000000 75
2133780 2133780_00191 17JUN2020:23:16:39.000000 59
2133780 2133780_00461 14JUL2020:09:35:32.000000 22
2133780 2133780_00532 21JUL2020:10:40:54.000000 46
2133780 2133780_00609 28JUL2020:09:58:30.000000 36
2133780 2133780_00504 28JUL2020:10:12:55.000000 87
2133780 2133780_00488 31JUL2020:15:30:44.000000 36
2689999 2689999_00284 02JUN2020:09:03:55.000000 44
2689999 2689999_00734 06JUN2020:11:20:23.000000 95
2689999 2689999_00061 06JUN2020:13:40:22.000000 29
2689999 2689999_00540 07JUN2020:14:00:19.000000 14
2689999 2689999_00616 08JUN2020:13:36:24.000000 14
2689999 2689999_00002 08JUN2020:13:59:32.000000 82
2689999 2689999_00966 04JUL2020:11:29:49.000000 21
2689999 2689999_00859 10JUL2020:14:46:37.000000 34
2689999 2689999_00816 29JUL2020:13:56:45.000000 95
2689999 2689999_00045 29JUL2020:15:12:19.000000 42
2689999 2689999_00481 02AUG2020:11:02:21.000000 38
2689999 2689999_00227 02AUG2020:11:03:41.000000 42
2689999 2689999_00354 02AUG2020:12:26:34.000000 42
2689999 2689999_00794 03AUG2020:09:10:43.000000 73
2689999 2689999_00857 20AUG2020:15:03:19.000000 16
;
run;
So, for each account, I need to:
1)Loop through every transaction
2)On each transaction, review all previous transactions and compare the transaction date to the previous date and calculate how long ago it occurred
3)Count those transactions that qualify
4)Sum those transaction amounts that qualify
5)Divide to find average
6)Repeat
7)Reset when you get to a new account
I believe the Hash Object will be the best route, because there could be tens of millions of rows. I have begun reading 'Data Management Solutions Using SAS Hash Table Operations' but the concept is so foreign to me, I am having trouble grasping it.
Here is what I have tried:
data _null_;
dcl hash transactions(dataset:"sample",multidata:"Y",ordered:"A");
transactions.defineKey("customerid");
transactions.defineData("Date","value");
transactions.defineDone();
dcl hash accounts(dataset:"sample(rename=(date=base_date))",multidata:"Y",ordered:"A");
accounts.defineKey("customerid");
accounts.defineData("customerid","base_date");
accounts.defineDone();
/* define the result hash object tables */
dcl hash h_pointer;
dcl hash byAccount(ordered:"A");
byAccount.defineKey("customerid");
byAccount.defineData("customerid","Date","value","_average");
byAccount.defineDone();
if 0 then set sample;
format base_date datetime26.;
lr = 0;
do until(lr);
set sample end = lr;
call missing(customerid,date,base_date,value);
accounts.find();
transactions_rc = transactions.find();
do while(transactions_rc=0);
time_diff = intck('second',base_date,date);
put time_diff;
if (-86400 le intck('second',base_date,date) le 86400) then leave;
transactions_rc = transactions.find_next();
end;
h_pointer = byAccount;
link calculator;
end;
stop;
calculator:
call missing(date,value,_average);
rc = h_pointer.find();
counts +1;
h_pointer.replace();
return;
run;
Right now, it is just trying to count. Any help would be appreciated!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I believe the has object is a good fit. You might want the check out the section on stacks - a separate stack for each day - output (as appropriate) and clearing as you complete each day.
I do have a couple of questions about your requirements:
3)Count those transactions that qualify
4)Sum those transaction amounts that qualify
5)Divide to find average
I am not sure I know what you mean by "transactions that qualify" as well as what to do with the results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
3)Count those transactions that qualify
4)Sum those transaction amounts that qualify
What I mean, if you look at the transactions in the sample data, is that as time passes, older transactions fall out of the 24 hour window. That is what I meant when I said they would no longer "qualify". That they wouldn't be considered for the current line's 24 hour calculation.
I have done the calculations by hand to see what I want the result table to look like. I hope it helps:
Row | customerid | transactionid | date | value | count | running_ave | Notes |
1 | 2133780 | 2133780_00355 | 15Jun2020 12:47:17 | 53 | 1 | 53.0 | |
2 | 2133780 | 2133780_00687 | 15Jun2020 12:48:35 | 72 | 2 | 62.5 | |
3 | 2133780 | 2133780_00181 | 15Jun2020 12:59:44 | 88 | 3 | 71.0 | |
4 | 2133780 | 2133780_00273 | 15Jun2020 16:43:46 | 96 | 4 | 77.3 | |
5 | 2133780 | 2133780_00803 | 16Jun2020 22:30:15 | 25 | 1 | 25.0 | More than 24 hours passed since the last transaction, a full reset |
6 | 2133780 | 2133780_00411 | 17Jun2020 7:22:18 | 75 | 2 | 50.0 | Even though a "day" has passed, it is June 17th now, the previous transaction is still within a 24 hour window |
7 | 2133780 | 2133780_00191 | 17Jun2020 23:16:39 | 59 | 2 | 67.0 | The transaction on row 5 now falls away, but the previous transaction on row 6 is within a 24 hour window |
8 | 2133780 | 2133780_00461 | 14Jul2020 9:35:32 | 22 | 1 | 22.0 | |
9 | 2133780 | 2133780_00532 | 21Jul2020 10:40:54 | 46 | 1 | 46.0 | |
10 | 2133780 | 2133780_00609 | 28Jul2020 9:58:30 | 36 | 1 | 36.0 | |
11 | 2133780 | 2133780_00504 | 28Jul2020 10:12:55 | 87 | 2 | 61.5 | |
12 | 2133780 | 2133780_00488 | 31Jul2020 15:30:44 | 36 | 1 | 36.0 | |
13 | 2689999 | 2689999_00284 | 02Jun2020 9:03:55 | 44 | 1 | 44.0 | A new CustomerId |
14 | 2689999 | 2689999_00734 | 06Jun2020 11:20:23 | 95 | 1 | 95.0 | |
15 | 2689999 | 2689999_00061 | 06Jun2020 13:40:22 | 29 | 2 | 62.0 | |
16 | 2689999 | 2689999_00540 | 07Jun2020 14:00:19 | 14 | 1 | 14.0 | This is just outside of the 24 hour window of the previous transaction |
17 | 2689999 | 2689999_00616 | 08Jun2020 13:36:24 | 14 | 2 | 14.0 | This is just inside of the 24 hour window of the previous transaction |
18 | 2689999 | 2689999_00002 | 08Jun2020 13:59:32 | 82 | 3 | 36.7 | This is around 30 seconds from being outside the 24 hour window of the previous 2 transactions. This is how granular I want it to get |
19 | 2689999 | 2689999_00966 | 04Jul2020 11:29:49 | 21 | 1 | 21.0 | |
20 | 2689999 | 2689999_00859 | 10Jul2020 14:46:37 | 34 | 1 | 34.0 | |
21 | 2689999 | 2689999_00816 | 29Jul2020 13:56:45 | 95 | 1 | 95.0 | |
22 | 2689999 | 2689999_00045 | 29Jul2020 15:12:19 | 42 | 2 | 68.5 | |
23 | 2689999 | 2689999_00481 | 02Aug2020 11:02:21 | 38 | 1 | 38.0 | |
24 | 2689999 | 2689999_00227 | 02Aug2020 11:03:41 | 42 | 2 | 40.0 | |
25 | 2689999 | 2689999_00354 | 02Aug2020 12:26:34 | 42 | 3 | 40.7 | |
26 | 2689999 | 2689999_00794 | 03Aug2020 9:10:43 | 73 | 4 | 48.8 | A new day, but still within 24 hours of the previous 3 transactions |
27 | 2689999 | 2689999_00857 | 20Aug2020 15:03:19 | 16 | 1 | 16.0 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Also, someone else in the comments below mentioned how they would use hash of hashes if the date were sorted by timestamp/customerid. I wanted to say, in the actual transaction data I am using, the data is sorted in this manner. I chose to present the sample data sorted by customerid/timestamp for the purpose of illustrating my problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's a hash-of-hashes solution:
data sample;
input customerid : 7. transactionid: $13.
date : datetime26. value : 3. ;
format date datetime26.;
datalines;
2133780 2133780_00355 15JUN2020:12:47:17.000000 53
2133780 2133780_00687 15JUN2020:12:48:35.000000 72
2133780 2133780_00181 15JUN2020:12:59:44.000000 88
2133780 2133780_00273 15JUN2020:16:43:46.000000 96
2133780 2133780_00803 16JUN2020:22:30:15.000000 25
2133780 2133780_00411 17JUN2020:07:22:18.000000 75
2133780 2133780_00191 17JUN2020:23:16:39.000000 59
2133780 2133780_00461 14JUL2020:09:35:32.000000 22
2133780 2133780_00532 21JUL2020:10:40:54.000000 46
2133780 2133780_00609 28JUL2020:09:58:30.000000 36
2133780 2133780_00504 28JUL2020:10:12:55.000000 87
2133780 2133780_00488 31JUL2020:15:30:44.000000 36
2689999 2689999_00284 02JUN2020:09:03:55.000000 44
2689999 2689999_00734 06JUN2020:11:20:23.000000 95
2689999 2689999_00061 06JUN2020:13:40:22.000000 29
2689999 2689999_00540 07JUN2020:14:00:19.000000 14
2689999 2689999_00616 08JUN2020:13:36:24.000000 14
2689999 2689999_00002 08JUN2020:13:59:32.000000 82
2689999 2689999_00966 04JUL2020:11:29:49.000000 21
2689999 2689999_00859 10JUL2020:14:46:37.000000 34
2689999 2689999_00816 29JUL2020:13:56:45.000000 95
2689999 2689999_00045 29JUL2020:15:12:19.000000 42
2689999 2689999_00481 02AUG2020:11:02:21.000000 38
2689999 2689999_00227 02AUG2020:11:03:41.000000 42
2689999 2689999_00354 02AUG2020:12:26:34.000000 42
2689999 2689999_00794 03AUG2020:09:10:43.000000 73
2689999 2689999_00857 20AUG2020:15:03:19.000000 16
run;
proc sort;
by date customerid;
run;
data want (drop=_:);
set sample;
if _n_=1 then do;
declare hash hoh ();
hoh.definekey('customerid');
hoh.definedata('customerid','h','ih','_value_sum','nvalues');
hoh.definedone();
declare hash h;
declare hiter ih;
end;
/* New customerid? Initialize a new hash & iterator. */
if hoh.find()^=0 then do;
h=_new_ hash(ordered:'a',multidata:'Y'); /*Added the MULTIDATA option*/
h.definekey('date');
h.definedata('date','value');
h.definedone();
ih=_new_ hiter('h');
_value_sum=0;
nvalues=0;
end;
h.add();
_value_sum = _value_sum + value;
nvalues=nvalues+1;
_stale_timestamp = date - '24:00:00't;
format _stale_timestamp datetime20.;
array _drop_dates {0:100} ;
/* Adjust _value_sum and nvalues for all stale timestamps */
/* Also capture their dates to guide stale dataitems deletions*/
do _rc=ih.first() by 0 while(date<_stale_timestamp);
_value_sum = _value_sum-value;
nvalues = nvalues -1;
_ndd=n(of _drop_dates{*});
if _drop_dates{_ndd}^=date then _drop_dates{_ndd+1}=date;
_rc=ih.next();
end;
/* Replace updated _value_sum and nvalues in hoh */
hoh.replace();
/* Now remove any stale timestamps from h */
_ndd=n(of _drop_dates{*});
if _ndd>0 then do _d=1 to _ndd;
h.remove(key:_drop_dates{_d});
_drop_dates{_d}=.;
end;
average=_value_sum/nvalues;
ih.last(); /*Retrieve date and value of most recent record*/
run;
Using the multidata:"Y" option alleviates the need to use transactionid as a hash key for storage purposes.
If there are duplicate dates they all appear in the hash object, but each date only appears once in the _dropdates array. That's not a problem because the remove method will remove all duplicates for the specific key.
Also this assumes that there are no more than 100 records in any 24 hour period for a given customerid. If there can be more then modifiy the upper bound of _dropdates.
EDITTED NOTE: I had originally forgotten to insert the multidata:'Y' option in the "h=_new_ hash ..." statement - even though I explicitly mentioned it in my comments.
It's there now, which will eliminate the ERROR messages for "duplicate keys" and (more importantly) properly account for duplicate date's in the running 24-hr average calculations.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for taking the time to help my understanding of the hash object. I will attempt to apply this solution to my data and check back in.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I get an error that repeats ad nauseum in the log that says "ERROR: Duplicate key."
Despite that error, on a small data set of 45,000 transactions it seems like it still calculated correctly.
However, on a large data set of 20 million transactions, it failed saying it ran out of memory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The errors are generated by the unassigned H.ADD method call at every attempt to add a hash item with the key already in the table. The key is not inserted, and and error is written to the log. Since the discarding of the duplicate key items have no effect on the correctness of your calculation, it's okay, so all you want is to make the pesky log errors go away. You can do it by making the call assigned (_iorc_ is a nice receptacle for it because it's (a) available and (b) auto-dropped):
_iorc_ = h.add() ;
As for the memory overload, turn the FULLSTIMER system option on to see the DATA step memory footprint at which it terminates and compare it to your available system resources. If your MEMSIZE option is set too low compared to the RAM available on your machine, you can remedy the situation by bumping it up when you invoke SAS or in the configuration file.
Kind regards
Paul D.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
SQL would be a lot more suitable.
proc sql;
create table WANT as
select a.*, mean(b.VALUE)
from HAVE a left join HAVE b
on a.CUSTOMERID=b.CUSTOMERID and b.DATE between a.DATE and a.DATE-3600*24
group by a.CUSTOMERID, a.DATE, a.VALUE, a.TRANSACTIONID
order by a.CUSTOMERID, a.DATE, a.VALUE;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I agree for the small sample data set, but across 100 million rows of transaction data, it is just not reasonable. While researching, I discovered hash tables and thought that would be the perfect solution for lighting fast lookups and calculations at such a scale.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Array_Mover wrote:
I agree for the small sample data set, but across 100 million rows of transaction data, it is just not reasonable. While researching, I discovered hash tables and thought that would be the perfect solution for lighting fast lookups and calculations at such a scale.
Not necessarily. You don't need all the rows at once for this logic to work. The minimum you need is all the rows for a given account at once.
So you could process the table say one hundredth of the table (say all accounts between 1 and 1000000, 1000001 and 2000000, etc) at a time and still get all the results you need.
If you have 20 transactions per account on average, that's a 20-million row table, which the SQL parser might well decide to load into a hash table. Or not.
In any case, not complex programming, no interpreted vs compiled inefficiencies, and since it seems you are going to load partial hash tables, you might be better off loading partial tables into SQL instead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here I start with 2 million accounts and 200 million observations.
Splitting the table into 20 takes 30 seconds, and each of the 20 SORT+SQL steps takes 1.2 minutes, so the whole thing takes about 25 minutes on my machine (140 SAS sessions open atm).
data W.HAVE;
VALUE=1;
do DATE='01jan2020:0:0'dt to '31aug2020:23:59:59'dt by '06:00:00't;
do ACCT =1 to 2e6;
if ranuni(1) > .9 then output;
end;
end;
run;
data _null_;
call execute('data ');
do I=1 to 20;
call execute (cat('W.TMP',I));
end;
do I=1 to 20;
call execute(ifc(I=1,';set W.HAVE;','else '));
call execute (cat('if ',I-1,'e5 <= ACCT < ',I,'e5 then output W.TMP',I,';'));
end;
run;
data _null_;
do I=1 to 20;
call execute (cat('proc sort data=W.TMP',I,'; by ACCT DATE; run; '));
call execute ('proc sql; create table W.OUT as select a.*, mean(b.VALUE) as AVG ');
call execute (cat('from W.TMP', I, ' a left join W.TMP', I, ' b '));
call execute ('on a.ACCT=b.ACCT and b.DATE between a.DATE and a.DATE-3600*24 ');
call execute ('group by a.ACCT, a.DATE, a.VALUE; quit; ');
call execute ('proc append base=W.WANT data=W.OUT; run; ');
end;
run;
Note that library W is an SPDE library with parameters compress=binary partsize=1T, in order to reduce IO. There is never a good reason not to use SPDE to process large tables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Regardless of whether you adopt ChrisNZs approach using SQL or hash tables, permit me suggest you use a macro loop to partition the data.
A simple technique to partition the data and is easily updated to change the size of the subset is to pick a digit from the account number to filter on. Typically that can result in each subset being 10%. And if that subset is too large, use two digits - giving you approximately a 1% subset; and three digits a .1% subset; and so on.
For example:
%macro LoopThruSubsets(digits=1,stopAt=);
%local i;
proc delete data=cumulativeResults;
/* clear out before starting */
run;
%do i = 0 %to %sysfunc(coalescec(&stopAt,10**&digits-1));
%put loop for last &digits digits = %sysfunc(putn(&i,z&digits.));
data subset;
set have;
where mod(acct,10**&digits)=1;
run;
proc append base=cumulativeResults data=subset;
run;
%end;
%mend;
Note that I used a simple data set step (edited to further clarify that the intent was not to create a local subset for processing) that does nothing but subset the data to highlight the approach.
You can then call the macro to get a 10% sample like this:
%LoopThruSubsets(digits=1)
And at 1% sample:
%LoopThruSubsets(digits=2)
And if you want to test using just the first 1% sample, call the macro this way:
%LoopThruSubsets(digits=2,stopAt=0)