SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
Array_Mover
Obsidian | Level 7

I will try and describe my problem in pseudo-code.  I am a self-taught programmer, so I apologize in advance for any gaps in my vocabulary and knowledge.

 

I have transaction data across accounts down to the second.  The problem with every moving average answer I have looked up is, the data is always very structured and designed in a way to make the calculation relatively easy.

 

The transaction data is difficult, because it can vary wildly depending on the account.  An account might have 10 transactions in a 24 hour period, or 5 transactions over 10 days.  I need to be able to check and see if the previous transactions were within 24 hours, and if not, do not count them.

 

I have created a fake dataset with dates that are commensurate with the problems I am facing.

 

data sample;
    input customerid : 7.
	transactionid: $13.
        date : datetime26.
        value : 3.
    ;
    format date datetime26.;

datalines;
2133780 2133780_00355 15JUN2020:12:47:17.000000   53
2133780 2133780_00687 15JUN2020:12:48:35.000000   72
2133780 2133780_00181 15JUN2020:12:59:44.000000   88
2133780 2133780_00273 15JUN2020:16:43:46.000000   96
2133780 2133780_00803 16JUN2020:22:30:15.000000   25
2133780 2133780_00411 17JUN2020:07:22:18.000000   75
2133780 2133780_00191 17JUN2020:23:16:39.000000   59
2133780 2133780_00461 14JUL2020:09:35:32.000000   22
2133780 2133780_00532 21JUL2020:10:40:54.000000   46
2133780 2133780_00609 28JUL2020:09:58:30.000000   36
2133780 2133780_00504 28JUL2020:10:12:55.000000   87
2133780 2133780_00488 31JUL2020:15:30:44.000000   36
2689999 2689999_00284 02JUN2020:09:03:55.000000   44
2689999 2689999_00734 06JUN2020:11:20:23.000000   95
2689999 2689999_00061 06JUN2020:13:40:22.000000   29
2689999 2689999_00540 07JUN2020:14:00:19.000000   14
2689999 2689999_00616 08JUN2020:13:36:24.000000   14
2689999 2689999_00002 08JUN2020:13:59:32.000000   82
2689999 2689999_00966 04JUL2020:11:29:49.000000   21
2689999 2689999_00859 10JUL2020:14:46:37.000000   34
2689999 2689999_00816 29JUL2020:13:56:45.000000   95
2689999 2689999_00045 29JUL2020:15:12:19.000000   42
2689999 2689999_00481 02AUG2020:11:02:21.000000   38
2689999 2689999_00227 02AUG2020:11:03:41.000000   42
2689999 2689999_00354 02AUG2020:12:26:34.000000   42
2689999 2689999_00794 03AUG2020:09:10:43.000000   73
2689999 2689999_00857 20AUG2020:15:03:19.000000   16

;
run;

So, for each account, I need to:

1)Loop through every transaction

2)On each transaction, review all previous transactions and compare the transaction date to the previous date and calculate how long ago it occurred

3)Count those transactions that qualify

4)Sum those transaction amounts that qualify

5)Divide to find average

6)Repeat

7)Reset when you get to a new account

 

I believe the Hash Object will be the best route, because there could be tens of millions of rows.  I have begun reading 'Data Management Solutions Using SAS Hash Table Operations' but the concept is so foreign to me, I am having trouble grasping it.

 

Here is what I have tried:

 

data _null_;

dcl hash transactions(dataset:"sample",multidata:"Y",ordered:"A");
 transactions.defineKey("customerid");
 transactions.defineData("Date","value");
 transactions.defineDone();
dcl hash accounts(dataset:"sample(rename=(date=base_date))",multidata:"Y",ordered:"A");
 accounts.defineKey("customerid");
 accounts.defineData("customerid","base_date");
 accounts.defineDone();
/* define the result hash object tables */
 dcl hash h_pointer;
 dcl hash byAccount(ordered:"A");
 byAccount.defineKey("customerid");
 byAccount.defineData("customerid","Date","value","_average");
 byAccount.defineDone();

if 0 then set sample;
format base_date datetime26.;

 lr = 0;
 do until(lr);
    set sample end = lr;
	call missing(customerid,date,base_date,value);
	accounts.find();
	transactions_rc = transactions.find();
	do while(transactions_rc=0);
		time_diff = intck('second',base_date,date);
		put time_diff;
		if (-86400 le intck('second',base_date,date) le 86400) then leave;
		transactions_rc = transactions.find_next();
	end;
	h_pointer = byAccount;
	link calculator;
end;

stop;
calculator:
	call missing(date,value,_average);
	rc = h_pointer.find();
	counts 			+1;
	h_pointer.replace();
return;
run;

Right now, it is just trying to count.  Any help would be appreciated!

 

@hashman@DonH

31 REPLIES 31
DonH
Lapis Lazuli | Level 10

I believe the has object is a good fit. You might want the check out the section on stacks - a separate stack for each day - output (as appropriate) and clearing as you complete each day.

 

I do have a couple of questions about your requirements:

 

3)Count those transactions that qualify

4)Sum those transaction amounts that qualify

5)Divide to find average

 

I am not sure I know what you mean by "transactions that qualify" as well as what to do with the results.

Array_Mover
Obsidian | Level 7

3)Count those transactions that qualify

4)Sum those transaction amounts that qualify

What I mean, if you look at the transactions in the sample data, is that as time passes, older transactions fall out of the 24 hour window.  That is what I meant when I said they would no longer "qualify". That they wouldn't be considered for the current line's 24 hour calculation.

 

I have done the calculations by hand to see what I want the result table to look like.  I hope it helps:

Rowcustomeridtransactioniddatevaluecountrunning_aveNotes
121337802133780_0035515Jun2020 12:47:1753153.0 
221337802133780_0068715Jun2020 12:48:3572262.5 
321337802133780_0018115Jun2020 12:59:4488371.0 
421337802133780_0027315Jun2020 16:43:4696477.3 
521337802133780_0080316Jun2020 22:30:1525125.0More than 24 hours passed since the last transaction, a full reset
621337802133780_0041117Jun2020 7:22:1875250.0Even though a "day" has passed, it is June 17th now, the previous transaction is still within a 24 hour window
721337802133780_0019117Jun2020 23:16:3959267.0The transaction on row 5 now falls away, but the previous transaction on row 6 is within a 24 hour window
821337802133780_0046114Jul2020 9:35:3222122.0 
921337802133780_0053221Jul2020 10:40:5446146.0 
1021337802133780_0060928Jul2020 9:58:3036136.0 
1121337802133780_0050428Jul2020 10:12:5587261.5 
1221337802133780_0048831Jul2020 15:30:4436136.0 
1326899992689999_0028402Jun2020 9:03:5544144.0A new CustomerId
1426899992689999_0073406Jun2020 11:20:2395195.0 
1526899992689999_0006106Jun2020 13:40:2229262.0 
1626899992689999_0054007Jun2020 14:00:1914114.0This is just outside of the 24 hour window of the previous transaction
1726899992689999_0061608Jun2020 13:36:2414214.0This is just inside of the 24 hour window of the previous transaction
1826899992689999_0000208Jun2020 13:59:3282336.7This is around 30 seconds from being outside the 24 hour window of the previous 2 transactions.  This is how granular I want it to get
1926899992689999_0096604Jul2020 11:29:4921121.0 
2026899992689999_0085910Jul2020 14:46:3734134.0 
2126899992689999_0081629Jul2020 13:56:4595195.0 
2226899992689999_0004529Jul2020 15:12:1942268.5 
2326899992689999_0048102Aug2020 11:02:2138138.0 
2426899992689999_0022702Aug2020 11:03:4142240.0 
2526899992689999_0035402Aug2020 12:26:3442340.7 
2626899992689999_0079403Aug2020 9:10:4373448.8A new day, but still within 24 hours of the previous 3 transactions
2726899992689999_0085720Aug2020 15:03:1916116.0 

 


Array_Mover
Obsidian | Level 7

Also, someone else in the comments below mentioned how they would use hash of hashes if the date were sorted by timestamp/customerid.  I wanted to say, in the actual transaction data I am using, the data is sorted in this manner.  I chose to present the sample data sorted by customerid/timestamp for the purpose of illustrating my problem.

hashman
Ammonite | Level 13

@DonH@mkeintz@ChrisNZ@Ksharp:

 

I greatly appreciate all of you chiming in. Would love to myself ... but haven't time.

 

Kind regards

Paul D.    

mkeintz
PROC Star

Here's a hash-of-hashes solution:

 

data sample;
  input customerid : 7.  transactionid: $13.
        date : datetime26.  value : 3. ;
    format date datetime26.;
datalines;
2133780 2133780_00355 15JUN2020:12:47:17.000000   53
2133780 2133780_00687 15JUN2020:12:48:35.000000   72
2133780 2133780_00181 15JUN2020:12:59:44.000000   88
2133780 2133780_00273 15JUN2020:16:43:46.000000   96
2133780 2133780_00803 16JUN2020:22:30:15.000000   25
2133780 2133780_00411 17JUN2020:07:22:18.000000   75
2133780 2133780_00191 17JUN2020:23:16:39.000000   59
2133780 2133780_00461 14JUL2020:09:35:32.000000   22
2133780 2133780_00532 21JUL2020:10:40:54.000000   46
2133780 2133780_00609 28JUL2020:09:58:30.000000   36
2133780 2133780_00504 28JUL2020:10:12:55.000000   87
2133780 2133780_00488 31JUL2020:15:30:44.000000   36
2689999 2689999_00284 02JUN2020:09:03:55.000000   44
2689999 2689999_00734 06JUN2020:11:20:23.000000   95
2689999 2689999_00061 06JUN2020:13:40:22.000000   29
2689999 2689999_00540 07JUN2020:14:00:19.000000   14
2689999 2689999_00616 08JUN2020:13:36:24.000000   14
2689999 2689999_00002 08JUN2020:13:59:32.000000   82
2689999 2689999_00966 04JUL2020:11:29:49.000000   21
2689999 2689999_00859 10JUL2020:14:46:37.000000   34
2689999 2689999_00816 29JUL2020:13:56:45.000000   95
2689999 2689999_00045 29JUL2020:15:12:19.000000   42
2689999 2689999_00481 02AUG2020:11:02:21.000000   38
2689999 2689999_00227 02AUG2020:11:03:41.000000   42
2689999 2689999_00354 02AUG2020:12:26:34.000000   42
2689999 2689999_00794 03AUG2020:09:10:43.000000   73
2689999 2689999_00857 20AUG2020:15:03:19.000000   16
run;

proc sort;
  by date customerid;
run;


data want (drop=_:);
  set sample;

  if _n_=1 then do;
    declare hash hoh ();
      hoh.definekey('customerid');
      hoh.definedata('customerid','h','ih','_value_sum','nvalues');
      hoh.definedone();
    declare hash h;
    declare hiter ih;
  end;

  /* New customerid?  Initialize a new hash & iterator. */
  if hoh.find()^=0 then do;
    h=_new_ hash(ordered:'a',multidata:'Y');  /*Added the MULTIDATA option*/
      h.definekey('date');
      h.definedata('date','value');
      h.definedone();
    ih=_new_ hiter('h');
    _value_sum=0;
    nvalues=0;
  end;
  h.add();

  _value_sum = _value_sum + value;
  nvalues=nvalues+1;

  _stale_timestamp = date - '24:00:00't;
  format _stale_timestamp datetime20.;

  array _drop_dates {0:100} ;
 
  /* Adjust _value_sum and nvalues for all stale timestamps */
  /* Also capture their dates to guide stale dataitems deletions*/
  do _rc=ih.first() by 0 while(date<_stale_timestamp);
    _value_sum = _value_sum-value;
    nvalues = nvalues -1;
    _ndd=n(of _drop_dates{*});
	if _drop_dates{_ndd}^=date then _drop_dates{_ndd+1}=date;
	_rc=ih.next();
  end;

  /* Replace updated _value_sum and nvalues in hoh */
  hoh.replace();

  /* Now remove any stale timestamps from h */
  _ndd=n(of _drop_dates{*});
  if _ndd>0 then do _d=1 to _ndd;
    h.remove(key:_drop_dates{_d});
    _drop_dates{_d}=.;
  end;

  average=_value_sum/nvalues;
  ih.last();  /*Retrieve date and value of most recent record*/
run;

Using the multidata:"Y" option alleviates the need to use transactionid as a hash key for storage purposes.

 

If there are duplicate dates they all appear in the hash object, but each date only appears once in the _dropdates array.  That's not a problem because the remove method will remove all duplicates for the specific key.

 

Also this assumes that there are no more than 100 records in any 24 hour period for a given customerid.  If there can be more then modifiy the upper bound of _dropdates.

 

EDITTED NOTE:  I had originally forgotten to insert the multidata:'Y' option in the "h=_new_ hash ..." statement - even though I explicitly mentioned it in my comments.

It's there now, which will eliminate the ERROR messages for "duplicate keys" and (more importantly) properly account for duplicate date's in the running 24-hr average calculations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
hashman
Ammonite | Level 13

@mkeintz:

What a beautiful piece of hash code, Mark. Thanks 1E+6!

Kind regards

Paul D. 

Array_Mover
Obsidian | Level 7

Thank you for taking the time to help my understanding of the hash object. I will attempt to apply this solution to my data and check back in.

Array_Mover
Obsidian | Level 7

I get an error that repeats ad nauseum in the log that says "ERROR: Duplicate key."

 

Despite that error, on a small data set of 45,000 transactions it seems like it still calculated correctly.

 

However, on a large data set of 20 million transactions, it failed saying it ran out of memory.

hashman
Ammonite | Level 13

@Array_Mover:

The errors are generated by the unassigned H.ADD method call at every attempt to add a hash item with the key already in the table. The key is not inserted, and and error is written to the log. Since the discarding of the duplicate key items have no effect on the correctness of your calculation, it's okay, so all you want is to make the pesky log errors go away. You can do it by making the call assigned (_iorc_ is a nice receptacle for it because it's (a) available and (b) auto-dropped):

 

_iorc_ = h.add() ;

 

As for the memory overload, turn the FULLSTIMER system option on to see the DATA step memory footprint at which it terminates and compare it to your available system resources. If your MEMSIZE option is set too low compared to the RAM available on your machine, you can remedy the situation by bumping it up when you invoke SAS or in the configuration file. 

 

Kind regards

Paul D.

ChrisNZ
Tourmaline | Level 20

SQL would be a lot more suitable.

 

proc sql;
  create table WANT as
  select a.*, mean(b.VALUE)
  from HAVE a left join HAVE b
  on a.CUSTOMERID=b.CUSTOMERID and b.DATE between a.DATE and a.DATE-3600*24
  group by a.CUSTOMERID, a.DATE, a.VALUE, a.TRANSACTIONID
  order by a.CUSTOMERID, a.DATE, a.VALUE;

 

 

 

Array_Mover
Obsidian | Level 7

I agree for the small sample data set, but across 100 million rows of transaction data, it is just not reasonable.  While researching, I discovered hash tables and thought that would be the perfect solution for lighting fast lookups and calculations at such a scale.

ChrisNZ
Tourmaline | Level 20

@Array_Mover wrote:

I agree for the small sample data set, but across 100 million rows of transaction data, it is just not reasonable.  While researching, I discovered hash tables and thought that would be the perfect solution for lighting fast lookups and calculations at such a scale.


Not necessarily. You don't need all the rows at once for this logic to work. The minimum you need is all the rows for a given account at once.

So you could process the table say one hundredth of the table (say all accounts between 1 and 1000000, 1000001 and 2000000, etc) at a time and still get all the results you need.

If you have 20 transactions per account on average, that's a 20-million row table, which the SQL parser might well decide to load into a hash table. Or not.

In any case, not complex programming, no interpreted vs compiled inefficiencies, and since it seems you are going to load partial hash tables, you might be better off loading partial tables into SQL instead.

ChrisNZ
Tourmaline | Level 20

Here I start with 2 million accounts and 200 million observations.

Splitting the table into 20 takes 30 seconds, and each of the 20 SORT+SQL steps takes 1.2 minutes, so the whole thing takes about 25 minutes on my machine (140 SAS sessions open atm).

data W.HAVE;
  VALUE=1;
  do DATE='01jan2020:0:0'dt to '31aug2020:23:59:59'dt by '06:00:00't;
    do ACCT =1 to 2e6;
      if ranuni(1) > .9 then output;
    end;
  end;
run;
       
data _null_;
  call execute('data ');
  do I=1 to 20;
   call execute (cat('W.TMP',I));
  end; 
  do I=1 to 20;
    call execute(ifc(I=1,';set W.HAVE;','else '));
    call execute (cat('if ',I-1,'e5 <= ACCT < ',I,'e5 then output W.TMP',I,';')); 
  end; 
run;    
 
data _null_;
  do I=1 to 20;
   call execute (cat('proc sort data=W.TMP',I,'; by ACCT DATE; run;                '));
   call execute ('proc sql; create table W.OUT as select a.*, mean(b.VALUE) as AVG ');
   call execute (cat('from W.TMP', I, ' a left join W.TMP', I, ' b                 '));
   call execute ('on a.ACCT=b.ACCT and b.DATE between a.DATE and a.DATE-3600*24    ');
   call execute ('group by a.ACCT, a.DATE, a.VALUE; quit;                          ');
   call execute ('proc append base=W.WANT data=W.OUT; run;                         ');
 end;
run;  

Note that library W is an SPDE library with parameters compress=binary partsize=1T, in order to reduce IO. There is never a good reason not to use SPDE to process large tables.

DonH
Lapis Lazuli | Level 10

Regardless of whether you adopt ChrisNZs approach using SQL or hash tables, permit me suggest you use a macro loop to partition the data.

A simple technique to partition the data and is easily updated to change the size of the subset is to pick a digit from the account number to filter on. Typically that can result in each subset being 10%. And if that subset is too large, use two digits - giving you approximately a 1% subset; and three digits a .1% subset; and so on.

For example:

%macro LoopThruSubsets(digits=1,stopAt=);

 %local i;
 proc delete data=cumulativeResults;
 /* clear out before starting */
 run;

 %do i = 0 %to %sysfunc(coalescec(&stopAt,10**&digits-1));
     %put loop for last &digits digits = %sysfunc(putn(&i,z&digits.));
     data subset;
      set have;
      where mod(acct,10**&digits)=1;
     run;
     proc append base=cumulativeResults data=subset;
     run;
 %end;
%mend;

Note that I used a simple data set step (edited to further clarify that the intent was not to create a local subset for processing) that does nothing but subset the data to highlight the approach.

You can then call the macro to get a 10% sample like this:

%LoopThruSubsets(digits=1)

And at 1% sample:

%LoopThruSubsets(digits=2)

And if you want to test using just the first 1% sample, call the macro this way:

%LoopThruSubsets(digits=2,stopAt=0)

 

sas-innovate-white.png

Special offer for SAS Communities members

Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 31 replies
  • 4492 views
  • 16 likes
  • 6 in conversation