Solved: Count/Total Transactions in Last X Days

Fenton007 · Posted 09-25-2017 11:12 AM

I posted this question awhile back and received a great solution via a Hash Object via user Ksharp

data new;
  input CustID TranAmt:dollar6. TranDate:mmddyy10.;
  format TranDate mmddyy10.;
  datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;

data want;
 if _n_=1 then do;
  if 0 then set new(rename=(TranAmt=_TranAmt));
  declare hash h(dataset:'new(rename=(TranAmt=_TranAmt))',hashexp:20,multidata:'y');
  h.definekey('CustID','TranDate');
  h.definedata('_TranAmt');
  h.definedone();
 end;
set new;
count_8day=0;
sum_8day=0;
do i=TranDate-7 to TranDate;
 rc=h.find(key:CustId,key:i);
 do while(rc=0);
  count_8day+1;
  sum_8day+_TranAmt;
  rc=h.find_next(key:CustId,key:i);
 end;
end;
drop i rc _TranAmt;
run;

This works, for "small" datasets, but now I'm working with datasets in the 40-60 million observation range, and the Hash Object solution is throwing an error that's running out of memory. I've posted the HASH solution below for reference.

The problem I'm trying to solve is count the number of transactions that occured in the last X days. The best example is saying if on 6/11/16, I want to know how many transactions happened in the prior 8 days, so including 6/11, that would be transactions from 6/11 - 6/4, or 3 total transactions.

Any suggestions on how to adapt this for much larger datasets? Thank you.

Ksharp · Posted 09-26-2017 09:24 AM

@Haikuo is right. Hold one group value each time can save a lot more memory .

data new;
  input CustID TranAmt:dollar6. TranDate:mmddyy10.;
  format TranDate mmddyy10.;
  datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;

data want;
 if _n_=1 then do;
  if 0 then set new(rename=(TranAmt=_TranAmt));
  declare hash h(hashexp:20,multidata:'y');
  h.definekey('TranDate');
  h.definedata('_TranAmt');
  h.definedone();
 end;

do until(last.CustID);
set new;
by CustID;
 _TranAmt=TranAmt;h.add();
end;


do until(last.CustID);
set new;
by CustID;

count_8day=0;
sum_8day=0;
do i=TranDate-7 to TranDate;
 rc=h.find(key:i);
 do while(rc=0);
  count_8day+1;
  sum_8day+_TranAmt;
  rc=h.find_next(key:i);
 end;
end;

output;
end;

h.clear();
drop i rc _TranAmt;
run;

View solution in original post

ballardw · Posted 09-25-2017 11:23 AM

Just count?

Used 18Jun2014 as an arbitrary day since NONE of your example data would have been returned for anything related to todays date unless you were looking for 1000 days or so.

proc summary data=new nway;
   where trandate ge ('18Jun2014'd - 8);
   class custid;
   var tranamt;
   output out=want (drop= _:) n=TranCount;
run;

Assumed that you meant transaction by custid.

Fenton007 · Posted 09-25-2017 12:05 PM

I'm not in front of my machine with SAS loaded on it, so I can't test... but I don't believe this will work since it has a hard-coded date. I'm basically trying to do a "rolling" look back 8 days on each record. I didn't realize I didn't include a required output dataset. It's listed below.

THank you for your prompt responses.

CustID	TranAmt	TranDate	EXPECTED
111	100	1-Jun-14	1
111	200	2-Jun-14	2
111	500	4-Jun-14	3
111	500	10-Jun-14	2
111	100	11-Jun-14	3
222	200	3-Jun-14	1
222	500	4-Jun-14	2
222	600	4-Jun-14	3
222	600	11-Jun-14	3
222	600	19-Jun-14	1
222	600	28-Jun-14	1
222	600	29-Jun-14	2

ChrisBrooks · Posted 09-25-2017 11:48 AM

When you use hash objects ALL the data loaded into the hash has to be loaded into memory and it looks like you’ve hit the limit in your environment. There’s no way of guaranteeing you won’t ever run out of memory so if you are using colossal data sets you’re better off using an alternative method as @ballardw suggests.

mkeintz · Posted 09-25-2017 01:02 PM

Although your example doesn't clarify the issue, I presume you want rolling counts and sums.

Now if your data is sorted by custid, you could revise your program to process the data one custid at a time, by use of two "do until last.custid" loops containing a SET statement. The first loop would build the hash object one item at a time, and the second would do pretty much what you're doing now. At the end of the second loop, clear the hash object (rc=h.clear()) to recover memory.

But this would be an expensive proposition - adding items to object h one at a time instead of as a sort of "batch load".

You'll be much better off with temporary arrays (indexed on dates), one for date-specific total count, and one for date-specific total transamt:

%let lobound=%eval(%sysfunc(inputn(01jan2012,date9.))-7);
%let upbound=%sysfunc(inputn(31dec2016,date9.));

data want (keep=custid trandate tranamt count_8day sum_8day);

  array trcnt{&lobound:&upbound} _temporary_;
  array tramt{&lobound:&upbound} _temporary_;
  call missing(of tramt{*},of trcnt{*});
  do until (last.custid);
    set new;
    by custid;
    trcnt{trandate}+1;
    tramt{trandate}+tranamt;
  end;

  do until (last.custid);
    set new;
    by custid;
    count_8day=0;
    sum_8day=0;
    do i=0 to 7;
      count_8day=sum(count_8day,trcnt{trandate-i});
      sum_8day=sum(sum_8day,tramt{trandate-i});
    end;
    output;
  end;
run;

Arrays can be indexed like this: array x{20:40} meaning the lower bound of array X is 20 and the upper bound is 40. Since dates are whole numbers, if you know the minimum and maximum possible dates for your data you can use those as the bounds, and you could use TRANDATE as the array index. Just be sure to move the lower bound down by an extra 7 so that even the lowest incoming date won't go out of range looking at the previous week's history.
Unfortunately you can't declare an array with date literals as bounds. I.e. you can't use:
   array mydata{'01jan2012'd:'31dec2016'd};
So you have to ask the macro processor to get that value. That's why I have

   %let lobound=%eval(%sysfunc(inputn(01jan2012,date9.))-7);

   %let upbound=%sysfunc(inputn(31dec2016,date9.));
and

   array trcnt{&lobound:&upbound} _temporary_;

array tramt{&lobound:&upbound} _temporary_;
Because this recycles the same array for each custid, it uses a lot less memory than the hash object you are using. Also it's faster than the hash object, since the array is key-indexed. (the hash object might end up being faster if you were retrieving a lot more variables in definedata method).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Fenton007 · Posted 09-25-2017 02:50 PM

Thanks for the thorough explanation. Would using Tran_dt as the array index be a problem if there are multiple transactions on the same date for the same cust_id. Ex... cust 111 has 6 obs, all dated 1/1/2016?

mkeintz · Posted 09-25-2017 04:25 PM

You appear to want the total count and total tranamt for each date, so notice that the first "do until" group uses:

trcnt{trandate}+1;

tramt{trandate}+tranamt;

instead of

trcnt{trandate}=1;

tramt{trandate}=tranamt;

resulting in one array element per date with accumluated values for that date.

Yes, your hash object would have two items if a given date has two records. But when you retrieve those data items, you accumulate the results. This just accumulates data prior to retrieval.

It would be a problem if you needed something other than accumulations within a date.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Haikuo · Posted 09-25-2017 07:02 PM

data new;
	input CustID TranAmt:dollar6. TranDate:mmddyy10.;
	format TranDate mmddyy10.;
	datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;

/*hwo about only loading one custid at a time, following tweak on Ksharp's code should circumvent or at least alleviate ram stress*/
data want_hash;
	if _n_=1 then
		do;
			declare hash h(multidata:'y', ordered:'y');
			h.definekey('TranDate');
			h.definedata('TranDate','_TranAmt');
			h.definedone();

			call missing (_TranAmt);
		end;

	do until (last.custid);
		set new;
		by CustID TranDate;

		if first.Custid then
			rc=h.clear();
		rc=h.add(key:trandate,data:trandate, data:tranamt);
	end;

	do until (last.custid);
		set new;
		by CustID TranDate;
		sum_8day=0;
		count_8day=0;

		do i=TranDate-7 to TranDate;
			rc=h.find(key:i);

			do while(rc=0);
				sum_8day+_TranAmt;
				count_8day+1;
				rc=h.find_next(key:i);
			end;
		end;

		output;
	end;
	drop i rc _TranAmt;
run;

/*this will slow, but a lot easier to code and read*/
proc sql;
	create table want_sql as
		select *, (select sum(tranamt) from new a where a.custid=b.custid and  b.trandate-a.trandate between 0 and 7) as sum_8days,
(select count(*) from new a where a.custid=b.custid and  b.trandate-a.trandate between 0 and 7) as count_8days
from new b
	;
quit;

Ksharp · Posted 09-26-2017 09:24 AM

@Haikuo is right. Hold one group value each time can save a lot more memory .

data new;
  input CustID TranAmt:dollar6. TranDate:mmddyy10.;
  format TranDate mmddyy10.;
  datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;

data want;
 if _n_=1 then do;
  if 0 then set new(rename=(TranAmt=_TranAmt));
  declare hash h(hashexp:20,multidata:'y');
  h.definekey('TranDate');
  h.definedata('_TranAmt');
  h.definedone();
 end;

do until(last.CustID);
set new;
by CustID;
 _TranAmt=TranAmt;h.add();
end;


do until(last.CustID);
set new;
by CustID;

count_8day=0;
sum_8day=0;
do i=TranDate-7 to TranDate;
 rc=h.find(key:i);
 do while(rc=0);
  count_8day+1;
  sum_8day+_TranAmt;
  rc=h.find_next(key:i);
 end;
end;

output;
end;

h.clear();
drop i rc _TranAmt;
run;

Fenton007 · Posted 09-28-2017 09:55 AM

@Ksharp thanks again for the tip! It took 13 hours on 40MM obs, but it ran and looks good. I can't thank you and the other posters, @Haikuo and @mkeintz.

mkeintz · Posted 09-28-2017 01:20 PM

13 hours? If you have to do this again, I'd strongly suggest using the array solution, which should be noticably faster. In the program code below processing 100,000,000 skinny observaions, the T2 dataset (using arrays) took about half the time to create as the T1 (using hash). About 84 seconds vs 44 on our server.

hash is more flexible, but when it comes to looking up single values, it is intrinsically slower than arrays, since the latter has a key pointing directly to the array location of data value, whereas hash needs to process the key value before locating the data value.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Re: Count/Total Transactions in Last X Days

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away