I posted this question awhile back and received a great solution via a Hash Object via user Ksharp
data new;
input CustID TranAmt:dollar6. TranDate:mmddyy10.;
format TranDate mmddyy10.;
datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;
data want;
if _n_=1 then do;
if 0 then set new(rename=(TranAmt=_TranAmt));
declare hash h(dataset:'new(rename=(TranAmt=_TranAmt))',hashexp:20,multidata:'y');
h.definekey('CustID','TranDate');
h.definedata('_TranAmt');
h.definedone();
end;
set new;
count_8day=0;
sum_8day=0;
do i=TranDate-7 to TranDate;
rc=h.find(key:CustId,key:i);
do while(rc=0);
count_8day+1;
sum_8day+_TranAmt;
rc=h.find_next(key:CustId,key:i);
end;
end;
drop i rc _TranAmt;
run;
This works, for "small" datasets, but now I'm working with datasets in the 40-60 million observation range, and the Hash Object solution is throwing an error that's running out of memory. I've posted the HASH solution below for reference.
The problem I'm trying to solve is count the number of transactions that occured in the last X days. The best example is saying if on 6/11/16, I want to know how many transactions happened in the prior 8 days, so including 6/11, that would be transactions from 6/11 - 6/4, or 3 total transactions.
Any suggestions on how to adapt this for much larger datasets? Thank you.
@Haikuo is right. Hold one group value each time can save a lot more memory .
data new;
input CustID TranAmt:dollar6. TranDate:mmddyy10.;
format TranDate mmddyy10.;
datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;
data want;
if _n_=1 then do;
if 0 then set new(rename=(TranAmt=_TranAmt));
declare hash h(hashexp:20,multidata:'y');
h.definekey('TranDate');
h.definedata('_TranAmt');
h.definedone();
end;
do until(last.CustID);
set new;
by CustID;
_TranAmt=TranAmt;h.add();
end;
do until(last.CustID);
set new;
by CustID;
count_8day=0;
sum_8day=0;
do i=TranDate-7 to TranDate;
rc=h.find(key:i);
do while(rc=0);
count_8day+1;
sum_8day+_TranAmt;
rc=h.find_next(key:i);
end;
end;
output;
end;
h.clear();
drop i rc _TranAmt;
run;
Just count?
Used 18Jun2014 as an arbitrary day since NONE of your example data would have been returned for anything related to todays date unless you were looking for 1000 days or so.
proc summary data=new nway; where trandate ge ('18Jun2014'd - 8); class custid; var tranamt; output out=want (drop= _:) n=TranCount; run;
Assumed that you meant transaction by custid.
I'm not in front of my machine with SAS loaded on it, so I can't test... but I don't believe this will work since it has a hard-coded date. I'm basically trying to do a "rolling" look back 8 days on each record. I didn't realize I didn't include a required output dataset. It's listed below.
THank you for your prompt responses.
CustID | TranAmt | TranDate | EXPECTED |
111 | 100 | 1-Jun-14 | 1 |
111 | 200 | 2-Jun-14 | 2 |
111 | 500 | 4-Jun-14 | 3 |
111 | 500 | 10-Jun-14 | 2 |
111 | 100 | 11-Jun-14 | 3 |
222 | 200 | 3-Jun-14 | 1 |
222 | 500 | 4-Jun-14 | 2 |
222 | 600 | 4-Jun-14 | 3 |
222 | 600 | 11-Jun-14 | 3 |
222 | 600 | 19-Jun-14 | 1 |
222 | 600 | 28-Jun-14 | 1 |
222 | 600 | 29-Jun-14 | 2 |
When you use hash objects ALL the data loaded into the hash has to be loaded into memory and it looks like you’ve hit the limit in your environment. There’s no way of guaranteeing you won’t ever run out of memory so if you are using colossal data sets you’re better off using an alternative method as @ballardw suggests.
Although your example doesn't clarify the issue, I presume you want rolling counts and sums.
Now if your data is sorted by custid, you could revise your program to process the data one custid at a time, by use of two "do until last.custid" loops containing a SET statement. The first loop would build the hash object one item at a time, and the second would do pretty much what you're doing now. At the end of the second loop, clear the hash object (rc=h.clear()) to recover memory.
But this would be an expensive proposition - adding items to object h one at a time instead of as a sort of "batch load".
You'll be much better off with temporary arrays (indexed on dates), one for date-specific total count, and one for date-specific total transamt:
%let lobound=%eval(%sysfunc(inputn(01jan2012,date9.))-7);
%let upbound=%sysfunc(inputn(31dec2016,date9.));
data want (keep=custid trandate tranamt count_8day sum_8day);
array trcnt{&lobound:&upbound} _temporary_;
array tramt{&lobound:&upbound} _temporary_;
call missing(of tramt{*},of trcnt{*});
do until (last.custid);
set new;
by custid;
trcnt{trandate}+1;
tramt{trandate}+tranamt;
end;
do until (last.custid);
set new;
by custid;
count_8day=0;
sum_8day=0;
do i=0 to 7;
count_8day=sum(count_8day,trcnt{trandate-i});
sum_8day=sum(sum_8day,tramt{trandate-i});
end;
output;
end;
run;
array mydata{'01jan2012'd:'31dec2016'd};
So you have to ask the macro processor to get that value. That's why I have%let lobound=%eval(%sysfunc(inputn(01jan2012,date9.))-7);
%let upbound=%sysfunc(inputn(31dec2016,date9.));
andarray trcnt{&lobound:&upbound} _temporary_;
array tramt{&lobound:&upbound} _temporary_;
You appear to want the total count and total tranamt for each date, so notice that the first "do until" group uses:
trcnt{trandate}+1;
tramt{trandate}+tranamt;
instead of
trcnt{trandate}=1;
tramt{trandate}=tranamt;
resulting in one array element per date with accumluated values for that date.
Yes, your hash object would have two items if a given date has two records. But when you retrieve those data items, you accumulate the results. This just accumulates data prior to retrieval.
It would be a problem if you needed something other than accumulations within a date.
data new;
input CustID TranAmt:dollar6. TranDate:mmddyy10.;
format TranDate mmddyy10.;
datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;
/*hwo about only loading one custid at a time, following tweak on Ksharp's code should circumvent or at least alleviate ram stress*/
data want_hash;
if _n_=1 then
do;
declare hash h(multidata:'y', ordered:'y');
h.definekey('TranDate');
h.definedata('TranDate','_TranAmt');
h.definedone();
call missing (_TranAmt);
end;
do until (last.custid);
set new;
by CustID TranDate;
if first.Custid then
rc=h.clear();
rc=h.add(key:trandate,data:trandate, data:tranamt);
end;
do until (last.custid);
set new;
by CustID TranDate;
sum_8day=0;
count_8day=0;
do i=TranDate-7 to TranDate;
rc=h.find(key:i);
do while(rc=0);
sum_8day+_TranAmt;
count_8day+1;
rc=h.find_next(key:i);
end;
end;
output;
end;
drop i rc _TranAmt;
run;
/*this will slow, but a lot easier to code and read*/
proc sql;
create table want_sql as
select *, (select sum(tranamt) from new a where a.custid=b.custid and b.trandate-a.trandate between 0 and 7) as sum_8days,
(select count(*) from new a where a.custid=b.custid and b.trandate-a.trandate between 0 and 7) as count_8days
from new b
;
quit;
@Haikuo is right. Hold one group value each time can save a lot more memory .
data new;
input CustID TranAmt:dollar6. TranDate:mmddyy10.;
format TranDate mmddyy10.;
datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;
data want;
if _n_=1 then do;
if 0 then set new(rename=(TranAmt=_TranAmt));
declare hash h(hashexp:20,multidata:'y');
h.definekey('TranDate');
h.definedata('_TranAmt');
h.definedone();
end;
do until(last.CustID);
set new;
by CustID;
_TranAmt=TranAmt;h.add();
end;
do until(last.CustID);
set new;
by CustID;
count_8day=0;
sum_8day=0;
do i=TranDate-7 to TranDate;
rc=h.find(key:i);
do while(rc=0);
count_8day+1;
sum_8day+_TranAmt;
rc=h.find_next(key:i);
end;
end;
output;
end;
h.clear();
drop i rc _TranAmt;
run;
13 hours? If you have to do this again, I'd strongly suggest using the array solution, which should be noticably faster. In the program code below processing 100,000,000 skinny observaions, the T2 dataset (using arrays) took about half the time to create as the T1 (using hash). About 84 seconds vs 44 on our server.
hash is more flexible, but when it comes to looking up single values, it is intrinsically slower than arrays, since the latter has a key pointing directly to the array location of data value, whereas hash needs to process the key value before locating the data value.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.