BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Fenton007
Fluorite | Level 6

I posted this question awhile back and received a great solution via a Hash Object via user Ksharp

data new;
  input CustID TranAmt:dollar6. TranDate:mmddyy10.;
  format TranDate mmddyy10.;
  datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;

data want;
 if _n_=1 then do;
  if 0 then set new(rename=(TranAmt=_TranAmt));
  declare hash h(dataset:'new(rename=(TranAmt=_TranAmt))',hashexp:20,multidata:'y');
  h.definekey('CustID','TranDate');
  h.definedata('_TranAmt');
  h.definedone();
 end;
set new;
count_8day=0;
sum_8day=0;
do i=TranDate-7 to TranDate;
 rc=h.find(key:CustId,key:i);
 do while(rc=0);
  count_8day+1;
  sum_8day+_TranAmt;
  rc=h.find_next(key:CustId,key:i);
 end;
end;
drop i rc _TranAmt;
run;

This works, for "small" datasets, but now I'm working with datasets in the 40-60 million observation range, and the Hash Object solution is throwing an error that's running out of memory.  I've posted the HASH solution below for reference.

 

The problem I'm trying to solve is count the number of transactions that occured in the last X days.  The best example is saying if on 6/11/16, I want to know how many transactions happened in the prior 8 days, so including 6/11, that would be transactions from 6/11 - 6/4, or 3 total transactions.

 

Any suggestions on how to adapt this for much larger datasets?  Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

@Haikuo is right. Hold one group value each time  can save a lot more memory .

 

 

data new;
  input CustID TranAmt:dollar6. TranDate:mmddyy10.;
  format TranDate mmddyy10.;
  datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;

data want;
 if _n_=1 then do;
  if 0 then set new(rename=(TranAmt=_TranAmt));
  declare hash h(hashexp:20,multidata:'y');
  h.definekey('TranDate');
  h.definedata('_TranAmt');
  h.definedone();
 end;

do until(last.CustID);
set new;
by CustID;
 _TranAmt=TranAmt;h.add();
end;


do until(last.CustID);
set new;
by CustID;

count_8day=0;
sum_8day=0;
do i=TranDate-7 to TranDate;
 rc=h.find(key:i);
 do while(rc=0);
  count_8day+1;
  sum_8day+_TranAmt;
  rc=h.find_next(key:i);
 end;
end;

output;
end;

h.clear();
drop i rc _TranAmt;
run;

View solution in original post

10 REPLIES 10
ballardw
Super User

Just count?

Used 18Jun2014 as an arbitrary day since NONE of your example data would have been returned for anything related to todays date unless you were looking for 1000 days or so.

 

proc summary data=new nway;
   where trandate ge ('18Jun2014'd - 8);
   class custid;
   var tranamt;
   output out=want (drop= _:) n=TranCount;
run;

 

 

Assumed that you meant transaction by custid.

Fenton007
Fluorite | Level 6

I'm not in front of my machine with SAS loaded on it, so I can't test... but I don't believe this will work since it has a hard-coded date.  I'm basically trying to do a "rolling" look back 8 days on each record.  I didn't realize I didn't include a required output dataset.  It's listed below.

 

THank you for your prompt responses.

 

CustID

TranAmtTranDateEXPECTED
1111001-Jun-141
1112002-Jun-142
1115004-Jun-143
11150010-Jun-142
11110011-Jun-143
2222003-Jun-141
2225004-Jun-142
2226004-Jun-143
22260011-Jun-143
22260019-Jun-141
22260028-Jun-141
22260029-Jun-142
ChrisBrooks
Ammonite | Level 13

When you use hash objects ALL the data loaded into the hash has to be loaded into memory and it looks like you’ve hit the limit in your environment. There’s no way of guaranteeing you won’t ever run out of memory so if you are using colossal data sets you’re better off using an alternative method as @ballardw suggests.

mkeintz
PROC Star

Although your example doesn't clarify the issue, I presume you want rolling counts and sums.

 

Now if your data is sorted by custid, you could revise your program to process the data one custid at a time, by use of two "do until last.custid" loops containing a SET statement.  The first loop would build the hash object one item at a time, and the second would do pretty much what you're doing now.  At the end of the second loop, clear the hash object (rc=h.clear()) to recover memory.

 

But this would be an expensive proposition - adding items to object h one at a time instead of as a sort of "batch load".

 

You'll be much better off with temporary arrays (indexed on dates), one for date-specific total count, and one for date-specific total transamt:

 

%let lobound=%eval(%sysfunc(inputn(01jan2012,date9.))-7);
%let upbound=%sysfunc(inputn(31dec2016,date9.));

data want (keep=custid trandate tranamt count_8day sum_8day);

  array trcnt{&lobound:&upbound} _temporary_;
  array tramt{&lobound:&upbound} _temporary_;
  call missing(of tramt{*},of trcnt{*});
  do until (last.custid);
    set new;
    by custid;
    trcnt{trandate}+1;
    tramt{trandate}+tranamt;
  end;

  do until (last.custid);
    set new;
    by custid;
    count_8day=0;
    sum_8day=0;
    do i=0 to 7;
      count_8day=sum(count_8day,trcnt{trandate-i});
      sum_8day=sum(sum_8day,tramt{trandate-i});
    end;
    output;
  end;
run;

 

 

 

  1. Arrays can be indexed like this: array x{20:40} meaning the lower bound of array X is 20 and the upper bound is 40.  Since dates are whole numbers, if you know the minimum and maximum possible dates for your data you can use those as the bounds, and you could use TRANDATE as the array index.  Just be sure to move the lower bound down by an extra 7 so that even the lowest incoming date won't go out of range looking at the previous week's history.

  2. Unfortunately you can't declare an array with date literals as bounds.  I.e. you can't use:  

       array mydata{'01jan2012'd:'31dec2016'd};

    So you have to ask the macro processor to get that value.  That's why I have  

       %let lobound=%eval(%sysfunc(inputn(01jan2012,date9.))-7);

       %let upbound=%sysfunc(inputn(31dec2016,date9.));

    and

       array trcnt{&lobound:&upbound} _temporary_;

       array tramt{&lobound:&upbound} _temporary_;

  3. Because this recycles the same array for each custid, it uses a lot less memory than the hash object you are using.  Also it's faster than the hash object, since the array is key-indexed.  (the hash object might end up being faster if you were retrieving a lot more variables in definedata method).
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Fenton007
Fluorite | Level 6
Thanks for the thorough explanation. Would using Tran_dt as the array index be a problem if there are multiple transactions on the same date for the same cust_id. Ex... cust 111 has 6 obs, all dated 1/1/2016?
mkeintz
PROC Star

You appear to want the total count and total tranamt for each date, so notice that the first "do until" group uses:

    trcnt{trandate}+1;

    tramt{trandate}+tranamt;

instead of

    trcnt{trandate}=1;

    tramt{trandate}=tranamt;

resulting in one array element per date with accumluated values for that date.

 

Yes, your hash object would have two items if a given date has two records.  But when you retrieve those data items, you accumulate the results.  This just accumulates data prior to retrieval.

 

It would be a problem if you needed something other than accumulations within a date.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Haikuo
Onyx | Level 15
data new;
	input CustID TranAmt:dollar6. TranDate:mmddyy10.;
	format TranDate mmddyy10.;
	datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;

/*hwo about only loading one custid at a time, following tweak on Ksharp's code should circumvent or at least alleviate ram stress*/
data want_hash;
	if _n_=1 then
		do;
			declare hash h(multidata:'y', ordered:'y');
			h.definekey('TranDate');
			h.definedata('TranDate','_TranAmt');
			h.definedone();

			call missing (_TranAmt);
		end;

	do until (last.custid);
		set new;
		by CustID TranDate;

		if first.Custid then
			rc=h.clear();
		rc=h.add(key:trandate,data:trandate, data:tranamt);
	end;

	do until (last.custid);
		set new;
		by CustID TranDate;
		sum_8day=0;
		count_8day=0;

		do i=TranDate-7 to TranDate;
			rc=h.find(key:i);

			do while(rc=0);
				sum_8day+_TranAmt;
				count_8day+1;
				rc=h.find_next(key:i);
			end;
		end;

		output;
	end;
	drop i rc _TranAmt;
run;

/*this will slow, but a lot easier to code and read*/
proc sql;
	create table want_sql as
		select *, (select sum(tranamt) from new a where a.custid=b.custid and  b.trandate-a.trandate between 0 and 7) as sum_8days,
(select count(*) from new a where a.custid=b.custid and  b.trandate-a.trandate between 0 and 7) as count_8days
from new b
	;
quit;
Ksharp
Super User

@Haikuo is right. Hold one group value each time  can save a lot more memory .

 

 

data new;
  input CustID TranAmt:dollar6. TranDate:mmddyy10.;
  format TranDate mmddyy10.;
  datalines;
111 $100 6/1/2014
111 $200 6/2/2014
111 $500 6/4/2014
111 $500 6/10/2014
111 $100 6/11/2014
222 $200 6/3/2014
222 $500 6/4/2014
222 $600 6/4/2014
222 $600 6/11/2014
222 $600 6/19/2014
222 $600 6/28/2014
222 $600 6/29/2014
;
run;

data want;
 if _n_=1 then do;
  if 0 then set new(rename=(TranAmt=_TranAmt));
  declare hash h(hashexp:20,multidata:'y');
  h.definekey('TranDate');
  h.definedata('_TranAmt');
  h.definedone();
 end;

do until(last.CustID);
set new;
by CustID;
 _TranAmt=TranAmt;h.add();
end;


do until(last.CustID);
set new;
by CustID;

count_8day=0;
sum_8day=0;
do i=TranDate-7 to TranDate;
 rc=h.find(key:i);
 do while(rc=0);
  count_8day+1;
  sum_8day+_TranAmt;
  rc=h.find_next(key:i);
 end;
end;

output;
end;

h.clear();
drop i rc _TranAmt;
run;
Fenton007
Fluorite | Level 6
@Ksharp thanks again for the tip! It took 13 hours on 40MM obs, but it ran and looks good. I can't thank you and the other posters, @Haikuo and @mkeintz.
mkeintz
PROC Star

13 hours?   If you have to do this again, I'd strongly suggest using the array solution, which should be noticably faster.  In the program code below processing 100,000,000 skinny observaions, the T2 dataset (using arrays) took about half the time to create as the T1 (using hash).  About 84 seconds vs 44 on our server.

 

hash is more flexible, but when it comes to looking up single values, it is intrinsically slower than arrays, since the latter has a key pointing directly to the array location of data value, whereas hash needs to process the key value before locating the data value.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2781 views
  • 1 like
  • 6 in conversation