Hi @Patrick
Very elegant indeed - I didn't think of that. I had a hash lookup in mind, but (I am ashamed to admit) most of my toolbox is pre-V9, and after 15 years with V9 I am still not really familiar with hash objects, so it would take me too long to figure out.
But I could not resist running a test to compare the two solutions. I used 24 months in A and 5.000.000 observations in B with random intervals. SQL is the winner:
236 data A (drop=i s);
237 format begin end mmddyy10.;
238 s = '01dec2017'd;
239 do i = 1 to 24;
240 begin = intnx('month',s,i);
241 end = intnx('month',begin,1)-1;
242 month = put(begin,monyy.);
243 output;
244 end;
245 run;
NOTE: The data set WORK.A has 24 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds
246
247 data B;
248 format recur_start recur_end yymmdd10.;
249 do custid = 1 to 5000000;
250 recur_start = (ranuni(1) * 730) + 21000;
251 recur_end = min(recur_start + (ranuni(3) * 730),22100);
252 recur_amt = int(ranuni(5)*1000);
253 output;
254 end;
255 run;
NOTE: The data set WORK.B has 5000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.45 seconds
cpu time 0.45 seconds
256
257 data want(keep=month begin end recur_amt);
258 set b;
259 _last=0;
260 do _i=1 to _nobs;
261 set a point=_i nobs=_nobs;
262 if begin<=recur_end and end>=recur_start then output want;
263 end;
264 run;
NOTE: There were 5000000 observations read from the data set WORK.B.
NOTE: The data set WORK.WANT has 51607422 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 17.74 seconds
cpu time 16.90 seconds
265
266 proc sql;
267 create table want as
268 select a.month, b.custid, b.recur_start, b.recur_end, b.recur_amt
269 from A, B
270 where
271 a.begin <= b.recur_end
272 and a.end >= b.recur_start;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can
not be optimized.
NOTE: Table WORK.WANT created, with 51607422 rows and 5 columns.
273 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 9.92 seconds
cpu time 8.40 seconds
I ran this on a Lenovo PC with SSD disk. I also tried it on a Linux Grid running a heavy batch load at the moment, and got almost the same figures:
21
22 data want(keep=month begin end recur_amt);
23 set b;
24 _last=0;
25 do _i=1 to _nobs;
26 set a point=_i nobs=_nobs;
27 if begin<=recur_end and end>=recur_start then output want;
28 end;
29 run;
NOTE: There were 5000000 observations read from the data set WORK.B.
NOTE: The data set WORK.WANT has 51607422 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 14.18 seconds
cpu time 13.16 seconds
30
31 proc sql;
32 create table want as
33 select a.month, b.custid, b.recur_start, b.recur_end, b.recur_amt
34 from A, B
35 where
36 a.begin <= b.recur_end
37 and a.end >= b.recur_start;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can
not be optimized.
NOTE: Table WORK.WANT created, with 51607422 rows and 5 columns.
38 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 8.39 seconds
cpu time 8.39 seconds
... View more