Solved: Re: Loop through large set multiple times based on matching values fro...

redfishJAX · Posted 03-10-2019 11:32 AM

What is the best way to do this?

The actual datasets are millions of obs.. one small (A) against one huge (~10m) B) but i need A to loop through B and find matches based on date conditions.
once each single observation on A has looped through B, i need it to do the same for the nexst observation in A and so on

A
month begin   end
Jan19   1/1/2019 1/31/2019
Feb19   2/1/2019 2/28/2019

B
custid recur_start recur_end recur_amt
1 10/15/2014 2/16/2019 150
2 2/18/2018 1/31/2019 150
3 12/15/2012 3/31/2021 100

result dataset needed:
month custid recur_start recur_end recur_amt
Jan19 1 10/15/2014 2/16/2019 150
Jan19 2 2/18/2018 1/31/2019 150
Jan19 3 12/15/2012 3/31/2021 100
Feb19 1 10/15/2014 2/16/2019 150
Feb19 3 12/15/2012 3/31/2021 100

ErikLund_Jensen · Posted 03-10-2019 06:11 PM

A relevant question because we don't know the actual problem. But I tried to imagine a situation where a partial overlap should be removed, like in "make a list of employees employed in the full months of april, may and june".

I often do interval joins, but the requirement has always been either a boolean "overlap or not", a real interval join where the overlapping interval is the wanted output, or overlap on a given day like end-of-month. It was never "Full overlap only", so I am just curious - is it something people often use "out there"?

View solution in original post

VDD · Posted 03-10-2019 11:42 AM

you request is not completely clear and allows for a programmer to build results that are not what you want because you forgot the full conditional statement for updating the results.

second why not flip you process and run through the small dataset to find what records match in A and use an update process.

ErikLund_Jensen · Posted 03-10-2019 12:02 PM

Hi @redfishJAX

One way is an SQL join. The might be more effecient solutions, but I think this would run pretty fast too, because the A data set is small, and it is simple.

data A; 
	informat begin end mmddyy10.;
	format begin end mmddyy10.;
	input month $ begin end;
datalines;
Jan19 1/1/2019 1/31/2019
Feb19 2/1/2019 2/28/2019
;
run;

data B; 
	informat recur_start recur_end mmddyy10.;
	format recur_start recur_end mmddyy10.;
	input custid recur_start recur_end recur_amt;
datalines;
1 10/15/2014 2/16/2019 150
2 2/18/2018 1/31/2019 150
3 12/15/2012 3/31/2021 100 
;
run;

proc sql;
	create table C as 
		select a.month, b.custid, b.recur_start, b.recur_end, b.recur_amt
		from A, B
		where 
			a.begin <= b.recur_end
			and a.end >= b.recur_start;
quit;

Patrick · Posted 03-10-2019 03:54 PM

I'd go in real live likely for SQL but below just for fun a data step approach.

data A; 
	informat begin end mmddyy10.;
	format begin end mmddyy10.;
	input month $ begin end;
datalines;
Jan19 1/1/2019 1/31/2019
Feb19 2/1/2019 2/28/2019
;
run;

data B; 
	informat recur_start recur_end mmddyy10.;
	format recur_start recur_end mmddyy10.;
	input custid recur_start recur_end recur_amt;
datalines;
1 10/15/2014 2/16/2019 150
2 2/18/2018 1/31/2019 150
3 12/15/2012 3/31/2021 100 
;
run;

data want(keep=month begin end recur_amt);
  set b;
  _last=0;
  do _i=1 to _nobs;
    set a point=_i nobs=_nobs;
    if begin<=recur_end and end>=recur_start then output want;
  end;
run;

ErikLund_Jensen · Posted 03-10-2019 05:15 PM

Hi @Patrick

Very elegant indeed - I didn't think of that. I had a hash lookup in mind, but (I am ashamed to admit) most of my toolbox is pre-V9, and after 15 years with V9 I am still not really familiar with hash objects, so it would take me too long to figure out.

But I could not resist running a test to compare the two solutions. I used 24 months in A and 5.000.000 observations in B with random intervals. SQL is the winner:

236  data A (drop=i s);
237      format begin end mmddyy10.;
238      s = '01dec2017'd;
239      do i = 1 to 24;
240          begin = intnx('month',s,i);
241          end = intnx('month',begin,1)-1;
242          month = put(begin,monyy.);
243          output;
244      end;
245  run;

NOTE: The data set WORK.A has 24 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds


246
247  data B;
248      format recur_start recur_end yymmdd10.;
249      do custid = 1 to 5000000;
250          recur_start = (ranuni(1) * 730) + 21000;
251          recur_end = min(recur_start + (ranuni(3) * 730),22100);
252          recur_amt = int(ranuni(5)*1000);
253          output;
254      end;
255  run;

NOTE: The data set WORK.B has 5000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.45 seconds
      cpu time            0.45 seconds


256
257  data want(keep=month begin end recur_amt);
258    set b;
259    _last=0;
260    do _i=1 to _nobs;
261      set a point=_i nobs=_nobs;
262      if begin<=recur_end and end>=recur_start then output want;
263    end;
264  run;

NOTE: There were 5000000 observations read from the data set WORK.B.
NOTE: The data set WORK.WANT has 51607422 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           17.74 seconds
      cpu time            16.90 seconds


265
266  proc sql;
267      create table want as
268          select a.month, b.custid, b.recur_start, b.recur_end, b.recur_amt
269          from A, B
270          where
271              a.begin <= b.recur_end
272              and a.end >= b.recur_start;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can
      not be optimized.
NOTE: Table WORK.WANT created, with 51607422 rows and 5 columns.

273  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           9.92 seconds
      cpu time            8.40 seconds

I ran this on a Lenovo PC with SSD disk. I also tried it on a Linux Grid running a heavy batch load at the moment, and got almost the same figures:

21
22    data want(keep=month begin end recur_amt);
23      set b;
24      _last=0;
25      do _i=1 to _nobs;
26        set a point=_i nobs=_nobs;
27        if begin<=recur_end and end>=recur_start then output want;
28      end;
29    run;

NOTE: There were 5000000 observations read from the data set WORK.B.
NOTE: The data set WORK.WANT has 51607422 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           14.18 seconds
      cpu time            13.16 seconds


30
31    proc sql;
32        create table want as
33            select a.month, b.custid, b.recur_start, b.recur_end, b.recur_amt
34            from A, B
35            where
36                a.begin <= b.recur_end
37                and a.end >= b.recur_start;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can
      not be optimized.
NOTE: Table WORK.WANT created, with 51607422 rows and 5 columns.

38    quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           8.39 seconds
      cpu time            8.39 seconds

Astounding · Posted 03-10-2019 04:52 PM

You have cleverly shown the easy cases, and omitted the difficult cases. What should be done with a partial overlap ... keep it or remove it?

A
month begin end
Jan19 1/1/2019 3/31/2019

B
custid recur_start recur_end recur_amt
1 10/15/2014 2/16/2019 150
2 2/18/2018 1/31/2019 150
3 12/15/2012 3/31/2021 100

ErikLund_Jensen · Posted 03-10-2019 06:11 PM

A relevant question because we don't know the actual problem. But I tried to imagine a situation where a partial overlap should be removed, like in "make a list of employees employed in the full months of april, may and june".

I often do interval joins, but the requirement has always been either a boolean "overlap or not", a real interval join where the overlapping interval is the wanted output, or overlap on a given day like end-of-month. It was never "Full overlap only", so I am just curious - is it something people often use "out there"?

redfishJAX · Posted 03-11-2019 07:55 AM

Thanks for all of the input. Going to give all of these a try and see what works best.
And fyi.. overlaps are intended to be included.

Loop through large set multiple times based on matching values from a smaller dataset

Re: Loop through large set multiple times based on matching values from a smaller dataset

Re: Loop through large set multiple times based on matching values from a smaller dataset

Re: Loop through large set multiple times based on matching values from a smaller dataset

Re: Loop through large set multiple times based on matching values from a smaller dataset

Re: Loop through large set multiple times based on matching values from a smaller dataset

Re: Loop through large set multiple times based on matching values from a smaller dataset

Re: Loop through large set multiple times based on matching values from a smaller dataset

Re: Loop through large set multiple times based on matching values from a smaller dataset

SAS Innovate 2025: Call for Content

Classroom Training Available!