What is the best way to do this?
The actual datasets are millions of obs.. one small (A) against one huge (~10m) B) but i need A to loop through B and find matches based on date conditions.
once each single observation on A has looped through B, i need it to do the same for the nexst observation in A and so on
A
month begin end
Jan19 1/1/2019 1/31/2019
Feb19 2/1/2019 2/28/2019
B
custid recur_start recur_end recur_amt
1 10/15/2014 2/16/2019 150
2 2/18/2018 1/31/2019 150
3 12/15/2012 3/31/2021 100
result dataset needed:
month custid recur_start recur_end recur_amt
Jan19 1 10/15/2014 2/16/2019 150
Jan19 2 2/18/2018 1/31/2019 150
Jan19 3 12/15/2012 3/31/2021 100
Feb19 1 10/15/2014 2/16/2019 150
Feb19 3 12/15/2012 3/31/2021 100
you request is not completely clear and allows for a programmer to build results that are not what you want because you forgot the full conditional statement for updating the results.
second why not flip you process and run through the small dataset to find what records match in A and use an update process.
Hi @redfishJAX
One way is an SQL join. The might be more effecient solutions, but I think this would run pretty fast too, because the A data set is small, and it is simple.
data A;
informat begin end mmddyy10.;
format begin end mmddyy10.;
input month $ begin end;
datalines;
Jan19 1/1/2019 1/31/2019
Feb19 2/1/2019 2/28/2019
;
run;
data B;
informat recur_start recur_end mmddyy10.;
format recur_start recur_end mmddyy10.;
input custid recur_start recur_end recur_amt;
datalines;
1 10/15/2014 2/16/2019 150
2 2/18/2018 1/31/2019 150
3 12/15/2012 3/31/2021 100
;
run;
proc sql;
create table C as
select a.month, b.custid, b.recur_start, b.recur_end, b.recur_amt
from A, B
where
a.begin <= b.recur_end
and a.end >= b.recur_start;
quit;
I'd go in real live likely for SQL but below just for fun a data step approach.
data A;
informat begin end mmddyy10.;
format begin end mmddyy10.;
input month $ begin end;
datalines;
Jan19 1/1/2019 1/31/2019
Feb19 2/1/2019 2/28/2019
;
run;
data B;
informat recur_start recur_end mmddyy10.;
format recur_start recur_end mmddyy10.;
input custid recur_start recur_end recur_amt;
datalines;
1 10/15/2014 2/16/2019 150
2 2/18/2018 1/31/2019 150
3 12/15/2012 3/31/2021 100
;
run;
data want(keep=month begin end recur_amt);
set b;
_last=0;
do _i=1 to _nobs;
set a point=_i nobs=_nobs;
if begin<=recur_end and end>=recur_start then output want;
end;
run;
Hi @Patrick
Very elegant indeed - I didn't think of that. I had a hash lookup in mind, but (I am ashamed to admit) most of my toolbox is pre-V9, and after 15 years with V9 I am still not really familiar with hash objects, so it would take me too long to figure out.
But I could not resist running a test to compare the two solutions. I used 24 months in A and 5.000.000 observations in B with random intervals. SQL is the winner:
236 data A (drop=i s); 237 format begin end mmddyy10.; 238 s = '01dec2017'd; 239 do i = 1 to 24; 240 begin = intnx('month',s,i); 241 end = intnx('month',begin,1)-1; 242 month = put(begin,monyy.); 243 output; 244 end; 245 run; NOTE: The data set WORK.A has 24 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.00 seconds 246 247 data B; 248 format recur_start recur_end yymmdd10.; 249 do custid = 1 to 5000000; 250 recur_start = (ranuni(1) * 730) + 21000; 251 recur_end = min(recur_start + (ranuni(3) * 730),22100); 252 recur_amt = int(ranuni(5)*1000); 253 output; 254 end; 255 run; NOTE: The data set WORK.B has 5000000 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.45 seconds cpu time 0.45 seconds 256 257 data want(keep=month begin end recur_amt); 258 set b; 259 _last=0; 260 do _i=1 to _nobs; 261 set a point=_i nobs=_nobs; 262 if begin<=recur_end and end>=recur_start then output want; 263 end; 264 run; NOTE: There were 5000000 observations read from the data set WORK.B. NOTE: The data set WORK.WANT has 51607422 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 17.74 seconds cpu time 16.90 seconds 265 266 proc sql; 267 create table want as 268 select a.month, b.custid, b.recur_start, b.recur_end, b.recur_amt 269 from A, B 270 where 271 a.begin <= b.recur_end 272 and a.end >= b.recur_start; NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: Table WORK.WANT created, with 51607422 rows and 5 columns. 273 quit; NOTE: PROCEDURE SQL used (Total process time): real time 9.92 seconds cpu time 8.40 seconds
I ran this on a Lenovo PC with SSD disk. I also tried it on a Linux Grid running a heavy batch load at the moment, and got almost the same figures:
21 22 data want(keep=month begin end recur_amt); 23 set b; 24 _last=0; 25 do _i=1 to _nobs; 26 set a point=_i nobs=_nobs; 27 if begin<=recur_end and end>=recur_start then output want; 28 end; 29 run; NOTE: There were 5000000 observations read from the data set WORK.B. NOTE: The data set WORK.WANT has 51607422 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 14.18 seconds cpu time 13.16 seconds 30 31 proc sql; 32 create table want as 33 select a.month, b.custid, b.recur_start, b.recur_end, b.recur_amt 34 from A, B 35 where 36 a.begin <= b.recur_end 37 and a.end >= b.recur_start; NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: Table WORK.WANT created, with 51607422 rows and 5 columns. 38 quit; NOTE: PROCEDURE SQL used (Total process time): real time 8.39 seconds cpu time 8.39 seconds
You have cleverly shown the easy cases, and omitted the difficult cases. What should be done with a partial overlap ... keep it or remove it?
A
month begin end
Jan19 1/1/2019 3/31/2019
B
custid recur_start recur_end recur_amt
1 10/15/2014 2/16/2019 150
2 2/18/2018 1/31/2019 150
3 12/15/2012 3/31/2021 100
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.