Hello, I have a rather large join and aggregation, which is part of a higher level program. The join and aggregation is used many times in the higher level program with tables which are structurally the same but contain different data. Therefore it would be nice to speed this step up. /*
We have one large table A (~ 550 Mio. lines) and one smaller table B (~ 1.2 Mio. lines).
Table A is sorted by i id id_2, table B is sorted by id id_2. B only contains id and id_2 entries of A, due to the creation of the tables.
Table A might contain not every entry of id and id_2 of table B. This should probably be irrelevant for the question and is therefore not modeled here
*/
/***************************************************************************************************/
/* Create sample data: */
%let max_i = 5600;
%let max_id = 55000;
%let max_category = 22;
data _ids;
do ID = 1 to &max_id.;
ID_2 = "1";
output;
/* We have some ids for which id_2 is 1 and 0*/
if rand('uniform',0,1) < 0.01 then do; ID_2="0"; output; end;
end;
run;
proc sort data=_ids; by ID ID_2;run;
data _i;
do I = 1 to &max_i.;
output;
/* Not every i is given. */
/* Furthermore it is not modeled here but might be important: we don't know max_i and max_i is different for different table As.*/
if rand('uniform',0,1) < 0.01 then i+1;
end;
run;
/* Cartesian Product for table A*/
proc sql;
create table table_A as
select A.I
,B.ID
,B.ID_2
,ceil(rand('uniform',0,&max_category.)) as CATEGORY
from _i A
left join _ids B
on 1 = 1
order by A.I, B.ID, B.ID_2;
quit;
/* Create table B*/
data table_B;
set _ids;
do CATEGORY = 1 to &max_category.;
x = rand('uniform',0,1);
output;
end;
run;
/***************************************************************************************************/
/* Join and Aggregation */
/* Option A: Join and group by */
proc sql;
create table WANT_OPTION_A as
select A.I
,sum (B.X) as SUM_X
from table_A(sortedby=I ID ID_2) A
inner join table_B(sortedby=ID ID_2 CATEGORY) B
on A.ID = B.ID and
A.ID_2 = B.ID_2 and
A.CATEGORY = B.CATEGORY
group by A.I;
quit;
/* Option B: Hash*/
data WANT_OPTION_B(keep = I SUM_X);
if 0 then set table_B;
if _N_ = 1 then do;
declare hash HH_BEW (dataset: 'table_B');
HH_BEW.defineKey ('ID', 'ID_2', 'CATEGORY');
HH_BEW.defineData ('X');
HH_BEW.defineDone ();
end;
set table_A(sortedby=I ID ID_2);
by I;
call missing (X);
retain SUM_X;
if first.I then SUM_X = 0;
RC = HH_BEW.find ();
SUM_X = sum (SUM_X, X);
if last.I;
run;
Unfortunately I cannot change underlying settings like memsize. However I have some control over the size of table A in the higher level program without loosing significant time there, i.e. I can split table A into several tables (e.g. two with 330 Mio. lines) or aggregate the table (e.g. two with 660 Mio. lines into one with 1260 Mio. lines). Testresults: For a table A with 330 Mio. lines: Option A: ~ 4 minutes user time and ~ 7 minutes cpu time Option B: ~ 4 minutes user time and ~ 4 minutes cpu time For a table A with 660 Mio. lines: Option A: ~ 8 minutes user time and ~ 14 minutes cpu time Option B: ~ 8 minutes user time and ~ 8 minutes cpu time There seems to be no need to change the size of table A. Option B is better but only regarding cpu time. I thought about combining ID, ID_2 and CATEGORY into a single key and using a format on table A. This would get rid of the join or hash. However I still need the full table A with the CATEGORY information for another join, so there is no way I can save time with creating only the aggregate and not table A. I have never used a format for this before, hence it would be a small challenge. Would this be still worth a try? Or is there any other way to speed something up? Thank you in advance.
... View more