Solved: Re: Join and aggregation of large tables - is there a faster way?

Bernd_S · Posted 03-06-2024 02:07 PM

Hello,

I have a rather large join and aggregation, which is part of a higher level program. The join and aggregation is used many times in the higher level program with tables which are structurally the same but contain different data. Therefore it would be nice to speed this step up.

 /* 
    We have one large table A (~ 550 Mio. lines) and one smaller table B (~ 1.2 Mio. lines).
    Table A is sorted by i id id_2, table B is sorted by id id_2. B only contains id and id_2 entries of A, due to the creation of the tables.
    Table A might contain not every entry of id and id_2 of table B. This should probably be irrelevant for the question and is therefore not modeled here  
 */  


/***************************************************************************************************/
/* Create sample data: */
 %let max_i = 5600;
 %let max_id = 55000;
 %let max_category = 22;

 data _ids;
 	do ID = 1 to &max_id.;
		ID_2 = "1";
		output;
		/* We have some ids for which id_2 is 1 and 0*/
		if rand('uniform',0,1) < 0.01 then do; ID_2="0"; output; end;
	end;
 run;
 proc sort data=_ids; by ID ID_2;run;

 data _i;
	do I = 1 to &max_i.;
		output;
/*		Not every i is given. */
/*		Furthermore it is not modeled here but might be important: we don't know max_i and max_i is different for different table As.*/
		if rand('uniform',0,1) < 0.01 then i+1;
	end;
 run;

/* Cartesian Product for table A*/
 proc sql;
 	create table table_A as
	select A.I
		  ,B.ID
		  ,B.ID_2
		  ,ceil(rand('uniform',0,&max_category.)) as CATEGORY
	from _i A
	left join _ids B
	on 1 = 1
	order by A.I, B.ID, B.ID_2;
 quit;

/* Create table B*/
 data table_B;
 	set _ids;
	do CATEGORY = 1 to &max_category.;
		x = rand('uniform',0,1);
		output;
	end;
 run;



/***************************************************************************************************/
/* Join and Aggregation */

 /* Option A: Join and group by */
proc sql;
create table WANT_OPTION_A as
select  A.I 
		,sum (B.X) as SUM_X 
from table_A(sortedby=I ID ID_2) A
inner join table_B(sortedby=ID ID_2 CATEGORY) B
on 	A.ID = B.ID and 
	A.ID_2 = B.ID_2 and
	A.CATEGORY = B.CATEGORY
group by A.I;
quit;



 /* Option B: Hash*/
data WANT_OPTION_B(keep = I SUM_X);

      if 0 then set table_B;
      if _N_ = 1 then do;
         declare hash HH_BEW (dataset: 'table_B');
         HH_BEW.defineKey  ('ID', 'ID_2', 'CATEGORY');
         HH_BEW.defineData ('X');
         HH_BEW.defineDone ();

      end;

      set table_A(sortedby=I ID ID_2);
      by I;
	  call missing (X);
      retain SUM_X;
      if first.I then SUM_X = 0;

      RC = HH_BEW.find ();

      SUM_X = sum (SUM_X, X);
      
      if last.I;
 run;

Unfortunately I cannot change underlying settings like memsize.

However I have some control over the size of table A in the higher level program without loosing significant time there, i.e. I can split table A into several tables (e.g. two with 330 Mio. lines) or aggregate the table (e.g. two with 660 Mio. lines into one with 1260 Mio. lines).

Testresults:

For a table A with 330 Mio. lines:
- Option A: ~ 4 minutes user time and ~ 7 minutes cpu time
- Option B: ~ 4 minutes user time and ~ 4 minutes cpu time
For a table A with 660 Mio. lines:

Option A: ~ 8 minutes user time and ~ 14 minutes cpu time
Option B: ~ 8 minutes user time and ~ 8 minutes cpu time

There seems to be no need to change the size of table A. Option B is better but only regarding cpu time.

I thought about combining ID, ID_2 and CATEGORY into a single key and using a format on table A. This would get rid of the join or hash. However I still need the full table A with the CATEGORY information for another join, so there is no way I can save time with creating only the aggregate and not table A. I have never used a format for this before, hence it would be a small challenge.

Would this be still worth a try? Or is there any other way to speed something up?

Thank you in advance.

yabwon · Posted 03-07-2024 05:03 AM

I used your test data for tests

Table A is ~9.2GB and B is ~38MB.

I observed the ID_2 is character and ID_2 values are "1" or "0".

If you can make ID_2 to be numeric with 1 and 0 instead "1" and "0":

 data _ids;
 	do ID = 1 to &max_id.;
		ID_2 = 1; /* <-- numeric */
		output;
		/* We have some ids for which id_2 is 1 and 0*/
		if rand('uniform',0,1) < 0.01 then do; ID_2=0; output; end;
	end;
 run;

Then you could use a temporary array for look-up.

%let max_i = 5600;
%let max_id = 55000;
%let max_category = 22;
%let max_ID_2=1;
                                        /* if ID_2 is numeric */
%let range_up = %sysevalf(&max_id.*1000 + &max_ID_2.*100 + &max_category.);
%put &=range_up.;


/* Option C: Array*/
data WANT_OPTION_B_array(keep = I SUM_X);

  /* populate array */
  array XX[&range_up.] _temporary_; /* instead hash table */
  do until(eofB);
    set table_B end=EOFB;
    XX[ID*1000 + ID_2*100 + CATEGORY]=X;
  end;

  do until (EOFA);
    set table_A end=EOFA;
    by I;

    if first.I then SUM_X = 0;
    SUM_X + XX[ID*1000 + ID_2*100 + CATEGORY]; /* direct keys addressing */
    
    if last.I then output;
  end;
  stop;
run;

With numeric ID_2 the result for the hash table code on my machine was:

NOTE: There were 1221506 observations read from the data set WORK.TABLE_B.
NOTE: There were 307875035 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B has 5545 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           1:22.39
      cpu time            1:22.73

and the result with Array was:

NOTE: There were 1221506 observations read from the data set WORK.TABLE_B.
NOTE: There were 307875035 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B_ARRAY has 5545 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           13.81 seconds
      user cpu time       12.36 seconds
      system cpu time     1.65 seconds
      memory              430637.46k
      OS Memory           458620.00k

(printout is with Fullstimer, so you can see memory usage too)

The formula for array size:

%let range_up = %sysevalf(&max_id.*1000 + &max_ID_2.*100 + &max_category.);

is based on assumption that category has 2 digits, and ID_2 is 0 or 1.

If number of categories changes to more than 99 or ID_2 increases above 9 you will have to adjust the 1000 and 100 accordingly.

Hope it helps.

Bart

P.S. I learned this technique from @hashman

[EDIT:]

P.S.2

Linear pass through data was very fast, but only ~6 seconds faster than "array lookup":

220    data WANT_OPTION_B_read(keep = I SUM_X);
221
222
223        set test.table_A(sortedby=I ID ID_2);
224        by I;
225
226
227         SUM_X +1;
228        if last.I;
229   run;

NOTE: There were 308619066 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B_READ has 5558 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           7.52 seconds
      cpu time            7.56 seconds

{EDIT 2:}

P.S.3 I did additional run of the hash approach to see memory usage:

NOTE: There were 1221506 observations read from the data set WORK.TABLE_B.
NOTE: There were 307875035 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B has 5545 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           1:22.18
      user cpu time       1:20.67
      system cpu time     1.60 seconds
      memory              91185.12k
      OS Memory           119216.00k

Array took ~4.5 times more RAM.

But! if ID_is only 1 or 0 then we can re-arrange lookup key calculation to:

%let range_up = %sysevalf(&max_id. + &max_ID_2.*1e7 + &max_category.*1e5);
%put &=range_up.;



/* Option C: Array*/
data WANT_OPTION_B_array(keep = I SUM_X);

  /* populate array */
  array XX[&range_up.] _temporary_;
  do until(eofB);
    set table_B end=EOFB;
    XX[ID + ID_2*1e7 + CATEGORY*1e5]=X;
  end;

  do until (EOFA);
    set table_A end=EOFA;
    by I;

    if first.I then SUM_X = 0;
    SUM_X + XX[ID + ID_2*1e7 + CATEGORY*1e5];
    
    if last.I then output;
  end;
  stop;
run;

this makes array maximal size 12255000 instead 55000122 which is ~4 times smaller and the result is even better (with array RAM size comparable to hash table RAM):

NOTE: There were 1221506 observations read from the data set WORK.TABLE_B.
NOTE: There were 307875035 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B_ARRAY has 5545 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           9.78 seconds
      user cpu time       8.35 seconds
      system cpu time     1.53 seconds
      memory              96686.93k
      OS Memory           124672.00k

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

View solution in original post

Reeza · Posted 03-06-2024 03:27 PM

Do you have indexes on the tables? Is it worthwhile to create an index with one of the ID variables?

https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/008-30.pdf

yabwon · Posted 03-07-2024 07:00 AM

It's a very good article!

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Tom · Posted 03-06-2024 04:03 PM

Can you explain what type of combination of the two datasets you are trying to do?

Do the key variables uniquely identify the observations in both datasets? if not why not, what does the duplication mean?

You mentioned aggregation. What type of aggregation are you doing?

The fastest way to combine two datasets that are already sorted is with the MERGE statement.

data want;
  merge a(in=in1) b(in=in2);
  by id1 id2;
run;

ballardw · Posted 03-06-2024 06:19 PM

A complete description of the aggregations might be in order.

Proc summary will "aggregate" multiple variables by multiple class/by variables pretty efficiently. So perhaps combine all (or most) of your aggregations into one or a very small number of sets and then use those instead of aggregating with multiple calls using the same data set.

AhmedAl_Attar · Posted 03-06-2024 07:46 PM

Hi @Bernd_S

How about creating Data Step View of the merge, followed by Proc Summary?

PROC SORT DATA=work.table_a; BY id id_2 category; Run;

DATA WANT_OPTION_A_v (KEEP=i x)/VIEW=WANT_OPTION_A_v;
	MERGE table_a table_b;
	By id id_2 category;
RUN;
PROC SUMMARY DATA=work.WANT_OPTION_A_v NWAY;
	CLASS i;
	VAR x;
	output out=WANT_OPTION_A_sum(DROP=_:) sum=sum_x;
RUN;

if i has too large cardinality, you may want to sort the view by i, then use By statement in the Proc Summary

PROC SORT DATA=work.table_a; BY id id_2 category; Run;

DATA WANT_OPTION_A_v (KEEP=i x)/VIEW=WANT_OPTION_A_v;
	MERGE table_a table_b;
	By id id_2 category;
RUN;
PROC SORT DATA=WANT_OPTION_A_v OUT=WANT_OPTION_A_srt;
   BY i;
RUN;
PROC SUMMARY DATA=work.WANT_OPTION_A_srt NWAY;
	BY i;
	VAR x;
	output out=WANT_OPTION_A_sum(DROP=_:) sum=sum_x;
RUN;

Hope this helps,

Ahmed

Patrick · Posted 03-06-2024 09:01 PM

Thanks for providing sample data and code.

Based on the testing on my laptop the hash lookup is faster than the SQL because it avoids the sorting (which is the multithreaded portion of the SQL).

Option fullstimer indicates that on my environment reading the data from disk is not the bottleneck. You should check this on your environment as well.

Afaik with these results the only thing that really could speed-up processing is faster CPU with a matching throughput reading from disk.

One other option would be to run this in multiple sessions (multiple CPU's) and process the data in multiple junks but ...the time it would take to determine where to split the data and to invocate these sessions (rsubmit blocks) would likely consume too much extra time to make this worthwhile given the current elapsed time of your single threaded process. And with multiple processes reading the same file in parallel there is then also a chance that I/O becomes the bottleneck.

Just reading table A and loading table B into a hash takes already considerable time.

Spoiler

28         options fullstimer msglevel=i;
29         data _null_;
30           set table_a(keep=i id id_2 category);
31         run;

NOTE: There were 308646702 observations read from the data set WORK.TABLE_A.
NOTE: DATA statement used (Total process time):
      real time           1:01.85
      user cpu time       32.73 seconds
      system cpu time     10.75 seconds
      memory              1276.00k
      OS Memory           22120.00k
      Timestamp           03/07/2024 12:54:31 PM
      Step Count                        53  Switch Count  100
      

32         data _null_;
33             if 0 then set table_a(keep=i id id_2 category) table_b(keep=x);
34             declare hash hh_bew (dataset: 'table_b');
35             hh_bew.definekey  ('id', 'id_2', 'category');
36             hh_bew.definedata ('x');
37             hh_bew.definedone ();
38         run;

NOTE: There were 1223244 observations read from the data set WORK.TABLE_B.
NOTE: DATA STEP stopped due to looping.
NOTE: DATA statement used (Total process time):
      real time           0.52 seconds
      user cpu time       0.46 seconds
      system cpu time     0.00 seconds
      memory              134070.12k
      OS Memory           154756.00k
      Timestamp           03/07/2024 12:54:32 PM
      Step Count                        54  Switch Count  4

28 options fullstimer msglevel=i; 29 data _null_; 30 set table_a(keep=i id id_2 category); 31 run; NOTE: There were 308646702 observations read from the data set WORK.TABLE_A. NOTE: DATA statement used (Total process time): real time 1:01.85 user cpu time 32.73 seconds system cpu time 10.75 seconds memory 1276.00k OS Memory 22120.00k Timestamp 03/07/2024 12:54:31 PM Step Count 53 Switch Count 100 32 data _null_; 33 if 0 then set table_a(keep=i id id_2 category) table_b(keep=x); 34 declare hash hh_bew (dataset: 'table_b'); 35 hh_bew.definekey ('id', 'id_2', 'category'); 36 hh_bew.definedata ('x'); 37 hh_bew.definedone (); 38 run; NOTE: There were 1223244 observations read from the data set WORK.TABLE_B. NOTE: DATA STEP stopped due to looping. NOTE: DATA statement used (Total process time): real time 0.52 seconds user cpu time 0.46 seconds system cpu time 0.00 seconds memory 134070.12k OS Memory 154756.00k Timestamp 03/07/2024 12:54:32 PM Step Count 54 Switch Count 4

And here the log from the run on my environment. Please note the the data step hash lookup is similar to a left join (and not an inner join). If you would only want rows in the result set where at least one source row per ID had a match with the hash then some additional logic would be required.

Spoiler

28         options fullstimer msglevel=i compress=no;
29         /***************************************************************************************************/
30         /* Join and Aggregation */
31         /* Option A: Join and group by */
32         proc sql _method ;
33           create table WANT_OPTION_A1 as
34             select  A.I
35               ,sum (B.X) as SUM_X
36             from table_A(sortedby=I ID ID_2 keep=i id id_2 category) A
37               inner join table_B(sortedby=ID ID_2 CATEGORY) B
38                 on  A.ID = B.ID and
39                 A.ID_2 = B.ID_2 and
40                 A.CATEGORY = B.CATEGORY
41               group by A.I;

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxsumg
              sqxsort
                  sqxjhsh
                      sqxsrc( WORK.TABLE_A(alias = A) )
                      sqxsrc( WORK.TABLE_B(alias = B) )
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT_OPTION_A1 created, with 5551 rows and 2 columns.

42         quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           5:43.96
      user cpu time       5:27.71
      system cpu time     22.79 seconds
      memory              1147505.04k
      OS Memory           1166680.00k
      Timestamp           03/07/2024 12:11:41 PM
      Step Count                        39  Switch Count  478
      

43         
44         /* Option B: Hash*/
45         data want_option_b(keep = i sum_x);
46           if _n_ = 1 then
47             do;
48               if 0 then set table_a(keep=i id id_2 category) table_b(keep=x);
49               declare hash hh_bew (dataset: 'table_b');
50               hh_bew.definekey  ('id', 'id_2', 'category');
51               hh_bew.definedata ('x');
52               hh_bew.definedone ();
53             end;
54         
55           set table_a(keep=i id id_2 category);
56           by i;
57           if hh_bew.find()=0 then sum_x + x;
58           if last.i then
59             do;
60               output;
61               call missing(sum_x);
62             end;
63         run;

NOTE: There were 1223244 observations read from the data set WORK.TABLE_B.
NOTE: There were 308646702 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B has 5551 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           4:27.76
      user cpu time       3:09.60
      system cpu time     20.25 seconds
      memory              134490.06k
      OS Memory           154508.00k
      Timestamp           03/07/2024 12:16:09 PM
      Step Count                        40  Switch Count  433

28 options fullstimer msglevel=i compress=no; 29 /***************************************************************************************************/ 30 /* Join and Aggregation */ 31 /* Option A: Join and group by */ 32 proc sql _method ; 33 create table WANT_OPTION_A1 as 34 select A.I 35 ,sum (B.X) as SUM_X 36 from table_A(sortedby=I ID ID_2 keep=i id id_2 category) A 37 inner join table_B(sortedby=ID ID_2 CATEGORY) B 38 on A.ID = B.ID and 39 A.ID_2 = B.ID_2 and 40 A.CATEGORY = B.CATEGORY 41 group by A.I; NOTE: SQL execution methods chosen are: sqxcrta sqxsumg sqxsort sqxjhsh sqxsrc( WORK.TABLE_A(alias = A) ) sqxsrc( WORK.TABLE_B(alias = B) ) NOTE: SAS threaded sort was used. NOTE: Table WORK.WANT_OPTION_A1 created, with 5551 rows and 2 columns. 42 quit; NOTE: PROCEDURE SQL used (Total process time): real time 5:43.96 user cpu time 5:27.71 system cpu time 22.79 seconds memory 1147505.04k OS Memory 1166680.00k Timestamp 03/07/2024 12:11:41 PM Step Count 39 Switch Count 478 43 44 /* Option B: Hash*/ 45 data want_option_b(keep = i sum_x); 46 if _n_ = 1 then 47 do; 48 if 0 then set table_a(keep=i id id_2 category) table_b(keep=x); 49 declare hash hh_bew (dataset: 'table_b'); 50 hh_bew.definekey ('id', 'id_2', 'category'); 51 hh_bew.definedata ('x'); 52 hh_bew.definedone (); 53 end; 54 55 set table_a(keep=i id id_2 category); 56 by i; 57 if hh_bew.find()=0 then sum_x + x; 58 if last.i then 59 do; 60 output; 61 call missing(sum_x); 62 end; 63 run; NOTE: There were 1223244 observations read from the data set WORK.TABLE_B. NOTE: There were 308646702 observations read from the data set WORK.TABLE_A. NOTE: The data set WORK.WANT_OPTION_B has 5551 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 4:27.76 user cpu time 3:09.60 system cpu time 20.25 seconds memory 134490.06k OS Memory 154508.00k Timestamp 03/07/2024 12:16:09 PM Step Count 40 Switch Count 433

Bernd_S · Posted 03-07-2024 03:48 AM

Hi,

thanks everybody for the replies.

Since questions were asked about the tables and what I try to achieve, I will provide more background information.

Table A has the primary key i, id, id_2 and its creation is such that it is sorted by i. I think a can change the higher level program such that table A would be sorted by i, id, id_2 without much overhead if this helps.

Table B has the primary key id, id_2, category. It is sorted by the primary key.

The aggregation I'm talking about is the sum(x) over each i.

The higher level program is basically a monte carlo simulation, which is already run on multiple sessions. I need to simulate a large number of scenarios, i.e. max_i is the number of scenarios of one session or one run of a session and the total number of all scenarios is the sum of max_i over all runs of the sessions. Table A is one outcome of one run of a session, i.e. for nearly each id, id_2 of table B a category is simulated for each scenario i. The simulation of the category is rather complex and I need the whole information in table A also for other purposes. Each run creates a different table A. Furthermore I must calculate the sum of x for each scenario i. Table B is basically a mapping of each CATEGORY of each id, id_2 to a value x. The values of table B are the same for each run. Therefore this rather large join/aggregation must be carried out several times. It is the most time consuming part of the whole program and I would be able to decrease the total time of the higher level program by a significant amount if this step could be run e.g. a minute faster.

@Patrick: Thanks for your analysis, this coincides with my findings. Your programmed hash is slightly faster than mine. So right now this is my prefered solution. Due to the nature of the tables the left join always hits, therefore we can skip the if hh_bew.find()=0 part.

@Reeza: Since I'm creating the tables in the higher level program I could put indexes on the tables. I'm however not familiar with indexes. Depending on max_i I retrieve a different percent of the rows of table B. For max_i=5600 around 80 % are retrieved. Could this help? How would this step look with indexes?

@AhmedAl_Attar: Thanks for your suggestion. I used your code and on my machine it seems like the real time of it is slightly longer than @Patricks suggestion and the cpu time is around the double of the hash method.

yabwon · Posted 03-07-2024 05:03 AM

I used your test data for tests

Table A is ~9.2GB and B is ~38MB.

I observed the ID_2 is character and ID_2 values are "1" or "0".

If you can make ID_2 to be numeric with 1 and 0 instead "1" and "0":

 data _ids;
 	do ID = 1 to &max_id.;
		ID_2 = 1; /* <-- numeric */
		output;
		/* We have some ids for which id_2 is 1 and 0*/
		if rand('uniform',0,1) < 0.01 then do; ID_2=0; output; end;
	end;
 run;

Then you could use a temporary array for look-up.

%let max_i = 5600;
%let max_id = 55000;
%let max_category = 22;
%let max_ID_2=1;
                                        /* if ID_2 is numeric */
%let range_up = %sysevalf(&max_id.*1000 + &max_ID_2.*100 + &max_category.);
%put &=range_up.;


/* Option C: Array*/
data WANT_OPTION_B_array(keep = I SUM_X);

  /* populate array */
  array XX[&range_up.] _temporary_; /* instead hash table */
  do until(eofB);
    set table_B end=EOFB;
    XX[ID*1000 + ID_2*100 + CATEGORY]=X;
  end;

  do until (EOFA);
    set table_A end=EOFA;
    by I;

    if first.I then SUM_X = 0;
    SUM_X + XX[ID*1000 + ID_2*100 + CATEGORY]; /* direct keys addressing */
    
    if last.I then output;
  end;
  stop;
run;

With numeric ID_2 the result for the hash table code on my machine was:

NOTE: There were 1221506 observations read from the data set WORK.TABLE_B.
NOTE: There were 307875035 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B has 5545 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           1:22.39
      cpu time            1:22.73

and the result with Array was:

NOTE: There were 1221506 observations read from the data set WORK.TABLE_B.
NOTE: There were 307875035 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B_ARRAY has 5545 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           13.81 seconds
      user cpu time       12.36 seconds
      system cpu time     1.65 seconds
      memory              430637.46k
      OS Memory           458620.00k

(printout is with Fullstimer, so you can see memory usage too)

The formula for array size:

%let range_up = %sysevalf(&max_id.*1000 + &max_ID_2.*100 + &max_category.);

is based on assumption that category has 2 digits, and ID_2 is 0 or 1.

If number of categories changes to more than 99 or ID_2 increases above 9 you will have to adjust the 1000 and 100 accordingly.

Hope it helps.

Bart

P.S. I learned this technique from @hashman

[EDIT:]

P.S.2

Linear pass through data was very fast, but only ~6 seconds faster than "array lookup":

220    data WANT_OPTION_B_read(keep = I SUM_X);
221
222
223        set test.table_A(sortedby=I ID ID_2);
224        by I;
225
226
227         SUM_X +1;
228        if last.I;
229   run;

NOTE: There were 308619066 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B_READ has 5558 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           7.52 seconds
      cpu time            7.56 seconds

{EDIT 2:}

P.S.3 I did additional run of the hash approach to see memory usage:

NOTE: There were 1221506 observations read from the data set WORK.TABLE_B.
NOTE: There were 307875035 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B has 5545 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           1:22.18
      user cpu time       1:20.67
      system cpu time     1.60 seconds
      memory              91185.12k
      OS Memory           119216.00k

Array took ~4.5 times more RAM.

But! if ID_is only 1 or 0 then we can re-arrange lookup key calculation to:

%let range_up = %sysevalf(&max_id. + &max_ID_2.*1e7 + &max_category.*1e5);
%put &=range_up.;



/* Option C: Array*/
data WANT_OPTION_B_array(keep = I SUM_X);

  /* populate array */
  array XX[&range_up.] _temporary_;
  do until(eofB);
    set table_B end=EOFB;
    XX[ID + ID_2*1e7 + CATEGORY*1e5]=X;
  end;

  do until (EOFA);
    set table_A end=EOFA;
    by I;

    if first.I then SUM_X = 0;
    SUM_X + XX[ID + ID_2*1e7 + CATEGORY*1e5];
    
    if last.I then output;
  end;
  stop;
run;

this makes array maximal size 12255000 instead 55000122 which is ~4 times smaller and the result is even better (with array RAM size comparable to hash table RAM):

NOTE: There were 1221506 observations read from the data set WORK.TABLE_B.
NOTE: There were 307875035 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_OPTION_B_ARRAY has 5545 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           9.78 seconds
      user cpu time       8.35 seconds
      system cpu time     1.53 seconds
      memory              96686.93k
      OS Memory           124672.00k

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Patrick · Posted 03-07-2024 06:49 AM

@yabwon Impressive! I knew that using a temporary array is faster than a hash lookup but I didn't expect such a significant difference.

@Bernd_S If your actual data allows for it then using a temporary array appears to make a very significant impact. And of course what also should drastically reduce overall elapsed time is running your multiple simulation and aggregation steps in multiple processes in parallel - either just via running multiple batch jobs in parallel or then if using EG/Studio a single program spawning multiple sessions (rsubmit).

yabwon · Posted 03-07-2024 07:04 AM

@Patrick I saw in your post you did nice "processing time analysis" too, could you be so kind and check that "array approach" time on your machine? So we could have broader view.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Patrick · Posted 03-07-2024 07:36 AM

@yabwon

Here you go. I've "tweaked" your code slightly so it accepts id_2 as character variable. I've also made the array two dimensional to not need to multiply ID by 1000.

Spoiler

29         options fullstimer msglevel=i ps=max;
30         /************ hash lookup ************/
32         data want_HASH(keep = i sum_x);
33           if _n_ = 1 then
34             do;
35               if 0 then set table_a(keep=i id id_2 category) table_b(keep=x);
36               declare hash hh_bew (dataset: 'table_b');
37               hh_bew.definekey  ('id', 'id_2', 'category');
38               hh_bew.definedata ('x');
39               hh_bew.definedone ();
40             end;
41         
42           set table_a(keep=i id id_2 category);
43           by i;
44           if hh_bew.find()=0 then sum_x + x;
45           if last.i then
46             do;
47               output;
48               call missing(sum_x);
49             end;
50         run;

NOTE: There were 1221506 observations read from the data set WORK.TABLE_B.
NOTE: There were 308097127 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_HASH has 5549 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           2:44.66
      user cpu time       2:35.29
      system cpu time     3.96 seconds
      memory              134044.34k
      OS Memory           153324.00k
      Timestamp           03/07/2024 11:27:16 PM
      Step Count                        31  Switch Count  249
      

51         
52         /************ temporary array direct key lookup ************/
53         %let max_i = 5600;
54         %let max_id = 55000;
55         %let max_category = 22;
56         %let max_ID_2=1;
57         
58         data want_ARRAY(keep = I SUM_X);
59         
60           /* populate array */
61           array XX[&max_id,%eval(&max_category*10 +1)] _temporary_; /* instead hash table */
62           do until(eofB);
63             set table_B end=EOFB;
64             XX[ID,CATEGORY*10 +input(ID_2,1.)]=X;
65           end;
66         
67           do until (EOFA);
68             set table_A end=EOFA;
69             by I;
70         
71             if first.I then SUM_X = 0;
72             SUM_X + XX[ID,CATEGORY*10 +input(ID_2,1.)]; /* direct keys addressing */
73         
74             if last.I then output;
75           end;
76           stop;
77         run;

NOTE: There were 1221506 observations read from the data set WORK.TABLE_B.
NOTE: There were 308097127 observations read from the data set WORK.TABLE_A.
NOTE: The data set WORK.WANT_ARRAY has 5549 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           57.70 seconds
      user cpu time       52.67 seconds
      system cpu time     2.46 seconds
      memory              96194.21k
      OS Memory           115820.00k
      Timestamp           03/07/2024 11:28:14 PM
      Step Count                        32  Switch Count  102

29 options fullstimer msglevel=i ps=max; 30 /************ hash lookup ************/ 32 data want_HASH(keep = i sum_x); 33 if _n_ = 1 then 34 do; 35 if 0 then set table_a(keep=i id id_2 category) table_b(keep=x); 36 declare hash hh_bew (dataset: 'table_b'); 37 hh_bew.definekey ('id', 'id_2', 'category'); 38 hh_bew.definedata ('x'); 39 hh_bew.definedone (); 40 end; 41 42 set table_a(keep=i id id_2 category); 43 by i; 44 if hh_bew.find()=0 then sum_x + x; 45 if last.i then 46 do; 47 output; 48 call missing(sum_x); 49 end; 50 run; NOTE: There were 1221506 observations read from the data set WORK.TABLE_B. NOTE: There were 308097127 observations read from the data set WORK.TABLE_A. NOTE: The data set WORK.WANT_HASH has 5549 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 2:44.66 user cpu time 2:35.29 system cpu time 3.96 seconds memory 134044.34k OS Memory 153324.00k Timestamp 03/07/2024 11:27:16 PM Step Count 31 Switch Count 249 51 52 /************ temporary array direct key lookup ************/ 53 %let max_i = 5600; 54 %let max_id = 55000; 55 %let max_category = 22; 56 %let max_ID_2=1; 57 58 data want_ARRAY(keep = I SUM_X); 59 60 /* populate array */ 61 array XX[&max_id,%eval(&max_category*10 +1)] _temporary_; /* instead hash table */ 62 do until(eofB); 63 set table_B end=EOFB; 64 XX[ID,CATEGORY*10 +input(ID_2,1.)]=X; 65 end; 66 67 do until (EOFA); 68 set table_A end=EOFA; 69 by I; 70 71 if first.I then SUM_X = 0; 72 SUM_X + XX[ID,CATEGORY*10 +input(ID_2,1.)]; /* direct keys addressing */ 73 74 if last.I then output; 75 end; 76 stop; 77 run; NOTE: There were 1221506 observations read from the data set WORK.TABLE_B. NOTE: There were 308097127 observations read from the data set WORK.TABLE_A. NOTE: The data set WORK.WANT_ARRAY has 5549 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 57.70 seconds user cpu time 52.67 seconds system cpu time 2.46 seconds memory 96194.21k OS Memory 115820.00k Timestamp 03/07/2024 11:28:14 PM Step Count 32 Switch Count 102

yabwon · Posted 03-07-2024 07:58 AM

In fact, why do we have to limit ourselves to 2 dimiensions? 🙂

data WANT_OPTION_B_array2(keep = I SUM_X);

  /* populate array */
  array XX[&max_id.       /* 1 to 55000 */
          ,&max_category. /* 1 to 22 */
          ,0:1            /* for ID_2  values 0 to 1 */
          ] _temporary_;
  do until(eofB);
    set test.table_B end=EOFB;
    XX[ID,CATEGORY,ID_2]=X;
  end;

  do until (EOFA);
    set test.table_A end=EOFA;
    by I;

    if first.I then SUM_X = 0;
    SUM_X + XX[ID,CATEGORY,ID_2];
    
    if last.I then output;
  end;
  stop;
run;

With 3, we not have to do any additional calculations.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Patrick · Posted 03-07-2024 08:03 AM

@yabwon LOL. Not sure why but I had so much in my mind that we're limited to max two dimensions that I didn't even try.... Learning new things today 🙂

yabwon · Posted 03-07-2024 08:09 AM

Well, you did figured out that there can be 2 dimensions. And me, I didn't step out from 1 dimension level in the first place 😄 😄 😄

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!