Solved: Re: Join and aggregation of large tables - is there a faster way? - Page 2

AhmedAl_Attar · Posted 03-07-2024 11:02 AM

Guys,

Thank you for inspiring solutions, and just as an FYI, on my machine, Single level/dimension array approach was the fastest when compared to 2-Dimensional & 3-Dimensional array approaches. I wonder if you had similar findings.

Tom · Posted 03-07-2024 11:08 AM

@AhmedAl_Attar wrote:

@yabwon @Patrick

Guys,

Thank you for inspiring solutions, and just as an FYI, on my machine, Single level/dimension array approach was the fastest when compared to 2-Dimensional & 3-Dimensional array approaches. I wonder if you had similar findings.

Could just be a result of re-running and getting benefits of caching. Run the two in opposite order and see if the the difference stays the same.

If so then I assume it because of the reduced number of checks that SAS has to do to make sure the indexes are in bounds. Doing the simple multiplication and then just checking that the result is in bounds is probably faster than checking three index values. Especially since it probably has to do the multiplication anyway to index to the requested location.

Patrick · Posted 03-07-2024 08:40 PM

@yabwon And for my own upskilling and because I could I've now also been running this aggregation step in Viya 4 CAS on an environment with 192 threads (4 CAS workers with 48 CPU's each).

This would in reality only be worth it if the simulation process already runs in CAS and creates a CAS table.

The hash approach took 14 seconds. I've been told that if there is no hash method used that needs write access to the hash then CAS is smart enough to load the hash only once (not sure if once per worker or once at all on the controller node).

The temp array approach executed in CAS without errors or warning BUT it somehow skipped processing all the data. Not sure yet if this is a bug or if I'm just not understanding yet how things work.

The moment I removed the do until step that loads the array the set statement read all the rows from casuser.table_a.

casuser.table_A is partitioned by I - which is why I don't need a 2nd single threaded pass through the data to consolidate the results from the different nodes ("MapReduce").

casuser.table_B is replicated on all nodes.

And here the log.

Spoiler

179  /************ hash lookup ************/
180  data casuser.want_HASH(keep = i sum_x replace=yes);
181    if _n_ = 1 then
182      do;
183        if 0 then set casuser.table_A(keep=i id id_2 category) casuser.table_B(keep=x);
184        declare hash hh_bew (dataset: 'casuser.table_B');
185        hh_bew.definekey  ('id', 'id_2', 'category');
186        hh_bew.definedata ('x');
187        hh_bew.definedone ();
188      end;
189  
190    set casuser.table_A(keep=i id id_2 category);
191    by i;
192    if hh_bew.find()=0 then sum_x + x;
193    if last.i then
194      do;
195        output;
196        call missing(sum_x);
197      end;
198  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 0 observations read from the table TABLE_A in caslib CASUSER(*****).
NOTE: There were 0 observations read from the table TABLE_B in caslib CASUSER(*****).
NOTE: There were 308519088 observations read from the table TABLE_A in caslib CASUSER(*****).
NOTE: There were 4890072 observations read from the table table_B in caslib CASUSER(*****).
NOTE: The table want_HASH in caslib CASUSER(*****) has 5552 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           13.56 seconds
      user cpu time       0.06 seconds
      system cpu time     0.02 seconds
      memory              1675.28k
      OS Memory           40812.00k
      Timestamp           03/08/2024 12:59:55 AM
      Step Count                        18  Switch Count  2
      Page Faults                       0
      Page Reclaims                     140
      Page Swaps                        0
      Voluntary Context Switches        275
      Involuntary Context Switches      0
      Block Input Operations            0
      Block Output Operations           0
      
199  
200  /************ temporary array direct key lookup ************/
201  %let max_i = 5600;
202  %let max_id = 55000;
203  %let max_category = 22;
204  %let max_ID_2=1;
205  
206  data casuser.want_ARRAY(keep = I SUM_X replace=yes) ;
207  
208    /* populate array */
209    array XX[&max_id.       /* 1 to 55000 */
210            ,&max_category. /* 1 to 22 */
211            ,0:1            /* for ID_2  values 0 to 1 */
212            ] _temporary_;
213    if _n_=1 then
214      do until(eofB);
215        set casuser.table_B end=EOFB;
216        XX[ID,CATEGORY,input(ID_2,1.)]=X;
217      end;
218  
219    set casuser.table_A;
220    by I;
221  
222    if first.I then SUM_X = 0;
223    SUM_X + XX[ID,CATEGORY,input(ID_2,1.)];
224  
225    if last.I then output;
226  
227  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 1222518 observations read from the table TABLE_B in caslib CASUSER(*****).
NOTE: There were 76796502 observations read from the table TABLE_A in caslib CASUSER(*****).
NOTE: The table want_ARRAY in caslib CASUSER(*****) has 1382 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           12.74 seconds
      user cpu time       0.03 seconds
      system cpu time     0.03 seconds
      memory              20470.84k
      OS Memory           59720.00k
      Timestamp           03/08/2024 01:00:07 AM
      Step Count                        19  Switch Count  2
      Page Faults                       0
      Page Reclaims                     703
      Page Swaps                        0
      Voluntary Context Switches        238
      Involuntary Context Switches      0
      Block Input Operations            0
      Block Output Operations           0

179 /************ hash lookup ************/ 180 data casuser.want_HASH(keep = i sum_x replace=yes); 181 if _n_ = 1 then 182 do; 183 if 0 then set casuser.table_A(keep=i id id_2 category) casuser.table_B(keep=x); 184 declare hash hh_bew (dataset: 'casuser.table_B'); 185 hh_bew.definekey ('id', 'id_2', 'category'); 186 hh_bew.definedata ('x'); 187 hh_bew.definedone (); 188 end; 189 190 set casuser.table_A(keep=i id id_2 category); 191 by i; 192 if hh_bew.find()=0 then sum_x + x; 193 if last.i then 194 do; 195 output; 196 call missing(sum_x); 197 end; 198 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step will run in multiple threads. NOTE: There were 0 observations read from the table TABLE_A in caslib CASUSER(*****). NOTE: There were 0 observations read from the table TABLE_B in caslib CASUSER(*****). NOTE: There were 308519088 observations read from the table TABLE_A in caslib CASUSER(*****). NOTE: There were 4890072 observations read from the table table_B in caslib CASUSER(*****). NOTE: The table want_HASH in caslib CASUSER(*****) has 5552 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 13.56 seconds user cpu time 0.06 seconds system cpu time 0.02 seconds memory 1675.28k OS Memory 40812.00k Timestamp 03/08/2024 12:59:55 AM Step Count 18 Switch Count 2 Page Faults 0 Page Reclaims 140 Page Swaps 0 Voluntary Context Switches 275 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 0 199 200 /************ temporary array direct key lookup ************/ 201 %let max_i = 5600; 202 %let max_id = 55000; 203 %let max_category = 22; 204 %let max_ID_2=1; 205 206 data casuser.want_ARRAY(keep = I SUM_X replace=yes) ; 207 208 /* populate array */ 209 array XX[&max_id. /* 1 to 55000 */ 210 ,&max_category. /* 1 to 22 */ 211 ,0:1 /* for ID_2 values 0 to 1 */ 212 ] _temporary_; 213 if _n_=1 then 214 do until(eofB); 215 set casuser.table_B end=EOFB; 216 XX[ID,CATEGORY,input(ID_2,1.)]=X; 217 end; 218 219 set casuser.table_A; 220 by I; 221 222 if first.I then SUM_X = 0; 223 SUM_X + XX[ID,CATEGORY,input(ID_2,1.)]; 224 225 if last.I then output; 226 227 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step will run in multiple threads. NOTE: There were 1222518 observations read from the table TABLE_B in caslib CASUSER(*****). NOTE: There were 76796502 observations read from the table TABLE_A in caslib CASUSER(*****). NOTE: The table want_ARRAY in caslib CASUSER(*****) has 1382 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 12.74 seconds user cpu time 0.03 seconds system cpu time 0.03 seconds memory 20470.84k OS Memory 59720.00k Timestamp 03/08/2024 01:00:07 AM Step Count 19 Switch Count 2 Page Faults 0 Page Reclaims 703 Page Swaps 0 Voluntary Context Switches 238 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 0

yabwon · Posted 03-08-2024 03:36 AM

@Patrick , so for hash table approach adding some "brute computing force" gives a nice boost.

There is one more interesting observation (regardless the fact the array approach has some data reading issues, which is totally bizarre to me).

When you take a look at my or @Bernd_S log from SAS 9.4, you will notice there is almost no difference between CPU time and Real time, e.g. @Bernd_S log says:

for array:

real time 1:04.32
user cpu time 55.27 seconds

for hash:

real time 4:46.23
user cpu time 4:41.24

In case of Viya there is almost none CPU time reported, I wonder what is the reason?

Another experiment I did was running both codes (hash and array) on my "home made Viya" which is "SAS9 + work in RAM disk", result was that there was no difference in processing time between data in RAM drive and data in SSD drive...

Bart

P.S. RamDrive tool i'm using is imdisk-toolkit (https://sourceforge.net/projects/imdisk-toolkit/)

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

yabwon · Posted 03-08-2024 04:56 AM

@Patrick

I did one more test with my "home made parallel processing using systask" code (basis was materials from my old parallel-systask training: https://pages.mini.pw.edu.pl/~jablonskib/SASpublic/Parallel-processing-in-BASE-SAS.sas)

I managed:

- with 5 parallel sas sessions I managed to reduce hash-table processing to 26.54 sec.

- with 4 parallel sas sessions I managed to reduce hash-table processing to 31.995 sec.

- with 5 parallel sas sessions I managed to reduce array processing to 4.831 sec.

- with 4 parallel sas sessions I managed to reduce array processing to 5.551 sec.

- with 3 parallel sas sessions I managed to reduce array processing to 7.005 sec.

I think SSD drives did the most "speed boost" but I dare to say "not bad for a laptop" 😉

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

hashman · Posted 03-07-2024 02:46 PM

Cześć Bartku,

Thanks for the plug.

Key-indexed search / aggregation always wins speed-wise if there is enough memory for it.

If not, an array-based hash table is the next fastest option coupled with a drastic memory reduction. I'd just use the simplest collision resolution policy, the linear probing, and set the hash table size at approximately twice the number of distinct keys. Code can be adapted almost verbatim from our (with Richard DeVenezia) last year's SESUG paper "Hashes From the Ashes", Section 2:

https://www.lexjansen.com/sesug/2023/SESUG2023_Paper_203_Final_PDF.pdf

The hash object is a great tool (it would be funny if I thought otherwise) with gobs of functionality compared to array-based hashing but the latter can be faster/lighter at specialized tasks. You know, at times one needs a dedicated blade instead of those included in a Swiss Army knife.

Gorące pozdrowienia

Paul D.

Bernd_S · Posted 03-07-2024 07:35 AM

Hi,

@yabwon Thank you very much, that is exactly what I need. I also didn't know and didn't expect that the use of an array would be so much faster. I can make ID_2 numeric.

Your solution on my machine:

real time 1:04.32
user cpu time 55.27 seconds
system cpu time 9.06 seconds
memory 96561.62k
OS Memory 132524.00k

@Patricks Hash solution:

real time 4:46.23
user cpu time 4:41.24
system cpu time 5.00 seconds
memory 132394.82k
OS Memory 168216.00k

@Patrick Thanks again for your suggestions and analysis.

mkeintz · Posted 03-07-2024 08:05 PM

@Bernd_S wrote:

Hi,

@yabwon Thank you very much, that is exactly what I need. I also didn't know and didn't expect that the use of an array would be so much faster. I can make ID_2 numeric.

Arrays are ALWAYS faster than hash lookups, because arrays use direct key indexing. The array index effectively points directly at the location of the content. But a hash KEY needs to go through a transformation to find its location in the hash object (but the hash transformation will be faster than the analogous binary search for a key in a sorted list).

But arrays can't do this:

You can index (i.e. definekey) on character values..
Dynamically grow the number of dataitems in the object.
Retrieve a dataitem of multiple variables with a single hash method call, whereas the array analog would require programming retrieval of the i'th element from each of a set of one-dimension arrays (or multiple columns from a given array row). The same relative advantage exists for adding to the hash vs adding to arrays.
Dynamically create the content of a hash object.
Dynamically free up memory used by a hash object.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!