BookmarkSubscribeRSS Feed

Study on the best method to join two tables

Started ‎09-18-2020 by
Modified ‎10-31-2017 by
Views 15,347

Introduction

 

One question that comes back often, both on this forum and around me, is about the method that a SAS programmer should use when joining tables.

 

There is of course no definitive answer to this question, but since I am working on the third edition of my book about optimising the performance of SAS programs, I thought I would take a closer look at this common theme.

 

No silver bullet

 

It is impossible to make a comprehensive study about table joins.

 

The goal of a join might be of add one or more columns, or to add none but to validate against some baseline values, or perhaps to add or delete rows.

The input tables could be sorted or indexed or without any order. Some tables may be huge while others are tiny, some may be wide (many columns) and others narrow (few columns). There could be many tables in a join or just a couple.

The number of matches derived from a join could vary from a few rows using unique keys to billions of rows using Cartesian products.

The output tables may need to be sorted or not.

The ways to combine data are countless.

 

Furthermore, SAS has many syntaxes to read and write tables. Data steps use the SET, MERGE, UPDATE and MODIFY statements, together with options like KEY = or POINT = . Data steps can also use hash tables. SAS can also use proc sql statements like CREATE, UPDATE, INSERT and DELETE statements with operators such as JOIN, UNION,INTERSECT, and EXCEPT and their many variants as well as other nested or subquery operators.

It is even possible to use other SAS features such as formats.

The SAS language offers a wide array of methods to join a given set of data data.

 

The variations are infinite, each join differs from the other and has to be examined on its own merit.
   

Benchmark

 

Yet, these considerations are not much help, and one must start somewhere if one is to study the most efficient way to join tables. Here is a benchmark that I found interesting enough to publish and hope you will find it interesting too.

 

For this test, we consider a common task such as combining two tables, and adding data from the second table to the first.

This example uses two tables with 10 million rows, and we gradually retrieve more rows and more variables from the second table.


    Four methods are used:
    - Sort and Merge
    - Index + SET with option KEY =
    - Index + load in memory with SASFILE + set with option KEY=
    - Hash table

 

The benchmark code is provided at the bottom of this page.

 

Listed below are the results of these four tests in my environment, with the method yielding the lowest times highlighted in color.

 

The results are measured in elapsed (i.e. wall clock) seconds. The lower the better.

 

 

Join method: Sort + Merge by

# columns retrieved

% of rows retrieved

1%

5%

10%

20%

50%

100%

1

79

80

81

84

79

78

5

81

79

79

79

79

79

10

89

88

89

95

93

90

20

93

94

92

95

90

90

30

95

95

95

96

95

94

40

99

98

102

99

100

99

50

104

105

105

105

105

110

 

Join Method: Hash table

# columns retrieved

% of rows retrieved

1%

5%

10%

20%

50%

100%

1

57

59

58

60

65

73

5

56

57

59

61

66

77

10

61

63

64

69

69

80

20

66

68

64

69

71

82

30

61

62

63

66

72

82

40

62

63

64

67

75

89

50

-

-

-

-

-

-

 

 

Join Method: Index+set key=

# columns retrieved

% of rows retrieved

1%

5%

10%

20%

50%

100%

1

47

61

69

94

161

268

5

46

56

68

92

153

270

10

48

59

71

94

187

297

20

49

59

71

97

159

270

30

48

57

68

90

156

267

40

47

56

68

88

156

262

50

47

56

68

92

158

263

 

 

Join Method: Index+sasfile+set key=

# columns retrieved

% of rows retrieved

1%

5%

10%

20%

50%

100%

1

49

53

53

58

72

93

5

50

52

55

59

71

96

10

55

58

60

66

77

99

20

61

63

67

72

82

105

30

63

66

68

73

87

110

40

68

71

72

78

94

118

50

-

-

-

-

-

-

 

Summarising the results yields the table below, showing the lowest elapse times. 

 

# columns retrieved

% of rows retrieved

   

1%

5%

10%

20%

50%

100%

   

1

47

53

53

58

65

73

 

Hash table

5

46

52

55

59

66

77

 

Index+set key=

10

48

58

60

66

69

80

 

Index+sasfile+key=

20

49

59

64

69

71

82

 

Sort+merge
  (sorted output)

30

48

57

63

66

72

82

   

40

47

56

64

67

75

89

   

50

47

56

68

92

105

110

   

 

Based on the above listed results one can deduce that some methods are clearly better than others depending on the task at hand.

 

Index look-ups are the fastest when few rows are needed (area highlighted in green).

 

When more rows are fetched, loading data in memory accelerates the random reads: Using SASFILE has an edge for fewer columns (area highlighted in blue) while a hash table provides better results when more columns are retrieved (area highlighted in purple).

 

Note that for the bottom row of the table (50 columns retrieved), the only available methods are indexing and sorting since the lookup table is too large to fit in memory. In this case, sorting becomes better than indexing when more than 20% of the rows are fetched, which is a common rule-of-thumb value.

 

Also note that when both input tables are already sorted, using MERGE BY is always the best solution by far: it only takes 32 seconds to merge all rows and all columns. That’s why permanent tables should be sorted when possible (with the VALIDATED flag set!). Sorting is a one-off cost that yields ongoing savings. Furthermore, using MERGE BY creates a sorted table (SAS, when will we have a way to set the SORT VALIDATED flag? When?)

 

Conclusion

 

I hope you found these results as interesting as I did.

There is no “good” or “bad” join method and as usual, one must master the different tools available in order to choose the right one for a given task.

 

Did you expect these results? Do you have comments about the results? Do you have other experiences or insights to share?

Feel free to comment.

 

Here is the code for this simple benchmark

 

 



%macro g;
  %let cols=1 5 10 20 30 40 50;
  %let rows=0.01 0.05 0.10 0.20 0.50 1;
  %do colno=1 %to %sysfunc(countw(&cols));
    %let colnb=%scan(&cols,&colno);
    %do rowno=1 %to %sysfunc(countw(&rows,%str( )));
      %let rowpct=%scan(&rows,&rowno,%str ( ));
 
      %put Trace: &=rowpct &=colnb *start;
      data TAB1 TAB2;
        retain VAR1-VAR50 8;
        do I=1 to 1e7;
          KEY=int(ranuni(0)*1e15);
          output;
        end;
      run;
 
      %put Trace: &=rowpct &=colnb *sort merge;
      proc sort data=TAB1 out=SORTED1 sortsize=1g details; by KEY; run;
      proc sort data=TAB2(keep=KEY VAR1-VAR30&colnb.) out=SORTED2 sortsize=1g details; by KEY; un;
      data OUT;
        merge SORTED1 SORTED2; by KEY;
        %* We always fetch all rows with a merge;
      run;
               
      %put Trace: &=rowpct &=colnb *index;
      proc sql; create index KEY on TAB2;
      data OUT;
        set TAB1;
        if ranuni(0) < &rowpct. then set TAB2(keep=KEY VAR1-VAR&colnb.) key=KEY;
      run;
 
      %put Trace: &=rowpct &=colnb *index sasfile;
      data REDUCED(index=(KEY)); set TAB2(keep=KEY VAR1-VAR&colnb.); run;
      sasfile REDUCED open;
      data OUT;
        set TAB1;
        if ranuni(0) < &rowpct. then set REDUCED(keep=KEY VAR1-VAR&colnb.) key=KEY;
      run;
      sasfile REDUCED close;
 
      %put Trace: &=rowpct &=colnb *hash;
      data OUT;
        set TAB1;
        if _N_=1 then do;
          dcl hash TAB2(dataset: "TAB2(keep=KEY VAR1-VAR&colnb)" );
          TAB2.definekey('KEY');
          TAB2.definedata('KEY'%do i=1 %to &colnb.;,%unquote(%nrbquote(' VAR&i'))%end; );
          TAB2.definedone();
        end;
        if ranuni(0) < &rowpct. then RC=TAB2.find();
      run;    
    %end;
  %end;
%mend;

 

 

 

Comments

Interesting metrics - I've been using hash merges ever since the hash object was supported but ONLY when the number of rows is very large, otherwise the overhead kills your performance gain.

@ChrisBrooks These results validate you approach.

It seems that in some cases, retrieving a large number of columns is also a reason for using a hash table, even for fewer (10% here) rows. 

Awesome article. Thanks for writing this.

@kiranv_ Thank you!  🙂

Great article @ChrisNZ - it's featured in the October 2017 issue of the SAS Tech Report: Tips Extra newsletter, so be prepared for the onslaught of new fans (and maybe a few opinions 😉 )

Thank you Chris!

I am safe here in remote NZ. That's precisely the reason I moved here: to escape the onslaught of my fans.  🐵

@ChrisHemedinger I just looked at the newsletter. I am humbled to be in such good company. 

Very good and insightful findings. You have used the Elapsed Time (Real Time ?). What about the CPU Time? Could you post the comparison of Memory used.

 

DataSP

@KachiM Indeed I am showing elapse time. I added this missing information to the article.

The CPU time is never far behind (or in front, in the case of the sort and indexing steps), so some steps may well be CPU-bound.

These figures will vary significantly depending on the environment anyway.
I haven't looked at memory usage; I'll have a look.

The whole benchmark takes about 5 hours to run as given, if you want to run it (or part of it) to see how your platform performs.

Outstanding article! This kind of metrics-driven comparison is always of great value to us.

 

Thanks!

   Tom

@TomKari Thank you so much for your positive feedback!  

I must admit of a period of doubt when the book was turned down by SAS Publishing (as they reckoned there was no interest for this kind of material) and it is good to know that articles such as this one are still appreciated within the SAS community.

 

@datasp I could not gather much insight by looking at memory usage. That's why benchmarks are so time-consuming: It can be hard to find interesting patterns.

@ChrisNZ

 

Suppose two methods give about the same Real Time. In such circumstance, I prefer to check the Memory used and/or the Ratio of Real Time to CPU Times and decide the efficient method. Benchmarks are time-consuming and sometimes frustrating. 

I agree with you that very few of the point-and-click programmers do look into efficiency. When the process takes days to finish, then they scratch their heads. I am for efficiency in programming. Your book-title (2nd ed.)  enticed me to buy it.

 

Thanks

DataSp

"point-and-click programmers" lol. So true. When there are ties, I'd give a sorted output priority if that's relevant. Otherwise CPU time would be considered, as CPUs are both harder to upgrade and costlier (because of CPU-based licencing and invoicing of their time, mostly on mainframes). Memory usage would come last: it is cheap and easy to upgrade, and seldom gets invoiced. 

Thank you for your trust buying the book. Don't hesitate getting back to me. 

FK1

Hello @ChrisNZ,

 

thank you for your very interessting article. As you mentioned yourself: "It is even possible to use other SAS features such as formats." I was wondering: have you also tried the "format" method?

 

On first sight, I found there is for example a "r" missing in the word "run" in the second proc sort step:

 

proc sort data=TAB2(keep=KEY VAR1-VAR30&colnb.) out=SORTED2 sortsize=1g details; by KEY; un;

 Also, you must have typed accidentaly a "30":

VAR30&colnb.

 

 

ERROR: Not all variables in the list VAR1-VAR301 were found.
180: LINE and COLUMN cannot be determined.
NOTE: NOSPOOL is on. Rerunning with OPTION SPOOL might allow recovery of the LINE and COLUMN where the error has occurred.
ERROR 180-322: Statement is not valid or it is used out of proper order.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.SORTED2 may be incomplete.  When this step was stopped there were 0 observations and 0 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds

 

 

Sincerely yours,

FK

Fixed the 2 typos. Thank you. 🙂 I must have mixed code from different batches...

 

Yes I did try using formats. They were way too slow for the 10 million values required here. It took 20 minutes just to create the format with proc format if I recall. 

FK1

Hi all,

I found another bug in the HASH-method:

 

TAB2.definedata(....). Therefore, I am posting my code, which ran perfectly fine without any warnings or errors:

 

%macro g;
  %let cols=1 5 10 20 30 40 50;
  %let rows=0.01 0.05 0.10 0.20 0.50 1;
  %do colno=1 %to %sysfunc(countw(&cols));
    %let colnb=%scan(&cols,&colno);
    %do rowno=1 %to %sysfunc(countw(&rows,%str( )));
      %let rowpct=%scan(&rows,&rowno,%str ( ));
 
      %put Trace: &=rowpct &=colnb *start;
      data TAB1 TAB2;
        retain VAR1-VAR50 8;
        do I=1 to 1e7;
          KEY=int(ranuni(0)*1e15);
          output;
        end;
      run;
 
      %put Trace: &=rowpct &=colnb *sort merge;
      proc sort data=TAB1 out=SORTED1 sortsize=1g details; by KEY; run;
      proc sort data=TAB2(keep=KEY VAR1-VAR&colnb.) out=SORTED2 sortsize=1g details; by KEY; run;
      data OUT;
        merge SORTED1 SORTED2; by KEY;
        %* We always fetch all rows with a merge;
      run;
               
      %put Trace: &=rowpct &=colnb *index;
      proc sql; create index KEY on TAB2;
      data OUT;
        set TAB1;
        if ranuni(0) < &rowpct. then set TAB2(keep=KEY VAR1-VAR&colnb.) key=KEY;
      run;
 
      %put Trace: &=rowpct &=colnb *index sasfile;
      data REDUCED(index=(KEY)); set TAB2(keep=KEY VAR1-VAR&colnb.); run;
      sasfile REDUCED open;
      data OUT;
        set TAB1;
        if ranuni(0) < &rowpct. then set REDUCED(keep=KEY VAR1-VAR&colnb.) key=KEY;
      run;
      sasfile REDUCED close;
 
      %put Trace: &=rowpct &=colnb *hash;
      data OUT;
        set TAB1;
        if _N_=1 then do;
          dcl hash TAB2(dataset: "TAB2(keep=KEY VAR1-VAR&colnb)" );
          TAB2.definekey('KEY');
          TAB2.definedata('KEY'%do j=1 %to &colnb.;,%unquote(%nrbquote("VAR&j."))%end; );
          TAB2.definedone();
        end;
        if ranuni(0) < &rowpct. then RC=TAB2.find();
      run;    
    %end;
  %end;
%mend;

Cheers,

FK1

 @FK1 Thank you.

 

The changed line can be either 

TAB2.definedata('KEY'%do i=1 %to 50;,%unquote(%nrbquote('VAR&i'))%end; );
or

TAB2.definedata('KEY'%do i=1 %to 50;,"VAR&i" %end; );

 

Now you have to share your results! 🙂

 

 

 

 

 

@ChrisNZ

 

Thanks for your systematic and information comparison of join techniques - very useful.

 

I'd suggest another technique, using the hash object strictly as an index, rather than data storage.  Then use a SET ... point= ... statement, along the lines of Paul Dorfman's "HASH + POINT=KEY" paper

 

data key_lookup / view=key_lookup;
  set tab2 (keep=key);
  P=_n_;
run;

data out;
  if _n_=1 then do;
    set tab2 (keep=key);
    declare hash h (dataset:'key_lookup');
      h.definekey('key');
      h.definedata('p');
      h.definedone();
  end;

  set tab1;
  where ranuni(0)<&rowpct. ;
  h.find();
  set tab2 point=p;
run;

 

I see this as an improved version of "Index + SET KEY".  The index system does a binary search of the index values in order to identify the proper record.  At some scale the hash lookup of key for purposes of generating a record number ought to be faster. 

 

 

Also, as an experimental design issue, I'd suggest not using ranuni(0), which generates different random value every time it's used.  For each given column count/sampling percentage, I'd use a fixed value as the ranuni argument.  That would allow each technique to generate perfectly identical output samples.

@mkeintz Thank you, very interesting. Your technique piqued my interest and I was hoping for a new way to speed up joins.

Sadly, this hash+point method never outperforms other join methods in this test in any of the data configurations.

It typically takes 50% more time than a hash join.

I tired to load the table in a sasfile to avoid the random-access-from-disk inefficiencies (which means no view, since sasfile still can't load views), but a simple hash is still faster.  

 

Thanks for the tip about ranuni(). I'll use ranuni(1) from now on.

Just a very small comment. I believe that SAS recommends the use of RAND('UNIFORM') instead of RANUNI, as it's more modern and apparently somewhat improved. At the level we're using it, I'm sure we'd never notice the difference, but thought I'd pass it along.

 

Tom

@TomKari Yes I did see this new preferred function being discussed by statisticians on these pages, but old habits die hard and for what I do, ranuni() is plenty good enough. Repeatability with ranuni(1) is actually an advantage here. Thank you for your input. 🙂

Version history
Last update:
‎10-31-2017 01:28 AM
Updated by:
Contributors

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags