We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Study on the best method to join two tables

by PROC Star on ‎10-23-2017 04:51 AM - edited 3 weeks ago (2,150 Views)

Introduction

 

One question that comes back often, both on this forum and around me, is about the method that a SAS programmer should use when joining tables.

 

There is of course no definitive answer to this question, but since I am working on the third edition of my book about optimising the performance of SAS programs, I thought I would take a closer look at this common theme.

 

No silver bullet

 

It is impossible to make a comprehensive study about table joins.

 

The goal of a join might be of add one or more columns, or to add none but to validate against some baseline values, or perhaps to add or delete rows.

The input tables could be sorted or indexed or without any order. Some tables may be huge while others are tiny, some may be wide (many columns) and others narrow (few columns). There could be many tables in a join or just a couple.

The number of matches derived from a join could vary from a few rows using unique keys to billions of rows using Cartesian products.

The output tables may need to be sorted or not.

The ways to combine data are countless.

 

Furthermore, SAS has many syntaxes to read and write tables. Data steps use the SET, MERGE, UPDATE and MODIFY statements, together with options like KEY = or POINT = . Data steps can also use hash tables. SAS can also use proc sql statements like CREATE, UPDATE, INSERT and DELETE statements with operators such as JOIN, UNION,INTERSECT, and EXCEPT and their many variants as well as other nested or subquery operators.

It is even possible to use other SAS features such as formats.

The SAS language offers a wide array of methods to join a given set of data data.

 

The variations are infinite, each join differs from the other and has to be examined on its own merit.
   

Benchmark

 

Yet, these considerations are not much help, and one must start somewhere if one is to study the most efficient way to join tables. Here is a benchmark that I found interesting enough to publish and hope you will find it interesting too.

 

For this test, we consider a common task such as combining two tables, and adding data from the second table to the first.

This example uses two tables with 10 million rows, and we gradually retrieve more rows and more variables from the second table.


    Four methods are used:
    - Sort and Merge
    - Index + SET with option KEY =
    - Index + load in memory with SASFILE + set with option KEY=
    - Hash table

 

The benchmark code is provided at the bottom of this page.

 

Listed below are the results of these four tests in my environment, with the method yielding the lowest times highlighted in color.

 

The results are measured in elapsed (i.e. wall clock) seconds. The lower the better.

 

 

Join method: Sort + Merge by

# columns retrieved

% of rows retrieved

1%

5%

10%

20%

50%

100%

1

79

80

81

84

79

78

5

81

79

79

79

79

79

10

89

88

89

95

93

90

20

93

94

92

95

90

90

30

95

95

95

96

95

94

40

99

98

102

99

100

99

50

104

105

105

105

105

110

 

Join Method: Hash table

# columns retrieved

% of rows retrieved

1%

5%

10%

20%

50%

100%

1

57

59

58

60

65

73

5

56

57

59

61

66

77

10

61

63

64

69

69

80

20

66

68

64

69

71

82

30

61

62

63

66

72

82

40

62

63

64

67

75

89

50

-

-

-

-

-

-

 

 

Join Method: Index+set key=

# columns retrieved

% of rows retrieved

1%

5%

10%

20%

50%

100%

1

47

61

69

94

161

268

5

46

56

68

92

153

270

10

48

59

71

94

187

297

20

49

59

71

97

159

270

30

48

57

68

90

156

267

40

47

56

68

88

156

262

50

47

56

68

92

158

263

 

 

Join Method: Index+sasfile+set key=

# columns retrieved

% of rows retrieved

1%

5%

10%

20%

50%

100%

1

49

53

53

58

72

93

5

50

52

55

59

71

96

10

55

58

60

66

77

99

20

61

63

67

72

82

105

30

63

66

68

73

87

110

40

68

71

72

78

94

118

50

-

-

-

-

-

-

 

Summarising the results yields the table below, showing the lowest elapse times. 

 

# columns retrieved

% of rows retrieved

   

1%

5%

10%

20%

50%

100%

   

1

47

53

53

58

65

73

 

Hash table

5

46

52

55

59

66

77

 

Index+set key=

10

48

58

60

66

69

80

 

Index+sasfile+key=

20

49

59

64

69

71

82

 

Sort+merge
  (sorted output)

30

48

57

63

66

72

82

   

40

47

56

64

67

75

89

   

50

47

56

68

92

105

110

   

 

Based on the above listed results one can deduce that some methods are clearly better than others depending on the task at hand.

 

Index look-ups are the fastest when few rows are needed (area highlighted in green).

 

When more rows are fetched, loading data in memory accelerates the random reads: Using SASFILE has an edge for fewer columns (area highlighted in blue) while a hash table provides better results when more columns are retrieved (area highlighted in purple).

 

Note that for the bottom row of the table (50 columns retrieved), the only available methods are indexing and sorting since the lookup table is too large to fit in memory. In this case, sorting becomes better than indexing when more than 20% of the rows are fetched, which is a common rule-of-thumb value.

 

Also note that when both input tables are already sorted, using MERGE BY is always the best solution by far: it only takes 32 seconds to merge all rows and all columns. That’s why permanent tables should be sorted when possible (with the VALIDATED flag set!). Sorting is a one-off cost that yields ongoing savings. Furthermore, using MERGE BY creates a sorted table (SAS, when will we have a way to set the SORT VALIDATED flag? When?)

 

Conclusion

 

I hope you found these results as interesting as I did.

There is no “good” or “bad” join method and as usual, one must master the different tools available in order to choose the right one for a given task.

 

Did you expect these results? Do you have comments about the results? Do you have other experiences or insights to share?

Feel free to comment.

 

Here is the code for this simple benchmark

 

 



%macro g;
  %let cols=1 5 10 20 30 40 50;
  %let rows=0.01 0.05 0.10 0.20 0.50 1;
  %do colno=1 %to %sysfunc(countw(&cols));
    %let colnb=%scan(&cols,&colno);
    %do rowno=1 %to %sysfunc(countw(&rows,%str( )));
      %let rowpct=%scan(&rows,&rowno,%str ( ));
 
      %put Trace: &=rowpct &=colnb *start;
      data TAB1 TAB2;
        retain VAR1-VAR50 8;
        do I=1 to 1e7;
          KEY=int(ranuni(0)*1e15);
          output;
        end;
      run;
 
      %put Trace: &=rowpct &=colnb *sort merge;
      proc sort data=TAB1 out=SORTED1 sortsize=1g details; by KEY; run;
      proc sort data=TAB2(keep=KEY VAR1-VAR30&colnb.) out=SORTED2 sortsize=1g details; by KEY; un;
      data OUT;
        merge SORTED1 SORTED2; by KEY;
        %* We always fetch all rows with a merge;
      run;
               
      %put Trace: &=rowpct &=colnb *index;
      proc sql; create index KEY on TAB2;
      data OUT;
        set TAB1;
        if ranuni(0) < &rowpct. then set TAB2(keep=KEY VAR1-VAR&colnb.) key=KEY;
      run;
 
      %put Trace: &=rowpct &=colnb *index sasfile;
      data REDUCED(index=(KEY)); set TAB2(keep=KEY VAR1-VAR&colnb.); run;
      sasfile REDUCED open;
      data OUT;
        set TAB1;
        if ranuni(0) < &rowpct. then set REDUCED(keep=KEY VAR1-VAR&colnb.) key=KEY;
      run;
      sasfile REDUCED close;
 
      %put Trace: &=rowpct &=colnb *hash;
      data OUT;
        set TAB1;
        if _N_=1 then do;
          dcl hash TAB2(dataset: "TAB2(keep=KEY VAR1-VAR&colnb)" );
          TAB2.definekey('KEY');
          TAB2.definedata('KEY'%do i=1 %to &colnb.;,%unquote(%nrbquote(' VAR&i'))%end; );
          TAB2.definedone();
        end;
        if ranuni(0) < &rowpct. then RC=TAB2.find();
      run;    
    %end;
  %end;
%mend;

 

 

 

Comments
by Super Contributor
on ‎10-23-2017 10:22 AM

Interesting metrics - I've been using hash merges ever since the hash object was supported but ONLY when the number of rows is very large, otherwise the overhead kills your performance gain.

by PROC Star
a month ago

@ChrisBrooks These results validate you approach.

It seems that in some cases, retrieving a large number of columns is also a reason for using a hash table, even for fewer (10% here) rows. 

by PROC Star
a month ago

Awesome article. Thanks for writing this.

by PROC Star
a month ago

@kiranv_ Thank you!  :)

by Community Manager
3 weeks ago

Great article @ChrisNZ - it's featured in the October 2017 issue of the SAS Tech Report: Tips Extra newsletter, so be prepared for the onslaught of new fans (and maybe a few opinions ;) )

by PROC Star
3 weeks ago

Thank you Chris!

I am safe here in remote NZ. That's precisely the reason I moved here: to escape the onslaught of my fans.  :o)

by PROC Star
3 weeks ago

@ChrisHemedinger I just looked at the newsletter. I am humbled to be in such good company. 

by Super Contributor
3 weeks ago

Very good and insightful findings. You have used the Elapsed Time (Real Time ?). What about the CPU Time? Could you post the comparison of Memory used.

 

DataSP

by PROC Star
3 weeks ago

@datasp Indeed I am showing elapse time. I added this missing information to the article.

The CPU time is never far behind (or in front, in the case of the sort and indexing steps), so some steps may well be CPU-bound.

These figures will vary significantly depending on the environment anyway.
I haven't looked at memory usage; I'll have a look.

The whole benchmark takes about 5 hours to run as given, if you want to run it (or part of it) to see how your platform performs.

by PROC Star
3 weeks ago

Outstanding article! This kind of metrics-driven comparison is always of great value to us.

 

Thanks!

   Tom

by PROC Star
2 weeks ago - last edited 2 weeks ago

@TomKari Thank you so much for your positive feedback!  

I must admit of a period of doubt when the book was turned down by SAS Publishing (as they reckoned there was no interest for this kind of material) and it is good to know that articles such as this one are still appreciated within the SAS community.

 

@datasp I could not gather much insight by looking at memory usage. That's why benchmarks are so time-consuming: It can be hard to find interesting patterns.

by Super Contributor
2 weeks ago

@ChrisNZ

 

Suppose two methods give about the same Real Time. In such circumstance, I prefer to check the Memory used and/or the Ratio of Real Time to CPU Times and decide the efficient method. Benchmarks are time-consuming and sometimes frustrating. 

I agree with you that very few of the point-and-click programmers do look into efficiency. When the process takes days to finish, then they scratch their heads. I am for efficiency in programming. Your book-title (2nd ed.)  enticed me to buy it.

 

Thanks

DataSp

by PROC Star
2 weeks ago

"point-and-click programmers" lol. So true. When there are ties, I'd give a sorted output priority if that's relevant. Otherwise CPU time would be considered, as CPUs are both harder to upgrade and costlier (because of CPU-based licencing and invoicing of their time, mostly on mainframes). Memory usage would come last: it is cheap and easy to upgrade, and seldom gets invoiced. 

Thank you for your trust buying the book. Don't hesitate getting back to me. 

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.