Introduction
One question that comes back often, both on this forum and around me, is about the method that a SAS programmer should use when joining tables.
There is of course no definitive answer to this question, but since I am working on the third edition of my book about optimising the performance of SAS programs, I thought I would take a closer look at this common theme.
No silver bullet
It is impossible to make a comprehensive study about table joins.
The goal of a join might be of add one or more columns, or to add none but to validate against some baseline values, or perhaps to add or delete rows.
The input tables could be sorted or indexed or without any order. Some tables may be huge while others are tiny, some may be wide (many columns) and others narrow (few columns). There could be many tables in a join or just a couple.
The number of matches derived from a join could vary from a few rows using unique keys to billions of rows using Cartesian products.
The output tables may need to be sorted or not.
The ways to combine data are countless.
Furthermore, SAS has many syntaxes to read and write tables. Data steps use the SET, MERGE, UPDATE and MODIFY statements, together with options like KEY = or POINT = . Data steps can also use hash tables. SAS can also use proc sql statements like CREATE, UPDATE, INSERT and DELETE statements with operators such as JOIN, UNION,INTERSECT, and EXCEPT and their many variants as well as other nested or subquery operators.
It is even possible to use other SAS features such as formats.
The SAS language offers a wide array of methods to join a given set of data data.
The variations are infinite, each join differs from the other and has to be examined on its own merit.
Benchmark
Yet, these considerations are not much help, and one must start somewhere if one is to study the most efficient way to join tables. Here is a benchmark that I found interesting enough to publish and hope you will find it interesting too.
For this test, we consider a common task such as combining two tables, and adding data from the second table to the first.
This example uses two tables with 10 million rows, and we gradually retrieve more rows and more variables from the second table.
Four methods are used: - Sort and Merge - Index + SET with option KEY = - Index + load in memory with SASFILE + set with option KEY= - Hash table
The benchmark code is provided at the bottom of this page.
Listed below are the results of these four tests in my environment, with the method yielding the lowest times highlighted in color.
The results are measured in elapsed (i.e. wall clock) seconds. The lower the better.
Join method: Sort + Merge by
# columns retrieved
% of rows retrieved
1%
5%
10%
20%
50%
100%
1
79
80
81
84
79
78
5
81
79
79
79
79
79
10
89
88
89
95
93
90
20
93
94
92
95
90
90
30
95
95
95
96
95
94
40
99
98
102
99
100
99
50
104
105
105
105
105
110
Join Method: Hash table
# columns retrieved
% of rows retrieved
1%
5%
10%
20%
50%
100%
1
57
59
58
60
65
73
5
56
57
59
61
66
77
10
61
63
64
69
69
80
20
66
68
64
69
71
82
30
61
62
63
66
72
82
40
62
63
64
67
75
89
50
-
-
-
-
-
-
Join Method: Index+set key=
# columns retrieved
% of rows retrieved
1%
5%
10%
20%
50%
100%
1
47
61
69
94
161
268
5
46
56
68
92
153
270
10
48
59
71
94
187
297
20
49
59
71
97
159
270
30
48
57
68
90
156
267
40
47
56
68
88
156
262
50
47
56
68
92
158
263
Join Method: Index+sasfile+set key=
# columns retrieved
% of rows retrieved
1%
5%
10%
20%
50%
100%
1
49
53
53
58
72
93
5
50
52
55
59
71
96
10
55
58
60
66
77
99
20
61
63
67
72
82
105
30
63
66
68
73
87
110
40
68
71
72
78
94
118
50
-
-
-
-
-
-
Summarising the results yields the table below, showing the lowest elapse times.
# columns retrieved
% of rows retrieved
1%
5%
10%
20%
50%
100%
1
47
53
53
58
65
73
Hash table
5
46
52
55
59
66
77
Index+set key=
10
48
58
60
66
69
80
Index+sasfile+key=
20
49
59
64
69
71
82
Sort+merge (sorted output)
30
48
57
63
66
72
82
40
47
56
64
67
75
89
50
47
56
68
92
105
110
Based on the above listed results one can deduce that some methods are clearly better than others depending on the task at hand.
Index look-ups are the fastest when few rows are needed (area highlighted in green).
When more rows are fetched, loading data in memory accelerates the random reads: Using SASFILE has an edge for fewer columns (area highlighted in blue) while a hash table provides better results when more columns are retrieved (area highlighted in purple).
Note that for the bottom row of the table (50 columns retrieved), the only available methods are indexing and sorting since the lookup table is too large to fit in memory. In this case, sorting becomes better than indexing when more than 20% of the rows are fetched, which is a common rule-of-thumb value.
Also note that when both input tables are already sorted, using MERGE BY is always the best solution by far: it only takes 32 seconds to merge all rows and all columns. That’s why permanent tables should be sorted when possible (with the VALIDATED flag set!). Sorting is a one-off cost that yields ongoing savings. Furthermore, using MERGE BY creates a sorted table (SAS, when will we have a way to set the SORT VALIDATED flag? When?)
Conclusion
I hope you found these results as interesting as I did.
There is no “good” or “bad” join method and as usual, one must master the different tools available in order to choose the right one for a given task.
Did you expect these results? Do you have comments about the results? Do you have other experiences or insights to share?
Feel free to comment.
Here is the code for this simple benchmark
%macro g;
%let cols=1 5 10 20 30 40 50;
%let rows=0.01 0.05 0.10 0.20 0.50 1;
%do colno=1 %to %sysfunc(countw(&cols));
%let colnb=%scan(&cols,&colno);
%do rowno=1 %to %sysfunc(countw(&rows,%str( )));
%let rowpct=%scan(&rows,&rowno,%str ( ));
%put Trace: &=rowpct &=colnb *start;
data TAB1 TAB2;
retain VAR1-VAR50 8;
do I=1 to 1e7;
KEY=int(ranuni(0)*1e15);
output;
end;
run;
%put Trace: &=rowpct &=colnb *sort merge;
proc sort data=TAB1 out=SORTED1 sortsize=1g details; by KEY; run;
proc sort data=TAB2(keep=KEY VAR1-VAR30&colnb.) out=SORTED2 sortsize=1g details; by KEY; un;
data OUT;
merge SORTED1 SORTED2; by KEY;
%* We always fetch all rows with a merge;
run;
%put Trace: &=rowpct &=colnb *index;
proc sql; create index KEY on TAB2;
data OUT;
set TAB1;
if ranuni(0) < &rowpct. then set TAB2(keep=KEY VAR1-VAR&colnb.) key=KEY;
run;
%put Trace: &=rowpct &=colnb *index sasfile;
data REDUCED(index=(KEY)); set TAB2(keep=KEY VAR1-VAR&colnb.); run;
sasfile REDUCED open;
data OUT;
set TAB1;
if ranuni(0) < &rowpct. then set REDUCED(keep=KEY VAR1-VAR&colnb.) key=KEY;
run;
sasfile REDUCED close;
%put Trace: &=rowpct &=colnb *hash;
data OUT;
set TAB1;
if _N_=1 then do;
dcl hash TAB2(dataset: "TAB2(keep=KEY VAR1-VAR&colnb)" );
TAB2.definekey('KEY');
TAB2.definedata('KEY'%do i=1 %to &colnb.;,%unquote(%nrbquote(' VAR&i'))%end; );
TAB2.definedone();
end;
if ranuni(0) < &rowpct. then RC=TAB2.find();
run;
%end;
%end;
%mend;
... View more