DATA Step, Macro, Functions and more

Join Efficiency

Reply
Super Contributor
Posts: 316

Join Efficiency

Hi Experts,

I have two tables:

table a (20 columns with around 1.4 mil records)

table b (100 columns with around 2 mil records)

i'll be creating a table from these two tables using an inner join so i can get only records whose keys exists in both tables. The thing is that i won't be getting any columns from table b.

does it make a difference if i create a temporary table containing only the key column that i'll be using for the join since i won't be needing the other columns from table b and then join that temporary table with table a instead of table b which has a lot of columns and rows?

would it be more efficient to do that or am i just creating an extra step?

Many thanks!
Milton
Respected Advisor
Posts: 3,887

Re: Join Efficiency

Hi Milton

I believe creating a temporary table would add an additional pass through the data and additional I\O. Something you want to avoid.

If the tables are SAS files:
I'm asking myself whethere a hash table might perform better. Load the keys in table B into a hash table and use the hash.rc() method to lookup the keys and select the rows in table A.

If your keys are indexed:
Make sure that you write your code in a way that the indexes get used.

If SQL syntax gets a bit more complicated I'm often using a EXPLAIN to see what execution path is used - and re-adjusting the syntax makes sometimes quite a difference.

HTH
Patrick Message was edited by: Patrick
Super Contributor
Posts: 474

Re: Join Efficiency

For this particular problem, you should without any doubt hash table B into memory and then perform a single pass on table A, outputting observations that will match the key with the hash.

To load the table into the hash, create a view over it, that will drop everything but the key.

Now, hashing a table, does have some considerations, such as:
Table size to hash - for a regular table, 2 to 3 Million observations.
Duplicate keys - only allowed in 9.2.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt
Super User
Posts: 5,255

Re: Join Efficiency

Agree that hash could be the fastest solution.
But it's not SQL, and it's quite more complicated to code if you do not already know the technique.
Also, if the data does not reside in SAS, there is the aspect of data transportation.

If this is SAS data, you could probably not expect to get SAS using an indexed join strategy, it would probably use a sort-merge join strategy.
Which is probably the best, unless there are very few common key values between the two tables.

I don't know about external databases and indexed joins, but in SPD Server there is a special join index that can be used that maybe would work well in this kind of scenario.

/Linus
Data never sleeps
N/A
Posts: 0

Re: Join Efficiency

The simplest and quickest thing to do , if you want all columns from "a", then the SQL for that would be:

[pre]
select a.* from a, b where a.key = b.key;
[/pre]

or

[pre]
select a.* from a inner join b on a.key=b.key;
[/pre]

the next level of improvement would be to use pass-through and let the database server do the join directly on its box, utilizing the DBMS's flavor of SQL to do the join:

[pre]
Proc SQL threads feedback;
connect to ____ as db (%remote_database);
create table c from connection to db
( select a* from a,b where a.key = b.key );
disconnect from db
quit;
[/pre]

If "a" and "b" are permanent SAS datasets that are used a lot, and generally selected and joined on the key values, then create an index on the key values. Then, SAS will use the index to speed up the join.

If "a" and "b" are in a database, the tables should already have an indexed primary key defined, so the database should be using that index for the joins. If the "key" values are not the primary key, or not part of the primary key, then you should have an index created on those key columns for the tables.

Hope this helps. Message was edited by: Chuck
Ask a Question
Discussion stats
  • 4 replies
  • 147 views
  • 0 likes
  • 5 in conversation