i'll be creating a table from these two tables using an inner join so i can get only records whose keys exists in both tables. The thing is that i won't be getting any columns from table b.
does it make a difference if i create a temporary table containing only the key column that i'll be using for the join since i won't be needing the other columns from table b and then join that temporary table with table a instead of table b which has a lot of columns and rows?
would it be more efficient to do that or am i just creating an extra step?
I believe creating a temporary table would add an additional pass through the data and additional I\O. Something you want to avoid.
If the tables are SAS files:
I'm asking myself whethere a hash table might perform better. Load the keys in table B into a hash table and use the hash.rc() method to lookup the keys and select the rows in table A.
If your keys are indexed:
Make sure that you write your code in a way that the indexes get used.
If SQL syntax gets a bit more complicated I'm often using a EXPLAIN to see what execution path is used - and re-adjusting the syntax makes sometimes quite a difference.
Agree that hash could be the fastest solution.
But it's not SQL, and it's quite more complicated to code if you do not already know the technique.
Also, if the data does not reside in SAS, there is the aspect of data transportation.
If this is SAS data, you could probably not expect to get SAS using an indexed join strategy, it would probably use a sort-merge join strategy.
Which is probably the best, unless there are very few common key values between the two tables.
I don't know about external databases and indexed joins, but in SPD Server there is a special join index that can be used that maybe would work well in this kind of scenario.
The simplest and quickest thing to do , if you want all columns from "a", then the SQL for that would be:
select a.* from a, b where a.key = b.key;
select a.* from a inner join b on a.key=b.key;
the next level of improvement would be to use pass-through and let the database server do the join directly on its box, utilizing the DBMS's flavor of SQL to do the join:
Proc SQL threads feedback;
connect to ____ as db (%remote_database);
create table c from connection to db
( select a* from a,b where a.key = b.key );
disconnect from db
If "a" and "b" are permanent SAS datasets that are used a lot, and generally selected and joined on the key values, then create an index on the key values. Then, SAS will use the index to speed up the join.
If "a" and "b" are in a database, the tables should already have an indexed primary key defined, so the database should be using that index for the joins. If the "key" values are not the primary key, or not part of the primary key, then you should have an index created on those key columns for the tables.