BookmarkSubscribeRSS Feed
milts
Pyrite | Level 9
Hi Experts,

I have two tables:

table a (20 columns with around 1.4 mil records)

table b (100 columns with around 2 mil records)

i'll be creating a table from these two tables using an inner join so i can get only records whose keys exists in both tables. The thing is that i won't be getting any columns from table b.

does it make a difference if i create a temporary table containing only the key column that i'll be using for the join since i won't be needing the other columns from table b and then join that temporary table with table a instead of table b which has a lot of columns and rows?

would it be more efficient to do that or am i just creating an extra step?

Many thanks!
Milton
4 REPLIES 4
Patrick
Opal | Level 21
Hi Milton

I believe creating a temporary table would add an additional pass through the data and additional I\O. Something you want to avoid.

If the tables are SAS files:
I'm asking myself whethere a hash table might perform better. Load the keys in table B into a hash table and use the hash.rc() method to lookup the keys and select the rows in table A.

If your keys are indexed:
Make sure that you write your code in a way that the indexes get used.

If SQL syntax gets a bit more complicated I'm often using a EXPLAIN to see what execution path is used - and re-adjusting the syntax makes sometimes quite a difference.

HTH
Patrick Message was edited by: Patrick
DanielSantos
Barite | Level 11
For this particular problem, you should without any doubt hash table B into memory and then perform a single pass on table A, outputting observations that will match the key with the hash.

To load the table into the hash, create a view over it, that will drop everything but the key.

Now, hashing a table, does have some considerations, such as:
Table size to hash - for a regular table, 2 to 3 Million observations.
Duplicate keys - only allowed in 9.2.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt
LinusH
Tourmaline | Level 20
Agree that hash could be the fastest solution.
But it's not SQL, and it's quite more complicated to code if you do not already know the technique.
Also, if the data does not reside in SAS, there is the aspect of data transportation.

If this is SAS data, you could probably not expect to get SAS using an indexed join strategy, it would probably use a sort-merge join strategy.
Which is probably the best, unless there are very few common key values between the two tables.

I don't know about external databases and indexed joins, but in SPD Server there is a special join index that can be used that maybe would work well in this kind of scenario.

/Linus
Data never sleeps
deleted_user
Not applicable
The simplest and quickest thing to do , if you want all columns from "a", then the SQL for that would be:

[pre]
select a.* from a, b where a.key = b.key;
[/pre]

or

[pre]
select a.* from a inner join b on a.key=b.key;
[/pre]

the next level of improvement would be to use pass-through and let the database server do the join directly on its box, utilizing the DBMS's flavor of SQL to do the join:

[pre]
Proc SQL threads feedback;
connect to ____ as db (%remote_database);
create table c from connection to db
( select a* from a,b where a.key = b.key );
disconnect from db
quit;
[/pre]

If "a" and "b" are permanent SAS datasets that are used a lot, and generally selected and joined on the key values, then create an index on the key values. Then, SAS will use the index to speed up the join.

If "a" and "b" are in a database, the tables should already have an indexed primary key defined, so the database should be using that index for the joins. If the "key" values are not the primary key, or not part of the primary key, then you should have an index created on those key columns for the tables.

Hope this helps. Message was edited by: Chuck

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 829 views
  • 0 likes
  • 5 in conversation