BookmarkSubscribeRSS Feed
emsmpa
Calcite | Level 5

Hi community,

we would be interested in knowing what is the best way of joining data from large data sets (>10million records).

Options:

a) sort and merge

b) create a format and then apply it within a data step

c) hash table join

d) sql join


Which is the quickest and which is the least memory intensive. We frequently used a format (option b) but within some codes they are crashing as we didn't have enough memory.
We subsequently used hash joins instead.

Do you have a view on how large a format can be (in terms of number of records) before its better to try another method?

Thanks!

3 REPLIES 3
TomKari
Onyx | Level 15

1. You are correct about the FORMAT approach. Because it's in-memory, it will be very fast, but uses up a lot of memory and you may run out. So, it's the fastest but the MOST memory intensive, sigh.

2. I have limited experience with hash table joins, so won't comment.

3. If your data is in SAS datasets, I believe you'll see similar performance from a sort and merge and from a SQL join, as behind the covers SQL will need to sort both datasets, and that's the expensive part.

4. If your data is in a database, depending on circumstances you might get the best results from pushing a JOIN to the database engine. It's worth trying, see if it's better, worse, or your DBA comes after you with a gun.

5. If you can sort and keep both datasets in the sequence of your join key, that will very fast with either a join or a sort and merge (sort is usually optimized to be very fast if the data is almost in the correct sequence).

Tom

Tom
Super User Tom
Super User

Really depends on what you are doing.  If you were able to do it with a FORMAT then it sounds like on the of tables is used to lookup a decoded value for a variable available in the other.  In that case you can maintain the lookup table with an INDEX and then use the SET statement with the KEY= option to lookup the decode variable (or variables).

It might be possible the PROC SQL could optimize this for your with you having to do anything special in the code.

Whether to sort the other table depends on how it will be used.  But normally its sort variables are different than the variables needed to lookup in the other table.

For example you could have pharmacy claims sorted by patient id and date and want to lookup the drug name from the drugcode included in the claim record.

proc sql ;

create table new as select a.*,b.drugname

  from claims a left join drugs b

  on a.drugcode = b.drugcode

  order by a.patient,a.claimdt

;

quit;

emsmpa
Calcite | Level 5

Thanks to all of you Smiley Happy

Helped us a lot.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 818 views
  • 6 likes
  • 3 in conversation