<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: joining data from large data sets in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/joining-data-from-large-data-sets/m-p/114773#M31749</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Really depends on what you are doing.&amp;nbsp; If you were able to do it with a FORMAT then it sounds like on the of tables is used to lookup a decoded value for a variable available in the other.&amp;nbsp; In that case you can maintain the lookup table with an INDEX and then use the SET statement with the KEY= option to lookup the decode variable (or variables).&lt;/P&gt;&lt;P&gt;It might be possible the PROC SQL could optimize this for your with you having to do anything special in the code.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Whether to sort the other table depends on how it will be used.&amp;nbsp; But normally its sort variables are different than the variables needed to lookup in the other table.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For example you could have pharmacy claims sorted by patient id and date and want to lookup the drug name from the drugcode included in the claim record.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;proc sql ;&lt;/P&gt;&lt;P&gt; create table new as select a.*,b.drugname &lt;/P&gt;&lt;P&gt;&amp;nbsp; from claims a left join drugs b&lt;/P&gt;&lt;P&gt;&amp;nbsp; on a.drugcode = b.drugcode&lt;/P&gt;&lt;P&gt;&amp;nbsp; order by a.patient,a.claimdt&lt;/P&gt;&lt;P&gt;;&lt;/P&gt;&lt;P&gt;quit;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Sun, 22 Jul 2012 19:29:09 GMT</pubDate>
    <dc:creator>Tom</dc:creator>
    <dc:date>2012-07-22T19:29:09Z</dc:date>
    <item>
      <title>joining data from large data sets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/joining-data-from-large-data-sets/m-p/114771#M31747</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi community,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;we would be interested in knowing what is the best way of joining data from large data sets (&amp;gt;10million records). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Options: &lt;/P&gt;&lt;P&gt;a) sort and merge &lt;/P&gt;&lt;P&gt;b) create a format and then apply it within a data step &lt;/P&gt;&lt;P&gt;c) hash table join &lt;/P&gt;&lt;P&gt;d) sql join&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Which is the quickest and which is the least memory intensive. We frequently used a format (option b) but within some codes they are crashing as we didn't have enough memory. &lt;BR /&gt;We subsequently used hash joins instead. &lt;/P&gt;&lt;P&gt;Do you have a view on how large a format can be (in terms of number of records) before its better to try another method?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 21 Jul 2012 12:27:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/joining-data-from-large-data-sets/m-p/114771#M31747</guid>
      <dc:creator>emsmpa</dc:creator>
      <dc:date>2012-07-21T12:27:53Z</dc:date>
    </item>
    <item>
      <title>Re: joining data from large data sets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/joining-data-from-large-data-sets/m-p/114772#M31748</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;1. You are correct about the FORMAT approach. Because it's in-memory, it will be very fast, but uses up a lot of memory and you may run out. So, it's the fastest but the MOST memory intensive, sigh.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;2. I have limited experience with hash table joins, so won't comment.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;3. If your data is in SAS datasets, I believe you'll see similar performance from a sort and merge and from a SQL join, as behind the covers SQL will need to sort both datasets, and that's the expensive part.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;4. If your data is in a database, depending on circumstances you might get the best results from pushing a JOIN to the database engine. It's worth trying, see if it's better, worse, or your DBA comes after you with a gun.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;5. If you can sort and keep both datasets in the sequence of your join key, that will very fast with either a join or a sort and merge (sort is usually optimized to be very fast if the data is almost in the correct sequence).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Tom&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 21 Jul 2012 14:38:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/joining-data-from-large-data-sets/m-p/114772#M31748</guid>
      <dc:creator>TomKari</dc:creator>
      <dc:date>2012-07-21T14:38:13Z</dc:date>
    </item>
    <item>
      <title>Re: joining data from large data sets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/joining-data-from-large-data-sets/m-p/114773#M31749</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Really depends on what you are doing.&amp;nbsp; If you were able to do it with a FORMAT then it sounds like on the of tables is used to lookup a decoded value for a variable available in the other.&amp;nbsp; In that case you can maintain the lookup table with an INDEX and then use the SET statement with the KEY= option to lookup the decode variable (or variables).&lt;/P&gt;&lt;P&gt;It might be possible the PROC SQL could optimize this for your with you having to do anything special in the code.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Whether to sort the other table depends on how it will be used.&amp;nbsp; But normally its sort variables are different than the variables needed to lookup in the other table.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For example you could have pharmacy claims sorted by patient id and date and want to lookup the drug name from the drugcode included in the claim record.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;proc sql ;&lt;/P&gt;&lt;P&gt; create table new as select a.*,b.drugname &lt;/P&gt;&lt;P&gt;&amp;nbsp; from claims a left join drugs b&lt;/P&gt;&lt;P&gt;&amp;nbsp; on a.drugcode = b.drugcode&lt;/P&gt;&lt;P&gt;&amp;nbsp; order by a.patient,a.claimdt&lt;/P&gt;&lt;P&gt;;&lt;/P&gt;&lt;P&gt;quit;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sun, 22 Jul 2012 19:29:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/joining-data-from-large-data-sets/m-p/114773#M31749</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2012-07-22T19:29:09Z</dc:date>
    </item>
    <item>
      <title>Re: joining data from large data sets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/joining-data-from-large-data-sets/m-p/114774#M31750</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Thanks to all of you &lt;img id="smileyhappy" class="emoticon emoticon-smileyhappy" src="https://communities.sas.com/i/smilies/16x16_smiley-happy.png" alt="Smiley Happy" title="Smiley Happy" /&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Helped us a lot.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 30 Jul 2012 09:50:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/joining-data-from-large-data-sets/m-p/114774#M31750</guid>
      <dc:creator>emsmpa</dc:creator>
      <dc:date>2012-07-30T09:50:20Z</dc:date>
    </item>
  </channel>
</rss>

