<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: hash joins with multiple fact tables in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369975#M88366</link>
    <description>&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;The datasets are not identical in metadata and have a few overlapping named variables. I worry about stacking the datasets because of this. I wouldn’t want to accidentally truncate values in instances where the lengths and things disagreed but the variable names were the same.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;Because of this I like your approach to rename the variables to something new then rename them back post merge, this could work but I would also have to drop and variables that were in one dataset but not in the appended dataset. This is shown in the second example of the code that I had originally attached. This would also be easier if I used your renaming method and just made each variable have a prefix of the variable name. That would simplify my keep= statements.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;I might attempt using multiple formats to see how that works out, I’m not a fan of needing to concatenate the multiple primary keys to make one for the index and then subsequently doing the same on the base dataset. That seems like a lot of reads to me.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;At this point I think I am going to have to just sort/sort merge my lookup tables and split up the larger files prior to merging them. That or revisit my settings on my UNIX server.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;Right now my config file has&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;Memsize 8g/Sortsize 1g/BUFNO =20/BUFSIZE=512k &lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;I will need to see if I can up some of these to get the tables to load into hash faster. It is a virtual machine so potentially I could have the server admin allocate more memory for this task, or move the job to a higher powered server.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;Thanks!&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 23 Jun 2017 16:02:10 GMT</pubDate>
    <dc:creator>RyanMcGovern</dc:creator>
    <dc:date>2017-06-23T16:02:10Z</dc:date>
    <item>
      <title>hash joins with multiple fact tables</title>
      <link>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369522#M88236</link>
      <description>&lt;P&gt;Hi everyone!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am looking to improve some processing and am looking to see if there is a way to either:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Keep the hash tables in memory after the data step is closed (so that the hash tables do not have to be reloaded into memory for the next iteration)&lt;/LI&gt;&lt;LI&gt;Match multiple tables back to the same lookup tables in one data step while avoiding butchering the variables in the output datasets.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Here is my situation. I am currently matching dozens of large files (100m+ rows &amp;gt;800 columns) against the same 2 lookup tables. Lookup table 1 has 20m+rows 5 columns and lookup table 2 has 2m+rows 4 columns.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Option 1 I think is promising… if there is a way to restrict sas from purging the hash objects it should save time loading them when I go to lookup values for the next dataset.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I made an example of both options, if you run it you will see what happens to the variables for option 2. Option one I don’t know how to restrict SAS from purging the hash objects.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I appreciate your comments!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;-Ryan&lt;/P&gt;</description>
      <pubDate>Thu, 22 Jun 2017 14:41:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369522#M88236</guid>
      <dc:creator>RyanMcGovern</dc:creator>
      <dc:date>2017-06-22T14:41:21Z</dc:date>
    </item>
    <item>
      <title>Re: hash joins with multiple fact tables</title>
      <link>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369586#M88264</link>
      <description>&lt;P&gt;Sorry, maybe I am missing something here, but in the example you give, why not merge on the two additional variables or do a dataset update?&lt;/P&gt;
&lt;PRE&gt;proc sort data=sashelp.class out=class (keep=name age height);
  by name;
run;
proc sort data=sashelp.classfit out=classfit (keep=name lowermean uppermean);
  by name;
run;
data want; 
  merge class (in=a) classfit (in=b);
  by name;
  if a;
run;

/* Or update - not sure what would be quiicker and not tested this code */
data want;
  update class classfit;
  by name;
run;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 22 Jun 2017 16:42:35 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369586#M88264</guid>
      <dc:creator>RW9</dc:creator>
      <dc:date>2017-06-22T16:42:35Z</dc:date>
    </item>
    <item>
      <title>Re: hash joins with multiple fact tables</title>
      <link>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369599#M88267</link>
      <description>&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;Thanks for the response, in real life the lookup tables are independent, they each share a common key with the fact table but do not share one with each other. Because of this I cannot combine both crosswalks and then read this all into one hash object.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;Thanks RW9!&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;here is what it looks like &lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;&lt;IMG title="HASH_Example_1.JPG" alt="HASH_Example_1.JPG" src="https://communities.sas.com/t5/image/serverpage/image-id/9658iF56F4D9181A5A96D/image-size/original?v=1.0&amp;amp;px=-1" border="0" /&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;again i have dozens of different fact tables that i need to hit against the same lookup tables.&lt;/P&gt;</description>
      <pubDate>Thu, 22 Jun 2017 17:21:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369599#M88267</guid>
      <dc:creator>RyanMcGovern</dc:creator>
      <dc:date>2017-06-22T17:21:48Z</dc:date>
    </item>
    <item>
      <title>Re: hash joins with multiple fact tables</title>
      <link>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369889#M88355</link>
      <description>&lt;P&gt;The answer to this depends on the exact details.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In particular, there are probably two or three reasonable options here - depending on your system, the datasets, etc.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If the many datasets&amp;nbsp;with 800+ variables are&amp;nbsp;either:&lt;/P&gt;
&lt;P&gt;a) Identical in metadata [variable names, types, lengths]&lt;/P&gt;
&lt;P&gt;or&lt;/P&gt;
&lt;P&gt;b) Have a combination of identical variable names/types/lengths and non-overlapping variable names&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then there is a simple solution, where you have one data step&amp;nbsp;and run each dataset through a loop.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;---&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If your datasets are nonidentical and have overlapping variable names that are not identically typed, then you can't easily do that. &amp;nbsp;You might be able to if you use some interesting renaming; I've done that, where I renamed every variable (through a macro) to DS_&amp;lt;var&amp;gt;, where DS was a prefix for that dataset or even the name of the dataset if sufficiently short. &amp;nbsp;It's just ... messy, and error-prone.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can use DOSUBL to generate data steps inside your main data step to do the matching work, and allow persistent hash tables, but that's quite&amp;nbsp;slow and&amp;nbsp;probably not faster than just reloading the data sets into hash tables.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;---&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Third, you could consider formats instead of hash tables. &amp;nbsp;Formats are persistent and quite fast to lookup, and if you're only writing a couple of them per dataset, not slow to load either. &amp;nbsp;They do sometimes have performance issues with high numbers of formats, but in a server environment you might be okay there. &amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;---&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Fourth, another probably-slow-but-maybe-worth-checking option is PROC DS2. &amp;nbsp;That would let you have persistent hash tables, I believe,&amp;nbsp;and process each of your datasets. &amp;nbsp;DS2 tends to be&amp;nbsp;slow, though, on simple data reads.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you can give us a bit more information as to your problems' full scope, we can probably help you find the specific solution that's most appropriate.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Jun 2017 14:10:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369889#M88355</guid>
      <dc:creator>snoopy369</dc:creator>
      <dc:date>2017-06-23T14:10:15Z</dc:date>
    </item>
    <item>
      <title>Re: hash joins with multiple fact tables</title>
      <link>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369975#M88366</link>
      <description>&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;The datasets are not identical in metadata and have a few overlapping named variables. I worry about stacking the datasets because of this. I wouldn’t want to accidentally truncate values in instances where the lengths and things disagreed but the variable names were the same.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;Because of this I like your approach to rename the variables to something new then rename them back post merge, this could work but I would also have to drop and variables that were in one dataset but not in the appended dataset. This is shown in the second example of the code that I had originally attached. This would also be easier if I used your renaming method and just made each variable have a prefix of the variable name. That would simplify my keep= statements.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;I might attempt using multiple formats to see how that works out, I’m not a fan of needing to concatenate the multiple primary keys to make one for the index and then subsequently doing the same on the base dataset. That seems like a lot of reads to me.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;At this point I think I am going to have to just sort/sort merge my lookup tables and split up the larger files prior to merging them. That or revisit my settings on my UNIX server.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;Right now my config file has&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;Memsize 8g/Sortsize 1g/BUFNO =20/BUFSIZE=512k &lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;I will need to see if I can up some of these to get the tables to load into hash faster. It is a virtual machine so potentially I could have the server admin allocate more memory for this task, or move the job to a higher powered server.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;Thanks!&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Jun 2017 16:02:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/369975#M88366</guid>
      <dc:creator>RyanMcGovern</dc:creator>
      <dc:date>2017-06-23T16:02:10Z</dc:date>
    </item>
    <item>
      <title>Re: hash joins with multiple fact tables</title>
      <link>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/370014#M88374</link>
      <description>Try overloading the formats to make a single lookup return the multple columns by parsing the returned value on a suitable delimiter.&lt;BR /&gt;Then you need just one format lookup for each dimension. &lt;BR /&gt;However the problem of loading the format into memory is still needed in each step (I expect).</description>
      <pubDate>Fri, 23 Jun 2017 16:59:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/hash-joins-with-multiple-fact-tables/m-p/370014#M88374</guid>
      <dc:creator>Peter_C</dc:creator>
      <dc:date>2017-06-23T16:59:24Z</dc:date>
    </item>
  </channel>
</rss>

