<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Join vs Merge 1.2 TB with 110 GB Datasets in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964192#M375532</link>
    <description>May be bitmap can help,  see Paul M. Dorfman’s classical article and give it a try: &lt;A href="https://support.sas.com/resources/papers/proceedings/proceedings/sugi26/p008-26.pdf" target="_blank"&gt;https://support.sas.com/resources/papers/proceedings/proceedings/sugi26/p008-26.pdf&lt;/A&gt;</description>
    <pubDate>Mon, 14 Apr 2025 02:39:39 GMT</pubDate>
    <dc:creator>whymath</dc:creator>
    <dc:date>2025-04-14T02:39:39Z</dc:date>
    <item>
      <title>Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964079#M375495</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Problem:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I have two datasets:&amp;nbsp; The first is detail records from a very large dataset (1.2 TB) and the second is row IDs from an only slightly smaller "header" dataset (110 GB).&amp;nbsp; The relation between line and header is many-to-one. I am trying to select the obs in the line that have a match in the header.&amp;nbsp; The header dataset only contains the key variable.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;What I've done so far:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The smaller "header" dataset is too small to fit in a hash dataset even if I increased the memsize to 115&lt;BR /&gt;GB – almost all of the available memory on the box!&lt;/LI&gt;&lt;LI&gt;I sorted and indexed the smaller header dataset by the key variable.&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;I selected 1/20th of the large dataset using the firstobs and obs dataset option&lt;/LI&gt;&lt;LI&gt;I use proc because I was advised that it is multi-threaded.&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Read post&amp;nbsp;&lt;A href="https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40080" target="_self"&gt;Efficient Way of Merging Very Large Datasets&lt;/A&gt;.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Result:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I started the script 8 days ago and my best guess from the looking at the size of the output lck file in Windows File Explorer is that it is only one tenth through.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The help I need:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;What would I need to do to access this dataset in a reasonable amount of time -- a couple of days?&amp;nbsp; Should I try to break the line input dataset into chunks, sort and interleave by clm_id and then try a data step merge?&amp;nbsp; If I were to request a more memory and processors for this virtual machine, how much would I need?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;SAS Versions:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The large dataset was created under SAS ver 9.0401M7 but the small dataset was created under 9.0401M5.&amp;nbsp; They are being accessed under 9.0401M5.&lt;/LI&gt;&lt;LI&gt;Large Line Dataset: taf_other_services_line (16)&lt;/LI&gt;&lt;LI&gt;Size on disk: 1.22 TB&lt;/LI&gt;&lt;LI&gt;Obs: 5,398,943,292&lt;/LI&gt;&lt;LI&gt;Vars: &amp;nbsp;59&lt;/LI&gt;&lt;LI&gt;Observation Length:&amp;nbsp; 525&lt;/LI&gt;&lt;LI&gt;Page Size: 65,536 / Pages: 19,749,411&lt;/LI&gt;&lt;LI&gt;Indexes: 0 / Sorted: NO / Point to Observations: YES&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Smaller Header Dataset:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Dataset size on disk: &amp;nbsp;110 GB&lt;/LI&gt;&lt;LI&gt;Index size on disk:&amp;nbsp; 126 GB&lt;/LI&gt;&lt;LI&gt;Obs: 1,849,842,886&lt;/LI&gt;&lt;LI&gt;Vars: &amp;nbsp;1&lt;/LI&gt;&lt;LI&gt;Observation Length:&amp;nbsp; 64&lt;/LI&gt;&lt;LI&gt;Page Size: 65,536 / Pages: 1,811,797&lt;/LI&gt;&lt;LI&gt;Indexes: 1 / Sorted: YES&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Query:&lt;/STRONG&gt;&lt;/P&gt;&lt;LI-CODE lang="sas"&gt;proc sql stimer ;
&amp;nbsp;&amp;nbsp;&amp;nbsp; create table saslibrary.outputdataset as
&amp;nbsp;&amp;nbsp;&amp;nbsp; select t.bene_id, t.clm_id, &amp;lt;26 other variables&amp;gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; from
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;saslibrary.lineinputdataset (firstobs=4859048953 obs=5128996116) as t
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;inner join saslibrary.headerinputdataset as c on (t.clm_id = c.clm_id)
&amp;nbsp;&amp;nbsp;&amp;nbsp; ;
&amp;nbsp;&amp;nbsp;&amp;nbsp; quit;&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;OS:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;MS Windows Server 2016 Standard V 10.0.14393 Build 14393&lt;/LI&gt;&lt;LI&gt;Hardware according to Windows Task Manager:&lt;/LI&gt;&lt;LI&gt;Memory Installed: &amp;nbsp;128 GB&lt;/LI&gt;&lt;LI&gt;Virtual Memory:&amp;nbsp; 46 GB&lt;/LI&gt;&lt;LI&gt;Page File Space:&amp;nbsp; 18.0 GB&lt;/LI&gt;&lt;LI&gt;Maximum Speed:&amp;nbsp; 2.90 GHz&lt;/LI&gt;&lt;LI&gt;Sockets:&amp;nbsp; 6&lt;/LI&gt;&lt;LI&gt;Virtual processors:&amp;nbsp; 12&lt;/LI&gt;&lt;LI&gt;L1 cache:&amp;nbsp; n/a&lt;/LI&gt;&lt;LI&gt;Processor:&amp;nbsp; Intel Xeon Gold 6542Y&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;For those of you familiar with Medicaid data this is the TAF data from CMS/MACBIS.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for reading.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Apr 2025 22:15:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964079#M375495</guid>
      <dc:creator>kenkaran</dc:creator>
      <dc:date>2025-04-10T22:15:23Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964135#M375504</link>
      <description>&lt;OL&gt;
&lt;LI&gt;PROC SQL is &lt;EM&gt;not&lt;/EM&gt; multi-threaded, so there is no threading advantage over DATA step.&lt;/LI&gt;
&lt;LI&gt;SQL and DATA step produce the same result set in a many-to-one situation&lt;/LI&gt;
&lt;LI&gt;Because of the fundamental differences between SQL and DATA step processing, the SQL join will be significantly more resource intensive&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;I'd recommend using a DATA step MERGE. You've already sorted and indexed the smaller dataset to conform to the larger one's sort order, so you're ready to roll. I'd expect that to process much faster than the SQL join.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 11 Apr 2025 19:11:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964135#M375504</guid>
      <dc:creator>SASJedi</dc:creator>
      <dc:date>2025-04-11T19:11:26Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964136#M375505</link>
      <description>&lt;P&gt;You can also try using FedSQL. FedSQL is multi-threaded where possible.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc fedsql;
    create table saslibrary.outputdataset as
        select t.bene_id, t.clm_id, &amp;lt;26 other variables&amp;gt;
        from saslibrary.lineinputdataset as t
        inner join
             saslibrary.headerinputdataset as c 
        on t.clm_id = c.clm_id
    ;
quit;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 11 Apr 2025 19:49:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964136#M375505</guid>
      <dc:creator>Stu_SAS</dc:creator>
      <dc:date>2025-04-11T19:49:03Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964137#M375506</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/61362"&gt;@Stu_SAS&lt;/a&gt;&amp;nbsp;, Thank you for the response.&amp;nbsp; 3 follow-up questions:&amp;nbsp;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Can I use the the firstobs and obs data set options as I did in regular PROC SQL?&lt;/LI&gt;&lt;LI&gt;Is there a way to write what observation I'm on and the clock time to the log every n number of obs?&lt;/LI&gt;&lt;LI&gt;Does FedSQL have a hint feature as in Oracle? (Not that Oracle ever "takes the hint.")&amp;nbsp;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Fri, 11 Apr 2025 19:55:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964137#M375506</guid>
      <dc:creator>kenkaran</dc:creator>
      <dc:date>2025-04-11T19:55:58Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964138#M375507</link>
      <description>&lt;OL&gt;
&lt;LI&gt;Unfortunately you cannot use SAS input dataset options like in PROC SQL. FEDSQL follows 1999 ANSI SQL standards.&lt;/LI&gt;
&lt;LI&gt;FEDSQL does support the stimer option but you cannot have it print out information as it goes like you could do for a DATA Step&lt;/LI&gt;
&lt;LI&gt;FEDSQL does not have hints, but PROC SQL kind of does with the magic= option; however FEDSQL does have a &lt;A href="https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/fedsqlref/n1qnwnxtq52mkvn1ky86xv6fms9m.htm" target="_self"&gt;ton of table options you can apply&lt;/A&gt;. Here's &lt;A href="https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p0bfvd7cz4cgibn1fhdvo0055blr.htm" target="_self"&gt;how to apply them&lt;/A&gt;.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 11 Apr 2025 20:25:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964138#M375507</guid>
      <dc:creator>Stu_SAS</dc:creator>
      <dc:date>2025-04-11T20:25:57Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964139#M375508</link>
      <description>&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/61362"&gt;@Stu_SAS&lt;/a&gt;, thank you for the FedSQL suggestion and the clarifications. I will try this.</description>
      <pubDate>Fri, 11 Apr 2025 20:31:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964139#M375508</guid>
      <dc:creator>kenkaran</dc:creator>
      <dc:date>2025-04-11T20:31:04Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964147#M375512</link>
      <description>&lt;P&gt;Are either of the datasets compressed? Compressing the 1.2TB dataset would likely speed up joining as it will improve IO. What proportion of the rows are you selecting out of the large dataset? Do you always select ALL rows from the small dataset for sub-setting out of the large one? If so an index may not help so I suggest you try without an index to see if that improves performance.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What type of SAS library are these stored in? V9? An &lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Unlocking-the-Power-of-Open-File-Formats-Freedom-Flexibility-and/ta-p/962381/jump-to/first-unread-message" target="_blank" rel="noopener"&gt;SPDE library&lt;/A&gt; might improve performance (See table in the link).&lt;/P&gt;</description>
      <pubDate>Fri, 11 Apr 2025 23:20:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964147#M375512</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2025-04-11T23:20:24Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964159#M375519</link>
      <description>Since your smaller dataset has only ONE variable and your memory is so big, I would like to use Hash Table to merge these two tables.</description>
      <pubDate>Sat, 12 Apr 2025 07:18:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964159#M375519</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2025-04-12T07:18:30Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964162#M375521</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/18408"&gt;@Ksharp&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;I would like to use Hash Table to merge these two tables.&lt;/BLOCKQUOTE&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/473735"&gt;@kenkaran&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;UL&gt;
&lt;LI&gt;The smaller "header" dataset is too small to fit in a hash dataset even if I increased the memsize to 115 GB&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;(...)&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Smaller Header Dataset:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Dataset size on disk: &amp;nbsp;110 GB&lt;/LI&gt;
&lt;LI&gt;Index size on disk:&amp;nbsp; 126 GB&lt;/LI&gt;
&lt;LI&gt;Obs: 1,849,842,886&lt;/LI&gt;
&lt;LI&gt;Vars: &amp;nbsp;1&lt;/LI&gt;
&lt;LI&gt;Observation Length:&amp;nbsp; 64&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;It should be possible to use a much smaller key item for the hash object, e.g. &lt;FONT face="courier new,courier"&gt;&lt;A href="https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lefunctionsref/n0mu81ozw0013yn1kzwve5ola26e.htm" target="_blank" rel="noopener"&gt;md5&lt;/A&gt;(clm_id)&lt;/FONT&gt;, which takes only 16 bytes, instead of the 64-byte &lt;FONT face="courier new,courier"&gt;clm_id&lt;/FONT&gt; itself.&amp;nbsp;Or maybe there are obvious redundancies in the structure of the &lt;FONT face="courier new,courier"&gt;clm_id&lt;/FONT&gt; values (such as long strings of zeros or blanks) which could be "compressed" without losing information. Then the 1.8E9 key values will have a chance to fit into memory.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'm not sure, though, if the hash lookup &lt;EM&gt;plus&lt;/EM&gt; the operations needed to obtain the smaller keys on both sides of the merge perform better than a DATA step using a second SET statement with KEY=&lt;FONT face="courier new,courier"&gt;clm_id&lt;/FONT&gt; option, which benefits from the index created already. You may want to compare test runs using small subsets of both datasets so that the run times are only a few minutes.&lt;/P&gt;</description>
      <pubDate>Sat, 12 Apr 2025 08:33:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964162#M375521</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2025-04-12T08:33:02Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964169#M375527</link>
      <description>Out of curiosity, WHY do you need to actually join these?  I haven't worked with Medicaid but have worked a ton with similarly huge Medicare data. If you're actually performing an analysis, do you actually need all 60-ish variables to do this?  The complete date range?</description>
      <pubDate>Sat, 12 Apr 2025 16:31:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964169#M375527</guid>
      <dc:creator>quickbluefish</dc:creator>
      <dc:date>2025-04-12T16:31:51Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964171#M375529</link>
      <description>&lt;P&gt;I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins.&amp;nbsp; Instead, examine each CLM_ID once, by &lt;EM&gt;&lt;U&gt;&lt;STRONG&gt;choosing a restricted range of CLM_ID in both datasets&lt;/STRONG&gt;&lt;/U&gt;&lt;/EM&gt;, for each subgroup join.&amp;nbsp; This can work because CLM_ID is the join variable.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values.&amp;nbsp; Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want1;
  set bigdataset (keep= list of variables);
  where LLLLLLL &amp;lt;= clm_id &amp;lt; UUUUUUU;

  if _n_=1 then do;
    declare hash h (dataset:'header (where=(LLLLLLL &amp;lt;= clm_id &amp;lt; UUUUUUU))');
      h.definekey('clm_id');
      h.definedata('clm_id');
      h.definedone();
  end;

  if h.check()=0;
run;


&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Notes:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Limiting the range in the hash object this way allows you to avoid requiring more memory than available.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;Using the "where" statement after the SET outsources the filtering of the big data set to the data set engine, saves a lot of resources.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;Then it's just a matter of seeing if the filtered CLM_ID from the big data set is also found in the hash object.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;I've coded the above ("&amp;lt;="&amp;nbsp;&lt;SPAN&gt;&amp;nbsp;for lower limit, and "&amp;lt;"&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;for upper limit) to avoid double inclusion of the quintile values.&amp;nbsp; So either drop the upper limit for the highest range, or change "&amp;lt;" to "&amp;lt;=".&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Of course, this requires generating the quintile CLM_ID values.&amp;nbsp; You could do something like this to find the quintiles:&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=header out=header_sorted  nodup;
  by clm_id;
run;

data limits (drop=CLM_ID);
  set header_sorted nobs=nclm;

  retain quintile 1;

  if _n_=1 then LLLLLLL=clm_id;
  retain LLLLLLL;

  if _N_ = ceil(nclm*(quintile/5));

  UUUUUUU=clm_id;
  output;
  quintile+1;
  LLLLLLL=UUUUUUU;
run;
proc print;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 06 May 2025 03:42:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964171#M375529</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2025-05-06T03:42:54Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964192#M375532</link>
      <description>May be bitmap can help,  see Paul M. Dorfman’s classical article and give it a try: &lt;A href="https://support.sas.com/resources/papers/proceedings/proceedings/sugi26/p008-26.pdf" target="_blank"&gt;https://support.sas.com/resources/papers/proceedings/proceedings/sugi26/p008-26.pdf&lt;/A&gt;</description>
      <pubDate>Mon, 14 Apr 2025 02:39:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964192#M375532</guid>
      <dc:creator>whymath</dc:creator>
      <dc:date>2025-04-14T02:39:39Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964277#M375557</link>
      <description>&lt;P&gt;FedSQL is computationally multi-threaded, but in base SAS, it uses a single read-write thread. In the situation described, the process is most likely I/O bound, not CPU bound. So I don't think FedSQL (or threaded DS2) would help in this situation.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 14 Apr 2025 23:59:50 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964277#M375557</guid>
      <dc:creator>SASJedi</dc:creator>
      <dc:date>2025-04-14T23:59:50Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964283#M375559</link>
      <description>&lt;P&gt;You certainly want to avoid sorting your big dataset and though using a hash table lookup feels like a good option.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Given the length of your key variable is 64 I assume that's already a digest hex value created using sha-256.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can't fit all the keys of your header table into memory and though like&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/32733"&gt;@FreelanceReinh&lt;/a&gt;&amp;nbsp;I've been thinking how to reduce the size of your key values so they can fit. Converting your key values to an md5 binary string should reduce memory requirements to what's available to you. BUT using md5 instead of sha-256 will increase the collision risk which with your data volume isn't negligeable. IF the still small risk is acceptable that you select a key that's not in your list then using md5 like in below sample code should be an option.&lt;/P&gt;
&lt;P&gt;I also would use the SPDE engine for storing such a huge SAS table.&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data spde_saslibrary.want(compress=yes);
  if _n_=1 then
    do;
      length _key $16;
      dcl hash h1();
      h1.defineKey('_key');
      h1.defineDone();
      do until(_done);
        set saslibrary.headerinputdataset end=_done;
        _key=md5(clm_id);
        _rc=h1.ref();
      end;
    end;
  set saslibrary.lineinputdataset;
  _key=md5(clm_id);
  if h1.check()=0 then output;
  drop _:;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;...and I believe to remember that at one point there was an issue with the hash that when one didn't define Data the key variables got used as Data doubling the required memory. Should I remember right and if that's still an issue with your SAS version then eventually load a placeholder Data variable for this not to happen.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data spde_saslibrary.want(compress=yes);
  if _n_=1 then
    do;
      length _key $16;
      retain _placeholder ' ';
      dcl hash h1();
      h1.defineKey('_key');
      h1.defineData('_placeholder');
      h1.defineDone();
      do until(_done);
        set saslibrary.headerinputdataset end=_done;
        _key=md5(clm_id);
        _rc=h1.ref();
      end;
    end;
  set saslibrary.lineinputdataset;
  _key=md5(clm_id);
  if h1.check()=0 then output;
  drop _:;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Now... If your key variable&amp;nbsp;clm_id contains a 64 character hex string then that's a base16 value. Another way for shortening the string without increasing the collision risk could be to convert this base16 value to a base32 value.&amp;nbsp;&lt;BR /&gt;I'm not sure how much processing time such a conversion would add but it's certainly worth giving it a shot - if you can make it work. The approaches I've seen allways first convert the values to base10 and need to do summations. Problem with SAS is that a sha-256 doesn't fit as a full precision integer into a SAS numerical variable. One could do it using something like Python which supports such large integers or then find another approach which doesn't require an intermediary numerical variable.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Apr 2025 03:27:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964283#M375559</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2025-04-15T03:27:43Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964320#M375571</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/12447"&gt;@Patrick&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;BUT using md5 instead of sha-256 will increase the collision risk which with your data volume isn't negligeable.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;According to simple approximation formulas (assuming the MD5 digests are uniformly distributed random strings, which might be too optimistic, but I'm not sure), the collision probability for&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/473735"&gt;@kenkaran&lt;/a&gt;'s 1.8E9 keys should be approx. 5E-21, i.e., &lt;EM&gt;extremely&lt;/EM&gt; small (see, e.g., &lt;A href="https://towardsdatascience.com/collision-risk-in-hash-based-surrogate-keys-4c87b716cbcd/" target="_blank" rel="noopener"&gt;https://towardsdatascience.com/collision-risk-in-hash-based-surrogate-keys-4c87b716cbcd/&lt;/A&gt;).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/12447"&gt;@Patrick&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;I believe to remember that at one point there was an issue with the hash that when one didn't define Data the key variables got used as Data doubling the required memory.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Adding the 1-byte dummy data item won't hurt. Tests on my Windows SAS 9.4M5 suggest, however, that for keys with length 16 there is no decrease in &lt;A href="https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lecompobjref/p195co8u1s7a91n1xv1get0544t3.htm" target="_blank" rel="noopener"&gt;item_size&lt;/A&gt; by doing so. (The benefit starts at length 17.)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/12447"&gt;@Patrick&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Now... If your key variable&amp;nbsp;clm_id contains a 64 character hex string then that's a base16 value. Another way for shortening the string without increasing the collision risk could be to convert this base16 value to a base32 value.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I would rather favor base 256, as it is both simpler to obtain -- the &lt;A href="https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/leforinforref/p1bol4tujnx1w7n1g2jg6pimf0hk.htm" target="_blank" rel="noopener"&gt;$HEX64. informat&lt;/A&gt; does the conversion -- and more effective: The string length is halved to 32 bytes (as opposed to 52 bytes with base 32).&lt;/P&gt;</description>
      <pubDate>Tue, 15 Apr 2025 10:07:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964320#M375571</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2025-04-15T10:07:21Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964330#M375574</link>
      <description>&lt;P&gt;Will half of the header table fit into memory?&amp;nbsp; A third? I'd first try to chunk the header table and run a hash lookup for each chunk. Plus, it might give you a decent idea as to how much memory would be needed for the whole header table.&lt;/P&gt;</description>
      <pubDate>Tue, 15 Apr 2025 13:25:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964330#M375574</guid>
      <dc:creator>DerylHollick</dc:creator>
      <dc:date>2025-04-15T13:25:17Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964379#M375590</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/3174"&gt;@DerylHollick&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Will half of the header table fit into memory?&amp;nbsp; A third? I'd first try to chunk the header table and run a hash lookup for each chunk. Plus, it might give you a decent idea as to how much memory would be needed for the whole header table.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;This is what my suggested code does.&amp;nbsp; But instead of just loading any old half of the header (or in my example a fifth), I proposed selecting a half (or a fifth) for a given range of join variable values - on both datasets.&amp;nbsp; This makes the number of needed join comparisons about one fourth (in the case of halves) or one 25th (for fifths) for each subgroup join - a very effective reduction especially when the process "outsources" the subgroup selection to the data engine.&amp;nbsp; &amp;nbsp;True, one has to process the big dataset for each subgroup join, but that burden is significantly reduced by outsourcing the subgroup filtering to the data engine, via the WHERE options.&lt;/P&gt;</description>
      <pubDate>Tue, 15 Apr 2025 22:46:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964379#M375590</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2025-04-15T22:46:17Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964754#M375697</link>
      <description>&lt;P&gt;I am using two data sets (small and big) to illustrate one method of solving your problem.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data small;
input id $8.;
datalines;
Rama
Seetha
Sras
Gopal
John
;
run;



data big;
input id $8. amount;
datalines;
Seetha   100
Rama     200
Gopal    500
Krishna  300
John     400
Anbu     500
Kachi    500
Lakshi   600
asdfgh   700
ordsfg   600
pqwers   600
kasert   700
lasert   800
Anbu     100
Rama     100
Gopal    400
;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;I am keeping only essential variables for sorting.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG class="fc6omth regular_f1jfh96c f18ev72d" aria-hidden="true"&gt;I am holding ID from&lt;/STRONG&gt;&lt;SPAN class="fc6omth regular_f1jfh96c f18ev72d"&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;the BIG data set&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN class="fc6omth regular_f1jfh96c f18ev72d"&gt;and add rowid(_N_)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;This new data set is small enough to sort.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data = small;
by id;
run;

data tempbig;
   set big(keep = id);
   RID = _N_;
run;

proc sort data = tempbig;
by id;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Next I merge the 2 data sets, SMALL and TEMPBIG.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data temp;
   merge small(in = a) tempbig(in = b);
   by id;
   if a and b;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Next, using POINT = option of SET statement, the matched records from BIG are output.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want;
   do i = 1 by 1 until(eof);
      set temp(keep = rid)  end = eof;
      p = rid;
      set big point = p;
      output;
   end;
   stop;
   drop i rid;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The output data set is:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Obs id amount &lt;BR /&gt;1 Gopal 500 &lt;BR /&gt;2 Gopal 400 &lt;BR /&gt;3 John 400 &lt;BR /&gt;4 Rama 200 &lt;BR /&gt;5 Rama 100 &lt;BR /&gt;6 Seetha 100&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 15:16:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964754#M375697</guid>
      <dc:creator>KachiM</dc:creator>
      <dc:date>2025-04-21T15:16:17Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964764#M375700</link>
      <description>&lt;P&gt;Let T = duration for flat read of detail table D&lt;/P&gt;
&lt;P&gt;Let K = number of header keys that _can_ fit in a hash table.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do&amp;nbsp; &lt;SPAN&gt;1,849,842,886 / K data step reads through D with hash lookup selectiond.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Append selections of each run-through.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Guessing conservatively and presuming 70 GB set aside for a session there should be enough memory for a hash table that holds 500K 64byte keys. So maybe 6 reads through D.&amp;nbsp; Make that a worst case of 10 read throughs. so 10 * (T + T-out) (writing key matched records)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If this needs to be done more than one time you might want to also code a solution in Proc DS2 that uses THREADs and compare resource and time consumptions to DATA step.&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 16:08:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964764#M375700</guid>
      <dc:creator>RichardAD</dc:creator>
      <dc:date>2025-04-21T16:08:18Z</dc:date>
    </item>
    <item>
      <title>Re: Join vs Merge 1.2 TB with 110 GB Datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964797#M375706</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/300781"&gt;@RichardAD&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Let T = duration for flat read of detail table D&lt;/P&gt;
&lt;P&gt;Let K = number of header keys that _can_ fit in a hash table.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do&amp;nbsp; &lt;SPAN&gt;1,849,842,886 / K data step reads through D with hash lookup selectiond.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Append selections of each run-through.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Guessing conservatively and presuming 70 GB set aside for a session there should be enough memory for a hash table that holds 500K 64byte keys. So maybe 6 reads through D.&amp;nbsp; Make that a worst case of 10 read throughs. so 10 * (T + T-out) (writing key matched records)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If this needs to be done more than one time you might want to also code a solution in Proc DS2 that uses THREADs and compare resource and time consumptions to DATA step.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You don't need that much input output activity.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If after examining the header dataset in comparison to your available memory for hash object you find that you have to do 10 subgroup joins, you don't have to have generate input/output totaling 10*(T +Tout).&amp;nbsp; It can be reduced to 3*T + 1*Tout.&amp;nbsp; Divide the header into 10 subgroups based on the value of the join variable.&amp;nbsp; Then, given you have disk space, also divide the detail dataset into 10 smaller datasets using the same join variable.&amp;nbsp; That can be done in one DATA step totaling 2*T&amp;nbsp;input/output.&amp;nbsp; Then each of the 10 subgroup joins will need only 0.1*(T+ ~0.1*Tout).&amp;nbsp; You can save even more my creating the detail subgroups containing only the variables of interest.&lt;/P&gt;</description>
      <pubDate>Tue, 22 Apr 2025 00:50:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964797#M375706</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2025-04-22T00:50:13Z</dc:date>
    </item>
  </channel>
</rss>

