<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Optimizing a join with large datasets in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869739#M343549</link>
    <description>&lt;P&gt;0) it's hard to optimise if there is no code...&amp;nbsp;&lt;/P&gt;
&lt;P&gt;1) does it have to be always "left join"?&lt;/P&gt;
&lt;P&gt;2) is it "many-to-many" join?&lt;/P&gt;
&lt;P&gt;3) are you adding those two little table, or are they just for filtering?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;4) what was the result of the filtering, were those tables much "smaller"?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;5) how long does it run ( options fullstimer; )?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;6) what is the size of your session (RAM available)?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Bart&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 14 Apr 2023 10:32:39 GMT</pubDate>
    <dc:creator>yabwon</dc:creator>
    <dc:date>2023-04-14T10:32:39Z</dc:date>
    <item>
      <title>Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869737#M343547</link>
      <description>&lt;P&gt;Hi&amp;nbsp;!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It’s my first thread&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":grinning_face:"&gt;😀&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I’m in the following situation:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Left join on five tables (first table: 100 million rows / 15 columns / about 15GB&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;second table: 300 million rows / 15 columns / about 80GB&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;third table: 10 000 rows / 10 columns / about 2MB&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;fourth table: 100 million rows / 40 columns / about 40GB&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;fifth table: 50 rows / 5columns / about 150KB)&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Filtering with IN clauses and Date intervals&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Selecting multiple variables and applying some Text functions and CASE WHEN statements&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I’d like to improve the performances of my code, especially the speed.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried filtering the largest tables before joining them and also to compress them with a binary option, but my performances didn’t improve so much.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I’d like to know if you have any advice to suggest me to make it better ( I’ve already read this interesting topic and tried some methods from &lt;A href="https://communities.sas.com/t5/SAS-Global-Forum-Proceedings/Making-SAS-Tables-Smaller-and-Faster-Without-Data-Loss/ta-p/726323" target="_blank"&gt;Making SAS® Tables Smaller and Faster Without Data Loss - SAS Support Communities&lt;/A&gt;, but without meaningful results).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hope you can help me and in any case thanks in advance!&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2023 10:13:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869737#M343547</guid>
      <dc:creator>Daniele_R</dc:creator>
      <dc:date>2023-04-14T10:13:02Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869738#M343548</link>
      <description>Are the tables native sas, are they indexed, do you have any I/O constraints e.g. network, disk?&lt;BR /&gt;Sometimes with SAS I've found multi-table joins don't seem to be well optimised and I break the query into smaller chunks. At least I do this to get a baseline.</description>
      <pubDate>Fri, 14 Apr 2023 10:24:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869738#M343548</guid>
      <dc:creator>bpowell</dc:creator>
      <dc:date>2023-04-14T10:24:28Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869739#M343549</link>
      <description>&lt;P&gt;0) it's hard to optimise if there is no code...&amp;nbsp;&lt;/P&gt;
&lt;P&gt;1) does it have to be always "left join"?&lt;/P&gt;
&lt;P&gt;2) is it "many-to-many" join?&lt;/P&gt;
&lt;P&gt;3) are you adding those two little table, or are they just for filtering?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;4) what was the result of the filtering, were those tables much "smaller"?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;5) how long does it run ( options fullstimer; )?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;6) what is the size of your session (RAM available)?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Bart&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2023 10:32:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869739#M343549</guid>
      <dc:creator>yabwon</dc:creator>
      <dc:date>2023-04-14T10:32:39Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869742#M343551</link>
      <description>I was also wondering about RAM but I expect you'd need, what, 128GB to make a difference on this query? Most will be done in I/O. I don't use more than 16-24GB and the 80GB dataset I normalize can't be done much in memory obviously.</description>
      <pubDate>Fri, 14 Apr 2023 10:40:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869742#M343551</guid>
      <dc:creator>bpowell</dc:creator>
      <dc:date>2023-04-14T10:40:06Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869747#M343554</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/440567"&gt;@Daniele_R&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here is my first stab at this&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Left join on five tables (first table: 100 million rows / 15 columns / about 15GB&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;second table: 300 million rows / 15 columns / about 80GB&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;third table: 10 000 rows / 10 columns / about 2MB &lt;FONT color="#0000FF"&gt;&lt;STRONG&gt;/* This can be loaded into a Hash Object */&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;fourth table: 100 million rows / 40 columns / about 40GB&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;fifth table: 50 rows / 5columns / about 150KB) &lt;FONT color="#0000FF"&gt;&lt;STRONG&gt;/* This can be loaded into a Hash Object */&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Filtering with IN clauses and Date intervals&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Selecting multiple variables and applying some Text functions and CASE WHEN statements&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Ensure you have the Proper Simple(Single column based) &lt;STRONG&gt;/&lt;/STRONG&gt; Complex (Muliple columns based) Indexes defined for the tables below, evaluate based on filtering criteria&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;EM&gt;100 million rows / 15 columns / about 15GB&lt;/EM&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;EM&gt;300 million rows / 15 columns / about 80GB&lt;/EM&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;EM&gt;100 million rows / 40 columns / about 40GB&lt;/EM&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;Here couple of papers to get you started on SAS Indexes&lt;/EM&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A title="The Basics of Using SAS® Indexes" href="https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/247-30.pdf" target="_blank" rel="noopener"&gt;&lt;EM&gt;The Basics of Using SAS® Indexes&lt;/EM&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A title="Frequently Asked Questions about SAS® Indexes" href="https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/008-30.pdf" target="_blank" rel="noopener"&gt;&lt;EM&gt;Frequently Asked Questions about SAS® Indexes&lt;/EM&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Depending on the nature of your tables:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;External Relational Database Management System (RDBMS) ---&amp;gt; Most have partitioning features, see if you can utilize them if applicable&lt;/LI&gt;
&lt;LI&gt;SAS Data sets. I would investigate using &lt;A title="SAS® 9.4 Scalable Performance Data Engine: Reference, Fourth Edition" href="https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/engspde/titlepage.htm" target="_blank" rel="noopener"&gt;SPDE&lt;/A&gt; to store such large tables.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;One other suggestion I would recommend,&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Consider creating temporary tables (if possible) by filtering these large tables before your query, then use them in your query. (This is a workaround for using &lt;A title="What Is a Common Table Expression (CTE) in SQL?" href="https://learnsql.com/blog/what-is-common-table-expression/" target="_blank" rel="noopener"&gt;CTEs&lt;/A&gt; in SQL)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Hope this helps&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2023 11:24:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869747#M343554</guid>
      <dc:creator>AhmedAl_Attar</dc:creator>
      <dc:date>2023-04-14T11:24:17Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869809#M343576</link>
      <description>&lt;P&gt;Thank you very much for all these tips to improve our data processing.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I will start reading the articles and experimenting with the suggested optimizations.&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2023 14:43:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869809#M343576</guid>
      <dc:creator>Daniele_R</dc:creator>
      <dc:date>2023-04-14T14:43:51Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869825#M343585</link>
      <description>In addition, I would highly recommend looking into your SAS environment and invocation settings.&lt;BR /&gt;Options that can enhance performance include:&lt;BR /&gt;-memsize (Default: 2G)&lt;BR /&gt;-sortsize (Default: 1G)&lt;BR /&gt;-CPUCOUNT (Default: 4)&lt;BR /&gt;-UTILLOC (Default: work)&lt;BR /&gt;Changing these defaults will help with the performance issues.&lt;BR /&gt;Here is another paper to read &lt;span class="lia-unicode-emoji" title=":winking_face:"&gt;😉&lt;/span&gt;&lt;BR /&gt;&lt;A href="https://support.sas.com/resources/papers/proceedings09/333-2009.pdf" target="_blank"&gt;https://support.sas.com/resources/papers/proceedings09/333-2009.pdf&lt;/A&gt;</description>
      <pubDate>Fri, 14 Apr 2023 15:09:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869825#M343585</guid>
      <dc:creator>AhmedAl_Attar</dc:creator>
      <dc:date>2023-04-14T15:09:26Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869845#M343593</link>
      <description>&lt;P&gt;On the CPU count option, I find CPU use is relatively low in SAS, especially with large datasets. There is loads of room for SAS to use higher compression on datasets and hence more CPU and I don't know why this hasn't been done. I found during lockdown compressing a compressed SAS dataset with 7z on fast could reduce filesize by ~90%. Depending how the compressed dataset is unpacked in the computation this may yield some benefit. It certainly would in writing output datasets.&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2023 16:10:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869845#M343593</guid>
      <dc:creator>bpowell</dc:creator>
      <dc:date>2023-04-14T16:10:28Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869849#M343594</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/12843"&gt;@bpowell&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;There are ways to reduce the SAS data set size and optimize it for processing without compression.&lt;/P&gt;
&lt;P&gt;This paper &lt;A title="Twenty Ways to Run Your SAS® Program Faster and Use Less  Space" href="https://support.sas.com/resources/papers/proceedings19/3684-2019.pdf" target="_blank" rel="noopener"&gt;Twenty Ways to Run Your SAS® Program Faster and Use Less Space&lt;/A&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I personally follow and advise my users to follow these points.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;6. Numeric variables in SAS data sets have a default length of 8. If the values of the &lt;BR /&gt;numeric variable are all integers, you can reduce the space by using the &lt;A title="Maximum Integer Size" href="https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lepg/p0dv87zb3bnse6n1mqo360be70qr.htm#p1euam1x1o1mjmn17qwpo67796qj" target="_blank" rel="noopener"&gt;following&amp;nbsp;&lt;/A&gt;table.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;7. Sometimes character variables imported into SAS from other systems, like Oracle or Excel, have very large lengths. You can use the following procedure to get the&amp;nbsp;shortest possible length for your character variable, although you might want to&amp;nbsp;allow room for growth:&lt;BR /&gt;a. Use the LENGTH function to calculate the actual length of the variable in each&amp;nbsp;observation in the data set.&lt;BR /&gt;b. Use the MAX option in PROC SUMMARY to get the largest value of the length.&lt;BR /&gt;c. Use the LENGTH statement to shorten the length of the character variable to&amp;nbsp;the maximum length.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;8. Switch variables from numeric to character if they are integers and range in value&amp;nbsp;from -9 to 99. The minimum length for numeric variables is 3, so you can save space if the variable can fit into one or two characters.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;9. Switch variables from character to numeric if they are all integers and occupy more than 3 bytes. For example, the number 1234 would occupy 4 bytes as a character&amp;nbsp;variable but item 6 above shows it would only occupy 3 bytes as a numeric variable.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Note: When the data set record length is reduced, that means more records can fit into the buffer, which in turn means less time to read the data&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2023 16:45:52 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869849#M343594</guid>
      <dc:creator>AhmedAl_Attar</dc:creator>
      <dc:date>2023-04-14T16:45:52Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869919#M343620</link>
      <description>&lt;P&gt;"There is loads of room for SAS to use higher compression on datasets and hence more CPU and I don't know why this hasn't been done." It has been done. Check out the &lt;A href="https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/engsas7bdat/p1vhgpuz9l5350n1t0w4zc4v7izz.htm" target="_blank" rel="noopener"&gt;COMPRESS option&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I can get compression ratios with some of my tables approaching 90% if character columns have a lot of blank space.&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2023 22:33:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869919#M343620</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2023-04-14T22:33:54Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869948#M343629</link>
      <description>&lt;P&gt;It's been my experience that multi-table joins in SQL tend to be very slow and consume excessive resources (disk space in WORK).&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Split the process into several steps&lt;/LI&gt;
&lt;LI&gt;Use SORT and DATA steps rather than SQL (unless you explicitly need a cartesian product)&lt;/LI&gt;
&lt;LI&gt;Do lookups with hash objects, which can be integrated in a DATA step which does another MERGE&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;We need to see your current code (accompanied by some example data in DATA steps with DATALINES, so we know the variable attributes) to throw the whole weight of our experience at your problem.&lt;/P&gt;</description>
      <pubDate>Sat, 15 Apr 2023 07:31:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/869948#M343629</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2023-04-15T07:31:09Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/870101#M343688</link>
      <description>Well, excuse me but "duh".&lt;BR /&gt;&lt;BR /&gt;40     data test;&lt;BR /&gt;41     length test $1000.;&lt;BR /&gt;42     do i=1 to 100000000;&lt;BR /&gt;43     test="&lt;BR /&gt;43   !       ....&lt;BR /&gt;43   !     ...";&lt;BR /&gt;NOTE: The quoted string currently being processed has become more than 262 characters long.&lt;BR /&gt;      You might have unbalanced quotation marks.&lt;BR /&gt;44     output;&lt;BR /&gt;45     end;&lt;BR /&gt;46     drop i;&lt;BR /&gt;47     run;&lt;BR /&gt;&lt;BR /&gt;NOTE: The data set WORK.TEST has 100000000 observations and 1 variables.&lt;BR /&gt;NOTE: Compressing data set WORK.TEST decreased size by 97.82 percent.&lt;BR /&gt;      Compressed is 33607 pages; un-compressed would require 1538462 pages.&lt;BR /&gt;NOTE: DATA statement used (Total process time):&lt;BR /&gt;      real time           56.63 seconds&lt;BR /&gt;      cpu time            48.06 seconds&lt;BR /&gt;&lt;BR /&gt;Output dataset size is 2gb *compressed*. SAS would have written 100s of GB without the compress option being set. That's a 1000 character string about 99% whitespace.&lt;BR /&gt;&lt;BR /&gt;Now, if, as I said, you compress the output dataset with 7z, on fast, the resulting output dataset is &amp;lt;200mb. Like I said, SAS has * a lot * of scope for additional compression.</description>
      <pubDate>Mon, 17 Apr 2023 08:27:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/870101#M343688</guid>
      <dc:creator>bpowell</dc:creator>
      <dc:date>2023-04-17T08:27:10Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/870113#M343692</link>
      <description>&lt;P&gt;The 7z compression compresses&amp;nbsp;&lt;EM&gt;everything&lt;/EM&gt;, which also means that one would need to uncompress&amp;nbsp;&lt;EM&gt;everything&lt;/EM&gt; before use.&lt;/P&gt;
&lt;P&gt;SAS compresses &lt;EM&gt;observations&lt;/EM&gt;, which means that individual observations can still be found within a dataset page without having to uncompress the whole dataset first.&lt;/P&gt;
&lt;P&gt;And you need to run the same test with COMPRESS=BINARY. COMPRESS=YES only does RLE (run length encoding) compression (which is almost undetectable in terms of CPU usage).&lt;/P&gt;</description>
      <pubDate>Mon, 17 Apr 2023 09:52:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/870113#M343692</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2023-04-17T09:52:04Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing a join with large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/870142#M343699</link>
      <description>Yes - I know SAS can't read a zipped dataset. My point is the overall compression is poor. Does anything you say mean SAS could not use greater compression using more CPU? No. I would be surprised if there wasn't at least some scope to further leverage the current multi-cpu universe we inhabit. I don't believe compress=YES has been updated since 7bdat which is what, 30 years old? Then multi-cpu was mainframe and cpu time was costly. No longer the case. Time for an upgrade.</description>
      <pubDate>Mon, 17 Apr 2023 11:50:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Optimizing-a-join-with-large-datasets/m-p/870142#M343699</guid>
      <dc:creator>bpowell</dc:creator>
      <dc:date>2023-04-17T11:50:44Z</dc:date>
    </item>
  </channel>
</rss>

