<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Efficient way of merging very large datasets in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40090#M8123</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Performance is fickle so making broad generalization is difficult.&amp;nbsp; I deal with datasets of these sizes regularly (from billions to 100's of billions, mostly web event data) and I have encountered situations that contradict all of the following but they are usually 'best' starting points.&amp;nbsp; I make the following points with a starting assumption that there is no highly paralleling dbms available (Oracle Exadata, Teradata, Netezza, SPDS, Aster nCluster, Greenplum, Vertica, etc...), as in the vast majority of cases that system would beat what a single threaded SAS DATA step ever could.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;An index will typically only benefit a merge is the small table makes up under 20-30% of the larger table.&lt;/P&gt;&lt;P&gt;Hash tables will be very fast however the ability to load a file of these sizes is highly unlikely unless you are dealing with tables containing very few variables on a machine containing lots of RAM (my servers have 512GB and there are still many tables too large).&lt;/P&gt;&lt;P&gt;Formats are also good approaches for large table merges.&amp;nbsp; As mentioned you can use it to go from an id to a variable or a version that has not been pointed out is creating a format from merge key to record number and then using a set with point.&amp;nbsp; Sometimes this is very quick, it is similar to utilizing an index and breaks down the higher the percentage of records you are reading gets, also there are memory limits for formats as well which has been pointed out.&lt;/P&gt;&lt;P&gt;A option that has not been mentioned yet is using SPDE, if you have the I/O subsystem available to use it you can sometime see large performance increases because it will allow multi-thread I/O to the DATA step moving the burden of the task more onto RAM and CPU.&amp;nbsp; Index's are also multi-thread read which can help performance with them as well.&amp;nbsp; It also allows for indexes and data contains to be help in separate locations where the I/O systems can be optimized for the different types of usage.&amp;nbsp; Indexs placed in storage meant for intense iops and random access, data on disk geared for sequential read throughput.&amp;nbsp; Also in dealing with tables over 2 billion rows, if you are using a 32-bit operating system you need to use SPDE anyway...&lt;/P&gt;&lt;P&gt;And finally, the best of these options is to always split these size files into logical groups of smaller files and performing the tasks in parallel and then set the results back together at the end if necessary.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;On another note with a related topic, I am very excited when it comes to big data and SAS with the new product SAS/ACCESS Interface to Hadoop!&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Fri, 23 Mar 2012 03:50:42 GMT</pubDate>
    <dc:creator>FriedEgg</dc:creator>
    <dc:date>2012-03-23T03:50:42Z</dc:date>
    <item>
      <title>Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40080#M8113</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I come across a situatin to merge two datasets each of size 40million observations.&lt;/P&gt;&lt;P&gt; The details are: &lt;/P&gt;&lt;P&gt;1. Dataset1 contains &lt;SPAN style="font-family: 'courier new', courier; font-size: 12pt;"&gt;lot_id&lt;/SPAN&gt; and &lt;SPAN style="font-family: 'courier new', courier; font-size: 12pt;"&gt;unit_id&lt;/SPAN&gt;, also some other variables. Each &lt;SPAN style="font-family: 'courier new', courier; font-size: 12pt;"&gt;lot_id&lt;/SPAN&gt; is mapped to more than one &lt;SPAN style="font-size: 12pt; font-family: 'courier new', courier;"&gt;unit_id&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;2. Dataset2 contains &lt;SPAN style="font-size: 12pt; font-family: 'courier new', courier;"&gt;unit_ids&lt;/SPAN&gt; and their &lt;SPAN style="font-size: 12pt; font-family: 'courier new', courier;"&gt;values&lt;/SPAN&gt;.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Now the task is to obtain the total value of each &lt;SPAN style="font-family: 'courier new', courier; font-size: 16px;"&gt;lot_id &lt;SPAN style="font-family: arial, helvetica, sans-serif; font-size: 10pt;"&gt;by merging the two datasets ( by &lt;SPAN style="font-family: 'courier new', courier; font-size: 16px;"&gt;unit_id).&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Both the data sets are readily indexed/sorted by unit_id, which is also unique in both datasets.&lt;/P&gt;&lt;P&gt;I have to select very few variables from each of the datasets. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;At the moment i used below proc sql code.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;proc sql;&lt;/P&gt;&lt;P&gt;create table output as&lt;/P&gt;&lt;P&gt;select A.&lt;SPAN style="font-family: 'courier new', courier; font-size: 16px;"&gt;lot_id&lt;/SPAN&gt;, sum(B.&lt;SPAN style="font-family: 'courier new', courier; font-size: 16px;"&gt;value&lt;/SPAN&gt;) as &lt;SPAN style="font-family: 'courier new', courier; font-size: 16px;"&gt;value&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;from Dataset1 as A, Dataset2 as B&lt;/P&gt;&lt;P&gt;where A.&lt;SPAN style="font-family: 'courier new', courier; font-size: 16px;"&gt;unit_id&lt;/SPAN&gt; = B.&lt;SPAN style="font-family: 'courier new', courier; font-size: 16px;"&gt;unit_id&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;group by &lt;SPAN style="font-family: 'courier new', courier; font-size: 16px;"&gt;lot_id&lt;/SPAN&gt;;&lt;/P&gt;&lt;P&gt;quit;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I knew I can achieve the same by using merge statement and&amp;nbsp; then using&amp;nbsp; first. / last. variables to obtain sum of values for each of the &lt;SPAN style="font-family: 'courier new', courier; font-size: 16px;"&gt;lot_id.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'courier new', courier; font-size: 16px;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Here my question is:&amp;nbsp; &lt;/P&gt;&lt;P&gt;&lt;SPAN style="text-decoration: underline;"&gt;&lt;STRONG&gt;What is the efficient way(in terms of time) to perform merging on bigger datasets?&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="text-decoration: underline;"&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;In near future I need to perform similar merging on datasets with&amp;nbsp; about 10billion observations.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If anyone can guide me to find a much efficient way(in terms of time), that would be a great help.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Many thanks in advance&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 22 Mar 2012 00:21:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40080#M8113</guid>
      <dc:creator>LHV</dc:creator>
      <dc:date>2012-03-22T00:21:53Z</dc:date>
    </item>
    <item>
      <title>Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40081#M8114</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;OMG. &lt;STRONG&gt;10billion&lt;/STRONG&gt; ! What dataset is it.&lt;/P&gt;&lt;P&gt;If I were you , Hash Table is my favorite.&lt;/P&gt;&lt;P&gt;Due to your very large table. &lt;/P&gt;&lt;P&gt;I will spilt the large table into many small tables which only contains one lot_id. then use it one by one.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Ksharp&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 22 Mar 2012 09:01:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40081#M8114</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2012-03-22T09:01:24Z</dc:date>
    </item>
    <item>
      <title>Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40082#M8115</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;If the data is that large - you might want to leverage the underlying database.&amp;nbsp; Is there data stored in Oracle, Terradata or SPDS?&amp;nbsp; Use SAS integrated functions,pass-thru SQL or execution of a in-database function.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 22 Mar 2012 10:32:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40082#M8115</guid>
      <dc:creator>jcbell</dc:creator>
      <dc:date>2012-03-22T10:32:32Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40083#M8116</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Definitely worth a try:&amp;nbsp; Create a format from one data set, then apply the format when summarizing the second data set.&amp;nbsp; For example:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data create_format;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; set&amp;nbsp; dataset1 (keep=lot_id unit_id);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; retain fmtname '$lots';&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; rename lot_id=label unit_id=start;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;proc format cntlin=create_format;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;proc summary data=dataset2 NWAY;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; var value;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; class unit_id;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; format unit_id $lots.;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; output out=totals (keep=unit_id value) sum=;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In the output data set TOTALS, there are just 2 variables and VALUE is the sum of all VALUEs.&amp;nbsp; The format $lots is still required to convert the UNIT_ID into the LOT_ID, for example:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data totals;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; set totals;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; lot_id = put(unit_id, $lots.);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; drop unit_id;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Good luck.&amp;nbsp; Let us know what you find out in terms of speed.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 22 Mar 2012 13:43:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40083#M8116</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2012-03-22T13:43:32Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40084#M8117</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Many thanks&amp;nbsp; for all your replies. At the moment i concentrate on merging 40m datasets, which took just more than an hour&lt;/P&gt;&lt;P&gt;to complete my proc sql step. and i have to repeat the same task for 12 different months, which may take more than 15 hours.&lt;/P&gt;&lt;P&gt;( I&amp;nbsp; don't worry about the 10billion dataset merge at this point, because there i will be merging&amp;nbsp; a 20million dataset with 10billion one).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks to Ksharp and will try hash tables as i need to bring couple of more variables from D&lt;SPAN style="background-color: #ffffff;"&gt;ataset1(the lot_ids belong to some categories&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="background-color: #ffffff;"&gt;and need that info for analysis.) &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="background-color: #ffffff;"&gt;However,&amp;nbsp; proc format&amp;nbsp; technique suggested by Astounding looks simple and effective for me.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="background-color: #ffffff;"&gt; But, i don't know how to bring additional variables(e.g., category) in this way from dataset1, without doing another iteration of proc format.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="background-color: #ffffff;"&gt;Anyway, i will post the outcomes once i succeed with my trials. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="background-color: #ffffff;"&gt;To jcbell.-&amp;gt;The data is in SAS server, and don't know&amp;nbsp; how to '&lt;SPAN style="background-color: #ffffff;"&gt;leverage the underlying database'.&lt;/SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 22 Mar 2012 22:28:50 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40084#M8117</guid>
      <dc:creator>LHV</dc:creator>
      <dc:date>2012-03-22T22:28:50Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40085#M8118</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;LHV,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It's possible to add a few more variables to a single format.&amp;nbsp; But you have to watch the details.&amp;nbsp; Here's how you would begin. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data create_format;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; set dataset1 (keep=lot_id unit_id plus a few more);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; retain fmtname '$lot';&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; rename unit_id=start;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; length label $ 200;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; label = unit_id;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; substr(label, 31) = plus;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; substr(label, 51) = a;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; substr(label, 71) = few;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; substr(label, 91) = more;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; keep start label fmtname;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The idea is to pack additional information into the format, in known positions.&amp;nbsp; If you are adding numeric pieces, you should convert them to character using the PUT function.&amp;nbsp; Once that is done, you need PROC SUMMARY to use only the beginning of the formatted value, not the whole string that includes the added information.&amp;nbsp; So, run PROC FORMAT as before, and then:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;proc summary data=dataset2 NWAY;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; var value;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; class unit_id;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; format unit_id $lots20.;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; output out=totals (keep=unit_id value) sum=;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In this step, the only difference is specifying $LOTS20., which tells SAS how many characters of the $LOTS format to apply.&amp;nbsp; Then in the DATA step, get all the related information:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data totals;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; set totals;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; length long_string $200 lot_id plus a few more $ 20;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; long_string = put(unit_id, $lots200.);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; lot_id = substr(long_string, 1, 20);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; plus = substr(long_string, 31, 20);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; a = substr(long_string, 51, 20);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; few = substr(long_string, 71, 20);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; more = substr(long_string, 91, 20);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; drop long_string unit_id;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Of course, you have to adjust the length you assign to each variable, depending on what is actually needed.&amp;nbsp; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I suspect that this will be MUCH faster than anything that has to join the data sets, but you have to use tools that you feel comfortable with.&amp;nbsp; Good luck.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 22 Mar 2012 22:55:16 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40085#M8118</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2012-03-22T22:55:16Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40086#M8119</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;If you are going to try the lookup format approach suggested by Astounding then you need to be aware that these are loaded into memory and so the size of the format is limited by your available memory. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This SAS note explains in more detail: &lt;A href="http://support.sas.com/kb/30/997.html"&gt;http://support.sas.com/kb/30/997.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have successfully used formats with a million or so discrete values but I suspect you may run into memory limits with a 40 million unique value lookup, and most certainly with 10 billion! Only testing will confirm what your maximum is.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I like Ksharp's idea of splitting the problem by LOT_ID as a fast lookup by hash (which also uses memory) or format would probably work OK. You can then combine the results for all grouped LOT_IDs afterwards.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What is the average size of your LOT_IDs (how many UNIT_IDs in each)?&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 22 Mar 2012 23:33:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40086#M8119</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2012-03-22T23:33:12Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40087#M8120</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Why not just index the look-up dataset and use the KEY= option on a SET statement?&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 22 Mar 2012 23:35:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40087#M8120</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2012-03-22T23:35:14Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40088#M8121</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt; I would favor jcbell's approach. Depending on where your tables are, most of the time might be spent on data transit. Also note that most DBMSs will use hashing for large merge operations when it is deemed beneficial.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 23 Mar 2012 02:11:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40088#M8121</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2012-03-23T02:11:19Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40089#M8122</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I think It will be very very slow.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 23 Mar 2012 02:53:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40089#M8122</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2012-03-23T02:53:22Z</dc:date>
    </item>
    <item>
      <title>Efficient way of merging very large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40090#M8123</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Performance is fickle so making broad generalization is difficult.&amp;nbsp; I deal with datasets of these sizes regularly (from billions to 100's of billions, mostly web event data) and I have encountered situations that contradict all of the following but they are usually 'best' starting points.&amp;nbsp; I make the following points with a starting assumption that there is no highly paralleling dbms available (Oracle Exadata, Teradata, Netezza, SPDS, Aster nCluster, Greenplum, Vertica, etc...), as in the vast majority of cases that system would beat what a single threaded SAS DATA step ever could.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;An index will typically only benefit a merge is the small table makes up under 20-30% of the larger table.&lt;/P&gt;&lt;P&gt;Hash tables will be very fast however the ability to load a file of these sizes is highly unlikely unless you are dealing with tables containing very few variables on a machine containing lots of RAM (my servers have 512GB and there are still many tables too large).&lt;/P&gt;&lt;P&gt;Formats are also good approaches for large table merges.&amp;nbsp; As mentioned you can use it to go from an id to a variable or a version that has not been pointed out is creating a format from merge key to record number and then using a set with point.&amp;nbsp; Sometimes this is very quick, it is similar to utilizing an index and breaks down the higher the percentage of records you are reading gets, also there are memory limits for formats as well which has been pointed out.&lt;/P&gt;&lt;P&gt;A option that has not been mentioned yet is using SPDE, if you have the I/O subsystem available to use it you can sometime see large performance increases because it will allow multi-thread I/O to the DATA step moving the burden of the task more onto RAM and CPU.&amp;nbsp; Index's are also multi-thread read which can help performance with them as well.&amp;nbsp; It also allows for indexes and data contains to be help in separate locations where the I/O systems can be optimized for the different types of usage.&amp;nbsp; Indexs placed in storage meant for intense iops and random access, data on disk geared for sequential read throughput.&amp;nbsp; Also in dealing with tables over 2 billion rows, if you are using a 32-bit operating system you need to use SPDE anyway...&lt;/P&gt;&lt;P&gt;And finally, the best of these options is to always split these size files into logical groups of smaller files and performing the tasks in parallel and then set the results back together at the end if necessary.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;On another note with a related topic, I am very excited when it comes to big data and SAS with the new product SAS/ACCESS Interface to Hadoop!&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 23 Mar 2012 03:50:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-way-of-merging-very-large-datasets/m-p/40090#M8123</guid>
      <dc:creator>FriedEgg</dc:creator>
      <dc:date>2012-03-23T03:50:42Z</dc:date>
    </item>
  </channel>
</rss>

