<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Query huge dataset in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692818#M211134</link>
    <description>Hi ChrisNZ,&lt;BR /&gt;&lt;BR /&gt;Sorry the dataset has 500 millions rows instead. Not sure is this considered huge or not. Anyway, the dataset is compressed but not sorted and not indexed.&lt;BR /&gt;&lt;BR /&gt;Here's the log:&lt;BR /&gt;NOTE: There were 0 observations read from the data set ILIN.ACMVPF.&lt;BR /&gt;WHERE (trandate='18OCT2020'D) and (orgtrcde in ('T679', 'TA69') or batctrcde in ('T679', 'TA69'));&lt;BR /&gt;NOTE: The data set WORK.TEST has 0 observations and 59 variables.&lt;BR /&gt;NOTE: DATA statement used (Total process time):&lt;BR /&gt;real time 24:43.62&lt;BR /&gt;user cpu time 4:23.93&lt;BR /&gt;system cpu time 1:55.64&lt;BR /&gt;memory 875.78k&lt;BR /&gt;OS Memory 20120.00k&lt;BR /&gt;Timestamp 10/20/2020 07:06:40 PM&lt;BR /&gt;Step Count 154 Switch Count 2230&lt;BR /&gt;Page Faults 123&lt;BR /&gt;Page Reclaims 592&lt;BR /&gt;Page Swaps 0&lt;BR /&gt;Voluntary Context Switches 932839&lt;BR /&gt;Involuntary Context Switches 150130&lt;BR /&gt;Block Input Operations 336232352&lt;BR /&gt;Block Output Operations 144</description>
    <pubDate>Tue, 20 Oct 2020 11:14:22 GMT</pubDate>
    <dc:creator>bekbek3128</dc:creator>
    <dc:date>2020-10-20T11:14:22Z</dc:date>
    <item>
      <title>Query huge dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692735#M211095</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We are suffering from performance issue in querying a dataset with more than 30 millions rows. It takes almost 30 minutes to select the necessary data from the dataset. Any suggestion how can we improve the performance?&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Oct 2020 04:01:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692735#M211095</guid>
      <dc:creator>bekbek3128</dc:creator>
      <dc:date>2020-10-20T04:01:06Z</dc:date>
    </item>
    <item>
      <title>Re: Query huge dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692740#M211096</link>
      <description>&lt;P&gt;30 millions rows is not huge.&lt;/P&gt;
&lt;P&gt;Show us the log, with option fullstimer turned on.&lt;/P&gt;
&lt;P&gt;Do you use a where clause, if statements to subset the data set?&lt;/P&gt;
&lt;P&gt;Is the data set sorted? Indexed? compressed?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Oct 2020 04:56:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692740#M211096</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-10-20T04:56:30Z</dc:date>
    </item>
    <item>
      <title>Re: Query huge dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692741#M211097</link>
      <description>&lt;P&gt;Buy better hardware?&lt;/P&gt;
&lt;P&gt;Without the code you are currently using it is hardly possible to suggest something useful.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Oct 2020 04:57:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692741#M211097</guid>
      <dc:creator>andreas_lds</dc:creator>
      <dc:date>2020-10-20T04:57:44Z</dc:date>
    </item>
    <item>
      <title>Re: Query huge dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692765#M211106</link>
      <description>&lt;P&gt;Please provide details:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;complete log of the step with options fullstimer&lt;/LI&gt;
&lt;LI&gt;observation size&lt;/LI&gt;
&lt;LI&gt;is this a native SAS dataset, or data in a remote database?&lt;/LI&gt;
&lt;LI&gt;if native SAS, stored on local disks, SAN, or network share?&lt;/LI&gt;
&lt;LI&gt;SAS server setup: number of cores, operating system&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Tue, 20 Oct 2020 06:54:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692765#M211106</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-10-20T06:54:39Z</dc:date>
    </item>
    <item>
      <title>Re: Query huge dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692818#M211134</link>
      <description>Hi ChrisNZ,&lt;BR /&gt;&lt;BR /&gt;Sorry the dataset has 500 millions rows instead. Not sure is this considered huge or not. Anyway, the dataset is compressed but not sorted and not indexed.&lt;BR /&gt;&lt;BR /&gt;Here's the log:&lt;BR /&gt;NOTE: There were 0 observations read from the data set ILIN.ACMVPF.&lt;BR /&gt;WHERE (trandate='18OCT2020'D) and (orgtrcde in ('T679', 'TA69') or batctrcde in ('T679', 'TA69'));&lt;BR /&gt;NOTE: The data set WORK.TEST has 0 observations and 59 variables.&lt;BR /&gt;NOTE: DATA statement used (Total process time):&lt;BR /&gt;real time 24:43.62&lt;BR /&gt;user cpu time 4:23.93&lt;BR /&gt;system cpu time 1:55.64&lt;BR /&gt;memory 875.78k&lt;BR /&gt;OS Memory 20120.00k&lt;BR /&gt;Timestamp 10/20/2020 07:06:40 PM&lt;BR /&gt;Step Count 154 Switch Count 2230&lt;BR /&gt;Page Faults 123&lt;BR /&gt;Page Reclaims 592&lt;BR /&gt;Page Swaps 0&lt;BR /&gt;Voluntary Context Switches 932839&lt;BR /&gt;Involuntary Context Switches 150130&lt;BR /&gt;Block Input Operations 336232352&lt;BR /&gt;Block Output Operations 144</description>
      <pubDate>Tue, 20 Oct 2020 11:14:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692818#M211134</guid>
      <dc:creator>bekbek3128</dc:creator>
      <dc:date>2020-10-20T11:14:22Z</dc:date>
    </item>
    <item>
      <title>Re: Query huge dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692823#M211139</link>
      <description>Hi KurtBremser,&lt;BR /&gt;&lt;BR /&gt;Hre's the log:&lt;BR /&gt;Hi ChrisNZ,&lt;BR /&gt;&lt;BR /&gt;Sorry the dataset has 500 millions rows instead. Not sure is this considered huge or not. Anyway, the dataset is compressed but not sorted and not indexed.&lt;BR /&gt;&lt;BR /&gt;Here's the log:&lt;BR /&gt;NOTE: There were 0 observations read from the data set ILIN.ACMVPF.&lt;BR /&gt;WHERE (trandate='18OCT2020'D) and (orgtrcde in ('T679', 'TA69') or batctrcde in ('T679', 'TA69'));&lt;BR /&gt;NOTE: The data set WORK.TEST has 0 observations and 59 variables.&lt;BR /&gt;NOTE: DATA statement used (Total process time):&lt;BR /&gt;real time 24:43.62&lt;BR /&gt;user cpu time 4:23.93&lt;BR /&gt;system cpu time 1:55.64&lt;BR /&gt;memory 875.78k&lt;BR /&gt;OS Memory 20120.00k&lt;BR /&gt;Timestamp 10/20/2020 07:06:40 PM&lt;BR /&gt;Step Count 154 Switch Count 2230&lt;BR /&gt;Page Faults 123&lt;BR /&gt;Page Reclaims 592&lt;BR /&gt;Page Swaps 0&lt;BR /&gt;Voluntary Context Switches 932839&lt;BR /&gt;Involuntary Context Switches 150130&lt;BR /&gt;Block Input Operations 336232352&lt;BR /&gt;Block Output Operations 144&lt;BR /&gt;&lt;BR /&gt;Observation size is 484928084&lt;BR /&gt;It is a native SAS stored on SAN.&lt;BR /&gt;Server V8: 4 Core , 32gb memory Operating system: Red Hat Linux Enterprise 6.3</description>
      <pubDate>Tue, 20 Oct 2020 11:31:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692823#M211139</guid>
      <dc:creator>bekbek3128</dc:creator>
      <dc:date>2020-10-20T11:31:27Z</dc:date>
    </item>
    <item>
      <title>Re: Query huge dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692827#M211142</link>
      <description>&lt;P&gt;An order of magnitude DOES make a difference.&lt;/P&gt;
&lt;P&gt;Depending on the observation size and therefore the resulting file size, we could make educated guesses about the SAN bandwidth.&lt;/P&gt;
&lt;P&gt;From your log&lt;/P&gt;
&lt;PRE&gt;real time 24:43.62
user cpu time 4:23.93
system cpu time 1:55.64&lt;/PRE&gt;
&lt;P&gt;you have ~7.5 minutes CPU time vs. ~25 minutes real time, so you are quite clearly I/O bound.&lt;/P&gt;
&lt;P&gt;Is your source dataset&amp;nbsp;&lt;SPAN&gt;ILIN.ACMVPF compressed? If not, consider using COMPRESS=YES on datasets that contain character variables of considerable length; also test COMPRESS=BINARY for mainly numeric/short character datasets.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;PS saw that you already use compression; please run a PROC CONTENTS and tell us the reported file size.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Oct 2020 11:53:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/692827#M211142</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-10-20T11:53:36Z</dc:date>
    </item>
    <item>
      <title>Re: Query huge dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/693003#M211241</link>
      <description>&lt;P&gt;I note that your query does not return any rows. How does the performance vary when rows are returned? How many rows are likely to be returned in a typical query on the 500m row table? If a typical query returns less than 10% of the rows than indexes on the query variables will help. That will have to be traded off against the overhead of maintaining the indexes when updating the source table.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Oct 2020 20:11:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/693003#M211241</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2020-10-20T20:11:24Z</dc:date>
    </item>
    <item>
      <title>Re: Query huge dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/693044#M211257</link>
      <description>&lt;P&gt;SAS reads over 300k rows a second, that's not bad.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If the observation size is&amp;nbsp;&lt;SPAN&gt;484928084 (is it really??) that's 1.5 TB a second.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-family: inherit;"&gt;This is unlikely. Can you check the values again please?.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;In any case, this seems to be a textbook example of where SPDE should be used.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;SPDE is more efficient when using large tables and indexes.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Run something&amp;nbsp;like this:&lt;/SPAN&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;libname SPEEDY spde "%sysfunc(pathname(ILIN))" partsize=1T compress=binary;

proc copy in=INLIN out=SPEEDY;
  select ACMVPF;
run;

proc datasets lib=SPEEDY noprint;
  modify ACMVPF;
  index create TRANDATE;
quit;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The run time of when querying SPEEDY.ACMVPF should drop to just seconds.&lt;/P&gt;
&lt;P&gt;Use any libname you want instead of SPEEDY.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Oct 2020 23:25:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/693044#M211257</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-10-20T23:25:53Z</dc:date>
    </item>
    <item>
      <title>Re: Query huge dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/693077#M211278</link>
      <description>&lt;P&gt;I seriously doubt an observation size of&amp;nbsp;&lt;SPAN&gt;484928084 (almost 500 MB).&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Please post the output of PROC CONTENTS.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;PS we want to know the observation&amp;nbsp;&lt;EM&gt;size&lt;/EM&gt;, not the observation&amp;nbsp;&lt;EM&gt;number&lt;/EM&gt;.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Oct 2020 06:33:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Query-huge-dataset/m-p/693077#M211278</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-10-21T06:33:53Z</dc:date>
    </item>
  </channel>
</rss>

