<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Need Code for faster data collection in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688795#M209338</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/260675"&gt;@abdulla&lt;/a&gt;&amp;nbsp; Your question though stated in simple terms, actually leads to utilization of more far reaching advanced concepts/techniques that pertains to performance and efficiency.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Your code, ho&lt;SPAN&gt;wever seems a perfect candidate for application of Indexes particularly for&amp;nbsp; '&lt;EM&gt;Where'&amp;nbsp;&lt;/EM&gt;subsetting or filter. The idea is to move away from sequential processing or in other&amp;nbsp;words- direct access using B-Tree binary search, locate and retrieve.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;To apply indexes, you&amp;nbsp;would however need to do some home&amp;nbsp;work to&amp;nbsp;weigh the cost and benefits associated&amp;nbsp;with it. Basically&amp;nbsp;what this means is &lt;EM&gt;1. Knowing your data 2. Identifying Candidates for indexes 3. Frequency of usage to determine the&amp;nbsp;worth 4. maintenance methodology &lt;/EM&gt;and so forth.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'd encourage you to get acquainted&amp;nbsp;&lt;SPAN&gt;with the &lt;EM&gt;concept of indexes&lt;/EM&gt; and see if your use case presents an ideal scenario for its implementation. In my humble opinion, your code seems to really suggest a great potential for considering indexes.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;And hey, if you are new to indexes, at least you know that such a thing exists and is&amp;nbsp;worth learning. Best!&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Sun, 04 Oct 2020 20:57:37 GMT</pubDate>
    <dc:creator>novinosrin</dc:creator>
    <dc:date>2020-10-04T20:57:37Z</dc:date>
    <item>
      <title>Need Code for faster data collection</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688788#M209332</link>
      <description>&lt;P&gt;%let year1=1970; %let year2=2020;&lt;BR /&gt;proc sql; create table comp0 as&lt;BR /&gt;select gvkey, fyear, datadate, SIC, fic, curcd, cogs, SALE, CSHO, AQC, AT, ACT, CAPX, CH, CHE,LT,LCT,&lt;BR /&gt;DLC, DLTT, DP, DV, DVC, dvp, SPPIV, IPODATE&lt;BR /&gt;from mydat.funda&lt;BR /&gt;where fyear between &amp;amp;year1 and &amp;amp;year2&lt;BR /&gt;and CONSOL='C'&lt;BR /&gt;and INDFMT='INDL'&lt;BR /&gt;and DATAFMT='STD'&lt;BR /&gt;and POPsrc='D'&lt;BR /&gt;order by gvkey, fyear;&lt;BR /&gt;quit;&lt;BR /&gt;RUN;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have all the data saved in my library in&amp;nbsp;&lt;A href="https://communities.sas.com/t5/SAS-Programming/how-to-export-dataset-to-a-sas7bdat-file/td-p/687684" target="_blank" rel="noopener"&gt;sas7bdat&lt;/A&gt;&amp;nbsp; format. I can collect my required data using the above code but runtime is slow. Is there any efficient method so that it takes less time to get the data?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 04 Oct 2020 19:51:08 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688788#M209332</guid>
      <dc:creator>abdulla</dc:creator>
      <dc:date>2020-10-04T19:51:08Z</dc:date>
    </item>
    <item>
      <title>Re: Need Code for faster data collection</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688790#M209333</link>
      <description>&lt;P&gt;Try a simple proc sort with a KEEP= and WHERE= dataset option:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort
  data=mydat.funda (
    keep=
      gvkey fyear datadate SIC fic curcd cogs SALE CSHO
      AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
      SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc
    where=(
      fyear between &amp;amp;year1 and &amp;amp;year2
      and CONSOL='C'
      and INDFMT='INDL'
      and DATAFMT='STD'
      and POPsrc='D'
    )
  )
  out=comp0 (
    drop=CONSOL INDFMT DATAFMT POPsrc
  )
;
by gvkey fyear;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 04 Oct 2020 20:23:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688790#M209333</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-10-04T20:23:33Z</dc:date>
    </item>
    <item>
      <title>Re: Need Code for faster data collection</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688793#M209336</link>
      <description>It is still not faster.</description>
      <pubDate>Sun, 04 Oct 2020 20:47:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688793#M209336</guid>
      <dc:creator>abdulla</dc:creator>
      <dc:date>2020-10-04T20:47:09Z</dc:date>
    </item>
    <item>
      <title>Re: Need Code for faster data collection</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688794#M209337</link>
      <description>&lt;P&gt;Please supply some information:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;observation size&lt;/LI&gt;
&lt;LI&gt;observation number (before and after the WHERE condition)&lt;/LI&gt;
&lt;LI&gt;type of computer (operating system, number of cores, CPU type/generation)&lt;/LI&gt;
&lt;LI&gt;storage infrastructure (is the source library on a local disk, SAN, network share, RDBMS)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Run both codes with&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;options fullstimer;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;and post the log of both.&lt;/P&gt;</description>
      <pubDate>Sun, 04 Oct 2020 20:52:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688794#M209337</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-10-04T20:52:46Z</dc:date>
    </item>
    <item>
      <title>Re: Need Code for faster data collection</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688795#M209338</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/260675"&gt;@abdulla&lt;/a&gt;&amp;nbsp; Your question though stated in simple terms, actually leads to utilization of more far reaching advanced concepts/techniques that pertains to performance and efficiency.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Your code, ho&lt;SPAN&gt;wever seems a perfect candidate for application of Indexes particularly for&amp;nbsp; '&lt;EM&gt;Where'&amp;nbsp;&lt;/EM&gt;subsetting or filter. The idea is to move away from sequential processing or in other&amp;nbsp;words- direct access using B-Tree binary search, locate and retrieve.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;To apply indexes, you&amp;nbsp;would however need to do some home&amp;nbsp;work to&amp;nbsp;weigh the cost and benefits associated&amp;nbsp;with it. Basically&amp;nbsp;what this means is &lt;EM&gt;1. Knowing your data 2. Identifying Candidates for indexes 3. Frequency of usage to determine the&amp;nbsp;worth 4. maintenance methodology &lt;/EM&gt;and so forth.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'd encourage you to get acquainted&amp;nbsp;&lt;SPAN&gt;with the &lt;EM&gt;concept of indexes&lt;/EM&gt; and see if your use case presents an ideal scenario for its implementation. In my humble opinion, your code seems to really suggest a great potential for considering indexes.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;And hey, if you are new to indexes, at least you know that such a thing exists and is&amp;nbsp;worth learning. Best!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 04 Oct 2020 20:57:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688795#M209338</guid>
      <dc:creator>novinosrin</dc:creator>
      <dc:date>2020-10-04T20:57:37Z</dc:date>
    </item>
    <item>
      <title>Re: Need Code for faster data collection</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688805#M209343</link>
      <description>&lt;P&gt;Depending on whether the dataset is indexed on FYEAR and it might be better to use IN rather than BETWEEN so that you are doing equality tests instead of inequality tests.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;where fyear in ( &amp;amp;year1 : &amp;amp;year2)&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sun, 04 Oct 2020 23:35:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688805#M209343</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2020-10-04T23:35:42Z</dc:date>
    </item>
    <item>
      <title>Re: Need Code for faster data collection</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688814#M209346</link>
      <description>&lt;P&gt;How is this table used? And how many times?&lt;/P&gt;
&lt;P&gt;What approximate proportion of the rows does this where clause represent?&lt;/P&gt;
&lt;P&gt;Are (some of) these selection criteria used often when querying the table?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 05 Oct 2020 01:43:56 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688814#M209346</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-10-05T01:43:56Z</dc:date>
    </item>
    <item>
      <title>Re: Need Code for faster data collection</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688827#M209351</link>
      <description>&lt;P&gt;Given the data set name funda, and the list of variables you have provided, you are using the Compustat fundamental annuals data set.&amp;nbsp; &amp;nbsp;If you are using it at WRDS, it is already sorted by GVKEY/datadate, which is almost the same as gvkey/fyear.&amp;nbsp; This reduces the burden of the sort process, but neither proc sort nor proc sql know, so both of those procs probably create intermediate datasets to generate the final sorted product.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But you can write code to take advantage of this knowledge, effectively reading in all the data for one gvkey at a time, sorting it in memory and outputting the sorted results, again one gvkey at a time.&amp;nbsp; This avoids the creation of the intermediate data sets used by proc sort and proc sql.&amp;nbsp; That's a lot of saved input/output activity.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can do all this in a data step, using an ordered hash object as below.&amp;nbsp; This program&amp;nbsp;reads in all the qualifying records for a gvkey, and puts them in an ordered hash.&amp;nbsp; At the end of each gvkey, it then retrieves data from the ordered hash, starting at the top, and outputs each retrieved set of values.&amp;nbsp; &amp;nbsp;Then clear the hash in preparation for the next gvkey:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data comp0;
  set mydat.funda (keep=
      gvkey fyear datadate SIC fic curcd cogs SALE CSHO
      AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
      SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc);
  by gvkey;
  where fyear between &amp;amp;year1 and &amp;amp;year2
      and CONSOL='C'     and INDFMT='INDL'
      and DATAFMT='STD'  and POPsrc='D';

  if _n_=1 then do;
    declare hash h (dataset:'mydat.funda (obs=0
                    keep= gvkey fyear datadate SIC fic curcd cogs SALE CSHO
                    AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
                    SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc)'
                   ,ordered:'A');
      h.definekey('fyear');
      h.definedata(all:'Y');
      h.definedone();
    declare hiter hi ('h');
  end;
  h.add();

  if last.gvkey;
  do while (hi.next()=0);
    output;
  end;
  rc=h.clear();
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;H is a hash object (think lookup table for your purposes).&amp;nbsp; It has the attribute &lt;EM&gt;&lt;STRONG&gt;ordered:'A'&lt;/STRONG&gt;&lt;/EM&gt;, telling sas that its contents will be sorted by the declared hash keys ('fyear' here).&amp;nbsp; &amp;nbsp;HI is a hash iterator, which provides a technique to step through each dataitem (i.e. each "row") in the hash.&amp;nbsp; That's where you see the "hi.next()" method being used.&amp;nbsp; hi.next() starts at the top of the hash and returns a zero as long as it is successful.&amp;nbsp; Once it steps past the last dataitem in the hash it returns a non-zero, so h can be cleared in preparation for the next gvkey.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;There are further techniques that can save some time, in creating a data set &lt;EM&gt;&lt;STRONG&gt;view&lt;/STRONG&gt;&lt;/EM&gt;, instead of (or in addition to) a data set &lt;EM&gt;&lt;STRONG&gt;file&lt;/STRONG&gt;&lt;/EM&gt;.&amp;nbsp; &amp;nbsp;But see whether this approach speeds up things first.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Edit note: notice I used an "obs=0" dataset name parameter in the declare hash statement.&amp;nbsp; That's because I didn't want the hash object to pre-load data from mydata.funda, but I did want it to look at the dataset variable names and attributes in the variables of interest.&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 05 Oct 2020 05:48:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688827#M209351</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2020-10-05T05:48:29Z</dc:date>
    </item>
    <item>
      <title>Re: Need Code for faster data collection</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688828#M209352</link>
      <description>If the data is sorted then just add PRESORTED to Kurt's proc sort code to get a sorted table with minimum resource cost.&lt;BR /&gt;</description>
      <pubDate>Mon, 05 Oct 2020 06:04:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688828#M209352</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-10-05T06:04:13Z</dc:date>
    </item>
    <item>
      <title>Re: Need Code for faster data collection</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688854#M209356</link>
      <description>&lt;P&gt;Try SPDE engine.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;libname temp spde 'c:\temp';
options user=temp;

data funda;
 set mydat.funda;
run;

%let year1=1970; %let year2=2020;
proc sql; create table comp0 as
select gvkey, fyear, datadate, SIC, fic, curcd, cogs, SALE, CSHO, AQC, AT, ACT, CAPX, CH, CHE,LT,LCT,
DLC, DLTT, DP, DV, DVC, dvp, SPPIV, IPODATE
from funda /*&amp;lt;------*/
where fyear between &amp;amp;year1 and &amp;amp;year2
and CONSOL='C'
and INDFMT='INDL'
and DATAFMT='STD'
and POPsrc='D'
order by gvkey, fyear;
quit;
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Mon, 05 Oct 2020 11:20:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-Code-for-faster-data-collection/m-p/688854#M209356</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2020-10-05T11:20:36Z</dc:date>
    </item>
  </channel>
</rss>

