topic Re: Need Code for faster data collection in SAS Programming

Need Code for faster data collection

abdulla — Sun, 04 Oct 2020 19:51:08 GMT

%let year1=1970; %let year2=2020;
proc sql; create table comp0 as
select gvkey, fyear, datadate, SIC, fic, curcd, cogs, SALE, CSHO, AQC, AT, ACT, CAPX, CH, CHE,LT,LCT,
DLC, DLTT, DP, DV, DVC, dvp, SPPIV, IPODATE
from mydat.funda
where fyear between &year1 and &year2
and CONSOL='C'
and INDFMT='INDL'
and DATAFMT='STD'
and POPsrc='D'
order by gvkey, fyear;
quit;
RUN;

I have all the data saved in my library in sas7bdat format. I can collect my required data using the above code but runtime is slow. Is there any efficient method so that it takes less time to get the data?

Re: Need Code for faster data collection

Kurt_Bremser — Sun, 04 Oct 2020 20:23:33 GMT

Try a simple proc sort with a KEEP= and WHERE= dataset option:

proc sort
  data=mydat.funda (
    keep=
      gvkey fyear datadate SIC fic curcd cogs SALE CSHO
      AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
      SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc
    where=(
      fyear between &year1 and &year2
      and CONSOL='C'
      and INDFMT='INDL'
      and DATAFMT='STD'
      and POPsrc='D'
    )
  )
  out=comp0 (
    drop=CONSOL INDFMT DATAFMT POPsrc
  )
;
by gvkey fyear;
run;

Re: Need Code for faster data collection

abdulla — Sun, 04 Oct 2020 20:47:09 GMT

It is still not faster.

Re: Need Code for faster data collection

Kurt_Bremser — Sun, 04 Oct 2020 20:52:46 GMT

Please supply some information:

observation size
observation number (before and after the WHERE condition)
type of computer (operating system, number of cores, CPU type/generation)
storage infrastructure (is the source library on a local disk, SAN, network share, RDBMS)

Run both codes with

options fullstimer;

and post the log of both.

Re: Need Code for faster data collection

novinosrin — Sun, 04 Oct 2020 20:57:37 GMT

Hi @abdulla Your question though stated in simple terms, actually leads to utilization of more far reaching advanced concepts/techniques that pertains to performance and efficiency.

Your code, however seems a perfect candidate for application of Indexes particularly for 'Where' subsetting or filter. The idea is to move away from sequential processing or in other words- direct access using B-Tree binary search, locate and retrieve.

To apply indexes, you would however need to do some home work to weigh the cost and benefits associated with it. Basically what this means is 1. Knowing your data 2. Identifying Candidates for indexes 3. Frequency of usage to determine the worth 4. maintenance methodology and so forth.

I'd encourage you to get acquainted with the concept of indexes and see if your use case presents an ideal scenario for its implementation. In my humble opinion, your code seems to really suggest a great potential for considering indexes.

And hey, if you are new to indexes, at least you know that such a thing exists and is worth learning. Best!

Re: Need Code for faster data collection

Tom — Sun, 04 Oct 2020 23:35:42 GMT

Depending on whether the dataset is indexed on FYEAR and it might be better to use IN rather than BETWEEN so that you are doing equality tests instead of inequality tests.

where fyear in ( &year1 : &year2)

Re: Need Code for faster data collection

ChrisNZ — Mon, 05 Oct 2020 01:43:56 GMT

How is this table used? And how many times?

What approximate proportion of the rows does this where clause represent?

Are (some of) these selection criteria used often when querying the table?

Re: Need Code for faster data collection

mkeintz — Mon, 05 Oct 2020 05:48:29 GMT

Given the data set name funda, and the list of variables you have provided, you are using the Compustat fundamental annuals data set. If you are using it at WRDS, it is already sorted by GVKEY/datadate, which is almost the same as gvkey/fyear. This reduces the burden of the sort process, but neither proc sort nor proc sql know, so both of those procs probably create intermediate datasets to generate the final sorted product.

But you can write code to take advantage of this knowledge, effectively reading in all the data for one gvkey at a time, sorting it in memory and outputting the sorted results, again one gvkey at a time. This avoids the creation of the intermediate data sets used by proc sort and proc sql. That's a lot of saved input/output activity.

You can do all this in a data step, using an ordered hash object as below. This program reads in all the qualifying records for a gvkey, and puts them in an ordered hash. At the end of each gvkey, it then retrieves data from the ordered hash, starting at the top, and outputs each retrieved set of values. Then clear the hash in preparation for the next gvkey:

data comp0;
  set mydat.funda (keep=
      gvkey fyear datadate SIC fic curcd cogs SALE CSHO
      AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
      SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc);
  by gvkey;
  where fyear between &year1 and &year2
      and CONSOL='C'     and INDFMT='INDL'
      and DATAFMT='STD'  and POPsrc='D';

  if _n_=1 then do;
    declare hash h (dataset:'mydat.funda (obs=0
                    keep= gvkey fyear datadate SIC fic curcd cogs SALE CSHO
                    AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
                    SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc)'
                   ,ordered:'A');
      h.definekey('fyear');
      h.definedata(all:'Y');
      h.definedone();
    declare hiter hi ('h');
  end;
  h.add();

  if last.gvkey;
  do while (hi.next()=0);
    output;
  end;
  rc=h.clear();
run;

H is a hash object (think lookup table for your purposes). It has the attribute ordered:'A', telling sas that its contents will be sorted by the declared hash keys ('fyear' here). HI is a hash iterator, which provides a technique to step through each dataitem (i.e. each "row") in the hash. That's where you see the "hi.next()" method being used. hi.next() starts at the top of the hash and returns a zero as long as it is successful. Once it steps past the last dataitem in the hash it returns a non-zero, so h can be cleared in preparation for the next gvkey.

There are further techniques that can save some time, in creating a data set view, instead of (or in addition to) a data set file. But see whether this approach speeds up things first.

Edit note: notice I used an "obs=0" dataset name parameter in the declare hash statement. That's because I didn't want the hash object to pre-load data from mydata.funda, but I did want it to look at the dataset variable names and attributes in the variables of interest.

Re: Need Code for faster data collection

ChrisNZ — Mon, 05 Oct 2020 06:04:13 GMT

If the data is sorted then just add PRESORTED to Kurt's proc sort code to get a sorted table with minimum resource cost.

Re: Need Code for faster data collection

Ksharp — Mon, 05 Oct 2020 11:20:36 GMT

Try SPDE engine.

libname temp spde 'c:\temp';
options user=temp;

data funda;
 set mydat.funda;
run;

%let year1=1970; %let year2=2020;
proc sql; create table comp0 as
select gvkey, fyear, datadate, SIC, fic, curcd, cogs, SALE, CSHO, AQC, AT, ACT, CAPX, CH, CHE,LT,LCT,
DLC, DLTT, DP, DV, DVC, dvp, SPPIV, IPODATE
from funda /*<------*/
where fyear between &year1 and &year2
and CONSOL='C'
and INDFMT='INDL'
and DATAFMT='STD'
and POPsrc='D'
order by gvkey, fyear;
quit;