Re: Need Code for faster data collection

abdulla · Posted 10-04-2020 03:51 PM

%let year1=1970; %let year2=2020;
proc sql; create table comp0 as
select gvkey, fyear, datadate, SIC, fic, curcd, cogs, SALE, CSHO, AQC, AT, ACT, CAPX, CH, CHE,LT,LCT,
DLC, DLTT, DP, DV, DVC, dvp, SPPIV, IPODATE
from mydat.funda
where fyear between &year1 and &year2
and CONSOL='C'
and INDFMT='INDL'
and DATAFMT='STD'
and POPsrc='D'
order by gvkey, fyear;
quit;
RUN;

I have all the data saved in my library in sas7bdat format. I can collect my required data using the above code but runtime is slow. Is there any efficient method so that it takes less time to get the data?

Kurt_Bremser · Posted 10-04-2020 04:23 PM

Try a simple proc sort with a KEEP= and WHERE= dataset option:

proc sort
  data=mydat.funda (
    keep=
      gvkey fyear datadate SIC fic curcd cogs SALE CSHO
      AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
      SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc
    where=(
      fyear between &year1 and &year2
      and CONSOL='C'
      and INDFMT='INDL'
      and DATAFMT='STD'
      and POPsrc='D'
    )
  )
  out=comp0 (
    drop=CONSOL INDFMT DATAFMT POPsrc
  )
;
by gvkey fyear;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

abdulla · Posted 10-04-2020 04:47 PM

It is still not faster.

Kurt_Bremser · Posted 10-04-2020 04:52 PM

Please supply some information:

observation size
observation number (before and after the WHERE condition)
type of computer (operating system, number of cores, CPU type/generation)
storage infrastructure (is the source library on a local disk, SAN, network share, RDBMS)

Run both codes with

options fullstimer;

and post the log of both.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 10-05-2020 01:44 AM

Given the data set name funda, and the list of variables you have provided, you are using the Compustat fundamental annuals data set. If you are using it at WRDS, it is already sorted by GVKEY/datadate, which is almost the same as gvkey/fyear. This reduces the burden of the sort process, but neither proc sort nor proc sql know, so both of those procs probably create intermediate datasets to generate the final sorted product.

But you can write code to take advantage of this knowledge, effectively reading in all the data for one gvkey at a time, sorting it in memory and outputting the sorted results, again one gvkey at a time. This avoids the creation of the intermediate data sets used by proc sort and proc sql. That's a lot of saved input/output activity.

You can do all this in a data step, using an ordered hash object as below. This program reads in all the qualifying records for a gvkey, and puts them in an ordered hash. At the end of each gvkey, it then retrieves data from the ordered hash, starting at the top, and outputs each retrieved set of values. Then clear the hash in preparation for the next gvkey:

data comp0;
  set mydat.funda (keep=
      gvkey fyear datadate SIC fic curcd cogs SALE CSHO
      AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
      SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc);
  by gvkey;
  where fyear between &year1 and &year2
      and CONSOL='C'     and INDFMT='INDL'
      and DATAFMT='STD'  and POPsrc='D';

  if _n_=1 then do;
    declare hash h (dataset:'mydat.funda (obs=0
                    keep= gvkey fyear datadate SIC fic curcd cogs SALE CSHO
                    AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
                    SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc)'
                   ,ordered:'A');
      h.definekey('fyear');
      h.definedata(all:'Y');
      h.definedone();
    declare hiter hi ('h');
  end;
  h.add();

  if last.gvkey;
  do while (hi.next()=0);
    output;
  end;
  rc=h.clear();
run;

H is a hash object (think lookup table for your purposes). It has the attribute ordered:'A', telling sas that its contents will be sorted by the declared hash keys ('fyear' here). HI is a hash iterator, which provides a technique to step through each dataitem (i.e. each "row") in the hash. That's where you see the "hi.next()" method being used. hi.next() starts at the top of the hash and returns a zero as long as it is successful. Once it steps past the last dataitem in the hash it returns a non-zero, so h can be cleared in preparation for the next gvkey.

There are further techniques that can save some time, in creating a data set view, instead of (or in addition to) a data set file. But see whether this approach speeds up things first.

Edit note: notice I used an "obs=0" dataset name parameter in the declare hash statement. That's because I didn't want the hash object to pre-load data from mydata.funda, but I did want it to look at the dataset variable names and attributes in the variables of interest.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ChrisNZ · Posted 10-05-2020 02:04 AM

If the data is sorted then just add PRESORTED to Kurt's proc sort code to get a sorted table with minimum resource cost.

High-Performance SAS Coding - Third Edition

novinosrin · Posted 10-04-2020 04:57 PM

Hi @abdulla Your question though stated in simple terms, actually leads to utilization of more far reaching advanced concepts/techniques that pertains to performance and efficiency.

Your code, however seems a perfect candidate for application of Indexes particularly for 'Where' subsetting or filter. The idea is to move away from sequential processing or in other words- direct access using B-Tree binary search, locate and retrieve.

To apply indexes, you would however need to do some home work to weigh the cost and benefits associated with it. Basically what this means is 1. Knowing your data 2. Identifying Candidates for indexes 3. Frequency of usage to determine the worth 4. maintenance methodology and so forth.

I'd encourage you to get acquainted with the concept of indexes and see if your use case presents an ideal scenario for its implementation. In my humble opinion, your code seems to really suggest a great potential for considering indexes.

And hey, if you are new to indexes, at least you know that such a thing exists and is worth learning. Best!

Tom · Posted 10-04-2020 07:35 PM

Depending on whether the dataset is indexed on FYEAR and it might be better to use IN rather than BETWEEN so that you are doing equality tests instead of inequality tests.

where fyear in ( &year1 : &year2)

ChrisNZ · Posted 10-04-2020 09:43 PM

How is this table used? And how many times?

What approximate proportion of the rows does this where clause represent?

Are (some of) these selection criteria used often when querying the table?

High-Performance SAS Coding - Third Edition

Ksharp · Posted 10-05-2020 07:20 AM

Try SPDE engine.

libname temp spde 'c:\temp';
options user=temp;

data funda;
 set mydat.funda;
run;

%let year1=1970; %let year2=2020;
proc sql; create table comp0 as
select gvkey, fyear, datadate, SIC, fic, curcd, cogs, SALE, CSHO, AQC, AT, ACT, CAPX, CH, CHE,LT,LCT,
DLC, DLTT, DP, DV, DVC, dvp, SPPIV, IPODATE
from funda /*<------*/
where fyear between &year1 and &year2
and CONSOL='C'
and INDFMT='INDL'
and DATAFMT='STD'
and POPsrc='D'
order by gvkey, fyear;
quit;

Registration is open

SAS Training: Just a Click Away