BookmarkSubscribeRSS Feed
abdulla
Pyrite | Level 9

%let year1=1970; %let year2=2020;
proc sql; create table comp0 as
select gvkey, fyear, datadate, SIC, fic, curcd, cogs, SALE, CSHO, AQC, AT, ACT, CAPX, CH, CHE,LT,LCT,
DLC, DLTT, DP, DV, DVC, dvp, SPPIV, IPODATE
from mydat.funda
where fyear between &year1 and &year2
and CONSOL='C'
and INDFMT='INDL'
and DATAFMT='STD'
and POPsrc='D'
order by gvkey, fyear;
quit;
RUN;

 

I have all the data saved in my library in sas7bdat  format. I can collect my required data using the above code but runtime is slow. Is there any efficient method so that it takes less time to get the data?

 

9 REPLIES 9
Kurt_Bremser
Super User

Try a simple proc sort with a KEEP= and WHERE= dataset option:

 

proc sort
  data=mydat.funda (
    keep=
      gvkey fyear datadate SIC fic curcd cogs SALE CSHO
      AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
      SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc
    where=(
      fyear between &year1 and &year2
      and CONSOL='C'
      and INDFMT='INDL'
      and DATAFMT='STD'
      and POPsrc='D'
    )
  )
  out=comp0 (
    drop=CONSOL INDFMT DATAFMT POPsrc
  )
;
by gvkey fyear;
run;

 

 

abdulla
Pyrite | Level 9
It is still not faster.
Kurt_Bremser
Super User

Please supply some information:

  • observation size
  • observation number (before and after the WHERE condition)
  • type of computer (operating system, number of cores, CPU type/generation)
  • storage infrastructure (is the source library on a local disk, SAN, network share, RDBMS)

Run both codes with

options fullstimer;

and post the log of both.

mkeintz
PROC Star

Given the data set name funda, and the list of variables you have provided, you are using the Compustat fundamental annuals data set.   If you are using it at WRDS, it is already sorted by GVKEY/datadate, which is almost the same as gvkey/fyear.  This reduces the burden of the sort process, but neither proc sort nor proc sql know, so both of those procs probably create intermediate datasets to generate the final sorted product.

 

But you can write code to take advantage of this knowledge, effectively reading in all the data for one gvkey at a time, sorting it in memory and outputting the sorted results, again one gvkey at a time.  This avoids the creation of the intermediate data sets used by proc sort and proc sql.  That's a lot of saved input/output activity.

 

You can do all this in a data step, using an ordered hash object as below.  This program reads in all the qualifying records for a gvkey, and puts them in an ordered hash.  At the end of each gvkey, it then retrieves data from the ordered hash, starting at the top, and outputs each retrieved set of values.   Then clear the hash in preparation for the next gvkey:

 

data comp0;
  set mydat.funda (keep=
      gvkey fyear datadate SIC fic curcd cogs SALE CSHO
      AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
      SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc);
  by gvkey;
  where fyear between &year1 and &year2
      and CONSOL='C'     and INDFMT='INDL'
      and DATAFMT='STD'  and POPsrc='D';

  if _n_=1 then do;
    declare hash h (dataset:'mydat.funda (obs=0
                    keep= gvkey fyear datadate SIC fic curcd cogs SALE CSHO
                    AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
                    SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc)'
                   ,ordered:'A');
      h.definekey('fyear');
      h.definedata(all:'Y');
      h.definedone();
    declare hiter hi ('h');
  end;
  h.add();

  if last.gvkey;
  do while (hi.next()=0);
    output;
  end;
  rc=h.clear();
run;

H is a hash object (think lookup table for your purposes).  It has the attribute ordered:'A', telling sas that its contents will be sorted by the declared hash keys ('fyear' here).   HI is a hash iterator, which provides a technique to step through each dataitem (i.e. each "row") in the hash.  That's where you see the "hi.next()" method being used.  hi.next() starts at the top of the hash and returns a zero as long as it is successful.  Once it steps past the last dataitem in the hash it returns a non-zero, so h can be cleared in preparation for the next gvkey.

 

There are further techniques that can save some time, in creating a data set view, instead of (or in addition to) a data set file.   But see whether this approach speeds up things first.

 

 

Edit note: notice I used an "obs=0" dataset name parameter in the declare hash statement.  That's because I didn't want the hash object to pre-load data from mydata.funda, but I did want it to look at the dataset variable names and attributes in the variables of interest.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ChrisNZ
Tourmaline | Level 20
If the data is sorted then just add PRESORTED to Kurt's proc sort code to get a sorted table with minimum resource cost.
novinosrin
Tourmaline | Level 20

Hi @abdulla  Your question though stated in simple terms, actually leads to utilization of more far reaching advanced concepts/techniques that pertains to performance and efficiency. 

 

Your code, however seems a perfect candidate for application of Indexes particularly for  'Where' subsetting or filter. The idea is to move away from sequential processing or in other words- direct access using B-Tree binary search, locate and retrieve. 

 

To apply indexes, you would however need to do some home work to weigh the cost and benefits associated with it. Basically what this means is 1. Knowing your data 2. Identifying Candidates for indexes 3. Frequency of usage to determine the worth 4. maintenance methodology and so forth.

 

I'd encourage you to get acquainted with the concept of indexes and see if your use case presents an ideal scenario for its implementation. In my humble opinion, your code seems to really suggest a great potential for considering indexes. 

 

And hey, if you are new to indexes, at least you know that such a thing exists and is worth learning. Best!

Tom
Super User Tom
Super User

Depending on whether the dataset is indexed on FYEAR and it might be better to use IN rather than BETWEEN so that you are doing equality tests instead of inequality tests.

where fyear in ( &year1 : &year2)
ChrisNZ
Tourmaline | Level 20

How is this table used? And how many times?

What approximate proportion of the rows does this where clause represent?

Are (some of) these selection criteria used often when querying the table?

 

Ksharp
Super User

Try SPDE engine.

 

libname temp spde 'c:\temp';
options user=temp;

data funda;
 set mydat.funda;
run;

%let year1=1970; %let year2=2020;
proc sql; create table comp0 as
select gvkey, fyear, datadate, SIC, fic, curcd, cogs, SALE, CSHO, AQC, AT, ACT, CAPX, CH, CHE,LT,LCT,
DLC, DLTT, DP, DV, DVC, dvp, SPPIV, IPODATE
from funda /*<------*/
where fyear between &year1 and &year2
and CONSOL='C'
and INDFMT='INDL'
and DATAFMT='STD'
and POPsrc='D'
order by gvkey, fyear;
quit;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1212 views
  • 11 likes
  • 7 in conversation