%let year1=1970; %let year2=2020;
proc sql; create table comp0 as
select gvkey, fyear, datadate, SIC, fic, curcd, cogs, SALE, CSHO, AQC, AT, ACT, CAPX, CH, CHE,LT,LCT,
DLC, DLTT, DP, DV, DVC, dvp, SPPIV, IPODATE
from mydat.funda
where fyear between &year1 and &year2
and CONSOL='C'
and INDFMT='INDL'
and DATAFMT='STD'
and POPsrc='D'
order by gvkey, fyear;
quit;
RUN;
I have all the data saved in my library in sas7bdat format. I can collect my required data using the above code but runtime is slow. Is there any efficient method so that it takes less time to get the data?
Try a simple proc sort with a KEEP= and WHERE= dataset option:
proc sort
data=mydat.funda (
keep=
gvkey fyear datadate SIC fic curcd cogs SALE CSHO
AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc
where=(
fyear between &year1 and &year2
and CONSOL='C'
and INDFMT='INDL'
and DATAFMT='STD'
and POPsrc='D'
)
)
out=comp0 (
drop=CONSOL INDFMT DATAFMT POPsrc
)
;
by gvkey fyear;
run;
Please supply some information:
Run both codes with
options fullstimer;
and post the log of both.
Given the data set name funda, and the list of variables you have provided, you are using the Compustat fundamental annuals data set. If you are using it at WRDS, it is already sorted by GVKEY/datadate, which is almost the same as gvkey/fyear. This reduces the burden of the sort process, but neither proc sort nor proc sql know, so both of those procs probably create intermediate datasets to generate the final sorted product.
But you can write code to take advantage of this knowledge, effectively reading in all the data for one gvkey at a time, sorting it in memory and outputting the sorted results, again one gvkey at a time. This avoids the creation of the intermediate data sets used by proc sort and proc sql. That's a lot of saved input/output activity.
You can do all this in a data step, using an ordered hash object as below. This program reads in all the qualifying records for a gvkey, and puts them in an ordered hash. At the end of each gvkey, it then retrieves data from the ordered hash, starting at the top, and outputs each retrieved set of values. Then clear the hash in preparation for the next gvkey:
data comp0;
set mydat.funda (keep=
gvkey fyear datadate SIC fic curcd cogs SALE CSHO
AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc);
by gvkey;
where fyear between &year1 and &year2
and CONSOL='C' and INDFMT='INDL'
and DATAFMT='STD' and POPsrc='D';
if _n_=1 then do;
declare hash h (dataset:'mydat.funda (obs=0
keep= gvkey fyear datadate SIC fic curcd cogs SALE CSHO
AQC AT ACT CAPX CH CHE LT LCT DLC DLTT DP DV DVC dvp
SPPIV IPODATE CONSOL INDFMT DATAFMT POPsrc)'
,ordered:'A');
h.definekey('fyear');
h.definedata(all:'Y');
h.definedone();
declare hiter hi ('h');
end;
h.add();
if last.gvkey;
do while (hi.next()=0);
output;
end;
rc=h.clear();
run;
H is a hash object (think lookup table for your purposes). It has the attribute ordered:'A', telling sas that its contents will be sorted by the declared hash keys ('fyear' here). HI is a hash iterator, which provides a technique to step through each dataitem (i.e. each "row") in the hash. That's where you see the "hi.next()" method being used. hi.next() starts at the top of the hash and returns a zero as long as it is successful. Once it steps past the last dataitem in the hash it returns a non-zero, so h can be cleared in preparation for the next gvkey.
There are further techniques that can save some time, in creating a data set view, instead of (or in addition to) a data set file. But see whether this approach speeds up things first.
Edit note: notice I used an "obs=0" dataset name parameter in the declare hash statement. That's because I didn't want the hash object to pre-load data from mydata.funda, but I did want it to look at the dataset variable names and attributes in the variables of interest.
Hi @abdulla Your question though stated in simple terms, actually leads to utilization of more far reaching advanced concepts/techniques that pertains to performance and efficiency.
Your code, however seems a perfect candidate for application of Indexes particularly for 'Where' subsetting or filter. The idea is to move away from sequential processing or in other words- direct access using B-Tree binary search, locate and retrieve.
To apply indexes, you would however need to do some home work to weigh the cost and benefits associated with it. Basically what this means is 1. Knowing your data 2. Identifying Candidates for indexes 3. Frequency of usage to determine the worth 4. maintenance methodology and so forth.
I'd encourage you to get acquainted with the concept of indexes and see if your use case presents an ideal scenario for its implementation. In my humble opinion, your code seems to really suggest a great potential for considering indexes.
And hey, if you are new to indexes, at least you know that such a thing exists and is worth learning. Best!
Depending on whether the dataset is indexed on FYEAR and it might be better to use IN rather than BETWEEN so that you are doing equality tests instead of inequality tests.
where fyear in ( &year1 : &year2)
How is this table used? And how many times?
What approximate proportion of the rows does this where clause represent?
Are (some of) these selection criteria used often when querying the table?
Try SPDE engine.
libname temp spde 'c:\temp';
options user=temp;
data funda;
set mydat.funda;
run;
%let year1=1970; %let year2=2020;
proc sql; create table comp0 as
select gvkey, fyear, datadate, SIC, fic, curcd, cogs, SALE, CSHO, AQC, AT, ACT, CAPX, CH, CHE,LT,LCT,
DLC, DLTT, DP, DV, DVC, dvp, SPPIV, IPODATE
from funda /*<------*/
where fyear between &year1 and &year2
and CONSOL='C'
and INDFMT='INDL'
and DATAFMT='STD'
and POPsrc='D'
order by gvkey, fyear;
quit;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.