DATA Step, Macro, Functions and more

reduce running time that works for a large dataset saved in a remote library

Reply
Occasional Contributor
Posts: 16

reduce running time that works for a large dataset saved in a remote library

Hi,

I am trying to work on a large dataset in total of approximately 40TB. I can access to the datasets by connecting to a server that has all the datasets saved as sas dataset format. Here I am trying to find an efficient coding that reduces running time as much as possible. The only output I need is the group difference for a couple of variables such as 'count' and the group mean such as 'meanspread'  that I compute using proc means at the end of the attached code. In order to reduce the time, I tried to use 'rsubmit' and 'SAS Data Views'. However, it still takes way too long and any comment or feedback that helps to reduce the running time is truly appreciated at this moment.

FYI, the name of the dataset in the remote library 'taq' is composed of two parts: cq_(or ct_) + date for the dataset, e.g., cq_19930102, cq_19930103,... :

libname mydir 'C:\sasdata\taq';

%let wrds=wrds.wharton.upenn.edu 4016;

options comamid=TCP remote=WRDS;

signon username=_prompt_;

rsubmit;

libname taq '/wrds/taq/sasdata';

data quotes/view=quotes;

set taq.cq_: ;

run;

data trades/view=trades;

set taq.ct_: ;

run;

data tradesquotes/view=tradesquotes;

set trades quotes ;

by symbol date time type;

run;

data tr_qt/view= tr_qt;

set tradesquotes;

lag_date=lag(date);

if date^=lag_date or symbol^=lag_symbol then do;

retain prior_qtime_adjusted;

if type='T' then output;

keep symbol date time price size prior_bid prior_ofr;

run;

data tr_qt2/view= tr_qt2;

set tr_qt;

lagprice1=lag(price);

mid=(prior_ofr+prior_bid)/2;

if price>mid then order=1;

else if price<mid then order=-1;

run;

proc means data=tr_qt2 noprint;

var size dollars;

class month order;

output out=output_raw n=count sum(size dollars)=sumsize sumdollars mean(spread)=meanspread;

run;

data outputcut;

set output_raw;

where _type_=3;

run;

proc transpose data=outputcut out=output_num;

by month;

var count;

run;

proc transpose data=outputcut out=output_sh;

by month;

var sumsize;

run;

data tmp_OIBNUM(keep=month OIBNUM);

set output_num;

OIBNUM=COL3-COL1;

run;

data tmp_OIBSH(keep=month OIBSH);

set output_sh;

OIBSH=COL3-COL1;

run;

data tmp_LIQ (keep=month QSPR NUMTRANS VOL);

set output_raw;

where _type_=2;

QSPR=meanspread;

NUMTRANS=count;

VOL=sumdollars;

run;

data finalout;

merge tmp_OIBNUM tmp_OIBSH tmp_LIQ;

by month;

run;

proc download data=finalout out=mydir.final;

run;

proc print data=finalout (obs=100);

run;

endrsubmit;

PROC Star
Posts: 7,492

Re: reduce running time that works for a large dataset saved in a remote library

You must not have posted all of your code as the code you posted would have crashed because it contained a do loop without an end statement.

I'm not sure if using a view is actually going to help, but I'll leave that for others to comment on.  However, what I will comment on is the fact that you appear to be doing three, four or more unnecessary reads of the large datasets.

Since you didn't include any sort statements can we presume that the data, as input, is already in the correct sort order?

Hard to say without knowing the full code or having an example dataset, but it appears that everything leading up to your proc means could be replaced with one datastep .. something like:

data tr_qt2 (keep=symbol date time price size

                  prior_bid prior_ofrtr_qt2

                  lagprice1 mid order);

  set taq.cq_:  taq.ct_: ;

  by symbol date time type;

  lag_date=lag(date);

/*  if date^=lag_date or symbol^=lag_symbol then do;*/

/*  ???? Do what?*/

/*  ???? end with an end; statement;*/

  if type='T';

  lagprice1=lag(price);

  mid=(prior_ofr+prior_bid)/2;

  if price>mid then order=1;

  else if price<mid then order=-1;

/* ????  what to do if price eq mid*/

run;

Trusted Advisor
Posts: 2,116

Re: reduce running time that works for a large dataset saved in a remote library

To follow on to Art's comments, DATA VIEWs save on disk space at the expense of execution time, as every time the data are accessed the computations have to be redone.  Arts approach gets rid of the need for views altogether.

One caution on the use of lag(price).  As this is after a subsetting IF, you may or may not be getting the value you want.  This note from the LAG function documentation is pertinent:  "Note:   Storing values at the bottom of the queue and returning values from the top of the queue occurs only when the function is executed. An occurrence of the LAGn function that is executed conditionally will store and return values only from the observations for which the condition is satisfied."  [Art's program will give the same lag sequence that yours did, it's just a general caution.]

Doc Muhlbaier

Duke

Ask a Question
Discussion stats
  • 2 replies
  • 589 views
  • 6 likes
  • 3 in conversation