Solved: Re: Proc SQL Taking a lot of Time. Am I doing anything wrong?

rkdubey84 · Posted 03-04-2017 11:59 AM

I am running the below code for a data-set containing 708045 Observations for 2 variables. And Its Taking approx 41 minutes to run.

Is it reasonable? Am I doing something wrong?

%Macro HILO(T);
Data &T._ID; Set DA.&T._Final;
Keep DTM1 TradePrice;
run;

proc sql;
create table &T.HILO as
select *,
 (select Max(TradePrice) from &T._ID where dtm1=a.dtm1+'00:01:00't ) as HI1,
 (select Max(TradePrice) from &T._ID where dtm1=a.dtm1+'00:05:00't ) as HI5,
 (select Min(TradePrice) from &T._ID where dtm1=a.dtm1+'00:01:00't ) as LO1,
 (select Min(TradePrice) from &T._ID where dtm1=a.dtm1+'00:05:00't ) as LO5, 
 (select Count(*)from &T._ID where dtm1=a.dtm1+'00:01:00't ) as Count1,
 (select Count(*)from &T._ID where a.dtm1 <= dtm1 <=a.dtm1+'00:05:00't ) as Count5
 from &T._ID as a;
quit;

Data DA.&T.HILO; Set &T.HILO;
PVol_1min = SQRT(1/Count1*(log(Hi1/Lo1)**2));
PVol_5min = SQRT(1/Count5*(log(Hi5/Lo5)**2));
run;

Data Analysis.&T._Final; 
Merge DA.&T.HILO DA.&T._Final;
By DTM1;
run;

proc datasets lib=work
 nolist kill;
quit;
run; 

%Mend;

%HILO(ACC);

Ksharp · Posted 03-05-2017 03:02 AM

OK. Assuming the interval between two neighbour obs always be one minute.

proc import datafile='/folders/myfolders/acc_test.txt' out=have dbms=tab replace;
run;
proc sql;
create table key as
 select dtm1,max(tradeprice) as max,min(tradeprice) as min,count(tradeprice) as count 
  from have
   group by dtm1;
quit;
data want;
 if _n_=1 then do;
  if 0 then set key;
  declare hash h(dataset:'key');
  h.definekey('dtm1');
  h.definedata('max','min','count');
  h.definedone();
 end;
set have;
call missing(max,min,count);
k=dtm1+'00:01:00't;
rc=h.find(key:k);
L1=min;H1=max;COUNT1=count;

count5=0;
do i=0 to 5;
call missing(max,min,count);
k=dtm1+'00:01:00't*i;
rc=h.find(key:k);
L5=min;H5=max;COUNT5+count;
end;

drop i k rc max min count;
run;

View solution in original post

Kurt_Bremser · Posted 03-04-2017 02:19 PM

How fine is your timeseries? I'd start by building the max, min and count values per smallest time unit, sort by time unit, and calculate the -1 and -5 minute times. Then a data step merge where you merge by the different time values would do it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

rkdubey84 · Posted 03-04-2017 02:35 PM

Hi @Kurt_Bremser,

Thanks for responding. It would be great if you can give me some hint about how to go about building the max, min and count values per smallest time unit, sort by time unit, and calculate the -1 and -5 minute times.

This is how my dataset looks like:

DTM1 TradePrice

6-3-13 9:15	1218
6-3-13 9:15	1218.05
6-3-13 9:15	1219.55
6-3-13 9:15	1219.6
6-3-13 9:15	1219.7
6-3-13 9:15	1219.7
6-3-13 9:15	1218
6-3-13 9:15	1218
6-3-13 9:15	1218
6-3-13 9:15	1218
6-3-13 9:15	1218
6-3-13 9:16	1219.25
6-3-13 9:16	1218.05
6-3-13 9:16	1218.05
6-3-13 9:16	1218
6-3-13 9:16	1219
6-3-13 9:16	1219
6-3-13 9:16	1219

6-3-13 9:16	1219.3
6-3-13 9:16	1218.8
6-3-13 9:16	1218.25
6-3-13 9:16	1218.2
6-3-13 9:17	1218
6-3-13 9:17	1218
6-3-13 9:17	1218.5
6-3-13 9:17	1218.55
6-3-13 9:17	1218.8

Thanks in advance.

Ritesh

Kurt_Bremser · Posted 03-04-2017 02:38 PM

If you have a SAS dataset, post it in a data step with datalines. Excel files cannot preserve the structure of SAS datasets.

This is important as I need to know how your datetime values are stored in SAS.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

rkdubey84 · Posted 03-04-2017 02:44 PM

Hi @Kurt_Bremser,

Attaching the SAS Dataset Pic as Unable to attach the data.

rkdubey84 · Posted 03-04-2017 03:02 PM

HI @Kurt_Bremser;

Attaching the Txt file for Datalines statement as well.

Thanks.

Ritesh

Kurt_Bremser · Posted 03-04-2017 04:09 PM

proc sort data=have;
by dtm1;
run;

data have2;
set have (keep=dtm1 tradeprice);
by dtm;
retain count max_price min_price;
if first.dtm1
then do;
  count = 0;
  max_price = 0;
  min_price = 99999999999;
end;
count + 1;
max_price = max(tradeprice,max_price);
min_price = min(tradeprice,min_price);
if last.dtm1 
then do;
  time1 = intnx('minute',dtm1,-1);
  time5 = intnx('minute',dtm1,-5);
  output;
end;
drop tradeprice;
run;

data want;
merge
  have1 (in=1 drop=time1 time5)
  have1 (drop=dtm1 time5 rename=(time1=dtm1 count=count1 max_price=max_price1 min_price=min_price1))
  have1 (drop=dtm1 time1 rename=(time5=dtm1 count=count5 max_price=max_price5 min_price=min_price5))
;
by dtm1;
if a;
run;

Untested (no SAS available on weekends).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

rkdubey84 · Posted 03-04-2017 04:22 PM

This one is pretty quick. I got few errors in the last step while merging.

294 data ACC_HILO;
295 merge
296 ACC_New (in=1 drop=time1 time5)
-
24
24
ERROR 24-322: Variable name is not valid.

297 ACC_New (drop=dtm1 time5 rename=(time1=dtm1 count=count1 max_price=max_price1
297! min_price=min_price1))
298 ACC_New (drop=dtm1 time1 rename=(time5=dtm1 count=count5 max_price=max_price5
298! min_price=min_price5))
299 ;
ERROR: Variable time1 is not on file WORK.ACC_NEW.
ERROR: Variable count is not on file WORK.ACC_NEW.
ERROR: Variable max_price is not on file WORK.ACC_NEW.
ERROR: Variable min_price is not on file WORK.ACC_NEW.
ERROR: Invalid DROP, KEEP, or RENAME option on file WORK.ACC_NEW.
ERROR: Variable time5 is not on file WORK.ACC_NEW.
ERROR: Variable count is not on file WORK.ACC_NEW.
ERROR: Variable max_price is not on file WORK.ACC_NEW.
ERROR: Variable min_price is not on file WORK.ACC_NEW.
ERROR: Invalid DROP, KEEP, or RENAME option on file WORK.ACC_NEW.
300 by dtm1;
301 if a;
302 run;

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.ACC_HILO may be incomplete. When this step was stopped there were 0
observations and 1 variables.
WARNING: Data set WORK.ACC_HILO was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds

Kurt_Bremser · Posted 03-04-2017 04:34 PM

Please post your log of my first datastep.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

rkdubey84 · Posted 03-04-2017 04:39 PM

Hi @Kurt_Bremser, Thanks a ton for the help.


proc sort data=DA.ACC_Final;
by dtm1;
run;

Data ACC_New; Set DA.ACC_Final;
run;

data ACC_Test;
set DA.ACC_Final (keep=dtm1 tradeprice);
by dtm1;
retain count max_price min_price;
if first.dtm1
then do;
  count = 0;
  max_price = 0;
  min_price = 99999999999;
end;
count + 1;
max_price = max(tradeprice,max_price);
min_price = min(tradeprice,min_price);
if last.dtm1 
then do;
  time1 = intnx('minute',dtm1,-1);
  time5 = intnx('minute',dtm1,-5);
  output;
end;
drop tradeprice;
run;

data ACC_HILO;
merge
  ACC_New (in=1 drop=time1 time5)
  ACC_New (drop=dtm1 time5 rename=(time1=dtm1 count=count1 max_price=max_price1 min_price=min_price1))
  ACC_New (drop=dtm1 time1 rename=(time5=dtm1 count=count5 max_price=max_price5 min_price=min_price5))
;
by dtm1;
if a;
run;

Log:

303
304 proc sort data=DA.ACC_Final;
305 by dtm1;
306 run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

307
308 Data ACC_New; Set DA.ACC_Final;
309 run;

NOTE: There were 708045 observations read from the data set DA.ACC_FINAL.
NOTE: The data set WORK.ACC_NEW has 708045 observations and 47 variables.
NOTE: DATA statement used (Total process time):
real time 0.27 seconds
cpu time 0.26 seconds

310
311 data ACC_Test;
312 set DA.ACC_Final (keep=dtm1 tradeprice);
313 by dtm1;
314 retain count max_price min_price;
315 if first.dtm1
316 then do;
317 count = 0;
318 max_price = 0;
319 min_price = 99999999999;
320 end;
321 count + 1;
322 max_price = max(tradeprice,max_price);
323 min_price = min(tradeprice,min_price);
324 if last.dtm1
325 then do;
326 time1 = intnx('minute',dtm1,-1);
327 time5 = intnx('minute',dtm1,-5);
328 output;
329 end;
330 drop tradeprice;
331 run;

NOTE: There were 708045 observations read from the data set DA.ACC_FINAL.
NOTE: The data set WORK.ACC_TEST has 23211 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.17 seconds

332
333 data ACC_HILO;
334 merge
335 ACC_New (in=1 drop=time1 time5)
-
24
24
ERROR 24-322: Variable name is not valid.

336 ACC_New (drop=dtm1 time5 rename=(time1=dtm1 count=count1 max_price=max_price1
336! min_price=min_price1))
337 ACC_New (drop=dtm1 time1 rename=(time5=dtm1 count=count5 max_price=max_price5
337! min_price=min_price5))
338 ;
ERROR: Variable time1 is not on file WORK.ACC_NEW.
ERROR: Variable count is not on file WORK.ACC_NEW.
ERROR: Variable max_price is not on file WORK.ACC_NEW.
ERROR: Variable min_price is not on file WORK.ACC_NEW.
ERROR: Invalid DROP, KEEP, or RENAME option on file WORK.ACC_NEW.
ERROR: Variable time5 is not on file WORK.ACC_NEW.
ERROR: Variable count is not on file WORK.ACC_NEW.
ERROR: Variable max_price is not on file WORK.ACC_NEW.
ERROR: Variable min_price is not on file WORK.ACC_NEW.
ERROR: Invalid DROP, KEEP, or RENAME option on file WORK.ACC_NEW.
339 by dtm1;
340 if a;
341 run;

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.ACC_HILO may be incomplete. When this step was stopped there were 0
observations and 1 variables.
WARNING: Data set WORK.ACC_HILO was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds

rkdubey84 · Posted 03-04-2017 04:42 PM

335 ACC_New (in=1 drop=time1 time5)
-
24
24
This one referred to the number 1 in the code.

Kurt_Bremser · Posted 03-04-2017 05:22 PM

You have to usd ACC_TEST in the merge.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

rogerjdeangelis · Posted 03-04-2017 09:29 PM

Stock price gyrations in 1 an 5 minute windows

https://goo.gl/JaA6wE
https://communities.sas.com/t5/forums/replypage/board-id/programming/message-id/76933

Not sure I created data like yours, maybe this will help?

I doubled the size of your input?

I don't have my 'beast' computer. This was done on an Dell E6400 dual code DDR2 2.2ghz 8gb ram(circa 2010)?

MAX and Mins --> 9.18 seconds
5 minute durations 0.30 seconds
1 minute durations 0.20 seconds

Did not do the rest but they should ve very fast.

I assummed the dtm1 was sorted and used left joins instead of outer joins


HAVE   (sequence of minutes)
====

Datetimes in seconds (no loss of generality using seconds)

to 40 obs WORK.T_ID total obs=1,416,090

 Obs       DTM1       TRADEPRICE

   1    1804269812         2
   2    1804269816        57
   3    1804269817        39
   4    1804269818        39
   5    1804269819        39
   6    1804269820        39
   7    1804269824        61
   8    1804269828        70
   9    1804269832        92

  10    1804269836        81  This group is less than 5 minutes apart
  11    1804269837        12
  12    1804269838        12
  13    1804269839        12
  14    1804269840        12

  15    1804269844        36
  16    1804269848        49
  17    1804269852        51

  18    1804269856        30  5 that are less than or equal to 1 minute apart
  19    1804269857        80
  20    1804269858        80
  21    1804269859        80
  22    1804269860        80

  23    1804269864         3
  24    1804269868        70
  25    1804269872        21


WANT
====

Max(TradePrice) where l.dtm1=r.dtm1+'00:05:00't ) as HI5
Min(TradePrice) where  l.dtm1=r.dtm1+'00:05:00't ) as LO5
Max(TradePrice) where  l.dtm1=r.dtm1+'00:01:00't ) as HI1
Min(TradePrice) where  l.dtm1=r.dtm1+'00:01:00't ) as LO1


Up to 40 obs WORK.T_HILO total obs

Obs    HI5    LO5    HI1    LO1

 1      99     0      99     0


Up to 40 obs WORK.T_5SEC_LAP total obs=1

Obs    COUNT5 (counts each of the durrations <=5?)

 1     590038

Up to 40 obs WORK.T_1SEC_LAP total obs=1

Obs    COUNT1  (counts each of the durrations <=1?)

 1     354023

WORKING CODE
============

T_ID as l left join t_id(drop=tradeprice) as r on l.dtm1=r.dtm1+'00:05:00't

* create some data
==================

Data T_ID (sortedby=dtm1 keep=dtm1 tradeprice);
retain dtm1 0;
start=int(datetime());
dtm1=start;
do dtm1= start to start+5*708045 by 4;
     tradeprice=int(100*uniform(5731));
     savdtm1=dtm1;
     if mod(dtm1,1000)=0 then do;
        dtm1=savdtm1-3;
        output;
        dtm1=savdtm1-2;
        output;
        dtm1=savdtm1-1;
        output;
     end;
     dtm1=savdtm1;
     output;
end;
drop start;
run;quit;

MAX/MINS
=========

proc sql;
create table T_HILO as
 select
  (select Max(TradePrice) from T_ID as l left join t_id(drop=tradeprice) as r on l.dtm1=r.dtm1+'00:05:00't ) as HI5
 ,(select Min(TradePrice) from T_ID as l left join t_id(drop=tradeprice) as r on l.dtm1=r.dtm1+'00:05:00't ) as LO5
 ,(select Max(TradePrice) from T_ID as l left join t_id(drop=tradeprice) as r on l.dtm1=r.dtm1+'00:01:00't ) as HI1
 ,(select Min(TradePrice) from T_ID as l left join t_id(drop=tradeprice) as r on l.dtm1=r.dtm1+'00:01:00't ) as LO1
from
  T_ID(obs=1)
;quit;

* Durations
===========

data t_5sec_lap(keep=cnt rename=cnt=count5);
  retain beg . cnt 0;
  set T_ID end=dne;
  if _n_=1 then beg=dtm1;
  else do;
    if dtm1 - beg <= 5 then do;
        cnt=cnt+1;
    end;
    else do;
       beg=dtm1;
    end;
  end;
  if dne then output;
run;quit;

data t_1sec_lap(keep=cnt rename=cnt=count1);;
  retain beg . cnt 0;
  set T_ID end=dne;
  if _n_=1 then beg=dtm1;
  else do;
    if dtm1 - beg <= 1 then do;
        cnt=cnt+1;
    end;
    else do;
       beg=dtm1;
    end;
  end;
  if dne then output;
run;quit;

mkeintz · Posted 03-04-2017 09:49 PM

The take home lesson of this topic is that it is difficult to get PROC SQL to take advantage of the order of observations in a data set, while @Kurt_Bremser's data step code does take advantage of the order.

In order to satisfy such conditions as

where dtm1=a.dtm1+'00:01:00't

sql has to effectively compare every record to every other record, to see if they are exactly 1 minute apart.

Kurt's code, on the other hand, knows that the order of the data set means that once a record is more than 1 minute after a reference record, then there is no need to check the remainder of the data set.

I am sure there are proc sql experts out there that can improve your code, but in my experience the data step approach has always been faster for matching ordered records based on the ordering variables.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 03-04-2017 10:57 PM

proc import datafile='/folders/myfolders/acc_test.txt' out=have dbms=tab replace;
run;
proc sql;
create table key as
 select dtm1,max(tradeprice) as max,min(tradeprice) as min,count(tradeprice) as count 
  from have
   group by dtm1;
quit;
data want;
 if _n_=1 then do;
  if 0 then set key;
  declare hash h(dataset:'key');
  h.definekey('dtm1');
  h.definedata('max','min','count');
  h.definedone();
 end;
set have;
call missing(max,min,count);
k=dtm1+'00:01:00't;
rc=h.find(key:k);
L1=min;H1=max;COUNT1=count;

call missing(max,min,count);
k=dtm1+'00:05:00't;
rc=h.find(key:k);
L5=min;H5=max;COUNT5=count;

drop k rc max min count;
run;

Registration is open

SAS Training: Just a Click Away