How to merge datasets by closest time using a hash table

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

How to merge datasets by closest time using a hash table

Hi, I have a very simple dataset on SAS 9.4 that has time and the best bid/best offer (BB/BO) at that time. The BB/BO at each time is the prevailing BB/BO until the next observation. I am trying to take a large dataset and merge it back onto itself so that it has the prevailing BB and BO 5 minutes into the future, i.e. I want the observation at at 11:35:00 to have the prevailing BB/BO at 11:40:00. This is only problematic because there is no observation at 11:40:00, and the observations happen in irregular intervals. Since my data is large, I am trying to use hash tables to maximize my efficiency. Here is an example of the dataset:

 

timeBBBO
11:35:00 AM71.571.56
11:36:01 AM71.5171.57
11:36:35 AM71.5371.58
11:36:48 AM71.5571.59
11:37:02 AM71.5371.58
11:37:55 AM71.5371.57
11:38:02 AM71.5571.59
11:38:22 AM71.5371.56
11:39:32 AM71.5271.56
11:39:58 AM71.571.56
11:40:01 AM71.5671.59
11:40:31 AM71.5671.58
11:41:05 AM71.5671.58
11:41:42 AM71.5671.58
11:41:51 AM71.5671.57
11:42:03 AM71.5671.57
11:42:05 AM71.5671.57
11:43:35 AM71.5671.57
11:44:41 AM71.5671.57
11:44:45 AM71.5671.57

 

 

In the end, I want the observation at 11:35:00 to be matched up with the BB/BO in the observation whose time is 11:39:58, since that contained the prevailing BB/BO at 11:40:00 (even though the 11:40:01 observation is closer in time to 11:40:00, it was not the prevailing BB/BO at 11:40:00). I have created the following code to do this. It gives me no errors or warnings, it just says that the output dataset has 0 observations.

 

The variables I'll have in the end are: time, timep5(time plus 5 minutes), time_5(the time that the prevailing NBBO 5 minutes later was set), BB, BO, BB5 (prevailing BB 5 minutes later), and BO (prevailing BO 5 minutes later).

 

I attempt to do this by creating two queues (current time and time+5), and and move the time+5 queue forward until it is ahead of the regular time queue's variable timep5, which is set to exactly 5 minutes after the original time. Once the time+5 queue moves ahead, I take the previous observation, which must have been behind the timep5 variable, and output it.

 

Any idea where I'm going wrong? Thanks so much!

 

data time5;

      length tid time timep5 time_5 BB BO BB5 BO5 8;

      declare hash time1(ordered:'a');

            time1.defineKey('tid');

            time1.defineData('time','timep5','BB','BO');

            time1.defineDone();

      declare hash timef(ordered:'a');

            timef.defineKey('tid');

            timef.defineData('time_5','BB5','BO5');

            timef.defineDone();

      declare hiter timereg('time1');

      declare hiter timef5('timef');

 

      do until(last_time);

            set nbbo1c end=last_time;

            by time;

            tid+1;

            if '09:30:00't <= time <= '04:05:00't then do;

                  time_5=time;

                  BB5=BB;

                  BO5=BO;

                  timef.add();

                  if '09:30:00't <= time <= '04:00:00't then do;

                        time=time;

                        timep5=time+hms(0,5,0);

                        BB=BB;

                        BO=BO;

                        time1.add();

                  end;

            end;

            else delete;

      end;

      keep time timep5 bb bo time_5 BB5 BO5;

      format time timep5 time_5 time12.3 BB BO BB5 BO5 8.2;

 

      t1=timereg.first();

      t5=timef5.first();

      do while (t1=0 and t5=0);

            select;

                  when (time_5<timep5) do;

                        t5=timef5.next();

                  end;

                  when (time_5=timep5) do;

                        time=time;

                        output;

                        t1=timereg.next();

                  end;

                  when (time_5>timep5) do;

                        time=time;

                        t5=timef5.prev();

                        output;

                        t1=timereg.next();

                  end;

                  otherwise;

            end;

      end;

run;

 


Accepted Solutions
Solution
‎11-29-2016 04:54 PM
Super User
Posts: 787

Re: How to merge datasets by closest time using a hash table

I've rethought this, and realize there is no need for an array, or hash.  Instead, just recognize that the lagged value  of the first time greater than or equal to 5 minutes plus current time, must be the closest time preceding the 5 minute target.

 

So

 

data have;
  infile datalines truncover dlm=',' dsd;
  input tm:time. (BB BO) (:best32.);
  format tm time8.;
  datalines;
10:35:00,71.5,71.56
10:36:01,71.51,71.57
10:36:35,71.53,71.58
10:36:48,71.55,71.59
10:37:02,71.53,71.58
10:37:55,71.53,71.57
10:38:02,71.55,71.59
10:38:22,71.53,71.56
10:39:32,71.52,71.56
10:39:58,71.5,71.56
10:40:01,71.56,71.59
10:40:31,71.56,71.58
10:41:05,71.56,71.58
11:41:42,71.56,71.58
11:41:51,71.56,71.57
11:42:03,71.56,71.57
11:42:05,71.56,71.57
11:43:35,71.56,71.57
11:44:41,71.56,71.57
11:44:45,71.56,71.57
11:54:45,88,88
11:55:45,99,99
;
run;

data want (drop=target_tm next_: );

  set have;
  target_tm=tm+300;

  do while (next_tm<target_tm);
    if out_of_time=0 then set have (keep=tm bb bo 
              rename=(tm=next_tm bb=next_bb bo=next_bo))
              end=out_of_time;
    else next_tm='24:00:00't;
    before_5min_tm=lag(next_tm);
	before_5min_bb=lag(next_bb);
	before_5min_bo=lag(next_bo);
  end;
  retain next_: before_5min_: ;
  format before_5min_tm time8.0;
run;

View solution in original post


All Replies
Super User
Posts: 787

Re: How to merge datasets by closest time using a hash table

[ Edited ]

This is not a problem I would solve with hash objects, although it can be done.  Instead, I'd suggest arrays with one element for each second of the day.  Key-indexing these arrays with time is faster than hash lookups, and in this case, less complex.

 

You want the BB/BO values that are found at a time closest to five minutes into the future (TARGET_TIME).

 

The program is untested, but is supposed to do the following:

  1. Declare arrays _BB and _BO with lower bound corresponding to 00:00:00 and upper bound at 24:00:00.   Since time is valued at seconds after midnight, lower bound of both arrays is 0, upper bound is 86400
     
  2. Read a single BB/BO record with a SET statement.  Calculate a target time at 5 minutes later.

  3. In a loop, read a sequence of records with a second SET until a first occurance of time (renamed_to NEXT_TIME) >=target time  is found.
    Inside the loop:
    1. Record each BB/BO in the appropriate elements of _BO/_BB.
    2. Keep a lagged value of NEXT_TIME, in variable BEFORE_TARGET
  4. After the loop use the array values for either NEXT_TIME or BEFORE_TARGET, whichever is closer to target

 

Editted to correct some typing errors and added a RETAIN statement 

 

data want (drop=target_time before_target next_: );
  array _bb {0:86400} _temporary_;  /*best bid by time*/
  array _bo {0:86400} _temporary_;  /*best offer "    */

  set have;
  target_time=time+300;

retain next_time before_target; do while (next_time<target_time); if out_of_time=0 then set have (keep=time bb bo rename=(time=next_time bb=next_bb bo=next_bo)) end=out_of_time; else next_time='24:00:00't; before_target=lag(next_time); _bb{next_time}=next_bb; _bo{next_time}=next_bo; end; ** At this point next_time>=target_time but before_target<target_time**; if (next_time-target_time) < (target_time-before_target) then future_time=next_time; else future_time=before_target; format future_time hhmmss8.0 ; future_bo=_bo{future_time}; future_bb=_bb{future_time}; run;

 

 If the data are sorted by SYMBOL/TIME, it would be relatively straightforward to accomodate.

 

 

 

 

 

Respected Advisor
Posts: 3,823

Re: How to merge datasets by closest time using a hash table

[ Edited ]

As you only need some sort of look ahead I believe it's easier to go with a DOW loop approach.

data have;
  infile datalines truncover dlm=',' dsd;
  input tm:time. (BB BO) (:best32.);
  format tm time8.;
  datalines;
10:35:00,71.5,71.56
10:36:01,71.51,71.57
10:36:35,71.53,71.58
10:36:48,71.55,71.59
10:37:02,71.53,71.58
10:37:55,71.53,71.57
10:38:02,71.55,71.59
10:38:22,71.53,71.56
10:39:32,71.52,71.56
10:39:58,71.5,71.56
10:40:01,71.56,71.59
10:40:31,71.56,71.58
10:41:05,71.56,71.58
11:41:42,71.56,71.58
11:41:51,71.56,71.57
11:42:03,71.56,71.57
11:42:05,71.56,71.57
11:43:35,71.56,71.57
11:44:41,71.56,71.57
11:44:45,71.56,71.57
11:54:45,88,88
11:55:45,99,99
;
run;


data want;

  set have end=last;

  if last then
    do;
      call missing(tm_next, bb_next, bo_next);      
    end;

  else 
    do;
      _i=_n_+2;
      do while(_i-1<=_nobs);
        if _i-2<=_nobs then
          set have(keep=tm bb bo rename=(tm=tm_next bb=bb_next bo=bo_next)) point=_i nobs=_nobs;
        if tm_next-tm>300 or _i-1=_nobs then
          do;
            _i=_i-1;
            set have(keep=tm bb bo rename=(tm=tm_next bb=bb_next bo=bo_next)) point=_i;
            diff=tm_next-tm;
            if tm_next-tm>300 then call missing(tm_next, bb_next, bo_next);
            leave;
          end;
        _i+1;
      end;
    end; 

run;

 

Grand Advisor
Posts: 9,567

Re: How to merge datasets by closest time using a hash table

You didn't post the output yet .



data have;
infile cards truncover expandtabs;
input time : time10. x $	BB	BO;
format time time10.;
cards;
11:35:00 AM	71.5	71.56
11:36:01 AM	71.51	71.57
11:36:35 AM	71.53	71.58
11:36:48 AM	71.55	71.59
11:37:02 AM	71.53	71.58
11:37:55 AM	71.53	71.57
11:38:02 AM	71.55	71.59
11:38:22 AM	71.53	71.56
11:39:32 AM	71.52	71.56
11:39:58 AM	71.5	71.56
11:40:01 AM	71.56	71.59
11:40:31 AM	71.56	71.58
11:41:05 AM	71.56	71.58
11:41:42 AM	71.56	71.58
11:41:51 AM	71.56	71.57
11:42:03 AM	71.56	71.57
11:42:05 AM	71.56	71.57
11:43:35 AM	71.56	71.57
11:44:41 AM	71.56	71.57
11:44:45 AM	71.56	71.57
;
run;
data temp;
 set have;
 time=time-'00:05:00't;
run;
data want;
 set temp have(in=inb);
 by time;
 lag_time=lag(time);
 lag_bb=lag(bb);
 lag_bo=lag(bo);
 
 if inb then do;
  prevail_time=lag_time+'00:05:00't;
  prevail_bb=lag_bb;
  prevail_bo=lag_bo;
  output;
 end;
format prevail_time time10.;
drop lag_:;
run;
 

Solution
‎11-29-2016 04:54 PM
Super User
Posts: 787

Re: How to merge datasets by closest time using a hash table

I've rethought this, and realize there is no need for an array, or hash.  Instead, just recognize that the lagged value  of the first time greater than or equal to 5 minutes plus current time, must be the closest time preceding the 5 minute target.

 

So

 

data have;
  infile datalines truncover dlm=',' dsd;
  input tm:time. (BB BO) (:best32.);
  format tm time8.;
  datalines;
10:35:00,71.5,71.56
10:36:01,71.51,71.57
10:36:35,71.53,71.58
10:36:48,71.55,71.59
10:37:02,71.53,71.58
10:37:55,71.53,71.57
10:38:02,71.55,71.59
10:38:22,71.53,71.56
10:39:32,71.52,71.56
10:39:58,71.5,71.56
10:40:01,71.56,71.59
10:40:31,71.56,71.58
10:41:05,71.56,71.58
11:41:42,71.56,71.58
11:41:51,71.56,71.57
11:42:03,71.56,71.57
11:42:05,71.56,71.57
11:43:35,71.56,71.57
11:44:41,71.56,71.57
11:44:45,71.56,71.57
11:54:45,88,88
11:55:45,99,99
;
run;

data want (drop=target_tm next_: );

  set have;
  target_tm=tm+300;

  do while (next_tm<target_tm);
    if out_of_time=0 then set have (keep=tm bb bo 
              rename=(tm=next_tm bb=next_bb bo=next_bo))
              end=out_of_time;
    else next_tm='24:00:00't;
    before_5min_tm=lag(next_tm);
	before_5min_bb=lag(next_bb);
	before_5min_bo=lag(next_bo);
  end;
  retain next_: before_5min_: ;
  format before_5min_tm time8.0;
run;
New Contributor
Posts: 2

Re: How to merge datasets by closest time using a hash table

Thanks all for your help - very helpful solutions. The processing time on this program was .01 seconds for my datasets and it did exactly what I needed. My only note is that if there are repeat times (in my example dataset there weren't, but in case there were), it is necessary to add other variables to the retain statement so that repeated times aren't blank. Thanks for your help!

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 229 views
  • 0 likes
  • 4 in conversation