Help using Base SAS procedures

same trade times - large data sets

Accepted Solution Solved
Reply
Contributor
Posts: 56
Accepted Solution

same trade times - large data sets

The Datasets here are for example only, my real datasets are for up to 1.5 million observations.

I have 2 products 'A' and 'B' trading, trade_time and trade_price are recorded variables, so I have

DATA product_A;

   INPUT A_datetime datetime16. A_price $;

   DATALINES;

22JUN11:00:00:04    1.7

22JUN11:00:00:08 1.9

22JUN11:00:00:10 2.3

22JUN11:00:05:12 2.9

22JUN11:00:05:36 3.5

22JUN11:00:32:45 3.2

22JUN11:00:39:01 3.1

22JUN11:00:39:03 2.7

22JUN11:01:10:12 2.6

22JUN11:07:22:10 2.1

26JUN11:12:42:17 3.4

26JUN11:12:42:18 3.6

26JUN11:12:42:21 3.9

;

RUN;

DATA product_B;

   INPUT B_datetime datetime16. B_price $;

   DATALINES;

22JUN11:00:00:09    0.02

22JUN11:00:05:34    0.07

22JUN11:00:39:05    0.04

26JUN11:12:42:20    0.09

;

RUN;

Each time 'B' trades, i want to know what 'A' traded at, but only if 'A' traded within the last 5 seconds looking backward.

(i.e if B trades and there are no records in the last 5 seconds for A, then discard, even if 'A' trades one second afterwards, - looking backwards in time only)

So I want to create a new dataset called 'same_trade_times'. which would have the following data,...

DATA same_trade_times;

   INPUT B_datetime datetime16.

         B_price $

         A_datetime $

         A_price $

         ;

   DATALINES;

22JUN11:00:00:09   0.02   22JUN11:00:00:08    1.9

22JUN11:00:39:05   0.04   22JUN11:00:39:03    2.7

26JUN11:00:42:20   0.09   26JUN11:12:42:18    3.6

;

RUN;

any ideas on this one?


Accepted Solutions
Solution
‎03-26-2012 03:18 PM
Super User
Super User
Posts: 6,500

Re: same trade times - large data sets

Use MERGE and RETAIN.

data want ;

  merge A(in=ina rename=(a_datetime=datetime))

        B(in=inb rename=(b_datetime=datetime))

  ;

  by datetime;

  retain a_datetime b_datetime prev_a;

  if ina then a_datetime=datetime;

  if inb then b_datetime=datetime;

  if ina then prev_a = a;

  if inb and  0 <= (b_datetime - a_datetime) <= 5 then output;

run;

View solution in original post


All Replies
Solution
‎03-26-2012 03:18 PM
Super User
Super User
Posts: 6,500

Re: same trade times - large data sets

Use MERGE and RETAIN.

data want ;

  merge A(in=ina rename=(a_datetime=datetime))

        B(in=inb rename=(b_datetime=datetime))

  ;

  by datetime;

  retain a_datetime b_datetime prev_a;

  if ina then a_datetime=datetime;

  if inb then b_datetime=datetime;

  if ina then prev_a = a;

  if inb and  0 <= (b_datetime - a_datetime) <= 5 then output;

run;

Contributor
Posts: 56

Re: same trade times - large data sets

is that an 'in' or an 'if' at the start of the third line up from the bottom?

Super User
Super User
Posts: 6,500

Re: same trade times - large data sets

IF of course.  I have updated the previous post to fix the typo.

Super User
Posts: 5,083

Re: same trade times - large data sets

Tom,

I'm not sure you can rally MERGE these two.  The DATETIME values may not form an exact match.

I don't have the time to do this, but here is an approach (bare bones outline).  It seems a little strange at first to create 5-second intervals based on the Product_A data set, but logically it seems sound.  Of course, when it comes to setting up a format with 1M+ levels on a PC, there are no guarantees as to how much time it will take.

Use Product_A data to create a CNTLIN data set for PROC FORMAT.  The ranges would be from lag(a_datetime) to the MIN of either a_datetime or lag(a_datetime)+5.  The CNTLIN data set would need to use an extra variable to indicate whether the endpoint of the range was either included (if using lag(a_datetime)+5 as the endpoint) or excluded (if using a_datetime as the endpoint).  The format would use concatenated a_datetime + A_Price as the LABEL.  Attention would be needed to properly create the very final range, and possibly to set up the Other= definition when there is no match.

Applying the format would be easy:

data same_time_trades;

set product_b;

A_match = put(b_datetime, fmtname.);

Then split A_match into the datetime and the price.

Super User
Super User
Posts: 6,500

Re: same trade times - large data sets

There is no need to match the datetime values.  The RETAIN statements take care of that.  The same logic should work using SET with a BY statment to interleave the records.

Super User
Posts: 5,083

same trade times - large data sets

OK, got it.  And I like it!

You've probably posted this already, but I think you meant to code prev_a=a_price instead of prev_a=a.

Super Contributor
Posts: 1,636

Re: same trade times - large data sets

I ran Tom's Code with the sample data OP provided.

DATA A;

   INPUT A_datetime datetime16. A_price $;

   DATALINES;

22JUN11:00:00:04 1.7

22JUN11:00:00:08 1.9

22JUN11:00:00:10 2.3

22JUN11:00:05:12 2.9

22JUN11:00:05:36 3.5

22JUN11:00:32:45 3.2

22JUN11:00:39:01 3.1

22JUN11:00:39:03 2.7

22JUN11:01:10:12 2.6

22JUN11:07:22:10 2.1

26JUN11:12:42:17 3.4

26JUN11:12:42:18 3.6

26JUN11:12:42:21 3.9

;

RUN;

DATA B;

   INPUT B_datetime datetime16. B_price $;

   DATALINES;

22JUN11:00:00:09    0.02

22JUN11:00:05:34    0.07

22JUN11:00:39:05    0.04

26JUN11:12:42:20    0.09

;

run;

data want ;

  merge A(in=ina rename=(a_datetime=datetime))

        B(in=inb rename=(b_datetime=datetime))

  ;

  by datetime;

  retain a_datetime b_datetime prev_a;

  if ina then a_datetime=datetime;

  if inb then b_datetime=datetime;

  if ina then prev_a = a_price;

  if inb and  0 <= (b_datetime - a_datetime) <= 5 then output;

  format a_datetime b_datetime datetime datetime16.;

run;

proc print ;run;

                                          a                b

                                          _                _

                  d                       d                d

                  a A  B                  a                a

                  t _  _                  t                t  p

                  e p  p                  e                e  r

                  t r  r                  t                t  e

O                i i  i                  i                i  v

b                m c  c                  m                m  _

s                e e  e                  e                e  a

1 22JUN11:00:00:09   0.02 22JUN11:00:00:08 22JUN11:00:00:09 1.9

2 22JUN11:00:39:05   0.04 22JUN11:00:39:03 22JUN11:00:39:05 2.7

3 26JUN11:12:42:20   0.09 26JUN11:12:42:18 26JUN11:12:42:20 3.6

Contributor
Posts: 56

same trade times - large data sets

seems to work on the small example data set, although variable 'a_price' seems to not show up in the 'want' table?

Contributor
Posts: 56

same trade times - large data sets

Just ran it on my large data sets

A - 1.5 million obs

B = 21k obs

data set 'want' gets done in 0.42 seconds, WOW, amazing!

and for some reason 'a_pricea now gets shown fine in the output of 'want' for these large data sets, strange?

Super User
Super User
Posts: 6,500

same trade times - large data sets

You are getting more values for A_PRICE because of more exact matches in DATETIME values.

Without an exact match the price for A that corresponds to the A_DATETIME would be in the variable PREV_A.

You could deal with that with more rename commands or make the logic more complex to handle getting the value retained back into the A_PRICE variable before the OUTPUT statement.

Note that if you wanted to match one B record to multiple A records then this method will not work.  It will only output the A price that is closest in time (but still earlier).

Respected Advisor
Posts: 3,124

same trade times - large data sets

Hi skipper,

The following code will be slower than Tom's, however, it should be way faster than my other two trials. It has a little bit more than 1 pass of your huge dataset, and it will keep your a_price as you wish. Please give it try and let me how it goes:

data have (index=(a_datetime));

format a_datetime datetime16.;

  set product_a product_b(rename=b_datetime=a_datetime);

  by a_datetime;

  lga_time=lag(a_datetime);

  lga_price=lag(a_price);

  run;

  data want (drop=lgSmiley Happy;

  format b_datetime datetime16.;

  set product_b (rename=b_datetime=a_datetime);

    set have key=a_datetime;

    if _iorc_=%sysrc(_sok) then do;

       if not missing(a_price) then do;

          b_datetime=a_datetime;

          output;

        end;

        else if missing(a_price) then do;

           b_datetime=a_datetime;

           a_datetime= lga_time;

           a_price=lga_price;

           if 0<=b_datetime-a_datetime<=5 then output;

        end;

     end;

run;

Contributor
Posts: 56

same trade times - large data sets

thanks I will try this and report back

Super User
Posts: 9,681

same trade times - large data sets

DOW can do it. But If you need fast it ,Hash Table will be favored.

But for your data, I am curious. You said you want datetime within five seconds .

There two obs within 5 seconds for  26JUN11:00:42:20   0.09

26JUN11:12:42:17 3.4

26JUN11:12:42:18 3.6


You need them all, or just only need the nearest obs ?

DATA product_A;
   INPUT A_datetime datetime16. A_price $;
   format a_datetime datetime.;
   DATALINES;
22JUN11:00:00:04    1.7
22JUN11:00:00:08 1.9
22JUN11:00:00:10 2.3
22JUN11:00:05:12 2.9
22JUN11:00:05:36 3.5
22JUN11:00:32:45 3.2
22JUN11:00:39:01 3.1
22JUN11:00:39:03 2.7
22JUN11:01:10:12 2.6
22JUN11:07:22:10 2.1
26JUN11:12:42:17 3.4
26JUN11:12:42:18 3.6
26JUN11:12:42:21 3.9
;
RUN;

DATA product_B;
   INPUT B_datetime datetime16. B_price $;
   format b_datetime datetime.;
   DATALINES;
22JUN11:00:00:09    0.02
22JUN11:00:05:34    0.07
22JUN11:00:39:05    0.04
26JUN11:12:42:20    0.09
;
RUN;

data want;
 set product_b;
 do until(a_datetime gt b_datetime);
  if b_datetime-a_datetime lt 5 and b_datetime-a_datetime ge 0 then output;
  set product_a;
 end;
run;


Ksharp

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 416 views
  • 0 likes
  • 6 in conversation