BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
skipper
Calcite | Level 5

The Datasets here are for example only, my real datasets are for up to 1.5 million observations.

I have 2 products 'A' and 'B' trading, trade_time and trade_price are recorded variables, so I have

DATA product_A;

   INPUT A_datetime datetime16. A_price $;

   DATALINES;

22JUN11:00:00:04    1.7

22JUN11:00:00:08 1.9

22JUN11:00:00:10 2.3

22JUN11:00:05:12 2.9

22JUN11:00:05:36 3.5

22JUN11:00:32:45 3.2

22JUN11:00:39:01 3.1

22JUN11:00:39:03 2.7

22JUN11:01:10:12 2.6

22JUN11:07:22:10 2.1

26JUN11:12:42:17 3.4

26JUN11:12:42:18 3.6

26JUN11:12:42:21 3.9

;

RUN;

DATA product_B;

   INPUT B_datetime datetime16. B_price $;

   DATALINES;

22JUN11:00:00:09    0.02

22JUN11:00:05:34    0.07

22JUN11:00:39:05    0.04

26JUN11:12:42:20    0.09

;

RUN;

Each time 'B' trades, i want to know what 'A' traded at, but only if 'A' traded within the last 5 seconds looking backward.

(i.e if B trades and there are no records in the last 5 seconds for A, then discard, even if 'A' trades one second afterwards, - looking backwards in time only)

So I want to create a new dataset called 'same_trade_times'. which would have the following data,...

DATA same_trade_times;

   INPUT B_datetime datetime16.

         B_price $

         A_datetime $

         A_price $

         ;

   DATALINES;

22JUN11:00:00:09   0.02   22JUN11:00:00:08    1.9

22JUN11:00:39:05   0.04   22JUN11:00:39:03    2.7

26JUN11:00:42:20   0.09   26JUN11:12:42:18    3.6

;

RUN;

any ideas on this one?

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Use MERGE and RETAIN.

data want ;

  merge A(in=ina rename=(a_datetime=datetime))

        B(in=inb rename=(b_datetime=datetime))

  ;

  by datetime;

  retain a_datetime b_datetime prev_a;

  if ina then a_datetime=datetime;

  if inb then b_datetime=datetime;

  if ina then prev_a = a;

  if inb and  0 <= (b_datetime - a_datetime) <= 5 then output;

run;

View solution in original post

13 REPLIES 13
Tom
Super User Tom
Super User

Use MERGE and RETAIN.

data want ;

  merge A(in=ina rename=(a_datetime=datetime))

        B(in=inb rename=(b_datetime=datetime))

  ;

  by datetime;

  retain a_datetime b_datetime prev_a;

  if ina then a_datetime=datetime;

  if inb then b_datetime=datetime;

  if ina then prev_a = a;

  if inb and  0 <= (b_datetime - a_datetime) <= 5 then output;

run;

skipper
Calcite | Level 5

is that an 'in' or an 'if' at the start of the third line up from the bottom?

Tom
Super User Tom
Super User

IF of course.  I have updated the previous post to fix the typo.

Astounding
PROC Star

Tom,

I'm not sure you can rally MERGE these two.  The DATETIME values may not form an exact match.

I don't have the time to do this, but here is an approach (bare bones outline).  It seems a little strange at first to create 5-second intervals based on the Product_A data set, but logically it seems sound.  Of course, when it comes to setting up a format with 1M+ levels on a PC, there are no guarantees as to how much time it will take.

Use Product_A data to create a CNTLIN data set for PROC FORMAT.  The ranges would be from lag(a_datetime) to the MIN of either a_datetime or lag(a_datetime)+5.  The CNTLIN data set would need to use an extra variable to indicate whether the endpoint of the range was either included (if using lag(a_datetime)+5 as the endpoint) or excluded (if using a_datetime as the endpoint).  The format would use concatenated a_datetime + A_Price as the LABEL.  Attention would be needed to properly create the very final range, and possibly to set up the Other= definition when there is no match.

Applying the format would be easy:

data same_time_trades;

set product_b;

A_match = put(b_datetime, fmtname.);

Then split A_match into the datetime and the price.

Tom
Super User Tom
Super User

There is no need to match the datetime values.  The RETAIN statements take care of that.  The same logic should work using SET with a BY statment to interleave the records.

Astounding
PROC Star

OK, got it.  And I like it!

You've probably posted this already, but I think you meant to code prev_a=a_price instead of prev_a=a.

Linlin
Lapis Lazuli | Level 10

I ran Tom's Code with the sample data OP provided.

DATA A;

   INPUT A_datetime datetime16. A_price $;

   DATALINES;

22JUN11:00:00:04 1.7

22JUN11:00:00:08 1.9

22JUN11:00:00:10 2.3

22JUN11:00:05:12 2.9

22JUN11:00:05:36 3.5

22JUN11:00:32:45 3.2

22JUN11:00:39:01 3.1

22JUN11:00:39:03 2.7

22JUN11:01:10:12 2.6

22JUN11:07:22:10 2.1

26JUN11:12:42:17 3.4

26JUN11:12:42:18 3.6

26JUN11:12:42:21 3.9

;

RUN;

DATA B;

   INPUT B_datetime datetime16. B_price $;

   DATALINES;

22JUN11:00:00:09    0.02

22JUN11:00:05:34    0.07

22JUN11:00:39:05    0.04

26JUN11:12:42:20    0.09

;

run;

data want ;

  merge A(in=ina rename=(a_datetime=datetime))

        B(in=inb rename=(b_datetime=datetime))

  ;

  by datetime;

  retain a_datetime b_datetime prev_a;

  if ina then a_datetime=datetime;

  if inb then b_datetime=datetime;

  if ina then prev_a = a_price;

  if inb and  0 <= (b_datetime - a_datetime) <= 5 then output;

  format a_datetime b_datetime datetime datetime16.;

run;

proc print ;run;

                                          a                b

                                          _                _

                  d                       d                d

                  a A  B                  a                a

                  t _  _                  t                t  p

                  e p  p                  e                e  r

                  t r  r                  t                t  e

O                i i  i                  i                i  v

b                m c  c                  m                m  _

s                e e  e                  e                e  a

1 22JUN11:00:00:09   0.02 22JUN11:00:00:08 22JUN11:00:00:09 1.9

2 22JUN11:00:39:05   0.04 22JUN11:00:39:03 22JUN11:00:39:05 2.7

3 26JUN11:12:42:20   0.09 26JUN11:12:42:18 26JUN11:12:42:20 3.6

skipper
Calcite | Level 5

seems to work on the small example data set, although variable 'a_price' seems to not show up in the 'want' table?

skipper
Calcite | Level 5

Just ran it on my large data sets

A - 1.5 million obs

B = 21k obs

data set 'want' gets done in 0.42 seconds, WOW, amazing!

and for some reason 'a_pricea now gets shown fine in the output of 'want' for these large data sets, strange?

Tom
Super User Tom
Super User

You are getting more values for A_PRICE because of more exact matches in DATETIME values.

Without an exact match the price for A that corresponds to the A_DATETIME would be in the variable PREV_A.

You could deal with that with more rename commands or make the logic more complex to handle getting the value retained back into the A_PRICE variable before the OUTPUT statement.

Note that if you wanted to match one B record to multiple A records then this method will not work.  It will only output the A price that is closest in time (but still earlier).

Haikuo
Onyx | Level 15

Hi skipper,

The following code will be slower than Tom's, however, it should be way faster than my other two trials. It has a little bit more than 1 pass of your huge dataset, and it will keep your a_price as you wish. Please give it try and let me how it goes:

data have (index=(a_datetime));

format a_datetime datetime16.;

  set product_a product_b(rename=b_datetime=a_datetime);

  by a_datetime;

  lga_time=lag(a_datetime);

  lga_price=lag(a_price);

  run;

  data want (drop=lg:);

  format b_datetime datetime16.;

  set product_b (rename=b_datetime=a_datetime);

    set have key=a_datetime;

    if _iorc_=%sysrc(_sok) then do;

       if not missing(a_price) then do;

          b_datetime=a_datetime;

          output;

        end;

        else if missing(a_price) then do;

           b_datetime=a_datetime;

           a_datetime= lga_time;

           a_price=lga_price;

           if 0<=b_datetime-a_datetime<=5 then output;

        end;

     end;

run;

skipper
Calcite | Level 5

thanks I will try this and report back

Ksharp
Super User

DOW can do it. But If you need fast it ,Hash Table will be favored.

But for your data, I am curious. You said you want datetime within five seconds .

There two obs within 5 seconds for  26JUN11:00:42:20   0.09

26JUN11:12:42:17 3.4

26JUN11:12:42:18 3.6


You need them all, or just only need the nearest obs ?

DATA product_A;
   INPUT A_datetime datetime16. A_price $;
   format a_datetime datetime.;
   DATALINES;
22JUN11:00:00:04    1.7
22JUN11:00:00:08 1.9
22JUN11:00:00:10 2.3
22JUN11:00:05:12 2.9
22JUN11:00:05:36 3.5
22JUN11:00:32:45 3.2
22JUN11:00:39:01 3.1
22JUN11:00:39:03 2.7
22JUN11:01:10:12 2.6
22JUN11:07:22:10 2.1
26JUN11:12:42:17 3.4
26JUN11:12:42:18 3.6
26JUN11:12:42:21 3.9
;
RUN;

DATA product_B;
   INPUT B_datetime datetime16. B_price $;
   format b_datetime datetime.;
   DATALINES;
22JUN11:00:00:09    0.02
22JUN11:00:05:34    0.07
22JUN11:00:39:05    0.04
26JUN11:12:42:20    0.09
;
RUN;

data want;
 set product_b;
 do until(a_datetime gt b_datetime);
  if b_datetime-a_datetime lt 5 and b_datetime-a_datetime ge 0 then output;
  set product_a;
 end;
run;


Ksharp

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1197 views
  • 0 likes
  • 6 in conversation