DATA Step, Macro, Functions and more

Difficult merge/join with huge datasets

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 78
Accepted Solution

Difficult merge/join with huge datasets

I'm good with sql, but it's not the right tool for this job.  I have to connect execution quantities for stocks with their last known prices. (That is, I can partially join the tables on date and ticker, but I have to find the price from Prices that is the latest time before the Event's time.)   Below is some sample code and a sql statement that works, but is too inefficent my large datasets.  I'm wondering if there is some tricky data step with merging that can do this better.

data Events;

    format TradeDate mmddyys10. TradeTime time.;

    input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4.  Qty :best10.;

cards;

1Aug2013 10:00 MSFT 1000

1Aug2013 10:02 MSFT 1002

1Aug2013 14:20 MSFT 1420

1Aug2013 15:59 MSFT 1559

2Aug2013 09:31 MSFT 0931

2Aug2013 12:01 MSFT 1201

2Aug2013 15:20 MSFT 1520

2Aug2013 15:21 MSFT 1521

1Aug2013 10:00 YHOO 1000

1Aug2013 10:06 YHOO 1006

1Aug2013 14:21 YHOO 1421

1Aug2013 15:58 YHOO 1558

2Aug2013 09:30 YHOO 0930

2Aug2013 12:01 YHOO 1201

2Aug2013 15:20 YHOO 1520

2Aug2013 15:27 YHOO 1527

1Aug2013 10:00 AAPL 1000

1Aug2013 10:06 AAPL 1006

1Aug2013 14:21 AAPL 1421

1Aug2013 15:58 AAPL 1558

2Aug2013 09:30 AAPL 0930

2Aug2013 12:01 AAPL 1201

2Aug2013 15:20 AAPL 1520

2Aug2013 15:27 AAPL 1527

;

    run;

proc sort data=Events;

    by Ticker TradeDate TradeTime;

data Prices;

    format TradeDate mmddyys10. TradeTime time.;

    input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4.  Price :best10.;

cards;

1Aug2013 09:30 MSFT 09.30

1Aug2013 10:01 MSFT 10.01

1Aug2013 10:02 MSFT 10.02

1Aug2013 11:59 MSFT 11.59

2Aug2013 09:30 MSFT 09.30

2Aug2013 10:01 MSFT 10.01

2Aug2013 15:18 MSFT 15.18

2Aug2013 15:19 MSFT 15.19

2Aug2013 15:20 MSFT 15.20

2Aug2013 15:21 MSFT 15.21

1Aug2013 10:01 YHOO 10.01

1Aug2013 10:05 YHOO 10.05

2Aug2013 09:28 YHOO 09.28

2Aug2013 09:29 YHOO 09.29

2Aug2013 11:00 YHOO 11.00

2Aug2013 15:25 YHOO 15.25

2Aug2013 16:00 YHOO 16.00

;

    run;

proc sort data=Prices;

    by Ticker TradeDate TradeTime;

proc print data=Events;

proc print data=Prices;

/* too slow on big data */
proc sql;

    select a.*, b.TradeTime as PriceTime, b.Price

    from Events a

    join Prices b

       on a.TradeDate = b.TradeDate

      and a.Ticker = b.Ticker

      and b.TradeTime < a.TradeTime

    group by a.TradeDate, a.Ticker, a.TradeTime

    having b.TradeTime = max( b.TradeTime)

    order by a.Ticker, a.TradeDate, a.TradeTime;

Here are the results:

TradeDateTradeTimeTickerQtyPriceTimePrice
08/01/201310:00:00MSFT10009:30:009.3
08/01/201310:02:00MSFT100210:01:0010.01
08/01/201314:20:00MSFT142011:59:0011.59
08/01/201315:59:00MSFT155911:59:0011.59
08/02/20139:31:00MSFT9319:30:009.3
08/02/201312:01:00MSFT120110:01:0010.01
08/02/201315:20:00MSFT152015:19:0015.19
08/02/201315:21:00MSFT152115:20:0015.2
08/01/201310:06:00YHOO100610:05:0010.05
08/01/201314:21:00YHOO142110:05:0010.05
08/01/201315:58:00YHOO155810:05:0010.05
08/02/20139:30:00YHOO9309:29:009.29
08/02/201312:01:00YHOO120111:00:0011
08/02/201315:20:00YHOO152011:00:0011
08/02/201315:27:00YHOO152715:25:0015.25

Accepted Solutions
Solution
‎08-30-2014 08:04 AM
Super User
Posts: 10,044

Re: Difficult merge/join with huge datasets

How about this :

data Events;
    format TradeDate mmddyys10. TradeTime time.;
    input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4.  Qty :best10.;
cards;
1Aug2013 10:00 MSFT 1000
1Aug2013 10:02 MSFT 1002
1Aug2013 14:20 MSFT 1420
1Aug2013 15:59 MSFT 1559
2Aug2013 09:31 MSFT 0931
2Aug2013 12:01 MSFT 1201
2Aug2013 15:20 MSFT 1520
2Aug2013 15:21 MSFT 1521
1Aug2013 10:00 YHOO 1000
1Aug2013 10:06 YHOO 1006
1Aug2013 14:21 YHOO 1421
1Aug2013 15:58 YHOO 1558
2Aug2013 09:30 YHOO 0930
2Aug2013 12:01 YHOO 1201
2Aug2013 15:20 YHOO 1520
2Aug2013 15:27 YHOO 1527
1Aug2013 10:00 AAPL 1000
1Aug2013 10:06 AAPL 1006
1Aug2013 14:21 AAPL 1421
1Aug2013 15:58 AAPL 1558
2Aug2013 09:30 AAPL 0930
2Aug2013 12:01 AAPL 1201
2Aug2013 15:20 AAPL 1520
2Aug2013 15:27 AAPL 1527
;
    run;
proc sort data=Events;
    by Ticker TradeDate TradeTime;
run;
 
data Prices;
    format TradeDate mmddyys10. TradeTime time.;
    input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4.  Price :best10.;
cards;
1Aug2013 09:30 MSFT 09.30
1Aug2013 10:01 MSFT 10.01
1Aug2013 10:02 MSFT 10.02
1Aug2013 11:59 MSFT 11.59
2Aug2013 09:30 MSFT 09.30
2Aug2013 10:01 MSFT 10.01
2Aug2013 15:18 MSFT 15.18
2Aug2013 15:19 MSFT 15.19
2Aug2013 15:20 MSFT 15.20
2Aug2013 15:21 MSFT 15.21
1Aug2013 10:01 YHOO 10.01
1Aug2013 10:05 YHOO 10.05
2Aug2013 09:28 YHOO 09.28
2Aug2013 09:29 YHOO 09.29
2Aug2013 11:00 YHOO 11.00
2Aug2013 15:25 YHOO 15.25
2Aug2013 16:00 YHOO 16.00
;
    run;
proc sort data=Prices;
    by Ticker TradeDate TradeTime;
run;

data want(drop=_price);
 set Events Prices(rename=(price=_price))  ;
 by Ticker TradeDate TradeTime ;
 retain price;
 if first.Ticker then call missing(price);
 if not missing(_price) then price=_price;
 if not missing(Qty) and not missing(price);
run;

Xia Keshan

View solution in original post


All Replies
Trusted Advisor
Posts: 1,228

Re: Difficult merge/join with huge datasets

Hi,

Hope small modification will generate the desired results. Remove a.TradeTime from group by clause.

instead of

group by a.TradeDate, a.Ticker, a.TradeTime

try this

group by a.TradeDate, a.Ticker

Frequent Contributor
Posts: 78

Re: Difficult merge/join with huge datasets

The results generated by your suggestion are not correct, but thanks for responding.

the new results only contain the rows that can be priced with the latest price for each ticker/date pair.  Events that occurred earlier than the latest price get omitted but should not be, because they have a valid price (before the event).

TradeDate TradeTime Tickr Qty PriceTime Price

08/01/2013 14:20:00 MSFT 1420 11:59:00 11.59

08/01/2013 15:59:00 MSFT 1559 11:59:00 11.59

08/02/2013 15:21:00 MSFT 1521 15:20:00 15.2

08/01/2013 10:06:00 YHOO 1006 10:05:00 10.05

08/01/2013 14:21:00 YHOO 1421 10:05:00 10.05

08/01/2013 15:58:00 YHOO 1558 10:05:00 10.05

08/02/2013 15:27:00 YHOO 1527 15:25:00 15.25

Trusted Advisor
Posts: 3,215

Re: Difficult merge/join with huge datasets

There are lot options to optimize performance. Most of then balancing between resources.
You did not mention those.

For a datastep merge approach using balance line (ordered access with point direct access)

%let dsid = %sysfunc(open(prices));

%let pr_nobs =%sysfunc(attrn(&dsid,NOBS));

%let rc = %sysfunc(close(&dsid));

%put prices has &pr_nobs observations;  

/* using assumption oredered events - prices */

data results;

  set Events nobs=evn_nobs ;

  Retain eventlk 1  ;

  check=1 ;

  do while (eventlk < &pr_nobs & check ) ; 

   set Prices (rename=(Ticker=pTicker TradeDate=PriceDate TradeTime=Pricetime ) ) point=eventlk  end=endprc ; 

   if ( Ticker > PTicker | ( Ticker = PTicker & TradeDate > PriceDate) |( Ticker = PTicker & TradeDate = PriceDate & TradeTime > Pricetime ) ) then eventlk=eventlk+1 ;

   else check=0;

  end;

  if not endprc then eventlk=eventlk-1 ; /* one back that has passed set equation */

  set Prices (rename=(Ticker=pTicker TradeDate=PriceDate TradeTime=Pricetime ) ) point=eventlk  end=endprc ; 

  IF ( Ticker=PTicker & TradeDate = PriceDate) then output; /* */

run; 

proc print;

run;

---->-- ja karman --<-----
Frequent Contributor
Posts: 78

Re: Difficult merge/join with huge datasets

This is the kind of answer I was hoping for -- there are techniques here that are new to me.  The conditional use of Set, pluse the point= option is something I need to learn about.

I'll give this a try (next week; I'm burned out) and report back.

Thanks for the reply.

Respected Advisor
Posts: 3,156

Re: Difficult merge/join with huge datasets

This is a Hash solution, suppose 'prices' is small enough to fit into your memory, and 'prices' does not need to be presorted.

data want_hash;

     if _n_=1 then

           do;

                if 0 then

                     set prices (rename=(tradetime=pricetime));

                declare hash h(dataset:'prices (rename=(tradetime=pricetime))', multidata:'y');

                h.definekey('ticker','tradedate');

     h.definedata(all:

                'y');

                h.definedone();

           end;

     set Events;

     do rc=h.find() by 0 while (rc=0);

           if pricetime < tradetime then

                do;

                     _f=1;

                     if pricetime > _t then

                           do;

                                _t=pricetime;

                                _P=price;

                           end;

                end;

           rc=h.find_next();

     end;

     price=_p;

     pricetime=_t;

     if _f=1 then

           output;

     drop rc _:;

run;


Haikuo

Frequent Contributor
Posts: 78

Re: Difficult merge/join with huge datasets

It's good that you posted this idea for future reference.  As it happens, my Prices data is much bigger than my Events data, so this probably won't work for me.  I can break the problem down into chunks (the smallest being a ticker-date), but extracting the chunks may be time-consuming, given that I'd have to wrap a loop around it.  It's worth a test, however.

Solution
‎08-30-2014 08:04 AM
Super User
Posts: 10,044

Re: Difficult merge/join with huge datasets

How about this :

data Events;
    format TradeDate mmddyys10. TradeTime time.;
    input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4.  Qty :best10.;
cards;
1Aug2013 10:00 MSFT 1000
1Aug2013 10:02 MSFT 1002
1Aug2013 14:20 MSFT 1420
1Aug2013 15:59 MSFT 1559
2Aug2013 09:31 MSFT 0931
2Aug2013 12:01 MSFT 1201
2Aug2013 15:20 MSFT 1520
2Aug2013 15:21 MSFT 1521
1Aug2013 10:00 YHOO 1000
1Aug2013 10:06 YHOO 1006
1Aug2013 14:21 YHOO 1421
1Aug2013 15:58 YHOO 1558
2Aug2013 09:30 YHOO 0930
2Aug2013 12:01 YHOO 1201
2Aug2013 15:20 YHOO 1520
2Aug2013 15:27 YHOO 1527
1Aug2013 10:00 AAPL 1000
1Aug2013 10:06 AAPL 1006
1Aug2013 14:21 AAPL 1421
1Aug2013 15:58 AAPL 1558
2Aug2013 09:30 AAPL 0930
2Aug2013 12:01 AAPL 1201
2Aug2013 15:20 AAPL 1520
2Aug2013 15:27 AAPL 1527
;
    run;
proc sort data=Events;
    by Ticker TradeDate TradeTime;
run;
 
data Prices;
    format TradeDate mmddyys10. TradeTime time.;
    input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4.  Price :best10.;
cards;
1Aug2013 09:30 MSFT 09.30
1Aug2013 10:01 MSFT 10.01
1Aug2013 10:02 MSFT 10.02
1Aug2013 11:59 MSFT 11.59
2Aug2013 09:30 MSFT 09.30
2Aug2013 10:01 MSFT 10.01
2Aug2013 15:18 MSFT 15.18
2Aug2013 15:19 MSFT 15.19
2Aug2013 15:20 MSFT 15.20
2Aug2013 15:21 MSFT 15.21
1Aug2013 10:01 YHOO 10.01
1Aug2013 10:05 YHOO 10.05
2Aug2013 09:28 YHOO 09.28
2Aug2013 09:29 YHOO 09.29
2Aug2013 11:00 YHOO 11.00
2Aug2013 15:25 YHOO 15.25
2Aug2013 16:00 YHOO 16.00
;
    run;
proc sort data=Prices;
    by Ticker TradeDate TradeTime;
run;

data want(drop=_price);
 set Events Prices(rename=(price=_price))  ;
 by Ticker TradeDate TradeTime ;
 retain price;
 if first.Ticker then call missing(price);
 if not missing(_price) then price=_price;
 if not missing(Qty) and not missing(price);
run;

Xia Keshan

Frequent Contributor
Posts: 78

Re: Difficult merge/join with huge datasets

This is very interesting because it's so direct.  My data is actually not sorted, but I built an index on (Ticker TradeDate TradeTime) on each, so I think the BY statement will still work.  And I surely need to change the first IF to include TradeDate, like this

if first.Ticker or first.TradeDate then call missing(price);

Thanks for the reply; I'm excited to try this.  (That should be obvious because I'm replying on a holiday weekend about a work-related problem.)

Super User
Posts: 5,437

Re: Difficult merge/join with huge datasets

I haven't dived into your problem. But just want to hint that building an index for BY processing is rarely good for performance. Many times it's more efficient to sort the whole table instead. Pls test both scenarios an get back with your results.

Data never sleeps
Respected Advisor
Posts: 3,156

Re: Difficult merge/join with huge datasets

Since it is sorted by  ' Ticker TradeDate TradeTime;', "if first.TradeDate" will contain 'if first.Ticker', so 'if first.TradeDate then call missing(price);' should do. One less condition to check may save you a few secs.

Regards,

Haikuo

Trusted Advisor
Posts: 3,215

Re: Difficult merge/join with huge datasets

jdmarino can you give more details on your dataset-sizing SAS version and your computer resource limitations (memory cpu IO/dasd) ?
Indexing can work but has the advantage of additional IO (caused by random IO). IO is often the most delaying factor. Even when you an index it can be more efficient to process it sequential.
Today's computers are not having the limitations of several years back. The hash approach could be the most effective one when the memory (several Gb) is there. 

---->-- ja karman --<-----
Frequent Contributor
Posts: 78

Re: Difficult merge/join with huge datasets

My original SQL version took 5 hours.

KSharp's SET-BY interleaving took < 1 minute (after a 7 minute sort).

I haven't tried Jaap's method, mostly because it seems to me like a manual version of KSharp's.

I am trying Hai.Kuo's hash method, but running out of memory, and I don't know why.  The dataset is 7G on my disk, but I have 24G of RAM.  This is the error:

ERROR: Hash object added 16252912 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

Proc options group=memory reports:

Group=MEMORY

SORTSIZE=268435456                    Size parameter for sort

SUMSIZE=0         Upper limit for data-dependent memory usage during summarization

MAXMEMQUERY=0     Maximum amount of memory returned when inquiring as to available space

MEMBLKSZ=16777216 Size of memory blocks allocated to support MEMLIB and MEMCACHE options.

MEMMAXSZ=2147483648                    Maximum amount of memory allocated to support MEMLIB and MEMCACHE options.

LOADMEMSIZE=0     Suggested memory limit for loaded SAS executables

MEMSIZE=2147483648                    Specifies the limit on the total amount of memory to be used by the SAS System

REALMEMSIZE=0     Limit on the total amount of real memory to be used by the SAS System N

I set "-MEMSIZE 19G" in my sasv9.cfg file.  Any ideas why I'm not using all the physical memory?

I have 64-bit SAS 9.3 (TS1M1) installed on a Windows 7 machine.

FYI: I don't expect the hash method to be efficient because I have only 100 keys (50 tickers x 2 dates) in my real data.  That means for each key I'll be looping/scanning an average of 80,000 (16,000,000 items / 100 keys / 2) elements to find the one I want for each row in Events.

Trusted Advisor
Posts: 3,215

Re: Difficult merge/join with huge datasets

John you are telling You have set 19Gb for the memsize counting the digits I am seeing 2Gb. For sorting you are having 256Mb a common default value.
Setting it to higher values will improve processing performance by avoiding IO. But why is your setting of memsize a fail?

Which sasv9.config did you change and is that one really used? Not overwritten by another?

The hashing is using a b-tree keys search. Very popular as of NO-SQL should perform well. The number of keys can be a problem needing some additional tuning.   
 

---->-- ja karman --<-----
Frequent Contributor
Posts: 78

Re: Difficult merge/join with huge datasets

>> Why is memsize a fail?

I don't know.  I was hoping you did.

Another fail: SORTSIZE.  In my .cfg file I have it set to 19G.

Does one need admin rights on a Windows machine to set these params?  I don't have admin rights on this machine.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 18 replies
  • 685 views
  • 7 likes
  • 6 in conversation