Solved: Re: Difficult merge/join with huge datasets

jdmarino · Posted 08-29-2014 01:44 PM

I'm good with sql, but it's not the right tool for this job. I have to connect execution quantities for stocks with their last known prices. (That is, I can partially join the tables on date and ticker, but I have to find the price from Prices that is the latest time before the Event's time.) Below is some sample code and a sql statement that works, but is too inefficent my large datasets. I'm wondering if there is some tricky data step with merging that can do this better.

data Events;

format TradeDate mmddyys10. TradeTime time.;

input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4. Qty :best10.;

cards;

1Aug2013 10:00 MSFT 1000

1Aug2013 10:02 MSFT 1002

1Aug2013 14:20 MSFT 1420

1Aug2013 15:59 MSFT 1559

2Aug2013 09:31 MSFT 0931

2Aug2013 12:01 MSFT 1201

2Aug2013 15:20 MSFT 1520

2Aug2013 15:21 MSFT 1521

1Aug2013 10:00 YHOO 1000

1Aug2013 10:06 YHOO 1006

1Aug2013 14:21 YHOO 1421

1Aug2013 15:58 YHOO 1558

2Aug2013 09:30 YHOO 0930

2Aug2013 12:01 YHOO 1201

2Aug2013 15:20 YHOO 1520

2Aug2013 15:27 YHOO 1527

1Aug2013 10:00 AAPL 1000

1Aug2013 10:06 AAPL 1006

1Aug2013 14:21 AAPL 1421

1Aug2013 15:58 AAPL 1558

2Aug2013 09:30 AAPL 0930

2Aug2013 12:01 AAPL 1201

2Aug2013 15:20 AAPL 1520

2Aug2013 15:27 AAPL 1527

;

run;

proc sort data=Events;

by Ticker TradeDate TradeTime;

data Prices;

format TradeDate mmddyys10. TradeTime time.;

input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4. Price :best10.;

cards;

1Aug2013 09:30 MSFT 09.30

1Aug2013 10:01 MSFT 10.01

1Aug2013 10:02 MSFT 10.02

1Aug2013 11:59 MSFT 11.59

2Aug2013 09:30 MSFT 09.30

2Aug2013 10:01 MSFT 10.01

2Aug2013 15:18 MSFT 15.18

2Aug2013 15:19 MSFT 15.19

2Aug2013 15:20 MSFT 15.20

2Aug2013 15:21 MSFT 15.21

1Aug2013 10:01 YHOO 10.01

1Aug2013 10:05 YHOO 10.05

2Aug2013 09:28 YHOO 09.28

2Aug2013 09:29 YHOO 09.29

2Aug2013 11:00 YHOO 11.00

2Aug2013 15:25 YHOO 15.25

2Aug2013 16:00 YHOO 16.00

;

run;

proc sort data=Prices;

by Ticker TradeDate TradeTime;

proc print data=Events;

proc print data=Prices;

/* too slow on big data */
proc sql;

select a.*, b.TradeTime as PriceTime, b.Price

from Events a

join Prices b

on a.TradeDate = b.TradeDate

and a.Ticker = b.Ticker

and b.TradeTime < a.TradeTime

group by a.TradeDate, a.Ticker, a.TradeTime

having b.TradeTime = max( b.TradeTime)

order by a.Ticker, a.TradeDate, a.TradeTime;

Here are the results:

TradeDate	TradeTime	Ticker	Qty	PriceTime	Price
08/01/2013	10:00:00	MSFT	1000	9:30:00	9.3
08/01/2013	10:02:00	MSFT	1002	10:01:00	10.01
08/01/2013	14:20:00	MSFT	1420	11:59:00	11.59
08/01/2013	15:59:00	MSFT	1559	11:59:00	11.59
08/02/2013	9:31:00	MSFT	931	9:30:00	9.3
08/02/2013	12:01:00	MSFT	1201	10:01:00	10.01
08/02/2013	15:20:00	MSFT	1520	15:19:00	15.19
08/02/2013	15:21:00	MSFT	1521	15:20:00	15.2
08/01/2013	10:06:00	YHOO	1006	10:05:00	10.05
08/01/2013	14:21:00	YHOO	1421	10:05:00	10.05
08/01/2013	15:58:00	YHOO	1558	10:05:00	10.05
08/02/2013	9:30:00	YHOO	930	9:29:00	9.29
08/02/2013	12:01:00	YHOO	1201	11:00:00	11
08/02/2013	15:20:00	YHOO	1520	11:00:00	11
08/02/2013	15:27:00	YHOO	1527	15:25:00	15.25

Ksharp · Posted 08-30-2014 08:04 AM

How about this :

data Events;
    format TradeDate mmddyys10. TradeTime time.;
    input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4.  Qty :best10.;
cards;
1Aug2013 10:00 MSFT 1000
1Aug2013 10:02 MSFT 1002
1Aug2013 14:20 MSFT 1420
1Aug2013 15:59 MSFT 1559
2Aug2013 09:31 MSFT 0931
2Aug2013 12:01 MSFT 1201
2Aug2013 15:20 MSFT 1520
2Aug2013 15:21 MSFT 1521
1Aug2013 10:00 YHOO 1000
1Aug2013 10:06 YHOO 1006
1Aug2013 14:21 YHOO 1421
1Aug2013 15:58 YHOO 1558
2Aug2013 09:30 YHOO 0930
2Aug2013 12:01 YHOO 1201
2Aug2013 15:20 YHOO 1520
2Aug2013 15:27 YHOO 1527
1Aug2013 10:00 AAPL 1000
1Aug2013 10:06 AAPL 1006
1Aug2013 14:21 AAPL 1421
1Aug2013 15:58 AAPL 1558
2Aug2013 09:30 AAPL 0930
2Aug2013 12:01 AAPL 1201
2Aug2013 15:20 AAPL 1520
2Aug2013 15:27 AAPL 1527
;
    run;
proc sort data=Events;
    by Ticker TradeDate TradeTime;
run;
 
data Prices;
    format TradeDate mmddyys10. TradeTime time.;
    input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4.  Price :best10.;
cards;
1Aug2013 09:30 MSFT 09.30
1Aug2013 10:01 MSFT 10.01
1Aug2013 10:02 MSFT 10.02
1Aug2013 11:59 MSFT 11.59
2Aug2013 09:30 MSFT 09.30
2Aug2013 10:01 MSFT 10.01
2Aug2013 15:18 MSFT 15.18
2Aug2013 15:19 MSFT 15.19
2Aug2013 15:20 MSFT 15.20
2Aug2013 15:21 MSFT 15.21
1Aug2013 10:01 YHOO 10.01
1Aug2013 10:05 YHOO 10.05
2Aug2013 09:28 YHOO 09.28
2Aug2013 09:29 YHOO 09.29
2Aug2013 11:00 YHOO 11.00
2Aug2013 15:25 YHOO 15.25
2Aug2013 16:00 YHOO 16.00
;
    run;
proc sort data=Prices;
    by Ticker TradeDate TradeTime;
run;

data want(drop=_price);
 set Events Prices(rename=(price=_price))  ;
 by Ticker TradeDate TradeTime ;
 retain price;
 if first.Ticker then call missing(price);
 if not missing(_price) then price=_price;
 if not missing(Qty) and not missing(price);
run;

Xia Keshan

View solution in original post

stat_sas · Posted 08-29-2014 02:51 PM

Hi,

Hope small modification will generate the desired results. Remove a.TradeTime from group by clause.

instead of

group by a.TradeDate, a.Ticker, a.TradeTime

try this

group by a.TradeDate, a.Ticker

jdmarino · Posted 08-29-2014 04:15 PM

The results generated by your suggestion are not correct, but thanks for responding.

the new results only contain the rows that can be priced with the latest price for each ticker/date pair. Events that occurred earlier than the latest price get omitted but should not be, because they have a valid price (before the event).

TradeDate TradeTime Tickr Qty PriceTime Price

08/01/2013 14:20:00 MSFT 1420 11:59:00 11.59

08/01/2013 15:59:00 MSFT 1559 11:59:00 11.59

08/02/2013 15:21:00 MSFT 1521 15:20:00 15.2

08/01/2013 10:06:00 YHOO 1006 10:05:00 10.05

08/01/2013 14:21:00 YHOO 1421 10:05:00 10.05

08/01/2013 15:58:00 YHOO 1558 10:05:00 10.05

08/02/2013 15:27:00 YHOO 1527 15:25:00 15.25

jakarman · Posted 08-29-2014 04:26 PM

There are lot options to optimize performance. Most of then balancing between resources.
You did not mention those.

For a datastep merge approach using balance line (ordered access with point direct access)

%let dsid = %sysfunc(open(prices));

%let pr_nobs =%sysfunc(attrn(&dsid,NOBS));

%let rc = %sysfunc(close(&dsid));

%put prices has &pr_nobs observations;

/* using assumption oredered events - prices */

data results;

set Events nobs=evn_nobs ;

Retain eventlk 1 ;

check=1 ;

do while (eventlk < &pr_nobs & check ) ;

set Prices (rename=(Ticker=pTicker TradeDate=PriceDate TradeTime=Pricetime ) ) point=eventlk end=endprc ;

if ( Ticker > PTicker | ( Ticker = PTicker & TradeDate > PriceDate) |( Ticker = PTicker & TradeDate = PriceDate & TradeTime > Pricetime ) ) then eventlk=eventlk+1 ;

else check=0;

end;

if not endprc then eventlk=eventlk-1 ; /* one back that has passed set equation */

set Prices (rename=(Ticker=pTicker TradeDate=PriceDate TradeTime=Pricetime ) ) point=eventlk end=endprc ;

IF ( Ticker=PTicker & TradeDate = PriceDate) then output; /* */

run;

proc print;

run;

---->-- ja karman --<-----

jdmarino · Posted 08-29-2014 04:42 PM

This is the kind of answer I was hoping for -- there are techniques here that are new to me. The conditional use of Set, pluse the point= option is something I need to learn about.

I'll give this a try (next week; I'm burned out) and report back.

Thanks for the reply.

Haikuo · Posted 08-29-2014 08:54 PM

This is a Hash solution, suppose 'prices' is small enough to fit into your memory, and 'prices' does not need to be presorted.

data want_hash;

if _n_=1 then

do;

if 0 then

set prices (rename=(tradetime=pricetime));

declare hash h(dataset:'prices (rename=(tradetime=pricetime))', multidata:'y');

h.definekey('ticker','tradedate');

h.definedata(all:

'y');

h.definedone();

end;

set Events;

do rc=h.find() by 0 while (rc=0);

if pricetime < tradetime then

do;

_f=1;

if pricetime > _t then

do;

_t=pricetime;

_P=price;

end;

rc=h.find_next();

end;

price=_p;

pricetime=_t;

if _f=1 then

output;

drop rc _:;

run;

Haikuo

jdmarino · Posted 08-30-2014 10:18 AM

It's good that you posted this idea for future reference. As it happens, my Prices data is much bigger than my Events data, so this probably won't work for me. I can break the problem down into chunks (the smallest being a ticker-date), but extracting the chunks may be time-consuming, given that I'd have to wrap a loop around it. It's worth a test, however.

Ksharp · Posted 08-30-2014 08:04 AM

How about this :

data Events;
    format TradeDate mmddyys10. TradeTime time.;
    input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4.  Qty :best10.;
cards;
1Aug2013 10:00 MSFT 1000
1Aug2013 10:02 MSFT 1002
1Aug2013 14:20 MSFT 1420
1Aug2013 15:59 MSFT 1559
2Aug2013 09:31 MSFT 0931
2Aug2013 12:01 MSFT 1201
2Aug2013 15:20 MSFT 1520
2Aug2013 15:21 MSFT 1521
1Aug2013 10:00 YHOO 1000
1Aug2013 10:06 YHOO 1006
1Aug2013 14:21 YHOO 1421
1Aug2013 15:58 YHOO 1558
2Aug2013 09:30 YHOO 0930
2Aug2013 12:01 YHOO 1201
2Aug2013 15:20 YHOO 1520
2Aug2013 15:27 YHOO 1527
1Aug2013 10:00 AAPL 1000
1Aug2013 10:06 AAPL 1006
1Aug2013 14:21 AAPL 1421
1Aug2013 15:58 AAPL 1558
2Aug2013 09:30 AAPL 0930
2Aug2013 12:01 AAPL 1201
2Aug2013 15:20 AAPL 1520
2Aug2013 15:27 AAPL 1527
;
    run;
proc sort data=Events;
    by Ticker TradeDate TradeTime;
run;
 
data Prices;
    format TradeDate mmddyys10. TradeTime time.;
    input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4.  Price :best10.;
cards;
1Aug2013 09:30 MSFT 09.30
1Aug2013 10:01 MSFT 10.01
1Aug2013 10:02 MSFT 10.02
1Aug2013 11:59 MSFT 11.59
2Aug2013 09:30 MSFT 09.30
2Aug2013 10:01 MSFT 10.01
2Aug2013 15:18 MSFT 15.18
2Aug2013 15:19 MSFT 15.19
2Aug2013 15:20 MSFT 15.20
2Aug2013 15:21 MSFT 15.21
1Aug2013 10:01 YHOO 10.01
1Aug2013 10:05 YHOO 10.05
2Aug2013 09:28 YHOO 09.28
2Aug2013 09:29 YHOO 09.29
2Aug2013 11:00 YHOO 11.00
2Aug2013 15:25 YHOO 15.25
2Aug2013 16:00 YHOO 16.00
;
    run;
proc sort data=Prices;
    by Ticker TradeDate TradeTime;
run;

data want(drop=_price);
 set Events Prices(rename=(price=_price))  ;
 by Ticker TradeDate TradeTime ;
 retain price;
 if first.Ticker then call missing(price);
 if not missing(_price) then price=_price;
 if not missing(Qty) and not missing(price);
run;

Xia Keshan

jdmarino · Posted 08-30-2014 10:27 AM

This is very interesting because it's so direct. My data is actually not sorted, but I built an index on (Ticker TradeDate TradeTime) on each, so I think the BY statement will still work. And I surely need to change the first IF to include TradeDate, like this

if first.Ticker or first.TradeDate then call missing(price);

Thanks for the reply; I'm excited to try this. (That should be obvious because I'm replying on a holiday weekend about a work-related problem.)

LinusH · Posted 09-01-2014 02:41 AM

I haven't dived into your problem. But just want to hint that building an index for BY processing is rarely good for performance. Many times it's more efficient to sort the whole table instead. Pls test both scenarios an get back with your results.

Data never sleeps

Haikuo · Posted 09-04-2014 03:07 PM

Since it is sorted by ' Ticker TradeDate TradeTime;', "if first.TradeDate" will contain 'if first.Ticker', so 'if first.TradeDate then call missing(price);' should do. One less condition to check may save you a few secs.

Regards,

Haikuo

jakarman · Posted 08-30-2014 10:36 AM

jdmarino can you give more details on your dataset-sizing SAS version and your computer resource limitations (memory cpu IO/dasd) ?
Indexing can work but has the advantage of additional IO (caused by random IO). IO is often the most delaying factor. Even when you an index it can be more efficient to process it sequential.
Today's computers are not having the limitations of several years back. The hash approach could be the most effective one when the memory (several Gb) is there.

---->-- ja karman --<-----

jdmarino · Posted 09-04-2014 09:06 AM

My original SQL version took 5 hours.

KSharp's SET-BY interleaving took < 1 minute (after a 7 minute sort).

I haven't tried Jaap's method, mostly because it seems to me like a manual version of KSharp's.

I am trying Hai.Kuo's hash method, but running out of memory, and I don't know why. The dataset is 7G on my disk, but I have 24G of RAM. This is the error:

ERROR: Hash object added 16252912 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

Proc options group=memory reports:

Group=MEMORY

SORTSIZE=268435456 Size parameter for sort

SUMSIZE=0 Upper limit for data-dependent memory usage during summarization

MAXMEMQUERY=0 Maximum amount of memory returned when inquiring as to available space

MEMBLKSZ=16777216 Size of memory blocks allocated to support MEMLIB and MEMCACHE options.

MEMMAXSZ=2147483648 Maximum amount of memory allocated to support MEMLIB and MEMCACHE options.

LOADMEMSIZE=0 Suggested memory limit for loaded SAS executables

MEMSIZE=2147483648 Specifies the limit on the total amount of memory to be used by the SAS System

REALMEMSIZE=0 Limit on the total amount of real memory to be used by the SAS System N

I set "-MEMSIZE 19G" in my sasv9.cfg file. Any ideas why I'm not using all the physical memory?

I have 64-bit SAS 9.3 (TS1M1) installed on a Windows 7 machine.

FYI: I don't expect the hash method to be efficient because I have only 100 keys (50 tickers x 2 dates) in my real data. That means for each key I'll be looping/scanning an average of 80,000 (16,000,000 items / 100 keys / 2) elements to find the one I want for each row in Events.

jakarman · Posted 09-04-2014 09:24 AM

John you are telling You have set 19Gb for the memsize counting the digits I am seeing 2Gb. For sorting you are having 256Mb a common default value.
Setting it to higher values will improve processing performance by avoiding IO. But why is your setting of memsize a fail?

Which sasv9.config did you change and is that one really used? Not overwritten by another?

The hashing is using a b-tree keys search. Very popular as of NO-SQL should perform well. The number of keys can be a problem needing some additional tuning.

---->-- ja karman --<-----

jdmarino · Posted 09-04-2014 10:10 AM

>> Why is memsize a fail?

I don't know. I was hoping you did.

Another fail: SORTSIZE. In my .cfg file I have it set to 19G.

Does one need admin rights on a Windows machine to set these params? I don't have admin rights on this machine.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away