I'm good with sql, but it's not the right tool for this job. I have to connect execution quantities for stocks with their last known prices. (That is, I can partially join the tables on date and ticker, but I have to find the price from Prices that is the latest time before the Event's time.) Below is some sample code and a sql statement that works, but is too inefficent my large datasets. I'm wondering if there is some tricky data step with merging that can do this better.
data Events;
format TradeDate mmddyys10. TradeTime time.;
input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4. Qty :best10.;
cards;
1Aug2013 10:00 MSFT 1000
1Aug2013 10:02 MSFT 1002
1Aug2013 14:20 MSFT 1420
1Aug2013 15:59 MSFT 1559
2Aug2013 09:31 MSFT 0931
2Aug2013 12:01 MSFT 1201
2Aug2013 15:20 MSFT 1520
2Aug2013 15:21 MSFT 1521
1Aug2013 10:00 YHOO 1000
1Aug2013 10:06 YHOO 1006
1Aug2013 14:21 YHOO 1421
1Aug2013 15:58 YHOO 1558
2Aug2013 09:30 YHOO 0930
2Aug2013 12:01 YHOO 1201
2Aug2013 15:20 YHOO 1520
2Aug2013 15:27 YHOO 1527
1Aug2013 10:00 AAPL 1000
1Aug2013 10:06 AAPL 1006
1Aug2013 14:21 AAPL 1421
1Aug2013 15:58 AAPL 1558
2Aug2013 09:30 AAPL 0930
2Aug2013 12:01 AAPL 1201
2Aug2013 15:20 AAPL 1520
2Aug2013 15:27 AAPL 1527
;
run;
proc sort data=Events;
by Ticker TradeDate TradeTime;
data Prices;
format TradeDate mmddyys10. TradeTime time.;
input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4. Price :best10.;
cards;
1Aug2013 09:30 MSFT 09.30
1Aug2013 10:01 MSFT 10.01
1Aug2013 10:02 MSFT 10.02
1Aug2013 11:59 MSFT 11.59
2Aug2013 09:30 MSFT 09.30
2Aug2013 10:01 MSFT 10.01
2Aug2013 15:18 MSFT 15.18
2Aug2013 15:19 MSFT 15.19
2Aug2013 15:20 MSFT 15.20
2Aug2013 15:21 MSFT 15.21
1Aug2013 10:01 YHOO 10.01
1Aug2013 10:05 YHOO 10.05
2Aug2013 09:28 YHOO 09.28
2Aug2013 09:29 YHOO 09.29
2Aug2013 11:00 YHOO 11.00
2Aug2013 15:25 YHOO 15.25
2Aug2013 16:00 YHOO 16.00
;
run;
proc sort data=Prices;
by Ticker TradeDate TradeTime;
proc print data=Events;
proc print data=Prices;
/* too slow on big data */
proc sql;
select a.*, b.TradeTime as PriceTime, b.Price
from Events a
join Prices b
on a.TradeDate = b.TradeDate
and a.Ticker = b.Ticker
and b.TradeTime < a.TradeTime
group by a.TradeDate, a.Ticker, a.TradeTime
having b.TradeTime = max( b.TradeTime)
order by a.Ticker, a.TradeDate, a.TradeTime;
Here are the results:
TradeDate | TradeTime | Ticker | Qty | PriceTime | Price |
---|---|---|---|---|---|
08/01/2013 | 10:00:00 | MSFT | 1000 | 9:30:00 | 9.3 |
08/01/2013 | 10:02:00 | MSFT | 1002 | 10:01:00 | 10.01 |
08/01/2013 | 14:20:00 | MSFT | 1420 | 11:59:00 | 11.59 |
08/01/2013 | 15:59:00 | MSFT | 1559 | 11:59:00 | 11.59 |
08/02/2013 | 9:31:00 | MSFT | 931 | 9:30:00 | 9.3 |
08/02/2013 | 12:01:00 | MSFT | 1201 | 10:01:00 | 10.01 |
08/02/2013 | 15:20:00 | MSFT | 1520 | 15:19:00 | 15.19 |
08/02/2013 | 15:21:00 | MSFT | 1521 | 15:20:00 | 15.2 |
08/01/2013 | 10:06:00 | YHOO | 1006 | 10:05:00 | 10.05 |
08/01/2013 | 14:21:00 | YHOO | 1421 | 10:05:00 | 10.05 |
08/01/2013 | 15:58:00 | YHOO | 1558 | 10:05:00 | 10.05 |
08/02/2013 | 9:30:00 | YHOO | 930 | 9:29:00 | 9.29 |
08/02/2013 | 12:01:00 | YHOO | 1201 | 11:00:00 | 11 |
08/02/2013 | 15:20:00 | YHOO | 1520 | 11:00:00 | 11 |
08/02/2013 | 15:27:00 | YHOO | 1527 | 15:25:00 | 15.25 |
How about this :
data Events; format TradeDate mmddyys10. TradeTime time.; input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4. Qty :best10.; cards; 1Aug2013 10:00 MSFT 1000 1Aug2013 10:02 MSFT 1002 1Aug2013 14:20 MSFT 1420 1Aug2013 15:59 MSFT 1559 2Aug2013 09:31 MSFT 0931 2Aug2013 12:01 MSFT 1201 2Aug2013 15:20 MSFT 1520 2Aug2013 15:21 MSFT 1521 1Aug2013 10:00 YHOO 1000 1Aug2013 10:06 YHOO 1006 1Aug2013 14:21 YHOO 1421 1Aug2013 15:58 YHOO 1558 2Aug2013 09:30 YHOO 0930 2Aug2013 12:01 YHOO 1201 2Aug2013 15:20 YHOO 1520 2Aug2013 15:27 YHOO 1527 1Aug2013 10:00 AAPL 1000 1Aug2013 10:06 AAPL 1006 1Aug2013 14:21 AAPL 1421 1Aug2013 15:58 AAPL 1558 2Aug2013 09:30 AAPL 0930 2Aug2013 12:01 AAPL 1201 2Aug2013 15:20 AAPL 1520 2Aug2013 15:27 AAPL 1527 ; run; proc sort data=Events; by Ticker TradeDate TradeTime; run; data Prices; format TradeDate mmddyys10. TradeTime time.; input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4. Price :best10.; cards; 1Aug2013 09:30 MSFT 09.30 1Aug2013 10:01 MSFT 10.01 1Aug2013 10:02 MSFT 10.02 1Aug2013 11:59 MSFT 11.59 2Aug2013 09:30 MSFT 09.30 2Aug2013 10:01 MSFT 10.01 2Aug2013 15:18 MSFT 15.18 2Aug2013 15:19 MSFT 15.19 2Aug2013 15:20 MSFT 15.20 2Aug2013 15:21 MSFT 15.21 1Aug2013 10:01 YHOO 10.01 1Aug2013 10:05 YHOO 10.05 2Aug2013 09:28 YHOO 09.28 2Aug2013 09:29 YHOO 09.29 2Aug2013 11:00 YHOO 11.00 2Aug2013 15:25 YHOO 15.25 2Aug2013 16:00 YHOO 16.00 ; run; proc sort data=Prices; by Ticker TradeDate TradeTime; run; data want(drop=_price); set Events Prices(rename=(price=_price)) ; by Ticker TradeDate TradeTime ; retain price; if first.Ticker then call missing(price); if not missing(_price) then price=_price; if not missing(Qty) and not missing(price); run;
Xia Keshan
Hi,
Hope small modification will generate the desired results. Remove a.TradeTime from group by clause.
instead of
group by a.TradeDate, a.Ticker, a.TradeTime
try this
group by a.TradeDate, a.Ticker
The results generated by your suggestion are not correct, but thanks for responding.
the new results only contain the rows that can be priced with the latest price for each ticker/date pair. Events that occurred earlier than the latest price get omitted but should not be, because they have a valid price (before the event).
TradeDate TradeTime Tickr Qty PriceTime Price
08/01/2013 14:20:00 MSFT 1420 11:59:00 11.59
08/01/2013 15:59:00 MSFT 1559 11:59:00 11.59
08/02/2013 15:21:00 MSFT 1521 15:20:00 15.2
08/01/2013 10:06:00 YHOO 1006 10:05:00 10.05
08/01/2013 14:21:00 YHOO 1421 10:05:00 10.05
08/01/2013 15:58:00 YHOO 1558 10:05:00 10.05
08/02/2013 15:27:00 YHOO 1527 15:25:00 15.25
There are lot options to optimize performance. Most of then balancing between resources.
You did not mention those.
For a datastep merge approach using balance line (ordered access with point direct access)
%let dsid = %sysfunc(open(prices));
%let pr_nobs =%sysfunc(attrn(&dsid,NOBS));
%let rc = %sysfunc(close(&dsid));
%put prices has &pr_nobs observations;
/* using assumption oredered events - prices */
data results;
set Events nobs=evn_nobs ;
Retain eventlk 1 ;
check=1 ;
do while (eventlk < &pr_nobs & check ) ;
set Prices (rename=(Ticker=pTicker TradeDate=PriceDate TradeTime=Pricetime ) ) point=eventlk end=endprc ;
if ( Ticker > PTicker | ( Ticker = PTicker & TradeDate > PriceDate) |( Ticker = PTicker & TradeDate = PriceDate & TradeTime > Pricetime ) ) then eventlk=eventlk+1 ;
else check=0;
end;
if not endprc then eventlk=eventlk-1 ; /* one back that has passed set equation */
set Prices (rename=(Ticker=pTicker TradeDate=PriceDate TradeTime=Pricetime ) ) point=eventlk end=endprc ;
IF ( Ticker=PTicker & TradeDate = PriceDate) then output; /* */
run;
proc print;
run;
This is the kind of answer I was hoping for -- there are techniques here that are new to me. The conditional use of Set, pluse the point= option is something I need to learn about.
I'll give this a try (next week; I'm burned out) and report back.
Thanks for the reply.
This is a Hash solution, suppose 'prices' is small enough to fit into your memory, and 'prices' does not need to be presorted.
data want_hash;
if _n_=1 then
do;
if 0 then
set prices (rename=(tradetime=pricetime));
declare hash h(dataset:'prices (rename=(tradetime=pricetime))', multidata:'y');
h.definekey('ticker','tradedate');
h.definedata(all:
'y');
h.definedone();
end;
set Events;
do rc=h.find() by 0 while (rc=0);
if pricetime < tradetime then
do;
_f=1;
if pricetime > _t then
do;
_t=pricetime;
_P=price;
end;
end;
rc=h.find_next();
end;
price=_p;
pricetime=_t;
if _f=1 then
output;
drop rc _:;
run;
Haikuo
It's good that you posted this idea for future reference. As it happens, my Prices data is much bigger than my Events data, so this probably won't work for me. I can break the problem down into chunks (the smallest being a ticker-date), but extracting the chunks may be time-consuming, given that I'd have to wrap a loop around it. It's worth a test, however.
How about this :
data Events; format TradeDate mmddyys10. TradeTime time.; input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4. Qty :best10.; cards; 1Aug2013 10:00 MSFT 1000 1Aug2013 10:02 MSFT 1002 1Aug2013 14:20 MSFT 1420 1Aug2013 15:59 MSFT 1559 2Aug2013 09:31 MSFT 0931 2Aug2013 12:01 MSFT 1201 2Aug2013 15:20 MSFT 1520 2Aug2013 15:21 MSFT 1521 1Aug2013 10:00 YHOO 1000 1Aug2013 10:06 YHOO 1006 1Aug2013 14:21 YHOO 1421 1Aug2013 15:58 YHOO 1558 2Aug2013 09:30 YHOO 0930 2Aug2013 12:01 YHOO 1201 2Aug2013 15:20 YHOO 1520 2Aug2013 15:27 YHOO 1527 1Aug2013 10:00 AAPL 1000 1Aug2013 10:06 AAPL 1006 1Aug2013 14:21 AAPL 1421 1Aug2013 15:58 AAPL 1558 2Aug2013 09:30 AAPL 0930 2Aug2013 12:01 AAPL 1201 2Aug2013 15:20 AAPL 1520 2Aug2013 15:27 AAPL 1527 ; run; proc sort data=Events; by Ticker TradeDate TradeTime; run; data Prices; format TradeDate mmddyys10. TradeTime time.; input TradeDate :anydtdte. TradeTime :anydttme. Ticker :$4. Price :best10.; cards; 1Aug2013 09:30 MSFT 09.30 1Aug2013 10:01 MSFT 10.01 1Aug2013 10:02 MSFT 10.02 1Aug2013 11:59 MSFT 11.59 2Aug2013 09:30 MSFT 09.30 2Aug2013 10:01 MSFT 10.01 2Aug2013 15:18 MSFT 15.18 2Aug2013 15:19 MSFT 15.19 2Aug2013 15:20 MSFT 15.20 2Aug2013 15:21 MSFT 15.21 1Aug2013 10:01 YHOO 10.01 1Aug2013 10:05 YHOO 10.05 2Aug2013 09:28 YHOO 09.28 2Aug2013 09:29 YHOO 09.29 2Aug2013 11:00 YHOO 11.00 2Aug2013 15:25 YHOO 15.25 2Aug2013 16:00 YHOO 16.00 ; run; proc sort data=Prices; by Ticker TradeDate TradeTime; run; data want(drop=_price); set Events Prices(rename=(price=_price)) ; by Ticker TradeDate TradeTime ; retain price; if first.Ticker then call missing(price); if not missing(_price) then price=_price; if not missing(Qty) and not missing(price); run;
Xia Keshan
This is very interesting because it's so direct. My data is actually not sorted, but I built an index on (Ticker TradeDate TradeTime) on each, so I think the BY statement will still work. And I surely need to change the first IF to include TradeDate, like this
if first.Ticker or first.TradeDate then call missing(price);
Thanks for the reply; I'm excited to try this. (That should be obvious because I'm replying on a holiday weekend about a work-related problem.)
I haven't dived into your problem. But just want to hint that building an index for BY processing is rarely good for performance. Many times it's more efficient to sort the whole table instead. Pls test both scenarios an get back with your results.
Since it is sorted by ' Ticker TradeDate TradeTime;', "if first.TradeDate" will contain 'if first.Ticker', so 'if first.TradeDate then call missing(price);' should do. One less condition to check may save you a few secs.
Regards,
Haikuo
jdmarino can you give more details on your dataset-sizing SAS version and your computer resource limitations (memory cpu IO/dasd) ?
Indexing can work but has the advantage of additional IO (caused by random IO). IO is often the most delaying factor. Even when you an index it can be more efficient to process it sequential.
Today's computers are not having the limitations of several years back. The hash approach could be the most effective one when the memory (several Gb) is there.
My original SQL version took 5 hours.
KSharp's SET-BY interleaving took < 1 minute (after a 7 minute sort).
I haven't tried Jaap's method, mostly because it seems to me like a manual version of KSharp's.
I am trying Hai.Kuo's hash method, but running out of memory, and I don't know why. The dataset is 7G on my disk, but I have 24G of RAM. This is the error:
ERROR: Hash object added 16252912 items when memory failure occurred.
FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.
ERROR: The SAS System stopped processing this step because of insufficient memory.
Proc options group=memory reports:
Group=MEMORY
SORTSIZE=268435456 Size parameter for sort
SUMSIZE=0 Upper limit for data-dependent memory usage during summarization
MAXMEMQUERY=0 Maximum amount of memory returned when inquiring as to available space
MEMBLKSZ=16777216 Size of memory blocks allocated to support MEMLIB and MEMCACHE options.
MEMMAXSZ=2147483648 Maximum amount of memory allocated to support MEMLIB and MEMCACHE options.
LOADMEMSIZE=0 Suggested memory limit for loaded SAS executables
MEMSIZE=2147483648 Specifies the limit on the total amount of memory to be used by the SAS System
REALMEMSIZE=0 Limit on the total amount of real memory to be used by the SAS System N
I set "-MEMSIZE 19G" in my sasv9.cfg file. Any ideas why I'm not using all the physical memory?
I have 64-bit SAS 9.3 (TS1M1) installed on a Windows 7 machine.
FYI: I don't expect the hash method to be efficient because I have only 100 keys (50 tickers x 2 dates) in my real data. That means for each key I'll be looping/scanning an average of 80,000 (16,000,000 items / 100 keys / 2) elements to find the one I want for each row in Events.
John you are telling You have set 19Gb for the memsize counting the digits I am seeing 2Gb. For sorting you are having 256Mb a common default value.
Setting it to higher values will improve processing performance by avoiding IO. But why is your setting of memsize a fail?
Which sasv9.config did you change and is that one really used? Not overwritten by another?
The hashing is using a b-tree keys search. Very popular as of NO-SQL should perform well. The number of keys can be a problem needing some additional tuning.
>> Why is memsize a fail?
I don't know. I was hoping you did.
Another fail: SORTSIZE. In my .cfg file I have it set to 19G.
Does one need admin rights on a Windows machine to set these params? I don't have admin rights on this machine.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.