My first post... please be kind 😊
I am using SAS Studio.
I have a base table of date, time, and coin ticker, as follows:
data BaseTXs;
infile datalines delimiter=',';
input
date :yymmdd10.
time :time.
coin $;
format
date yymmdd10.
time timeampm.;
datalines;
2020-09-01,9:29:55 PM,BTC
2020-09-01,9:58:00 PM,BTC
2020-09-08,10:30:42 AM,BTC
2020-09-09,4:20:00 PM,BTC
;
To my base table, I simply want to append the price of bitcoin using my PriceLookup table, picking the price that is closest to my date and time. Here is my example PriceLookup table:
data PriceLookup;
infile datalines delimiter=',' truncover;
input
price
date :yymmdd10.
time :time.
coin $;
format
price 8.2
date yymmdd10.
time timeampm.;
datalines;
351.05,2020-08-31,11:59:59 PM,BTC
400.45,2020-09-01,8:16:41 AM,BTC
425.02,2020-09-01,9:18:45 PM,BTC
378.05,2020-09-01,10:13:47 PM,BTC
755.66,2020-09-07,6:23:47 PM,BTC
789.99,2020-09-08,9:25:00 AM,BTC
804.33,2020-09-08,1:55:54 PM,BTC
1013.82,2020-09-09,3:05:00 AM,BTC
1020.10,2020-09-09,6:00:00 PM,BTC
1015.05,2020-09-10,8:00:00 PM,BTC
999.65,2020-09-10,8:15:05 PM,BTC
;
So my final output table should look like this:
data BaseTXs_fin;
infile datalines delimiter=',';
input
date :yymmdd10.
time :time.
coin $
price;
format
date yymmdd10.
time timeampm.
price 8.2;
datalines;
2020-09-01,9:29:55 PM,BTC,425.02
2020-09-01,9:58:00 PM,BTC,378.05
2020-09-08,10:30:42 AM,BTC,789.99
2020-09-09,4:20:00 PM,BTC,1020.10
;
I have seen several posts/articles showing how this can be done in one PROC SQL step, but I can't seem to alter the code to fit my needs (I am mostly unsure how to setup the WHERE, GROUP BY, and HAVING statements). Here, for example:
https://www.lexjansen.com/pharmasug/2003/CodersCorner/cc001.pdf
Any help would be much appreciated!
(My request is actually a bit simplified, as my production base table consists of more coins than just bitcoin, and each different coin has its own separate PriceLookup table. So I'd need to do a merge for each coin. I think I can do this on my own if I can just get the simplified version to work, but if you're up for providing the extra help, I certainly won't complain!)
Keeping sorted data is one of SAS's great features, so you should take advantage of that.
This takes 20 seconds on my small Windows server with 500 and 200,000,000 observations.
data _V/view=_V; %* Save data for one date and next date on one line;
if ^LASTOBS then do;
set PRICELOOKUP (firstobs=2);
DT2=DATE*3600*24+TIME; PRICE2=PRICE;
end;
set PRICELOOKUP end=LASTOBS;
DT1=DATE*3600*24+TIME; PRICE1=PRICE;
format DT1 DT2 datetime.;
run;
data WANT;
set BASETXS _V; %* Interleave prices and transactions;
by DATE TIME;
if PRICE1 then do; % Previous date data also contain next date price, retain all;
D1=DT1; P1=PRICE1; D2=DT2; P2=PRICE2; retain D1 D2 P1 P2;
end;
else do; %* Transaction found, use previous record values and keep closest;
DIFF1=abs(DATE*3600*24 + TIME - D1);
DIFF2=abs(DATE*3600*24 + TIME - D2);
PRICE=ifn(DIFF1<DIFF2, P1, P2);
output;
end;
keep DATE TIME COIN PRICE;
run;
.
Congratulations on a perfect first post!
Well presented, well formatted, clear, comprehensive and concise.
And this coming from the grumpier one around here... 🙂
For small volumes, this will do:
proc sql;
select b.*, l.PRICE, abs( l.DATE*3600*24 + l.TIME - b.DATE*3600*24 - b.TIME ) as DIFF
from BASETXS b, PRICELOOKUP l
group by b.DATE, b.TIME
having DIFF=min(DIFF);
date | time | coin | price | DIFF |
---|---|---|---|---|
2020-09-01 | 9:29:55 PM | BTC | 425.02 | 670 |
2020-09-01 | 9:58:00 PM | BTC | 378.05 | 947 |
2020-09-08 | 10:30:42 AM | BTC | 789.99 | 3942 |
2020-09-09 | 4:20:00 PM | BTC | 1020.10 | 6000 |
This creates a Cartesian product however, so speed will plummet as volumes grow.
If the volumes are too large, other more complex strategies may be necessary.
Thank you for the kind words and swift reply, ChrisNZ! Your code suggestion worked perfectly on my sample tables (as you already knew), but it failed on my huge production tables (as you also predicted). I get ERROR: Sort execution failure. I looked it up on support.sas.com and thought maybe moving my work library to a bigger hard drive would fix that, but that process looks a little too technical.
My base table is small enough I think, with around 1200 records and a size of 320kb. But my smallest PriceLookup table is about 500,000 records, and the largest is about 21,000,000 records. Doing a Cartesian product is probably indeed bogging it down.
Note: All tables are already sorted by date, time... and all PriceLookup tables have the same layout. Not sure if that's helpful info or not.
I was trying to step through a logical solution on my own:
This sounds like I would need to nest my datasteps though, which I dont think I can do.
Any code or suggestions would be much appreciated!
Hello @iggles and welcome to the SAS Support Communities!
Interesting task. I would suggest that you create numeric informats from CNTLIN= datasets based on the PriceLookup tables (e.g., one per coin or one per coin and year, depending on the data volumes). The start and end points of the informat ranges would be the mean values of two consecutive datetimes in the PriceLookup table. (The first and last range could use HLO='L' and 'H', respectively.) The prices would serve as the labels. The lookup would then be a simple assignment statement using the INPUTN function.
Here's an example using your sample datasets, creating one informat per coin, using the name of the coin as the informat name (assuming that those are valid informat names).
data infmts(keep=fmtname--start);
do until(last.coin);
set pricelookup;
by coin;
dt=dhms(date,0,0,time);
lag_dt=lag(dt);
lag_price=lag(price);
if first.coin then do;
fmtname=coin;
type='I';
hlo='L';
end;
else do;
end=mean(dt, lag_dt);
label=lag_price;
output;
hlo='I';
start=end;
end;
if last.coin then do;
hlo='H';
label=price;
output;
end;
end;
format start end datetime21.1;
run;
proc format cntlin=infmts;
run;
data want;
set BaseTXs;
price=inputn(put(dhms(date,0,0,time),12.),coin);
format price 8.2;
run;
Thanks for giving it a go, @FreelanceReinh !
I went with ChrisNZ's suggestion and that seemed to work well enough for me. Although I did not try your code yet, I can still use it as a learning tool for future reference. Thanks again!
Keeping sorted data is one of SAS's great features, so you should take advantage of that.
This takes 20 seconds on my small Windows server with 500 and 200,000,000 observations.
data _V/view=_V; %* Save data for one date and next date on one line;
if ^LASTOBS then do;
set PRICELOOKUP (firstobs=2);
DT2=DATE*3600*24+TIME; PRICE2=PRICE;
end;
set PRICELOOKUP end=LASTOBS;
DT1=DATE*3600*24+TIME; PRICE1=PRICE;
format DT1 DT2 datetime.;
run;
data WANT;
set BASETXS _V; %* Interleave prices and transactions;
by DATE TIME;
if PRICE1 then do; % Previous date data also contain next date price, retain all;
D1=DT1; P1=PRICE1; D2=DT2; P2=PRICE2; retain D1 D2 P1 P2;
end;
else do; %* Transaction found, use previous record values and keep closest;
DIFF1=abs(DATE*3600*24 + TIME - D1);
DIFF2=abs(DATE*3600*24 + TIME - D2);
PRICE=ifn(DIFF1<DIFF2, P1, P2);
output;
end;
keep DATE TIME COIN PRICE;
run;
.
Thanks @ChrisNZ !
This gets me close enough such that I think I can get it from here. My production BaseTX table has several different coins in the coin column, and each coin has its own PriceHistory table, so I'll just need to loop through each coin's PriceHistory table. For anyone else reading, the solution code that ChrisNZ provided assumes only 1 type of coin and 1 PriceHistory table, but it works like a charm to grab the best match based on date/time.
BTW Chris, I assume from the fern avatar and 'NZ' in your name that you are from New Zealand? I was there for the 2019 International Hobbit Day celebration (I'm from the USA). I actually rented a campervan for 2 months and drove around both islands, taking selfies where I could dressed as a hobbit. That itself was a lot of fun, but the jaw-dropping scenery is what makes me want to go back. Truly an amazing country! Take care, and thanks again for the help.
2 months! Nice. Very glad you enjoyed your time here. We're lucky we have a sensible govt and are covid free atm, so it's a good place to be.
For more coins, and provided the volumes are reasonable and the lookup tables homogeneous in structure, this should work:
proc sql;
select b.*
, l.PRICE
, abs(l.DATE*3600*24+l.TIME -b.DATE*3600*24-b.TIME) as DIFF
from BASETXS b
, (select * from PRICELOOKUP1
union
select * from PRICELOOKUP2 ) l
where b.COIN = l.COIN
group by b.DATE, b.TIME, b.COIN
having DIFF=min(DIFF);
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.