BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Dominik1
Fluorite | Level 6

Hey, iam new to SAS and I'm writting my Bachelorthesis in empirical finance at the moment. I have a major problem though. For the thesis which deals with transaction costs i have 2 data sets. One contains bid-ask quotes for a certain stock and one contains all the trades for this stock. The problem is now that i need to compute an "easy" equation. I need to compute the realized spread which is defined as price of a trade t minus the quotation midoint in five minutes. And thats the problem. I don't know how to get SAS to the point that it gives me the value of the mid quotation in 5 minutes which is also in the data set. The thing is that there is probably never a quotation midpoint exactly! after 5 minutes.

Maybe one of you knows how to deal with such a problem.

 

Thank you in advance.

Dominik

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Assuming I understand what is your purpose.

 

Data A ;
input midquote time : anydttme.;
format time time.;
cards;
91.72       9:02:35.27
91.73       9:02:40.21
91.7         9:02:47.51
;
run;
 
Data B; 
input tradepice time : anydttme.;
format time time.;
cards;
91.73         9:02:36.27
91.6           9:02:40.21
91.74         9:02:41.21
;
run;

data want;
 set a(in=ina) b(in=inb);
 by time;
 retain _midquote _time;
 if ina then do;_midquote=midquote; _time=time;end;
 if inb;
 format _time time.;
 drop midquote time;
run;

View solution in original post

14 REPLIES 14
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Not entirely sure on your question.  Posting test data - in the form of a datastep, plus required output is useful to illustrate the problem.  If you want to find a time value plus five, then you can do: 

new_time=time_variable + "00:05"t;

 

Remember a time value is the number of seconds since midnight, so you could do:
time_variable + 5 * 60;  

i.e. 5 times 60 seconds.

 

As for the rest of your question its not clear at all to me.

 

Dominik1
Fluorite | Level 6

Thank you for your answer and sorry for being unclear. It is not that easy to describe. I will try to show you the data structure. The problem is that i need to calcuate a simple difference. Not a time difference but a difference of the value of a variable lets say at 2:15pm (now) and the value of the variable in 2:20pm which is also in the data set.

 

So i have two data sets. Data set A contains the variable of interest called midquote and all limit orders that have been postet at a specific time. Data set B contains data about occuring trades (prices, quanitity) and a time variable.

 

Data A (example, there are about 500 000 observations)

midquote time

91.72       9:02:35.27

91.73       9:02:40.21

91.7         9:02:47.51

 

Data B 

tradepice time

91.73         9:02:36.27

91.6           9:02:40.21

91.74         9:02:41.21

 

What i need to do now is to subtract from the trade price now the value for the midquote that prevails 5 minutes later. The problem is that there is often no midquote exactly 5 minutes later which I need to take the closest one in time. There is the idea that for each trade i cut out a t+5 minute timeintervall of dataset A and take the last observation. However, i dont know exactly how to program such a loop.

I hope I could give you some more information.

 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Ah ok, think I am getting it.  What you want to do is to merge the merge the largest time which is within the 5 minute future back to the data.  Try something like:

data a;
  mid_quote=91.72; time="9:02:35.27"t; output;
  mid_quote=91.73; time="9:02:40.21"t; output;
  mid_quote=91.7; time="9:02:47.51"t; output;
run;

data b; 
  trade_price=91.73; time="9:02:36.27"t; output;
  trade_price=91.6; time="9:02:40.21"t; output;
  trade_price=91.74; time="9:02:41.21"t; output;
run;

proc sql;
  create table WANT as
  select  B.*,
          (select MID_QUOTE from (select * from A where (B.TIME-"00:05"t) <= TIME <= B.TIME) having TIME=max(TIME)) as MID_QUOTE
  from    B B;
quit;
Dominik1
Fluorite | Level 6

Ok thank you that sounds good. I put in the variables time is actually called lob_time and dataset a ads_lob_taq while dataset b is called tradads.

 

So i put it like this 

proc sql;
  create table WANT as
  select  B.*,
          (select midquote from (select * from ads_lob_taq where (B.lob_time-"00:05"t) <= lob_time <= B.lob_time) having lob_time=max(lob_time)) as midquote
  from    tradads B;
quit;

 

However, i get the error message now:

507 proc sql;
508 create table WANT as
509 select B.*,
510 (select mpask from (select * from ads_lob_taq where (B.lob_time-"00:05"t) <=
510! lob_time <= B.lob_time) having lob_time=max(lob_time)) as mpask
511 from tradads B;
NOTE: The query requires remerging summary statistics back with the original data.
ERROR: Subquery evaluated to more than one row.
NOTE: Correlation values are: lob_time=32782.98 lob_time=32782.98 .

 

what can I do to avoid that? Thank you already for your help'!!

RW9
Diamond | Level 26 RW9
Diamond | Level 26

What this means is that the subquery (the bit between brackets) is returning more than one record based on the where condition.  It may be as simple as putting distinct:

proc sql;
  create table WANT as
  select  B.*,
          (select distinct midquote from (select * from ads_lob_taq where (B.lob_time-"00:05"t) <= lob_time <= B.lob_time) having lob_time=max(lob_time)) as midquote
  from    tradads B;
quit;

That should solve it.  If it doesn't then you might be getting different miquote values, on the same minimum value.  Thats more problematic as you need to decide which one is more relevant and then code to that.  For instance if 01:00 appears twice with values 9.23 and 9.21, they both fulfill the restriction, but you can't have two of them, so is it the biggest, the smallest, average etc.

Dominik1
Fluorite | Level 6

I used your version and it was executing! 🙂 So far so good, however it is still operating that step and no end yet 😞 How can i implement this with the average oder largest? 

 

Edit: Wow, it is doing what it should do! Thank you very very much, I have been so busy with finding a solution for that problem. The only thing is that it needs endless times to execute the order. Is there any possibility to reduce computing time? SAS is now executing for two hours 😞

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Yes, that's probably because its doing the subquery for each row.  The below code should work faster, it basically joins all data back to trade price dataset, then sorts and takes the first.  Its not pretty, but I don't have time right now to really think about it, though you could check out hash tables:

http://support.sas.com/documentation/cdl/en/lrcon/65287/HTML/default/viewer.htm#n1b4cbtmb049xtn1vh9x...

 

data b;
  mid_quote=91.72; time="9:02:35.27"t; output;
  mid_quote=91.73; time="9:02:40.21"t; output;
  mid_quote=91.7; time="9:02:47.51"t; output;
  format time time8.;
run;

data a; 
  trade_price=91.73; time="9:02:36.27"t; output;
  trade_price=91.6; time="9:02:40.21"t; output;
  trade_price=91.74; time="9:02:41.21"t; output;
  format time time8.;
run;

proc sql;
  create table WANT as
  select  A.*,
          B.MID_QUOTE,
          B.TIME as BTIME
  from    A A
  left join B B
  on      (B.TIME-"00:05"t) <= A.TIME <= B.TIME;
quit;
proc sort data=want;
  by time btime;
run;
data want;
  set want;
  by time;
  if first.time then output;
run;
  
Ksharp
Super User

Then what is your output ?

Dominik1
Fluorite | Level 6

They output is as requested, I get the midquote in t+5 written in the trade file for each trade. So my Trade file is extended by the midquote variable. As far as I checked it the quotes are correct. The only problem is that i have about 30 pairs of trade and quote files and so far I was only able to process four because the step needs about 1-2 hours per file. Is there a way to reduce the time?

Ksharp
Super User

Assuming I understand what is your purpose.

 

Data A ;
input midquote time : anydttme.;
format time time.;
cards;
91.72       9:02:35.27
91.73       9:02:40.21
91.7         9:02:47.51
;
run;
 
Data B; 
input tradepice time : anydttme.;
format time time.;
cards;
91.73         9:02:36.27
91.6           9:02:40.21
91.74         9:02:41.21
;
run;

data want;
 set a(in=ina) b(in=inb);
 by time;
 retain _midquote _time;
 if ina then do;_midquote=midquote; _time=time;end;
 if inb;
 format _time time.;
 drop midquote time;
run;
Dominik1
Fluorite | Level 6

Thank you very much Xia! I shifted the first file 5 minutes back in time and then it worked perfectly! RW9 your code was also helpful but still need hours of computing time for large samples. 

Thank you both for your help!

mgao6767
Calcite | Level 5

Hi I've encountered the same problem recently and I've found several ways in doing that. They are all effective but the most efficient one I found is as follow. It generates correct results for billions of transactions within minutes. The PROC SQL is effective but it takes enormous time to execute should you have a huge dataset. So I figure out a way to do it only using DATA step.

 

The problem is to find the nearest midpoint after 5min for every transaction.

 

Suppose you have the following dataset:

 

Data transactions;
	input PRICE MIDPT TIME : anydttme.;
	format time time.;
	cards;
		18.2	18.43	10:00:53
		18.41	18.405	10:07:05
		18.38	18.375	10:10:14
		18.42	18.395	10:13:53
	;
run;

 

 

The code below will generate a dataset named FinalMatchedResult where the nearest MIDPT after 5min for each transaction is added as a new variable MIDPT_5.

 

 

			/*	
				The transactions are sorted according to TIME.
				Add an ID to each transaction.
			 */
			DATA WORK.transactions; 
				set WORK.transactions;
				ID + 1;
			RUN;

			/*	
				Create a new dataset where each transaction is delayed by 5min (300s).
				Keep record of the ID in the original dataset.
			 */
			DATA WORK.transactions2;
				set WORK.transactions;
				TIME = TIME + 300;
				rename ID = OriginalID;
			RUN;

			/*	
				Stack the two datasets together and sort them by TIME, ID and OriginalID.

				After these two steps, you will have a dataset where the nearst MIDPT after 5min
					for each transaction is simply the first obs that has non-missing OriginalID.

				This finding is especially important, and you should notice such result in the
					output WORK.transaction3 sorted.
			 */
			DATA WORK.transactions3;
				set WORK.transactions WORK.transactions2;
			RUN;

			PROC sort data= WORK.transactions3;
				by TIME ID OriginalID;
			RUN;

			/*	
				Based on previous result, delete all irrelavent transactions.
			 */
			DATA WORK.transactions3_matched (drop=lastObs);
				set transactions3;
				lastObs = lag(OriginalID);
				if OriginalID =. and lastObs =. then delete;
			RUN;

			/*	
				One MIDPT maybe matched to several past transactions 5min ago.
				Luckily these transactions' positions in the dataset are known, start to _N_ -1,
					where _N_ is the row number of current MIDPT.
				The matched results are saved in WORK.transactions3_matched.
			 */
			DATA WORK.matched_result;
				set WORK.transactions3_matched;
				retain start 1;
				if OriginalID =. then do;
					MIDPT_5 = MIDPT;
					do i = start to _N_ - 1;
						set transactions3_matched point= i;
						output;
					end;
					start = _N_ + 1;
				end;
			RUN;

			/*	
				WORK.FinalMatchedResult saves all nearest MIDPT 5min later for every transaction.
				You could double check the dataset to make sure you have the result right.
			 */
			DATA WORK.FinalMatchedResult(drop=ID OriginalID start);
				set WORK.matched_result;
				TIME = TIME - 300;
			RUN;

The output dataset looks like this:

 

 

PRICE    MIDPT    TIME           MIDPT_5

18.2        18.43      10:00:53     18.405

18.41      18.405    10:07:05     18.395

 

 

Hope it helps!

 

Adrian

Ksharp
Super User
Can you start a new session ? It is almost a year ago.

And Don't  forget  Posting  your sample data and output you want .


mgao6767
Calcite | Level 5
Yes sure. I found this discussion days ago when I was dealing with exactly the same problem. Don't know whether I should start a new one because I'm only improving on execution time.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 2242 views
  • 3 likes
  • 4 in conversation