Counting character observations per specific interval

Reply
Occasional Contributor
Posts: 7

Counting character observations per specific interval

Hi community,

Im very new to sas so i will really appreciate any help. My problem is that i need to sort out the high-frequency fx data (csv), which includes date, time per second, currency pairs, order type, bid price, ask price, bid volume and ask volume. So far i managed very little, only to sort if by currency pairs. My first issue is that i need to sum up the bid volume (as well as ask vol.) in 5 min interval, but the bid-ask vol are represented by certain letters. So what i need is to count how many A`s, B`etc. i have in each 5 min interval. The second issue is that i need to find the average of bid-ask price at the end of the same 5 min interval (per sec1 , than per sec 300, etc). The rest excel can manage.

Attaching the file.

I will be VERY grateful for any tips or help!!

Regards

Anna

Attachment
Super Contributor
Posts: 333

Re: Counting character observations per specific interval

I think we need some more information before we can help...

What are you summing over? Currency Pairs? Order Type? Everything?

What starts the 5 min intervals? Do these intervals overlap? Are these by currency pairs or order type (or both)?

Clarity on the needs for the average price would help as well.

EJ

Frequent Contributor
Posts: 114

Re: Counting character observations per specific interval

Hi,

Looking for more information...

as per heading may be COUNTC, COUNTW function can help.... but we need more info...

Occasional Contributor
Posts: 7

Re: Counting character observations per specific interval

Hi again,

I think i should tell the whole story of what im actually doing, hope it will shed some light.

I posted an example in excel to the original post of what Im trying to do.  The  columns im after are date, time,currency pair, 5 min return and net ask-bid volume. But since the amount data is HUGE I need to sort it out is sas so i can use it in eviews for some modelling.

First of all, the data needs to be separated into three files for eur/usd, use/chf and gbp/usd exchange rates. Then i need to calculate average of bid and ask prices per every second (if one of them is missing, then no averaging , for example, if no ask price, then ask=bid price). After that the 5 min ln return needs to be calculated (return=ln(everage bid-ask price per second 300/everage bid-ask price per second 1) starting from the beginning.

For the bid and ask volume the size is enjoyingly expressed by letters from A to G. So the way i am thinking to find what i actually need (net ask-bid volume) is:

count how many As, Bs, etc were there in the 5 min interval (the same interval for which return is calculated) separately for bid and ask volume. Then to assign specific value to each letter, multiply the amount of letter by the assigned value, sum them:

for example: if there were 2 As and 3Bs for bid volume in 5 min interval (and assigned values are A=1, B=6), then 2*1+3*6=20 (5 min bid volume).

the same goes for ask volume. the last step is 5 min ask volume minus 5 min bid volume  = net order flow.

I hope it sounds more clear now.

Looking forward for any suggestions!

Regards

Anna

Super Contributor
Posts: 644

Re: Counting character observations per specific interval

I'm not sure I can answer all your questions but I think I can assist you part of the way. You do not need to sort your data in order to split it out.

Can I assume you have used Proc Import to get the data into SAS?  Because you have no header row you probably have your table with columns called COL1, COL2, COL3 etc or something similar.  To check, use (assuming your table is called imported)

Proc Contents data = imported ;

run ;

Note from the output whether the date and time values have been correctly imported as numbers.  In the case of dates, most probably not.

Do the split first.

Data eur_usd, usd_chf gbp_usd ;

      Set imported ;

     Rename  COL1 = datechar

                    COL2 = time          /* or timechar if it is not numeric */

                    COL3 = currency_pair

                    /* etc */

                    ;

     Select COL3 ;

          When 'EUR/USD' then output eur_usd ;

          When 'USD/CHF' then output usd_chf ;

          When 'GBP/USD' then output gbp_usd ;

          Otherwise return ;

     End ;

Run ;


Now you should have 3 tables with just the values you want.

If all is well you can save much space


Proc Delete data = imported ;

run ;


(Can always be imported from the text file)


Perform the same calculations on each new table, I'll use gbp_usd as an example.


Data gbp_usd_1 ;

     Set gbp_usd ;

     Date = input (datechar, MMDDYY.) ;

     Format date MMDDYY10. ;

     /* If needed

     Time = input (timechar, time.) ;

     Format time time. ;

     */

     timesec = int(time) ;

     time5min = int(timesec/300) ;

     Format     timesec

                    time5min     time. ;

     /* more stuff - your averageing goes here */

     Drop     datechar /* timechar */ ;

run ;


Your description of the averaging has been somewhat 'lost in translation'.  I am guessing you need a row average of bid and ask, and then this values averaged over the 5 minute period.  To do this I would replace


/* more stuff - your averageing goes here */

with

     row_average = mean (COL5, COL6) ;     /* use the renamed values instead */


Then to get the 5 min average


Proc SQL ;

     Create table gbp_usd_5min_average as

          Select  date

               ,     time5min

               ,     currency_pair

               ,     MEAN (row_average)     As _5min_average

          From    gbp_usd_1

          Group  By     1, 2, 3

          ;

Quit ;


Try this out and let us know whether it meets your requirement, and if not try and explain your requirements even more clearly.  If you encounter errors please copy the LOG record so we can assist.


Richard


Occasional Contributor
Posts: 7

Re: Counting character observations per specific interval

Hi,

Thank you very much for your suggestions!

Ya, its not that easy to explain things as i thought.

I sort of figured out how to split the data to get three currency pairs that i want.

I guess now i need to create an interval variable for every 5 min ( trying now by using your suggestions). Next step will be:

for bid and ask price:  find average  per 1st sec of the beginning of the interval and per last second (sec. 300) of the end of the 5 min interval and so on with 300 sec step (or the closest since the time doesnt exactly go second by second ). Then to find return: R=average bidask price( per sec300)/average bidask price (per sec1) and so on, so i will end up with 5 min returns. I do not need to average of bid ask price for 5 min, just the average per every row (per every second), which then used to fine 5 min returns

for bid ask volume, the first thing is to count the amount of each letter (letters go from A fo G) in the same 5 min interval (like in excel file that i attached to the original post).

Any ideas? Thank you in advance!

Here is the code that i used to sort everything into different files:

data test2;

infile'C:/Thesis Data/ehd_1p5-2002/*.out' dlmstr=',' dsd truncover;

input date :mmddyy8. time :hhmmss8. spot $ type $ bid ask bidvol $ askvol $;

format date date9. time time8.0;

run;

%macro

split (data=,var=);

proc sort data=&data(keep=spot) out=values nodupkey;

by &var;

run;

data _null_;

set values end=last;

call symputx(

'val'||left(_n_),translate(&var,'_','/'));

call symputx(

'va'||left(_n_),&var);

if last then call symputx(

'count',_n_);

run;

data

%do i=1 %to &count;

&&val&i

%end;

;

set &data;

select(&var);

%do i=1 %to &count;

when(

"&&va&i") output &&val&i;

%end;

otherwise;

end;

run;

%mend

split;

%split(data=test2, var=spot)


Super Contributor
Posts: 333

Re: Counting character observations per specific interval

Maybe Im missing somthing, but I am still not sure what starts the interval and whether the different types of transactions matter. I think you need to describe the output you are looking for based on the example data you gave. For example if I look at the usd_jpy breakout in the sample data the first observation is 1:08 am but the next is 3:48 the next day. How do this work with your intervals? Without this information we cant help you identify the right observartions for your formulas.

EJ

Super Contributor
Posts: 644

Re: Counting character observations per specific interval

You undersold yourself on SAS skills!  Had you supplied your code in the first post we could have given you more specific advice.

I would recommend you do not sort your data, because it is already in date time order and a very large dataset consumes mega resources in sorting.  Also, unless you are careful, the sort may disturb the existing order.

To get distinct values of spot, and subsequently split your data

Proc SQL Noprint ;

     Crate table spot_list as

          Select distinct spot

          from test2

          ;

     Select distinct spot into :spotlist separated by ' '  /* space */

          from spot_list ;

     Select count * into :spotcount

          from spot_list ;

Quit ;

%Let spottables  = %sysfunc (translate(&spotlist, _, / )) ;

%macro Whenspots ;

     %do i = 1 to &spotcount ;

          When "%Scan(&Spotlist, &i)" then output %Scan(&spottables, &i) ;

     %end ;

%mend ;

Data &spottables ;

     Set test2 ;

     time5min = int(timesec/300) ;

     /* calculate row average - not sure i understand your requirements fully here */

     ...

     Format     time5min     time. ;

     Select spot ;

          %Whenspots ;

          Otherwise ;

     End ;

Run ;


From there you can modify my SQL code to get averages across 5 min intervals. Another SQL step (or Proc Summary with a class statement) can count your As, Bs, etc, and sum volumes, using "group by time5min, type" then followed by Proc transpose to put all the counts/sums into the same row.  A datastep using first.time5min and last.time5min (times need not be on the exact boundary) can be used for calculations specific to those rows (if required.)

Here is an outline fo at least soe of what you want

%Macro AnalyseSpottbl (Spottbl) ;

     Proc SQL ;

          Create table &Spottbl._summary as

               Select  date

                    ,     spot

                    ,     time5min

                    ,     type

                    ,     'count_' || type as countname

                    ,     'bidvol_' || type as bidname

                    ,     'askvol_' || type as askname

                    ,     count (*) as count5min

                    ,     sum (bidvol) as bidvol5min

                    ,     sum (askvol) as askvol5min

               from test2

               group by 1, 2, 3, 4, 5, 6, 7

               ;

Quit ;

Proc transpose data = &Spottbl._summary (rename = (bidname = _name_))

                    out = &Spottbl._bidvol (drop = _name_)

                    ;

          By date spot time5min ;

          var bidvol5min ;

Run ;

         

Proc transpose data = &Spottbl._summary (rename = (askname = _name_))

                    out = &Spottbl._askvol (drop = _name_)

                    ;

          By date spot time5min ;

          var askvol5min ;

Run ;

Proc transpose data = &Spottbl._summary (rename = (countname = _name_))

                    out = &Spottbl._counts (drop = _name_)

                    ;

          By date spot time5min ;

          var count5min ;

Run ;

Data &Spottbl._allstats ;

     Merge &Spottbl._counts

               &Spottbl._bidvol

               &Spottbl._askvol

               ;

          By date spot time5min ;

Run ;

%Mend ;

%Macro DoAnalysis (num) ;

     %Do I = 1 to &num ;

          %Let Spottbl = %Scan(&spottables, &i) ;

          %AnalyseSpottbl (&Spottbl) ;

     %End ;

%Mend ;

%DoAnalysis (&spotcount) ;

Good luck

Richard

Occasional Contributor
Posts: 7

Re: Counting character observations per specific interval

I have to say when i posted my question the first time i was still struggling with simple thing like opening cdv files in sas, but since then its going better, very slow though...

The sorting itself went fine, looks like i got all the stuff right. I just attached an example of 1 day data im working with. Now i created new variable: datetime variable, average price, netOF (net order flow (askvol-bidvol) - i replaced letters with actual values so its easier to work with them) and 5min interval, which is 1 for first 5 min (dont know how to do it for the whole period without using if statement for every datetime and not sure if i actually need it or not).  The thing is I'm not sure how to specify is sas:

1. how to calculate ln return by taking the last average price in 5min interval (in example file it is the average price in the same row as 02jan2002 00:07:02 ) and divide it by the first one (first row average price for 02jan2002 00:02:52 ) and so on til the end.Then report the value of return in the row which is the end of every 5min interval.

2. how to sum all net OF for every 5 min interval (for the first 5 min  - from 00:02:52 til 00:07:02) and report in the same row as return.

3. And how get the output with  variables which are calculated for every 5min (Im "shrinking" the data so eviews can read it)

Any ideas?

Thanks in advance

Anna

Super Contributor
Posts: 333

Re: Counting character observations per specific interval

The example was great but it starts mid dataset. How should them be treated? Why did it start at 18:31 when there was obs with 18:30 right before it? Also the end of the period is actually longer than 5 minutes so is the end the observation after 5 min is passed? what happens if the next observation is 10 minutes latter?

With out the rules to define the intervals we cant provide specific code. Basically you would have to id each observation that starts the interval (sec1) and retain the value until you get to your end of the interval then perform you calculation.

Here is the code I was playing with to figure it out:

data eur_usd2;

  set eur_usd;

  datetime = dhms(date,hour(time),minute(time),second(time));

  secsince = datetime - lag1(datetime);

  if secsince > 360 then do;

  start = 1;

  addsec = 0;

  end;

  else start = 0;

  if start = 0 then do;

  addsec = addsec + secsince;

  end;

  retain addsec;

  format datetime datetime16.;

run;

Occasional Contributor
Posts: 7

Re: Counting character observations per specific interval

I meant the last attachment, the txt file : test_eurusd. Sorry for inconvenience

Anna

Super Contributor
Posts: 333

Re: Counting character observations per specific interval

Are you hand assigning the 5min_int var? if so I would just id the start and end of each interval. (The formulas are not the problem ...I think the problem has been getting sas to id the intervals for you).

Using that as an example (0 to start the interval 1 for those obs in the interval, and 2 to end it or something like that):

data data3;

     set data2;

     if min_5_int = 0 then sec1avg = average(bid,ask);

    

     if min_5_int = 2 then do;

          sec300avg = average(bid,ask);

          5minret = ln(sec300avg/sec1avg);

     end;

     retain sec1avg;

     /*once you verify this works you can add: if min_5_int = 2 then output; */

  run;

I didnt have time to test it but hopefully it at least gets you going in the right direction.

EJ

Occasional Contributor
Posts: 7

Re: Counting character observations per specific interval

Thanks! I think it looks like what i need, will try it.

Anna

Occasional Contributor
Posts: 7

Re: Counting character observations per specific interval

Im not sure how to id the intervals correctly to avoid hand assigning... Any tips?

Regards

Anna

Super Contributor
Posts: 333

Re: Counting character observations per specific interval

The code in the early post was my first pass at identifying the interval. But just saying 5 min interval does not mean a lot once you look at complex data. What if there is only a 3 min interval before a large jump in time? Does that matter? The likely hood that every series is not exactly 5 minutes complicates things.

I think the answer lies in the following code (I just dont have the time to get further with it). Hopefully someone else in the community may have some ideas / solutions.

data eur_usd2;

  set eur_usd;

  start = 0;

  datetime = dhms(date,hour(time),minute(time),second(time));

  avg = mean(bid,ask);

  if _N_ = 1 then do;

  start = 1;

  secsince = 0;

  lagavg = 0;

  addsec = 0;

  end;

  else do;

  secsince = datetime - lag1(datetime);

  lagavg = lag1(avg);

  end;

  if secsince > 300 or addsec>300 then do;

  start = 1;

  addsec = 0;

  sec1avg = avg;

  end;

  if start = 0 then do;

  addsec = addsec + secsince;

  end;

  retain addsec sec1avg;

  format datetime datetime16.;

run;

EJ

Ask a Question
Discussion stats
  • 14 replies
  • 643 views
  • 6 likes
  • 4 in conversation