SAS Data Management

NewSASuser2018 · Posted 04-25-2018 07:35 AM

Hello sas community!

Many thanks again for beeing very helpful regarding my last problem. The solution worked marvelously!

My issue is the following. I have a large dataset containing ultra high frequency data (tick data):

Time RateBid RateAsk .....

01.01.2015:17:12:12.445   xxxxxxxxxx              xxxxxxxxx
01.01.2015:17:13:32.565   xxxxxxxxxx              xxxxxxxxx
01.01.2015:17:13:40.685   xxxxxxxxxx              xxxxxxxxx
01.01.2015:17:14:59.895   1.32473             1.32487
01.01.2015:17:14:59.895 1.86743               1.97473

An example.csv is attached below. I have already removed many obvious data anomalies and now want to filter for outliers as suggested in the literature.

1) For quotes that have the same time stamp (in HH:MM: SS : SSS precision) I want to detect duplicates (see the last to lines of my illustration above). For each of these duplicates I want to keep the median of RateBid and RateAsk for the respective timestamps respectively. So something like that:

Time RateBid RateAsk .....

01.01.2015:17:12:12.445   xxxxxxxxxx              xxxxxxxxx
01.01.2015:17:13:32.565   xxxxxxxxxx              xxxxxxxxx
01.01.2015:17:13:40.685   xxxxxxxxxx              xxxxxxxxx
01.01.2015:17:14:59.895   medRateBid         medRateAsk

My attempt was this:

proc sql;
   title 'Duplicate Rows in DUPLICATES Table';
   select *, count(*) as Count
      from example
      group by Time
      having count(*) > 1;
run;

So I was able to get at least see what timestamps feature duplicates. Online I found many solutions for detecting and deleting duplicates using proc sort and e.g. the noupkey option, however, I did not find any that suits my purpose.

As a second step I want to perform a manupulation regarding which I struggle understanding sas's syntax.

For each day delete the entry for which the spread is more than 50 times the median spread on that day. My attempt is this:

data example;
set example;
spread = RateAsk - RateBid;
run;

data example;
set example;
t=datepart(Time);
t1=timepart(Time);
format t date9. t1 time.;
run;

data example;
set example;
by t;
medspread= median(spread);
run;

data example;
set example;
if spread > 50*medspread then delete;
run;

This approch is erroneous, however. I don't get an error message but sas does not coumpute the median of the variable "spread" for each day. I must have got the indexing wrong.

After this step, to make sure, I want to delete all entries for which the so called mid-quote deviated by more than 10 mean absolute deviations from a rolling centered median (excluding the observation under consideration) of the 50 observations around the one considered. Here to be honest, I cannot figure out how to construct such a measure in sas. Generally I am used to working with matlab where I would just construct a loop and use the runnig index to index intro the matrix elements. I know, however, that this is not the way to go in sas (that I know).

I hope you can help me with my issue. Thank you very much in advance!

Kind regars

NewSASuser2018 · Posted 04-25-2018 10:05 AM

Any help with any Problem as they are technicall 3 is greatly appreciated!

Best

NewSASuser2018 · Posted 04-25-2018 10:37 AM

As an update,

I seem to have figured out my first issue:

proc sql;
create table want as
select *, median(RateBid) as median_bid, median(RateAsk) as median_ask
from work.example
group by Time;
quit;


data want;
  set want;
  by Time;
  if last.Time;   
run;

so calculate the median for all observations (which is just the observation if there are no duplicates) and then I select the last one of ech observation. So for all time stamps only one observation (either the observation itself or the median) remains. Feel free to correct this, if I overlook something!

SAS Data Management

delete entries for which the spread is 50 times the median spread

Re: delete entries for which the spread is 50 times the median spread

Re: delete entries for which the spread is 50 times the median spread

SAS Helps You Understand Disease Spread: Part 2 Biostat Concepts in th...

Spreading the dates over the months

Using Network Analysis and Machine Learning to Identify Virus Spread T...

Group data and spread across.

Mean median

Follow Us

What is...

SAS Data Management

Join us for our biggest event of the year!

Follow Us

What is...