Hello sas community!
Many thanks again for beeing very helpful regarding my last problem. The solution worked marvelously!
My issue is the following. I have a large dataset containing ultra high frequency data (tick data):
An example.csv is attached below. I have already removed many obvious data anomalies and now want to filter for outliers as suggested in the literature.
1) For quotes that have the same time stamp (in HH:MM: SS : SSS precision) I want to detect duplicates (see the last to lines of my illustration above). For each of these duplicates I want to keep the median of RateBid and RateAsk for the respective timestamps respectively. So something like that:
My attempt was this:
proc sql;
title 'Duplicate Rows in DUPLICATES Table';
select *, count(*) as Count
from example
group by Time
having count(*) > 1;
run;
So I was able to get at least see what timestamps feature duplicates. Online I found many solutions for detecting and deleting duplicates using proc sort and e.g. the noupkey option, however, I did not find any that suits my purpose.
As a second step I want to perform a manupulation regarding which I struggle understanding sas's syntax.
For each day delete the entry for which the spread is more than 50 times the median spread on that day. My attempt is this:
data example;
set example;
spread = RateAsk - RateBid;
run;
data example;
set example;
t=datepart(Time);
t1=timepart(Time);
format t date9. t1 time.;
run;
data example;
set example;
by t;
medspread= median(spread);
run;
data example;
set example;
if spread > 50*medspread then delete;
run;
This approch is erroneous, however. I don't get an error message but sas does not coumpute the median of the variable "spread" for each day. I must have got the indexing wrong.
After this step, to make sure, I want to delete all entries for which the so called mid-quote deviated by more than 10 mean absolute deviations from a rolling centered median (excluding the observation under consideration) of the 50 observations around the one considered. Here to be honest, I cannot figure out how to construct such a measure in sas. Generally I am used to working with matlab where I would just construct a loop and use the runnig index to index intro the matrix elements. I know, however, that this is not the way to go in sas (that I know).
I hope you can help me with my issue. Thank you very much in advance!
Kind regars
As an update,
I seem to have figured out my first issue:
proc sql;
create table want as
select *, median(RateBid) as median_bid, median(RateAsk) as median_ask
from work.example
group by Time;
quit;
data want;
set want;
by Time;
if last.Time;
run;
so calculate the median for all observations (which is just the observation if there are no duplicates) and then I select the last one of ech observation. So for all time stamps only one observation (either the observation itself or the median) remains. Feel free to correct this, if I overlook something!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.