04-25-2018 07:35 AM - edited 04-25-2018 07:38 AM
Hello sas community!
Many thanks again for beeing very helpful regarding my last problem. The solution worked marvelously!
My issue is the following. I have a large dataset containing ultra high frequency data (tick data):
An example.csv is attached below. I have already removed many obvious data anomalies and now want to filter for outliers as suggested in the literature.
1) For quotes that have the same time stamp (in HH:MM: SS : SSS precision) I want to detect duplicates (see the last to lines of my illustration above). For each of these duplicates I want to keep the median of RateBid and RateAsk for the respective timestamps respectively. So something like that:
My attempt was this:
proc sql; title 'Duplicate Rows in DUPLICATES Table'; select *, count(*) as Count from example group by Time having count(*) > 1; run;
So I was able to get at least see what timestamps feature duplicates. Online I found many solutions for detecting and deleting duplicates using proc sort and e.g. the noupkey option, however, I did not find any that suits my purpose.
As a second step I want to perform a manupulation regarding which I struggle understanding sas's syntax.
For each day delete the entry for which the spread is more than 50 times the median spread on that day. My attempt is this:
data example; set example; spread = RateAsk - RateBid; run; data example; set example; t=datepart(Time); t1=timepart(Time); format t date9. t1 time.; run; data example; set example; by t; medspread= median(spread); run; data example; set example; if spread > 50*medspread then delete; run;
This approch is erroneous, however. I don't get an error message but sas does not coumpute the median of the variable "spread" for each day. I must have got the indexing wrong.
After this step, to make sure, I want to delete all entries for which the so called mid-quote deviated by more than 10 mean absolute deviations from a rolling centered median (excluding the observation under consideration) of the 50 observations around the one considered. Here to be honest, I cannot figure out how to construct such a measure in sas. Generally I am used to working with matlab where I would just construct a loop and use the runnig index to index intro the matrix elements. I know, however, that this is not the way to go in sas (that I know).
I hope you can help me with my issue. Thank you very much in advance!
04-25-2018 10:37 AM
As an update,
I seem to have figured out my first issue:
proc sql; create table want as select *, median(RateBid) as median_bid, median(RateAsk) as median_ask from work.example group by Time; quit; data want; set want; by Time; if last.Time; run;
so calculate the median for all observations (which is just the observation if there are no duplicates) and then I select the last one of ech observation. So for all time stamps only one observation (either the observation itself or the median) remains. Feel free to correct this, if I overlook something!