07-18-2016 02:04 PM
I have written a procedure that flags suspicious data for a process in which we read in data on a daily basis into a data warehouse. One of the dimensions that I am currently flagging is when we have a correlation break (which I have defined as a situation where the day over day correlation of today's data when compared to yesterday's data is significantly different than the historical values).
If this happens, I was thinking that it would be useful to see an overlay of the distribution of today's values over some historical distribution of the data. How would I do that?
The current form of the data that I have is a skinny table with three columns: date, identifier, value. There are about 1000 identifiers, and 50 days worth of data (for this example).
Or, what is the best way that you could think of to show a representation of this "correlation break" so that we can determine whether or not there is, in fact, a data integrity problem.
07-18-2016 02:22 PM
Can you provide a bit more definition to the "correlation break"? Do you mean correlation is the PROC CORR sense, or something else?
It might be helpful to post a small sample of dummy data, perhaps a dataset YESTERDAY, a dataset TODAYcorr which passes your correlation test, and a dataset TODAYnocorr which would be flagged as suspicous.
Know the metric you are using to define correlation break should help people in suggesting a visualization for this concept.
07-18-2016 02:36 PM
Thanks for your reply. I was hoping not to get into the definition of "correlation break" since the way that I flag that is a little involved, but I can give it a shot if you'd like.
I calculate the day-over-day correlation of a metric every day. So if the data from yesterday is exactly the same for today for a certian variable, then the correlation would be 1. And then I use proc timeseries to try to get a predicted value of that correlation (since it wouldn't make sense to just constrain for the correlation to be within certain bounds for all different variables, regardless). If the value of today's correlation is larger than three standard deviations away from the predicted value using proc timeseries, then I flag that variable for today as having a correlation break.
07-18-2016 02:28 PM
If you a comparison data set with the same horizontal axis variable and values then append the report values to the base data with an additional variable to indicate the data source (base, current or similar). The IN data set options can do this easily:
if in2 then source="Current";
You may need to control the x values of the output set possibly with range values in a WHERE clause data set option as well.
Proc Sgplot data=want;
series x=xvariable y=yvariable/ group=source;
If I were doing this for as many variables as you appear to have then a macro or call execute code would likely be appropriate.
07-18-2016 02:38 PM
Thank you very much, I will see what this looks like. I am just trying to find the best way to look at this data.
I am currently doing this, which seems to be ok:
proc sgplot data=want;
density w_old / LEGENDLABEL= "Historical" lineattrs=(pattern=solid);
density w_new / LEGENDLABEL= "Today" lineattrs=(pattern=solid);
keylegend / location=inside position=topright across=1;
Thanks again for your suggestion!
07-18-2016 02:53 PM
If you want to display the correlation break itself, perhaps would want some sort of series plot with date on the x-axis, with the y-axis plotting the observed correlation each day, and a band plot showing the predicted interval. And any points are outside of the predicted band, they could be traffic-lighted.
Another option would be to just visualize today's correlation (with a scatter plot and regression line), rather than visualize the break. That is, if you are triggering an assessment of today's data, maybe it's enough to send an alert email that says "observed correlation of today's data and yesterday's data was X, which is less than expected lower limit of Y. See below correlation plot
But I maybe misunderstanding...