BookmarkSubscribeRSS Feed
mahler_ji
Obsidian | Level 7

Hey all,

 

I have written a procedure that flags suspicious data for a process in which we read in data on a daily basis into a data warehouse.  One of the dimensions that I am currently flagging is when we have a correlation break (which I have defined as a situation where the day over day correlation of today's data when compared to yesterday's data is significantly different than the historical values).

 

If this happens, I was thinking that it would be useful to see an overlay of the distribution of today's values over some historical distribution of the data.  How would I do that?

 

The current form of the data that I have is a skinny table with three columns: date, identifier, value.  There are about 1000 identifiers, and 50 days worth of data (for this example).

 

Or, what is the best way that you could think of to show a representation of this "correlation break" so that we can determine whether or not there is, in fact, a data integrity problem.

 

5 REPLIES 5
Quentin
Super User

Can you provide a bit more definition to the "correlation break"?  Do you mean correlation is the PROC CORR sense, or something else?

 

It might be helpful to post a small sample of dummy data, perhaps a dataset YESTERDAY, a dataset TODAYcorr which passes your correlation test, and a dataset TODAYnocorr which would be flagged as suspicous.

 

Know the metric you are using to define correlation break should help people in suggesting a visualization for this concept.

 

 

 

BASUG is hosting free webinars ! Next up: Art Carpenter on February 28 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
mahler_ji
Obsidian | Level 7

Hello,

 

Thanks for your reply.  I was hoping not to get into the definition of "correlation break" since the way that I flag that is a little involved, but I can give it a shot if you'd like.

 

I calculate the day-over-day correlation of a metric every day.  So if the data from yesterday is exactly the same for today for a certian variable, then the correlation would be 1.  And then I use proc timeseries to try to get a predicted value of that correlation (since it wouldn't make sense to just constrain for the correlation to be within certain bounds for all different variables, regardless).  If the value of today's correlation is larger than three standard deviations away from the predicted value using proc timeseries, then I flag that variable for today as having a correlation break.

 

 

ballardw
Super User

If you a comparison data set with the same horizontal axis variable and values then append the report values to the base data with an additional variable to indicate the data source (base, current or similar). The IN data set options can do this easily:

 

Data want;

    set

        base

        report (in=in2);

   if in2 then source="Current";

   else Source="Base";

run;

 

You may need to control the x values of the output set possibly with range values in a WHERE clause data set option as well.

 

Then

Proc Sgplot data=want;

   series x=xvariable y=yvariable/ group=source;

run;

 

If I were doing this for as many variables as you appear to have then a macro or call execute code would likely be appropriate.

mahler_ji
Obsidian | Level 7

ballardw,

 

Thank you very much, I will see what this looks like.  I am just trying to find the best way to look at this data.  

 

I am currently doing this, which seems to be ok:

 

title 'Distributions';
proc sgplot data=want;
density w_old / LEGENDLABEL= "Historical" lineattrs=(pattern=solid);
density w_new / LEGENDLABEL= "Today" lineattrs=(pattern=solid);
keylegend / location=inside position=topright across=1;
xaxis display=(nolabel);
run;

 

Thanks again for your suggestion!

Quentin
Super User

If you want to display the correlation break itself, perhaps would want some sort of series plot with date on the x-axis, with the y-axis plotting the observed correlation each day, and a band plot showing the predicted interval.  And any points are outside of the predicted band, they could be traffic-lighted.

 

Another option would be to just visualize today's correlation (with a scatter plot and regression line), rather than visualize the break.  That is, if you are triggering an assessment of today's data, maybe it's enough to send an alert email that says "observed correlation of today's data and yesterday's data was X, which is less than expected lower limit of Y.  See below correlation plot  

 

But I maybe misunderstanding...

BASUG is hosting free webinars ! Next up: Art Carpenter on February 28 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

sas-innovate-2024.png

 

Time is running out to save with the early bird rate. Register by Friday, March 1 for just $695 - $100 off the standard rate.

 

Check out the agenda and get ready for a jam-packed event featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events. 

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 966 views
  • 0 likes
  • 3 in conversation