Data visualization with SAS programming

How to Best Present Data

Reply
Frequent Contributor
Posts: 101

How to Best Present Data

Hey all,

 

I have written a procedure that flags suspicious data for a process in which we read in data on a daily basis into a data warehouse.  One of the dimensions that I am currently flagging is when we have a correlation break (which I have defined as a situation where the day over day correlation of today's data when compared to yesterday's data is significantly different than the historical values).

 

If this happens, I was thinking that it would be useful to see an overlay of the distribution of today's values over some historical distribution of the data.  How would I do that?

 

The current form of the data that I have is a skinny table with three columns: date, identifier, value.  There are about 1000 identifiers, and 50 days worth of data (for this example).

 

Or, what is the best way that you could think of to show a representation of this "correlation break" so that we can determine whether or not there is, in fact, a data integrity problem.

 

PROC Star
Posts: 1,287

Re: How to Best Present Data

Can you provide a bit more definition to the "correlation break"?  Do you mean correlation is the PROC CORR sense, or something else?

 

It might be helpful to post a small sample of dummy data, perhaps a dataset YESTERDAY, a dataset TODAYcorr which passes your correlation test, and a dataset TODAYnocorr which would be flagged as suspicous.

 

Know the metric you are using to define correlation break should help people in suggesting a visualization for this concept.

 

 

 

Frequent Contributor
Posts: 101

Re: How to Best Present Data

Hello,

 

Thanks for your reply.  I was hoping not to get into the definition of "correlation break" since the way that I flag that is a little involved, but I can give it a shot if you'd like.

 

I calculate the day-over-day correlation of a metric every day.  So if the data from yesterday is exactly the same for today for a certian variable, then the correlation would be 1.  And then I use proc timeseries to try to get a predicted value of that correlation (since it wouldn't make sense to just constrain for the correlation to be within certain bounds for all different variables, regardless).  If the value of today's correlation is larger than three standard deviations away from the predicted value using proc timeseries, then I flag that variable for today as having a correlation break.

 

 

Super User
Posts: 11,101

Re: How to Best Present Data

If you a comparison data set with the same horizontal axis variable and values then append the report values to the base data with an additional variable to indicate the data source (base, current or similar). The IN data set options can do this easily:

 

Data want;

    set

        base

        report (in=in2);

   if in2 then source="Current";

   else Source="Base";

run;

 

You may need to control the x values of the output set possibly with range values in a WHERE clause data set option as well.

 

Then

Proc Sgplot data=want;

   series x=xvariable y=yvariable/ group=source;

run;

 

If I were doing this for as many variables as you appear to have then a macro or call execute code would likely be appropriate.

Frequent Contributor
Posts: 101

Re: How to Best Present Data

ballardw,

 

Thank you very much, I will see what this looks like.  I am just trying to find the best way to look at this data.  

 

I am currently doing this, which seems to be ok:

 

title 'Distributions';
proc sgplot data=want;
density w_old / LEGENDLABEL= "Historical" lineattrs=(pattern=solid);
density w_new / LEGENDLABEL= "Today" lineattrs=(pattern=solid);
keylegend / location=inside position=topright across=1;
xaxis display=(nolabel);
run;

 

Thanks again for your suggestion!

PROC Star
Posts: 1,287

Re: How to Best Present Data

If you want to display the correlation break itself, perhaps would want some sort of series plot with date on the x-axis, with the y-axis plotting the observed correlation each day, and a band plot showing the predicted interval.  And any points are outside of the predicted band, they could be traffic-lighted.

 

Another option would be to just visualize today's correlation (with a scatter plot and regression line), rather than visualize the break.  That is, if you are triggering an assessment of today's data, maybe it's enough to send an alert email that says "observed correlation of today's data and yesterday's data was X, which is less than expected lower limit of Y.  See below correlation plot  

 

But I maybe misunderstanding...

Ask a Question
Discussion stats
  • 5 replies
  • 302 views
  • 0 likes
  • 3 in conversation