Solved: Negative residual values with heatmap

daltonchris7720 · Posted 09-11-2019 12:44 AM

Hello

I have a large dataset(>5,000,000 observations) and I'm trying a few different models with an insurance cost(OOP,or out of pocket) as the outcome.There are lots of zeros in the outcome variable,so I'm trying PROC HPGENSELECT with a Tweedie or ZINB distributions(separate models),

When I try to do a heat map of the residuals or standardised residuals v. the predicted observations,the map won't show the negative residuals in the y axis.

I'm using

ods graphics on/MAXOBS=5565474 NXYBINSMAX= 673740588;
proc sgplot data=ZINB_resid_stan;
heatmap x=predZINB_OOP y=resid_stan;
run;

Do I need some more code to make the heatmap show the negative values? I've attached an example of the heatmap output.

Regards

Chris

Rick_SAS · Posted 09-15-2019 05:41 AM

If that is your goal, you do not need as many bins as you are using. A typical graph uses about 600 horizontal pixels and 400 vertical pixels for the graph area. To make a heat map on a fine scale, you probably want between 1-5 pixels per bin. It certainly doesn't make sense to use more than (600 X 400) bins!

Try using about 200 bins in each direction and see if that enables you to see the density of the data at the scale you are interested in:

proc sgplot data=Have;
   heatmap x=x1 y=x2 / nxbins=200 nybins=200;
run;

View solution in original post

Rick_SAS · Posted 09-11-2019 08:53 AM

I suspect if you add

REFLINE 0 / axis=y;

you will see that the negative bins are actually there. You can use

PROC MEANS data=ZINB_resid_stan;
var resid_stan;
run;

to find the minimum value of resid_stan, which I'd guess to be about -3.

The following example shows that the HEATMAP statement can, indeed, show negative values.

data test;
call streaminit(1);
do i = 1 to 10000;
   x = rand("Normal");
   y = rand("Normal");
   output;
end;
run;

proc sgplot data=test;
  heatmap x=x y=y;
run;

daltonchris7720 · Posted 09-13-2019 02:18 AM

Thanks Ric.

Unfortunately I updated the dataset I am analysing and I'm now getting this message:

The default number of heatmap bins exceeds the maximum possible number of bins.  The heatmap is not produced.

I haven't increased the number of observations greatly (5565474 to 5723125) so I'm not sure why it is doing this now.

With the smaller dataset, the log was giving me the option of increasing the NXYBINMAX as well as the MAXOBS but it's not doing this now. Maybe I've just hit the limit!

Regards

Chris

Rick_SAS · Posted 09-13-2019 07:52 AM

What is the goal of your analysis?

The HEATMAP statement approximates the density of the scatter plot. With that many bins, each box shows the count for a very small region. There are other ways to get similar information, as detailed in the article "The Essential Guide to Binning in SAS."

The section on 2-D binning links to information about PROC KDE and using SAS/IML to compute the counts.

If you use PROC KDE, the syntax might look something like this:

proc kde data=sashelp.heart;
  bivar cholesterol systolic / bwm=0.5 plots=contour ngrid=250;
run;

If you use PROC IML (or even the DATA step) to compute the counts within each bin, you can then create a scatter point where each point is the center of the bin and the color of the marker represents the count.

daltonchris7720 · Posted 09-15-2019 02:37 AM

Thanks again Rick.

The goal here is just to get a view of the usual residual scatter plots(standardised residuals v.predicted outcome etc.) and a scatter plot of observations v. predicted outcomes.

The outcome is the out of pocket expense I mentioned in another post(trying to use PROC HPFMM), modelled here as a continuous outcome using PROC HPREG(OLS model) and PROC HPGENSELECT (with a Tweedie distribution). I'm also trying to model it as a count outcome with PROC HPGENSELECT and a ZINB distribution.

So I'm looking at the residuals, or trying to, with these models.

Despite the large dataset, these models are not very "accurate", but I guess I don't have the "correct" predictor variables. Like trying to model stock prices, I suspect.

A logistic model seems more useful, modelling the out of pocket expense as binary.

I'll investigate the guide to binning.

Regards

Chris

Rick_SAS · Posted 09-15-2019 05:41 AM

If that is your goal, you do not need as many bins as you are using. A typical graph uses about 600 horizontal pixels and 400 vertical pixels for the graph area. To make a heat map on a fine scale, you probably want between 1-5 pixels per bin. It certainly doesn't make sense to use more than (600 X 400) bins!

Try using about 200 bins in each direction and see if that enables you to see the density of the data at the scale you are interested in:

proc sgplot data=Have;
   heatmap x=x1 y=x2 / nxbins=200 nybins=200;
run;

daltonchris7720 · Posted 09-15-2019 08:33 PM

Thanks Rick.

Always simple when you know how.

Regards

Chris

Negative residual values with heatmap

Re: Negative residual values with heatmap

Re: Negative residual values with heatmap

Re: Negative residual values with heatmap

Re: Negative residual values with heatmap

Re: Negative residual values with heatmap

Re: Negative residual values with heatmap

Re: Negative residual values with heatmap

SAS Innovate 2025: Call for Content

Classroom Training Available!