BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
daltonchris7720
Calcite | Level 5

Hello

 

I have a large dataset(>5,000,000 observations) and I'm trying a few different models with an insurance cost(OOP,or out of pocket) as the outcome.There are lots of zeros in the outcome variable,so I'm trying PROC HPGENSELECT with a Tweedie or ZINB distributions(separate models),

When I try to do a heat map of the residuals or standardised residuals v. the predicted observations,the map won't show the negative residuals in the y axis.

I'm using

ods graphics on/MAXOBS=5565474 NXYBINSMAX= 673740588;
proc sgplot data=ZINB_resid_stan;
heatmap x=predZINB_OOP y=resid_stan;
run;

Do I need some more code to make the heatmap show the negative values? I've attached an example of the heatmap output.

 

Regards

 

Chris

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

If that is your goal, you do not need as many bins as you are using. A typical graph uses about 600 horizontal pixels and 400 vertical pixels for the graph area. To make a heat map on a fine scale, you probably want between 1-5 pixels per bin. It certainly doesn't make sense to use more than (600 X 400) bins!

 

Try using about 200 bins in each direction and see if that enables you to see the density of the data at the scale you are interested in:

 

proc sgplot data=Have;
   heatmap x=x1 y=x2 / nxbins=200 nybins=200;
run;

View solution in original post

6 REPLIES 6
Rick_SAS
SAS Super FREQ

I suspect if you add

    REFLINE 0 / axis=y;

you will see that the negative bins are actually there. You can use

 

PROC MEANS data=ZINB_resid_stan;
var resid_stan;
run;

 

to find the minimum value of resid_stan, which I'd guess to be about -3.

 

The following example shows that the HEATMAP statement can, indeed, show negative values.

 

data test;
call streaminit(1);
do i = 1 to 10000;
   x = rand("Normal");
   y = rand("Normal");
   output;
end;
run;

proc sgplot data=test;
  heatmap x=x y=y;
run;
daltonchris7720
Calcite | Level 5

Thanks Ric.

Unfortunately I updated the dataset I am analysing and I'm now getting this message:

The default number of heatmap bins exceeds the maximum possible number of bins.  The heatmap is not produced.

I haven't increased the number of observations greatly (5565474 to 5723125) so I'm not sure why it is doing this now.

With the smaller dataset, the log was giving me the option of increasing the NXYBINMAX as well as the MAXOBS but it's not doing this now. Maybe I've just hit the limit!

Regards

Chris

Rick_SAS
SAS Super FREQ

What is the goal of your analysis?

 

The HEATMAP statement approximates the density of the scatter plot. With that many bins, each box shows the count for a very small region. There are other ways to get similar information, as detailed in the article "The Essential Guide to Binning in SAS."

The section on 2-D binning links to information about PROC KDE and using SAS/IML to compute the counts.

 

If you use PROC KDE, the syntax might look something like this:

proc kde data=sashelp.heart;
  bivar cholesterol systolic / bwm=0.5 plots=contour ngrid=250;
run;

If you use PROC IML (or even the DATA step) to compute the counts within each bin, you can then create a scatter point where each point is the center of the bin and the color of the marker represents the count.

 

daltonchris7720
Calcite | Level 5

Thanks again Rick.

The goal here is just to get a view of the usual residual scatter plots(standardised residuals v.predicted outcome etc.) and a scatter plot of observations v. predicted outcomes.

The outcome is the out of pocket expense I mentioned in another post(trying to use PROC HPFMM), modelled here as a continuous outcome using PROC HPREG(OLS model) and PROC HPGENSELECT (with a Tweedie distribution). I'm also trying to model it as a count outcome with PROC HPGENSELECT and a ZINB distribution.

So I'm looking at the residuals, or trying to, with these models.

Despite the large dataset, these models are not very "accurate", but I guess I don't have the "correct" predictor variables. Like trying to model stock prices, I suspect.

A logistic model seems more useful, modelling the out of pocket expense as binary.

I'll investigate the guide to binning.

Regards

Chris

Rick_SAS
SAS Super FREQ

If that is your goal, you do not need as many bins as you are using. A typical graph uses about 600 horizontal pixels and 400 vertical pixels for the graph area. To make a heat map on a fine scale, you probably want between 1-5 pixels per bin. It certainly doesn't make sense to use more than (600 X 400) bins!

 

Try using about 200 bins in each direction and see if that enables you to see the density of the data at the scale you are interested in:

 

proc sgplot data=Have;
   heatmap x=x1 y=x2 / nxbins=200 nybins=200;
run;

daltonchris7720
Calcite | Level 5

Thanks Rick.

Always simple when you know how.

Regards

Chris

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 3926 views
  • 1 like
  • 2 in conversation