Re: Proc univariate - selecting numbers for bins

michal_1407 · Posted 09-23-2025 08:34 AM

Hi,
 
I need your help.
 
I want to understand how PROC UNIVARIATE decides how many bins are required?
 
I have the following code:
 
ods output HistogramBins = prefix_th; 
proc univariate data = datain noprint; 
histogram y / vscale = percent  MIDPERCENTS;
run;
 
and I see depending on the data I have different numbers of bins. 
 
I found the original paper: https://www.jstor.org/stable/2288074 and I use the following approach:
width = 3.5 * σ * n^(-1/3)
nbins = ceil( (max - min) / width )
 
but still I have different number of bins.
 
Can you help?
 
from support:
 
ENDPOINTS <=values |KEY |UNIFORM>
uses histogram bin endpoints as the tick mark values for the horizontal axis and determines how to compute the bin width of the histogram bars. You can specify the following values:
values specifies both the left and right endpoints of each histogram interval. The width of the histogram bars is the difference between consecutive endpoints. The procedure uses the same values for all variables.

KEY
determines the endpoints for the data in the key cell. The initial number of endpoints is based on the number of observations in the key cell by using the method of Terrell and Scott (1985). The procedure extends the endpoint list for the key cell in either direction as necessary until it spans the data in the remaining cells.
UNIFORM
determines the endpoints by using all the observations as if there were no cells. In other words, the number of endpoints is based on the total sample size by using the method of Terrell and Scott (1985).

Tom · Posted 09-23-2025 09:17 AM

If you just want a HISTOGRAM why are you running PROC UNIVARIATE instead of the appropriate graphics procedure, like PROC SGPLOT with the HISTOGRAM statement?

michal_1407 · Posted 09-23-2025 09:57 AM

My goal is to get the HistogramBins table.
Based on this table the histogram is generated. I edited my post, sorry for my mistake

Tom · Posted 09-23-2025 10:08 AM

What do you mean by "table"?

Sounds like you want to use ODS OUTPUT to convert this TABLE (tabular report) in the output of PROC UNIVARIABLE

into a DATASET?

And what is your question or your goal?

Do you want to know if there is a way to change PROC UNIVARIATE so that it produces a different number of bins?

Do you want to understand how PROC UNIVARIATE decides how many bins are required?

michal_1407 · Posted 09-23-2025 10:58 AM

Hi,

I want to understand how PROC UNIVARIATE decides how many bins are required.

Only this

ballardw · Posted 09-24-2025 01:18 AM

@michal_1407 wrote:
Hi,

I want to understand how PROC UNIVARIATE decides how many bins are required.

Only this

Univariate has been around for a very long time, the first time I used is was 1987 and it wasn't new then, so there are very many options available to interact.

If you check the online references you will likely see repeated references to

the procedure computes the midpoints by using an algorithm (Terrell and Scott 1985)

which in the references listed becomes:

Terrell, G. R., and Scott, D. W. (1985). “Oversmoothed Nonparametric Density Estimates.” Journal of the American Statistical Association 80:209–214.

michal_1407 · Posted 09-24-2025 04:21 AM

Thanks, I saw this paper, but still I have different number of bins in SAS than in paper and I want to understand how SAS do it.

Tom · Posted 09-24-2025 09:31 AM

@michal_1407 wrote:
Thanks, I saw this paper, but still I have different number of bins in SAS than in paper and I want to understand how SAS do it.

I don't have access to the paper. Can you show your work? How did you pick the KEY cell? (or for that matter how does PROC UNIVARIATE pick the KEY cell?) Or id you ask it to just use UNIFORM bins?

It looks like PROC UNIVARIATE can output a number of statistics that from their names might be related to that paper. Perhaps you could see if using those in the formula shows how it determined the number of bins.

Also note that the particular ODS output table you selected does not include empty bins, at least it did not include empty bins at front or back in the examples I tried. Is that confusing your calculations?

Ksharp · Posted 09-24-2025 03:49 AM

If you want to get the HistogramBins table ,try the option outhistogram= :

proc univariate data=sashelp.heart;
var weight;
histogram weight/outhistogram= histogram;
run;

And @Rick_SAS might give you a hand.

https://blogs.sas.com/content/iml/2023/05/01/overlay-curve-histogram-sas.html

michal_1407 · Posted 09-24-2025 04:20 AM

Thanks for answer, but I want to understand how sas determines the numbers of BINs

Rick_SAS · Posted 09-29-2025 10:52 AM

If your question is answered, please close this thread. If you have additional questions, please let us know how we can help.

Rick_SAS · Posted 09-24-2025 11:10 AM

The histogram bin widths (and therefore the number of bins) are not only determined by n, the number of nonmissing values, but also by choosing bin widths that are "convenient", as described in Lewart (Algorithm 463 of the Collected Algorithms of the ACM, 1973). You can get the bin locations from Lewart's algorithm by using the GSCALE subroutine in SAS IML. For detail, examples, and a discussion, see The location of ticks in statistical graphics - The DO Loop

Registration is open