BookmarkSubscribeRSS Feed
michal_1407
Obsidian | Level 7
Hi,
 
I need your help.
 
I want to understand how PROC UNIVARIATE decides how many bins are required?
 
I have the following code:
 
ods output HistogramBins = prefix_th; 
proc univariate data = datain noprint; 
histogram y / vscale = percent  MIDPERCENTS;
run;
 
and I see depending on the data I have different numbers of bins. 
 
I found the original paper: https://www.jstor.org/stable/2288074 and I use the following approach:
width = 3.5 * σ * n^(-1/3)
nbins = ceil( (max - min) / width )
 
but still I have different number of bins.
 
Can you help?
 
from support:
 
ENDPOINTS <=values |KEY |UNIFORM>
uses histogram bin endpoints as the tick mark values for the horizontal axis and determines how to compute the bin width of the histogram bars. You can specify the following values:
values specifies both the left and right endpoints of each histogram interval. The width of the histogram bars is the difference between consecutive endpoints. The procedure uses the same values for all variables.

KEY
determines the endpoints for the data in the key cell. The initial number of endpoints is based on the number of observations in the key cell by using the method of Terrell and Scott (1985). The procedure extends the endpoint list for the key cell in either direction as necessary until it spans the data in the remaining cells.
UNIFORM
determines the endpoints by using all the observations as if there were no cells. In other words, the number of endpoints is based on the total sample size by using the method of Terrell and Scott (1985).
10 REPLIES 10
Tom
Super User Tom
Super User

If you just want a HISTOGRAM why are you running PROC UNIVARIATE instead of the appropriate graphics procedure, like PROC SGPLOT with the HISTOGRAM statement?

michal_1407
Obsidian | Level 7
My goal is to get the HistogramBins table.
Based on this table the histogram is generated. I edited my post, sorry for my mistake
Tom
Super User Tom
Super User

What do you mean by "table"?

Sounds like you want to use ODS OUTPUT to convert this TABLE (tabular report) in the output of PROC UNIVARIABLE 

Tom_0-1758636354002.png

into a DATASET?

 

And what is your question or your goal?

Do you want to know if there is a way to change PROC UNIVARIATE so that it produces a different number of bins?

Do you want to understand how PROC UNIVARIATE decides how many bins are required?

michal_1407
Obsidian | Level 7
Hi,

I want to understand how PROC UNIVARIATE decides how many bins are required.

Only this
ballardw
Super User

@michal_1407 wrote:
Hi,

I want to understand how PROC UNIVARIATE decides how many bins are required.

Only this

Univariate has been around for a very long time, the first time I used is was 1987 and it wasn't new then, so there are very many options available to interact. 

 

If you check the online references you will likely see repeated references to 

the procedure computes the midpoints by using an algorithm (Terrell and Scott 1985)

which in the references listed becomes:

  • Terrell, G. R., and Scott, D. W. (1985). “Oversmoothed Nonparametric Density Estimates.” Journal of the American Statistical Association 80:209–214.

michal_1407
Obsidian | Level 7
Thanks, I saw this paper, but still I have different number of bins in SAS than in paper and I want to understand how SAS do it.
Tom
Super User Tom
Super User

@michal_1407 wrote:
Thanks, I saw this paper, but still I have different number of bins in SAS than in paper and I want to understand how SAS do it.

I don't have access to the paper.  Can you show your work?  How did you pick the KEY cell? (or for that matter how does PROC UNIVARIATE pick the KEY cell?)  Or id you ask it to just use UNIFORM bins?

 

It looks like PROC UNIVARIATE can output a number of statistics that from their names might be related to that paper.  Perhaps you could see if using those in the formula shows how it determined the number of bins.

 

Also note that the particular ODS output table you selected does not include empty bins, at least it did not include empty bins at front or back in the examples I tried.  Is that confusing your calculations?

Ksharp
Super User

If you want to get the HistogramBins table ,try the option outhistogram= :

proc univariate data=sashelp.heart;
var weight;
histogram weight/outhistogram= histogram;
run;

And @Rick_SAS  might give you a hand.

https://blogs.sas.com/content/iml/2023/05/01/overlay-curve-histogram-sas.html

michal_1407
Obsidian | Level 7
Thanks for answer, but I want to understand how sas determines the numbers of BINs
Rick_SAS
SAS Super FREQ

The histogram bin widths (and therefore the number of bins) are not only determined by n, the number of nonmissing values, but also by choosing bin widths that are "convenient", as described in Lewart (Algorithm 463 of the Collected Algorithms of the ACM, 1973). You can get the bin locations from Lewart's algorithm by using the GSCALE subroutine in SAS IML.  For detail, examples, and a discussion, see The location of ticks in statistical graphics - The DO Loop

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 318 views
  • 1 like
  • 5 in conversation