How to categorize in natural breaks

Not applicable
Posts: 1

How to categorize in natural breaks


My boss has asked me to show the distribution of a variable (square footage of 17,000 public library outlets/branches), breaking it into 5 categories using natural breaks. Putting aside arguments of the "best" number of categories or the problems with categorizing a continuous variable... I've used proc rank for creating deciles, quantiles, etc., but the only place I've seen that uses "natural breaks" (outside of distance sporting events) is chloropleth maps. I've Googled for a solution.  The closest I got, other than a bunch of articles on analyses using the Jenks optimization in ArcGIS, was a post to the UGA SAS-Listserv in 2000. I've seen a few possible links to using proc cluster, but that always seems to be when doing a multivariate analysis.

My question: Does anyone know whether SAS has a procedure that does this?  I could probably write up code to do the iterations, but if there is a ready-made proc, there's no reason to do that.  It also looks like the head/tail breaks approach is better (would be more appropriate for this data, since it has a long tail) and would be more straight-forward if I had to write something myself.

My current strategy is to use a decile approach and look at the distributions within deciles to determine reasonable break points.  If anyone has any suggestions, macros, or knowledge about how to do this in SAS without reinventing the wheel, I'd appreciate it.  I haven't attached data since it is a simple univariate descriptive statistical analysis - the only "difficult" part is determining the break points.



Posts: 4,238

Re: How to categorize in natural breaks

I'm not sure what mathematical definition you are using for "natural breaks," but one interpretation is to identify places where there are "gaps" in the data. In other words, you want to cluster the data.

One problem is that your boss is not letting the data speak.  Perhaps there are three "natural" clusters. Insisting on five will result in breaking apart two of the clusters.  Those break points will not be "natural."

The first thing I'd do is plot the data as a histogram with a fringe plot (="rug plot") underneath.

You might also want to apply a square root transform to normalize the data.

If you go the "must have 5 groups" route, look at PROC FASTCLUS, which does k-means clustering.

Occasional Learner
Posts: 1

Re: How to categorize in natural breaks

I've been working with the same problem..trying to replicate "Jenks-Fisher Natural Breaks" via SAS.  The goal of the Jenks algorithm is straightforward...create N groups out of a single variable so that the within subclass variation is minimized and between class variation is maximized.  It's pretty much the default classification mode for most GIS/mapping software.  I'm not at all sure what tweaks ESRI has in their ArcMap..they claim their soln. is proprietary.  But you can come very close to what ArcMap provides using either FASTCLUS (several times if you want to try, say, 4 through 7 different categories) or using CLUSTER and then Proc Tree.  When i've done very simple checks - ie looking at the within group variance with, say, 4 categories out of ArcMap / Jenks and maxclusters = 4 with SAS. the SAS intervals have slightly less overall variance than the similar categories out of ArcMap.  The data happens to be the % of quartz by weight in topsoil samples in a new USGS geochemical/mineralogical survey of the lower 48.  ArcMap DOES let you start w/ any of several different categorization methods and then modify it interactively w/ a data histogram.

proc fastclus data=A_Quartzn out=quartz.fastout maxclusters = 4;

    var A_Quartzn;


first is proc means using the groups/clusters picked by FASTCLUS, next is the same using the 4 Jenks categories out of ArcMap.  The Min/Max are the lower and upper level of each interval.

                                   Analysis Variable : A_Quartzn


     Cluster     Obs     Minimum     Maximum      Mean     Std Dev        Variance


           1     703      77.7      99.9      88.7069701        6.5550437      42.9685981

           2    1541      50.1     77.6      61.5883193       7.8154812      61.0817466

           3    1967      22.5     50.0      38.0440773       7.7738989      60.4335038

           4     591       0.2      22.4      12.0318105         6.7539344      45.6156304


L:\AHS646_Bob\USGS\usgs_sas\                                                                                                         19:25 Wednesday, October 1, 2014   5

Modifying USGS geochemical database, Getting mapping levels

The MEANS Procedure

                                  Analysis Variable : A_Quartzn


       ArcMap     Obs   Minimum   Maximum      Mean      Std Dev        Variance


           1     841      73.7      99.9     86.5820452        7.6916868      59.1620463

           2    1348      50.8      73.6     60.5956231       6.5572610      42.9976720

           3    1687      29.0      50.7     40.8642561       6.1089930      37.3197952

           4     926       0.2      28.9     17.0363931         8.6349186      74.5618201

Ask a Question
Discussion stats
  • 2 replies
  • 1 like
  • 3 in conversation