06-19-2014 10:57 AM
My boss has asked me to show the distribution of a variable (square footage of 17,000 public library outlets/branches), breaking it into 5 categories using natural breaks. Putting aside arguments of the "best" number of categories or the problems with categorizing a continuous variable... I've used proc rank for creating deciles, quantiles, etc., but the only place I've seen that uses "natural breaks" (outside of distance sporting events) is chloropleth maps. I've Googled for a solution. The closest I got, other than a bunch of articles on analyses using the Jenks optimization in ArcGIS, was a post to the UGA SAS-Listserv in 2000. I've seen a few possible links to using proc cluster, but that always seems to be when doing a multivariate analysis.
My question: Does anyone know whether SAS has a procedure that does this? I could probably write up code to do the iterations, but if there is a ready-made proc, there's no reason to do that. It also looks like the head/tail breaks approach is better (would be more appropriate for this data, since it has a long tail) and would be more straight-forward if I had to write something myself.
My current strategy is to use a decile approach and look at the distributions within deciles to determine reasonable break points. If anyone has any suggestions, macros, or knowledge about how to do this in SAS without reinventing the wheel, I'd appreciate it. I haven't attached data since it is a simple univariate descriptive statistical analysis - the only "difficult" part is determining the break points.
06-19-2014 11:47 AM
I'm not sure what mathematical definition you are using for "natural breaks," but one interpretation is to identify places where there are "gaps" in the data. In other words, you want to cluster the data.
One problem is that your boss is not letting the data speak. Perhaps there are three "natural" clusters. Insisting on five will result in breaking apart two of the clusters. Those break points will not be "natural."
The first thing I'd do is plot the data as a histogram with a fringe plot (="rug plot") underneath.
You might also want to apply a square root transform to normalize the data.
If you go the "must have 5 groups" route, look at PROC FASTCLUS, which does k-means clustering.
10-01-2014 07:49 PM
I've been working with the same problem..trying to replicate "Jenks-Fisher Natural Breaks" via SAS. The goal of the Jenks algorithm is straightforward...create N groups out of a single variable so that the within subclass variation is minimized and between class variation is maximized. It's pretty much the default classification mode for most GIS/mapping software. I'm not at all sure what tweaks ESRI has in their ArcMap..they claim their soln. is proprietary. But you can come very close to what ArcMap provides using either FASTCLUS (several times if you want to try, say, 4 through 7 different categories) or using CLUSTER and then Proc Tree. When i've done very simple checks - ie looking at the within group variance with, say, 4 categories out of ArcMap / Jenks and maxclusters = 4 with SAS. the SAS intervals have slightly less overall variance than the similar categories out of ArcMap. The data happens to be the % of quartz by weight in topsoil samples in a new USGS geochemical/mineralogical survey of the lower 48. ArcMap DOES let you start w/ any of several different categorization methods and then modify it interactively w/ a data histogram.
proc fastclus data=A_Quartzn out=quartz.fastout maxclusters = 4;
first is proc means using the groups/clusters picked by FASTCLUS, next is the same using the 4 Jenks categories out of ArcMap. The Min/Max are the lower and upper level of each interval.
Analysis Variable : A_Quartzn
Cluster Obs Minimum Maximum Mean Std Dev Variance
1 703 77.7 99.9 88.7069701 6.5550437 42.9685981
2 1541 50.1 77.6 61.5883193 7.8154812 61.0817466
3 1967 22.5 50.0 38.0440773 7.7738989 60.4335038
4 591 0.2 22.4 12.0318105 6.7539344 45.6156304
L:\AHS646_Bob\USGS\usgs_sas\ahs_USGS_14_09_29.sas 19:25 Wednesday, October 1, 2014 5
Modifying USGS geochemical database, Getting mapping levels
The MEANS Procedure
Analysis Variable : A_Quartzn
ArcMap Obs Minimum Maximum Mean Std Dev Variance
1 841 73.7 99.9 86.5820452 7.6916868 59.1620463
2 1348 50.8 73.6 60.5956231 6.5572610 42.9976720
3 1687 29.0 50.7 40.8642561 6.1089930 37.3197952
4 926 0.2 28.9 17.0363931 8.6349186 74.5618201