I am creating a box and whisker graph using 4 data points: -53.8, -41.2, -27.0, and -26.5. I would have expected the box would extend from the 2nd value to the 3rd value and the bottom whisker would extend to the minimum value and the upper whisker would extend to the maximum value. Below is the graph I created in SAS where the box extends to the midpoint between the 1st and 2nd value (-47.5) to the midpoint between the 3rd and 4th value (-26.75), but the whiskers extend to the minimum and maximum. Any ideas why SAS is computing 1st and 3rd quartiles in this manner? With only 4 values, I would expect the median would divide the box evenly.
First a caveat: I don't use or have access to Visual Analytics
SAS has different approaches for calculating percentiles depending on usage. I do not that Proc Univariate, Proc SGPlot, Proc Boxplot and others use either a PCTLDEF= or PERCENTILE= option with values from 1 to 5 to specify which approach is used. The boxes are drawn from the 25th to 75th percentiles. So which definition is used does impact the appearance of graphs. Similarly the median, 50th percentile,
For what its worth, PROC UNIVARIATE produces the same chart. I think that with just 4 data points, the quartile limits are not going to behave in any type of intuitive way. And I don't think your statement "With only 4 values, I would expect the median would divide the box evenly" is a correct statement.
Anyway, the PROC UNIVARIATE documentation explains exactly how the percentiles are computed. https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/procstat/procstat_univariate_details14.htm
I'm guessing your real data has a lot more than 4 data points. Am I correct? If so, does the boxplot on the real data look more intuitively correct?
Hello @bmcohen36,
@bmcohen36 wrote:
... using 4 data points: -53.8, -41.2, -27.0, and -26.5. I would have expected the box would extend from the 2nd value to the 3rd value (...) With only 4 values, I would expect the median would divide the box evenly.
In addition to the five quantile definitions offered by SAS there are (at least) four more available in other common statistical software packages. Of course, they can be implemented in SAS by programming: see Rick Wicklin's blog article "Sample quantiles: A comparison of 9 definitions" and the accompanying PROC IML code.
However, none of those nine definitions matches your expectations, even though defining the first, second and third quartile of your example data as x0.25=-41.2, x0.5=(-41.2-27.0)/2 and x0.75=-27.0, respectively, would satisfy the criterion
- at least 100p percent of the sample values are less than or equal to xp and
- at least 100(1-p) percent of the sample values are greater than or equal to xp
which is sometimes used to characterize sample p-quantiles xp (0<p<1). Note that, by this characterization, all values in the interval [-53.8, -41.2] qualify as a first quartile and similarly all values in [-41.2, -27.0] as a median and all values in [-27.0, -26.5] as a third quartile. Hence, your definition would pick the upper interval endpoint for the first quartile, the midpoint of the interval for the median and the lower interval endpoint for the third quartile to make the definition unique. The default quantile definition in SAS, however, consistently uses the interval midpoints in these cases.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
See how to use one filter for multiple data sources by mapping your data from SAS’ Alexandria McCall.
Find more tutorials on the SAS Users YouTube channel.