BookmarkSubscribeRSS Feed
Percentile95
Fluorite | Level 6

Dear SAS-VA Users,

 

I am currently developing reports that include statistical distribution analyses within the field of clinical biochemistry. A key feature of these reports is a histogram that displays the distribution of test results. Users can interactively filter the test results using several parameters, and the histogram updates accordingly with the new data.

 

However, I have encountered a problem: the histograms often display irregular spikes or other types of "artifacts" (please see the image below). I suspect these artifacts arise from the SAS-VA algorithm used for determining bin widths, which leads to these discrepancies. These artifacts undermine the users' confidence in the validity of the distribution analysis and the accompanying calculations. Additionally, SAS-VA provides limited options for adjusting the bin size or range. This issue does not occur with other statistical software I have used.

 

I would greatly appreciate any suggestions for resolving or minimizing this problem.

 

I have attached a data file with a single variable containing 64,151 test results. Creating a SAS-VA histogram with these data results in a spiked histogram, as shown below.

 

Best regards Percentile95Bad_histogram.png

8 REPLIES 8
Stu_SAS
SAS Employee

Hey @Percentile95! First, thank you so much for supplying sample data. This makes working on a solution much easier!

I found that for this particular set of data, if you set the bin width to be 50 you get a smooth distribution as expected:

 

Stu_SAS_0-1717681024621.png

 

When comparing this with SGPLOT, you get the same results as in Visual Analytics with the automatic algorithm:

Stu_SAS_1-1717681212918.png

 

proc sgplot data=a.histogram_bin_problem;
    histogram internal_reply_num / scale=count binwidth=0.02;
run;

Stu_SAS_2-1717681679271.png

 

This is because they use the same auto-binning algorithm under the hood. In this case I would recommend choosing a number of bins that helps generate a smoother distribution.

 

Percentile95
Fluorite | Level 6

Thanks for the fast reply, much appreciated.

 

I understand your suggestion, but the SAS-VA report interface allows users to adjust which numbers go into the histogram.

 

As you showed with SGPLOT, using the automatic algorithm results in a spiked histogram. Adjusting to:

 

proc sgplot data=a.histogram_bin_problem;
    histogram internal_reply_num / scale=count binwidth=0.02;
run

 

Gives a nice smooth looking histogram. 

 

If the user then adust the input (albeit in SAS-VA):

proc sgplot data=histogram_bin_problem;
    histogram internal_reply_num / scale=count binwidth=0.02;
    where internal_reply_num between 2 and 2.8;
run;

The output is again get a "bad" looking histogram:

Percentile95_0-1717683301963.png

 

So the input to the Histogram function (SGPLOT or SAS-VA) is dynamic, and I'm hoping for a better/different auto-binning algorithm. I have tried, as suggested by you, to use the number of bins that gives a smooth histogram, but then the input changes and I get a spiked histogram. I have attempted to use a parameter inside "Number of bins" to let the user adjust the look of the histogram, but parameters are not allowed as input.

 

Hope thit makes sense

Stu_SAS
SAS Employee
Thanks, @Percentile95. I agree with your suggestion about allowing users to adjust the number of bins with a parameter. We recently released Dynamic Parameters in Visual Analytics and are planning on adding more places where you can add dynamic values. The histogram number of bins sounds like a fantastic place. I'll bring this to R&D for their thoughts.
ballardw
Super User

First a caveat or two. I don't have access to VA so am not sure if this suggest can be implemented.

Second, it takes a bit more training on the part of individuals reading but BOXPLOTS can contain a lot of distribution information and are not subject to "bin width" issues. Outlier definitions and displays have some issues but I suspect may be easier to deal with.  So perhaps consider box plots until this alternate parameterization of histograms is available.

Percentile95
Fluorite | Level 6

Hi Ballarddw,

 

I appreciate your well-thought-out suggestion to my problem.

Yes, the SAS-VA boxplot is quite useful as a supplement to ordinary histograms.

Hence, I have used your suggestion (see below) together with histograms inside a stacking container, where each stacked histogram has a number of bins of varying size. This allows users to choose the number of bins that gives a smooth-looking histogram.

 

The very best regards

 

histsas.png

Quentin
Super User

I didn't look at the data, but curious as to the underlying cause of these weird looking spiky histograms. 

 

Is it that your values are not continuous, they are rounded in some way that creates spikes for certain bin sizes / bin locations?

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Rick_SAS
SAS Super FREQ

This can happen with data that are rounded. For a discussion, example, and solution, see 

The mystery of the density curve that was too short - The DO Loop (sas.com)

To make this visual illusion disappear, use a bin width that is at least as large as the rounding unit in the data. 

FreelanceReinh
Jade | Level 19

@Rick_SAS wrote:

To make this visual illusion disappear, use a bin width that is at least as large as the rounding unit in the data. 


If the bin width is larger than the rounding unit, it should be an integer multiple of the rounding unit. Otherwise, you can still get those spikes in the histogram, as was discussed in the 2021 thread Histogram does not reflect summary statistics, where due to the non-integer ratio 1.2 : 1 every fifth histogram bar comprised two values rather than one.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Tips for filtering data sources in SAS Visual Analytics

See how to use one filter for multiple data sources by mapping your data from SAS’ Alexandria McCall.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 598 views
  • 5 likes
  • 6 in conversation