Dear SAS-VA Users,
I am currently developing reports that include statistical distribution analyses within the field of clinical biochemistry. A key feature of these reports is a histogram that displays the distribution of test results. Users can interactively filter the test results using several parameters, and the histogram updates accordingly with the new data.
However, I have encountered a problem: the histograms often display irregular spikes or other types of "artifacts" (please see the image below). I suspect these artifacts arise from the SAS-VA algorithm used for determining bin widths, which leads to these discrepancies. These artifacts undermine the users' confidence in the validity of the distribution analysis and the accompanying calculations. Additionally, SAS-VA provides limited options for adjusting the bin size or range. This issue does not occur with other statistical software I have used.
I would greatly appreciate any suggestions for resolving or minimizing this problem.
I have attached a data file with a single variable containing 64,151 test results. Creating a SAS-VA histogram with these data results in a spiked histogram, as shown below.
Best regards Percentile95
Hey @Percentile95! First, thank you so much for supplying sample data. This makes working on a solution much easier!
I found that for this particular set of data, if you set the bin width to be 50 you get a smooth distribution as expected:
When comparing this with SGPLOT, you get the same results as in Visual Analytics with the automatic algorithm:
proc sgplot data=a.histogram_bin_problem;
histogram internal_reply_num / scale=count binwidth=0.02;
run;
This is because they use the same auto-binning algorithm under the hood. In this case I would recommend choosing a number of bins that helps generate a smoother distribution.
Thanks for the fast reply, much appreciated.
I understand your suggestion, but the SAS-VA report interface allows users to adjust which numbers go into the histogram.
As you showed with SGPLOT, using the automatic algorithm results in a spiked histogram. Adjusting to:
proc sgplot data=a.histogram_bin_problem;
histogram internal_reply_num / scale=count binwidth=0.02;
run
Gives a nice smooth looking histogram.
If the user then adust the input (albeit in SAS-VA):
proc sgplot data=histogram_bin_problem;
histogram internal_reply_num / scale=count binwidth=0.02;
where internal_reply_num between 2 and 2.8;
run;
The output is again get a "bad" looking histogram:
So the input to the Histogram function (SGPLOT or SAS-VA) is dynamic, and I'm hoping for a better/different auto-binning algorithm. I have tried, as suggested by you, to use the number of bins that gives a smooth histogram, but then the input changes and I get a spiked histogram. I have attempted to use a parameter inside "Number of bins" to let the user adjust the look of the histogram, but parameters are not allowed as input.
Hope thit makes sense
First a caveat or two. I don't have access to VA so am not sure if this suggest can be implemented.
Second, it takes a bit more training on the part of individuals reading but BOXPLOTS can contain a lot of distribution information and are not subject to "bin width" issues. Outlier definitions and displays have some issues but I suspect may be easier to deal with. So perhaps consider box plots until this alternate parameterization of histograms is available.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
See how to use one filter for multiple data sources by mapping your data from SAS’ Alexandria McCall.
Find more tutorials on the SAS Users YouTube channel.