04-24-2013 02:14 PM
I am trying to plot the distribution of scores of a continuous variable for 4 groups on one plot, and have found the best visualization for what I am looking for is using sg plot with the density fx (rather than bulky overlapping historgrams which don't display the data well). However, I'm not 100% positive on the interpretation of the x and y axes.
For the x axis, even when I set min and max values in my code, the graph displays from -50 to 200 (even though the variable's score range is 0 to 200- ie no-one has scores <0 and this doesn't make sense) - is this like a standard deviation unit below the mean? If so, do I need to set the mean for my groups? (And how would I do that?)
For the y axis, I believe this is the the theoretical proportion of all observations that would fall under the curve, at that value or less, but I just want to make sure I'm interpretting correctly.
Originally, I'd just wanted distribution curves that would be the smoothed curve version of a histogram (ie, the percent on the y axis and the true actual score, ranging from 0 to 200, on the x, but I'm not sure there's a way to do this....)
Help very much appreciated.
04-24-2013 02:48 PM
We may be able to provide more help if you share your code. The DENSITY statement defaults to a normal density distribution. If your data are skewed toward the left, that could explain why the curve goes below 0. See
The KERNEL density estimate uses a non-parametric smoother and will follow the histogram better for non-normal data.
04-24-2013 06:05 PM
Ok, Thank you- I think using kernel is what I'm looking for- it provides the correct range of the score variable on the x-axis.... it's just not as "pretty" a graph as the normally distributed graph using density.
Do you know if my interpretation of the density plot axes is appropriate, though?
For DENSITY PLOT
X= ? (it ranges from -50 to 200, my variable is a continuous score that ranges from 0-200, but I have 4 groups with essentially 2 different means of the scores.)
Y=density= the theoretical proportion of all observations that would fall under the curve, at that value or less -?
I want to make sure I understand the interpretation of the density graph.
(My code is essentially:
proc sgplot data=score;
these are all scores for the same measure, but I created separate variables for each group I wanted to plot, because I wanted to compare their distributions. The 2 case groups have similar distributions, and the 2 control groups are also similar but they differ a bit more.)
04-24-2013 08:48 PM
Changing to type=kernel makes it not so much of a smoothed curve but a nonparametric curve that more closely follows the data. So, with smaller n's, it looks messier, but for large numbers with approximately normal distribution it doesn't change much. It did take my x axis to the scale of the variable I am using, as well.
04-25-2013 06:02 AM
How "messy" it looks is related to the bandwidth for the kernel density estimator. If you think that the default bandwidth is underfitting the data, you can manually increase the bandwidth. The syntax i PROC SGPLOT is kernel(C=value). Small values (near zero) result in widely oscillating fits. Large values (near 100) results in fitting the mean density. Try values in the range 20--60 and see what happens.
04-24-2013 08:17 PM
When you create four separate variables, each variable is scaled independently. If you have some groups with many observations and other groups with few observations, this might be misleading. However, if the number of observations in each group is approximately equal, then this is a good approach. This method and other similar methods are described in this article: Overlay density estimates on a plot - The DO Loop
04-24-2013 03:07 PM
I'm not sure that I understand your concerns, but if you have ONE variable with FOUR class levels, then look up comparative histogram in PROC UNIVARIATE: Base SAS(R) 9.3 Procedures Guide: Statistical Procedures, Second Edition
The output will be panelled histograms. However, you can use the OUTKERNEL= option to write the kernel density estimates to a SAS data set and then overlay the curves on a single plot. The following example should get you started, if this is what you want to do:
proc univariate data=sashelp.class;
histogram weight / kernel outkernel=outk;
proc sgplot data=outk;
series x=_value_ y=_Density_ / group=sex;