BookmarkSubscribeRSS Feed
kldepi
Calcite | Level 5

Hi-

I am trying to plot the distribution of scores of a continuous variable for 4 groups on one plot, and have found the best visualization for what I am looking for is using sg plot with the density fx (rather than bulky overlapping historgrams which don't display the data well). However, I'm not 100% positive on the interpretation of the x and y axes.

For the x axis, even when I set min and max values in my code, the graph displays from -50 to 200 (even though the variable's score range is 0 to 200- ie no-one has scores <0 and this doesn't make sense) - is this like a standard deviation unit below the mean? If so, do I need to set the mean for my groups? (And how would I do that?)

For the y axis, I believe this is the the theoretical proportion of all observations that would fall under the curve, at that value or less, but I just want to make sure I'm interpretting correctly.

Originally, I'd just wanted distribution curves that would be the smoothed curve version of a histogram (ie, the percent on the y axis and the true actual score, ranging from 0 to 200, on the x, but I'm not sure there's a way to do this....)

Help very much appreciated.

8 REPLIES 8
Doc_Duke
Rhodochrosite | Level 12

We may be able to provide more help if you share your code.  The DENSITY statement defaults to a normal density distribution.  If your data are skewed toward the left, that could explain why the curve goes below 0.  See

http://support.sas.com/documentation/cdl/en/grstatproc/62603/HTML/default/viewer.htm#density-stmt.ht...

The KERNEL density estimate uses a non-parametric smoother and will follow the histogram better for non-normal data.

Doc Muhlbaier

Duke

kldepi
Calcite | Level 5

Ok, Thank you- I think using kernel is what I'm looking for- it provides the correct range of the score variable on the x-axis.... it's just not as "pretty" a graph as the normally distributed graph using density.

Do you know if my interpretation of the density plot axes is appropriate, though?

For DENSITY PLOT

X= ? (it ranges from -50 to 200, my variable is a continuous score that ranges from 0-200, but I have 4 groups with essentially 2 different means of the scores.)

Y=density= the theoretical proportion of all observations that would fall under the curve, at that value or less -?

I want to make sure I understand the interpretation of the density graph.

(My code is essentially:

proc sgplot data=score;

density caseFscore/type=normal;

density caseMscore/type=normal;

density controlFscore/type=normal;

density controlMscore/type=normal;

run;

these are all scores for the same measure, but I created separate variables for each group I wanted to plot, because I wanted to compare their distributions. The 2 case groups have similar distributions, and the 2 control groups are also similar but they differ a bit more.)

Reeza
Super User

What happens if you change the type=normal to type=kernel?

kldepi
Calcite | Level 5

Changing to type=kernel makes it not so much of a smoothed curve but a nonparametric curve that more closely follows the data. So, with smaller n's, it looks messier, but for large numbers with approximately normal distribution it doesn't change much. It did take my x axis to the scale of the variable I am using, as well.

Rick_SAS
SAS Super FREQ

How "messy" it looks is related to the bandwidth for the kernel density estimator. If you think that the default bandwidth is underfitting the data, you can manually increase the bandwidth. The syntax i PROC SGPLOT is kernel(C=value). Small values (near zero) result in widely oscillating fits. Large values (near 100) results in fitting the mean density.  Try values in the range 20--60 and see what happens.

Rick_SAS
SAS Super FREQ

When you create four separate variables, each variable is scaled independently. If you have some groups with many observations and other groups with few observations, this might be misleading. However, if the number of observations in each group is approximately equal, then this is a good approach.  This method and other similar methods are described in this article: Overlay density estimates on a plot - The DO Loop

kldepi
Calcite | Level 5

Good to know- Thanks, that is helpful Rick!

Rick_SAS
SAS Super FREQ

I'm not sure that I understand your concerns, but if you have ONE variable with FOUR class levels, then look up comparative histogram in PROC UNIVARIATE: Base SAS(R) 9.3 Procedures Guide: Statistical Procedures, Second Edition

The output will be panelled histograms. However, you can use the OUTKERNEL= option to write the kernel density estimates to a SAS data set and then overlay the curves on a single plot. The following example should get you started, if this is what you want to do:


proc univariate data=sashelp.class;
class sex;
var weight;
histogram weight / kernel outkernel=outk;
run;

proc sgplot data=outk;
series x=_value_ y=_Density_ / group=sex;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 4578 views
  • 6 likes
  • 4 in conversation