Hi,
Rather embarrassingly, I produced a draft report for a client, who immediately spotted that something was wrong. It took me an hour to find a "fix", but I'm worried that this problem arose in the first place. The data were calculated scores from a questionnaire given repeatedly (and at fixed but unequal intervals) to subjects; the example below uses a different dataset, bit illustrates the problem.
/* Examples illustrating PROC SGPLOT overlay problem */;
data forplot ;
set sashelp.bmimen ;
If Age < 20 then delete; /* simplify the dataset */
agedec = 10*int(Age/10); /* calculate age decade */
run;
title3 'BMI in men by age group';
proc means data=forplot maxdec=2 min max median q1 q3 ;
var BMI;
class agedec ;
run;
title3 'BMI in men by age group - discrete xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=discrete ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
title3 'BMI in men by age group - time xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=time ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
Why is the plot using a discrete axis so different from a plot using a time axis? Even more alarmingly, why are there two men in their 20s with BMI > 40 in the first plot, but four such men in the second plot!?!
Two men with BMI > 40 have appeared out of thin air?!?
The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.
Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.
Thanks!
Dan
This appears to be an issue with the JITTER option, not the axis type or statement.
Try removing it and see what you get? I suspect jitter is partially random and is using something about the data to show how it displays.
@Norman21 wrote:
Hi,
Rather embarrassingly, I produced a draft report for a client, who immediately spotted that something was wrong. It took me an hour to find a "fix", but I'm worried that this problem arose in the first place. The data were calculated scores from a questionnaire given repeatedly (and at fixed but unequal intervals) to subjects; the example below uses a different dataset, bit illustrates the problem.
/* Examples illustrating PROC SGPLOT overlay problem */; data forplot ; set sashelp.bmimen ; If Age < 20 then delete; /* simplify the dataset */ agedec = 10*int(Age/10); /* calculate age decade */ run; title3 'BMI in men by age group'; proc means data=forplot maxdec=2 min max median q1 q3 ; var BMI; class agedec ; run; title3 'BMI in men by age group - discrete xaxis'; proc sgplot data=forplot noautolegend; vbox BMI / category=agedec ; scatter x=agedec y=BMI / jitter; xaxis type=discrete ; label agedec = 'Age decade' BMI="Body Mass Index" ; run; title3 'BMI in men by age group - time xaxis'; proc sgplot data=forplot noautolegend; vbox BMI / category=agedec ; scatter x=agedec y=BMI / jitter; xaxis type=time ; label agedec = 'Age decade' BMI="Body Mass Index" ; run;
Why is the plot using a discrete axis so different from a plot using a time axis? Even more alarmingly, why are there two men in their 20s with BMI > 40 in the first plot, but four such men in the second plot!?!
Two men with BMI > 40 have appeared out of thin air?!?
Thanks for the swift response, Reeza. You are (of course) correct - removing jitter does result in only two outliers, but in my questionnaire dataset there are many observations with identical scores, so the points overlap.
Also, is there any documentation describing this "jitter problem"?
No, and I don't know anything for sure and just an educated guess mostly.
Jitter seems to be wrong here, you definitely only have two outliers for ageDec=20 - I would contact SAS tech support here. Not sure where those points are coming from.
@Norman21 wrote:
Thanks for the swift response, Reeza. You are (of course) correct - removing jitter does result in only two outliers, but in my questionnaire dataset there are many observations with identical scores, so the points overlap.
Also, is there any documentation describing this "jitter problem"?
The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.
Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.
Thanks!
Dan
Ok, took me a while to understand this, so I'll rephrase for anyone else.
The 4 dots are coming because two are outliers from the VBOX statement and this is overlayed with the two, jittered points, from the SCATTER statement. If you don't have JITTER specified, they just overlay, making this a non-issue.
To graph the scatter on top of the box plot you should suppress the outliers so they don't show and it doesn't look like multiple outliers are present.
*create sample data;
data forplot;
set sashelp.bmimen;
if age < 20 then delete;
agedec = 10*int(age/10);
decade = agedec;
format decade time8.;
run;
*adds tooltips to points in HTML to see x/y values when hovering;
ods graphics /imagemap;
title 'Just box plots';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
run;
title 'Just scatter plots';
proc sgplot data=forplot noautolegend;
scatter x=agedec y=BMI / jitter;
xaxis type=linear ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
title 'Overlayed plots';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec nooutliers ;
scatter x=agedec y=BMI / jitter;
xaxis type=linear ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
title;
*check number of outliers;
title "Number of Outliers for agedec=20";
proc sql;
select count(*) as Number_Outliers label="Number of Outliers"
from forplot
where agedec=20 and bmi > 40;
quit;
title;
@DanH_sas wrote:
The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.
Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.
Thanks!
Dan
Thank you very much for the discussion and solution. After some experimentation, the following shows two "correct" solutions:
/* Examples illustrating PROC SGPLOT overlay problem */;
data forplot ;
set sashelp.bmimen ;
If Age < 20 then delete; /* simplify the dataset */
agedec = 10*int(Age/10); /* calculate age decade */
run;
title3 'BMI in men by age group';
proc means data=forplot maxdec=2 min max median q1 q3 ;
var BMI;
class agedec ;
run;
title3 'BMI in men by age group - discrete xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec NOOUTLIERS boxwidth = 0.8 ;
scatter x=agedec y=BMI / jitter transparency = 0.75 ;
xaxis type=discrete ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
title3 'BMI in men by age group - time xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec NOOUTLIERS ;
scatter x=agedec y=BMI / jitter transparency = 0.75 ;
xaxis type=time ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
Conclusions:
1. You MUST use NOOUTLIERS when you overlay a SCATTER plot over a VBOX plot.
2. The results of JITTER differ depending on the axis type:
3. Use of the TRANSPARENCY option enables you to "see the wood for the trees", especially with a large number of data points.
4. The BOXWIDTH by default is different with a DISCRETE or TIME axis, but this can easily be modified with the BOXWIDTH option.
5. This SAS Support Communities forum is fantastic!
I have a similar problem using SAS 9.4.
I have added "NOOUTLIERS" and "xaxis type = discrete".
However, 2 points with y-values 2.4 and 2.44 are still overlapping.
It seems the jitter function is not working
Could you please assist me in finding out why jitter function is not working?
Here the code :
proc sgplot data=mydata noautolegend noborder;
scatter y=level x=Group/ jitter jitterwidth=1 group=group markerattrs=(size=10);
vbox level/ category= Group group = group discreteoffset=0.2 boxwidth=0.3 nooutliers /*nofill*/
dataskin=gloss nomean;
xaxis type = discrete display= (nolabel noticks) values=("Bill" "Aman" "Rude") valueattrs=(Color=Black Family=Arial Size=16);
yaxis display=(noline noticks)
label= "Levels" labelattrs= (Color = Black Family = Arial Size = 16) grid valueattrs=(Color=Black Family=Arial Size=16);
run; quit;
Thank you!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.