- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Rather embarrassingly, I produced a draft report for a client, who immediately spotted that something was wrong. It took me an hour to find a "fix", but I'm worried that this problem arose in the first place. The data were calculated scores from a questionnaire given repeatedly (and at fixed but unequal intervals) to subjects; the example below uses a different dataset, bit illustrates the problem.
/* Examples illustrating PROC SGPLOT overlay problem */;
data forplot ;
set sashelp.bmimen ;
If Age < 20 then delete; /* simplify the dataset */
agedec = 10*int(Age/10); /* calculate age decade */
run;
title3 'BMI in men by age group';
proc means data=forplot maxdec=2 min max median q1 q3 ;
var BMI;
class agedec ;
run;
title3 'BMI in men by age group - discrete xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=discrete ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
title3 'BMI in men by age group - time xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=time ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
Why is the plot using a discrete axis so different from a plot using a time axis? Even more alarmingly, why are there two men in their 20s with BMI > 40 in the first plot, but four such men in the second plot!?!
Two men with BMI > 40 have appeared out of thin air?!?
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.
Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.
Thanks!
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This appears to be an issue with the JITTER option, not the axis type or statement.
Try removing it and see what you get? I suspect jitter is partially random and is using something about the data to show how it displays.
@Norman21 wrote:
Hi,
Rather embarrassingly, I produced a draft report for a client, who immediately spotted that something was wrong. It took me an hour to find a "fix", but I'm worried that this problem arose in the first place. The data were calculated scores from a questionnaire given repeatedly (and at fixed but unequal intervals) to subjects; the example below uses a different dataset, bit illustrates the problem.
/* Examples illustrating PROC SGPLOT overlay problem */; data forplot ; set sashelp.bmimen ; If Age < 20 then delete; /* simplify the dataset */ agedec = 10*int(Age/10); /* calculate age decade */ run; title3 'BMI in men by age group'; proc means data=forplot maxdec=2 min max median q1 q3 ; var BMI; class agedec ; run; title3 'BMI in men by age group - discrete xaxis'; proc sgplot data=forplot noautolegend; vbox BMI / category=agedec ; scatter x=agedec y=BMI / jitter; xaxis type=discrete ; label agedec = 'Age decade' BMI="Body Mass Index" ; run; title3 'BMI in men by age group - time xaxis'; proc sgplot data=forplot noautolegend; vbox BMI / category=agedec ; scatter x=agedec y=BMI / jitter; xaxis type=time ; label agedec = 'Age decade' BMI="Body Mass Index" ; run;
Why is the plot using a discrete axis so different from a plot using a time axis? Even more alarmingly, why are there two men in their 20s with BMI > 40 in the first plot, but four such men in the second plot!?!
Two men with BMI > 40 have appeared out of thin air?!?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the swift response, Reeza. You are (of course) correct - removing jitter does result in only two outliers, but in my questionnaire dataset there are many observations with identical scores, so the points overlap.
Also, is there any documentation describing this "jitter problem"?
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
No, and I don't know anything for sure and just an educated guess mostly.
Jitter seems to be wrong here, you definitely only have two outliers for ageDec=20 - I would contact SAS tech support here. Not sure where those points are coming from.
@Norman21 wrote:
Thanks for the swift response, Reeza. You are (of course) correct - removing jitter does result in only two outliers, but in my questionnaire dataset there are many observations with identical scores, so the points overlap.
Also, is there any documentation describing this "jitter problem"?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.
Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.
Thanks!
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Ok, took me a while to understand this, so I'll rephrase for anyone else.
The 4 dots are coming because two are outliers from the VBOX statement and this is overlayed with the two, jittered points, from the SCATTER statement. If you don't have JITTER specified, they just overlay, making this a non-issue.
To graph the scatter on top of the box plot you should suppress the outliers so they don't show and it doesn't look like multiple outliers are present.
*create sample data;
data forplot;
set sashelp.bmimen;
if age < 20 then delete;
agedec = 10*int(age/10);
decade = agedec;
format decade time8.;
run;
*adds tooltips to points in HTML to see x/y values when hovering;
ods graphics /imagemap;
title 'Just box plots';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
run;
title 'Just scatter plots';
proc sgplot data=forplot noautolegend;
scatter x=agedec y=BMI / jitter;
xaxis type=linear ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
title 'Overlayed plots';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec nooutliers ;
scatter x=agedec y=BMI / jitter;
xaxis type=linear ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
title;
*check number of outliers;
title "Number of Outliers for agedec=20";
proc sql;
select count(*) as Number_Outliers label="Number of Outliers"
from forplot
where agedec=20 and bmi > 40;
quit;
title;
@DanH_sas wrote:
The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.
Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.
Thanks!
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much for the discussion and solution. After some experimentation, the following shows two "correct" solutions:
/* Examples illustrating PROC SGPLOT overlay problem */;
data forplot ;
set sashelp.bmimen ;
If Age < 20 then delete; /* simplify the dataset */
agedec = 10*int(Age/10); /* calculate age decade */
run;
title3 'BMI in men by age group';
proc means data=forplot maxdec=2 min max median q1 q3 ;
var BMI;
class agedec ;
run;
title3 'BMI in men by age group - discrete xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec NOOUTLIERS boxwidth = 0.8 ;
scatter x=agedec y=BMI / jitter transparency = 0.75 ;
xaxis type=discrete ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
title3 'BMI in men by age group - time xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec NOOUTLIERS ;
scatter x=agedec y=BMI / jitter transparency = 0.75 ;
xaxis type=time ;
label agedec = 'Age decade' BMI="Body Mass Index" ;
run;
Conclusions:
1. You MUST use NOOUTLIERS when you overlay a SCATTER plot over a VBOX plot.
2. The results of JITTER differ depending on the axis type:
- With a DISCRETE axis, the jitter is random in one dimension (see first plot).
- With a TIME axis, the jitter is random in two dimensions (see second plot).
3. Use of the TRANSPARENCY option enables you to "see the wood for the trees", especially with a large number of data points.
4. The BOXWIDTH by default is different with a DISCRETE or TIME axis, but this can easily be modified with the BOXWIDTH option.
5. This SAS Support Communities forum is fantastic!
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a similar problem using SAS 9.4.
I have added "NOOUTLIERS" and "xaxis type = discrete".
However, 2 points with y-values 2.4 and 2.44 are still overlapping.
It seems the jitter function is not working
Could you please assist me in finding out why jitter function is not working?
Here the code :
proc sgplot data=mydata noautolegend noborder;
scatter y=level x=Group/ jitter jitterwidth=1 group=group markerattrs=(size=10);
vbox level/ category= Group group = group discreteoffset=0.2 boxwidth=0.3 nooutliers /*nofill*/
dataskin=gloss nomean;
xaxis type = discrete display= (nolabel noticks) values=("Bill" "Aman" "Rude") valueattrs=(Color=Black Family=Arial Size=16);
yaxis display=(noline noticks)
label= "Levels" labelattrs= (Color = Black Family = Arial Size = 16) grid valueattrs=(Color=Black Family=Arial Size=16);
run; quit;
Thank you!