Lapis Lazuli | Level 10

## PROC SGPLOT overlay problem

Hi,

Rather embarrassingly, I produced a draft report for a client, who immediately spotted that something was wrong. It took me an hour to find a "fix", but I'm worried that this problem arose in the first place. The data were calculated scores from a questionnaire given repeatedly (and at fixed but unequal intervals) to subjects; the example below uses a different dataset, bit illustrates the problem.

``````
/* Examples illustrating PROC SGPLOT overlay problem */;

data forplot ;
set sashelp.bmimen ;
If Age < 20 then delete; /* simplify the dataset */
agedec = 10*int(Age/10); /* calculate age decade */
run;

title3 'BMI in men by age group';
proc means data=forplot maxdec=2 min max median q1 q3 ;
var BMI;
class agedec ;
run;

title3 'BMI in men by age group - discrete xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=discrete ;
label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

title3 'BMI in men by age group - time xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=time ;
label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;
``````

Why is the plot using a discrete axis so different from a plot using a time axis? Even more alarmingly, why are there two men in their 20s with BMI > 40 in the first plot, but four such men in the second plot!?!

Two men with BMI > 40 have appeared out of thin air?!?

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

1 ACCEPTED SOLUTION

Accepted Solutions
SAS Super FREQ

## Re: PROC SGPLOT overlay problem

The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.

Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.

Thanks!
Dan

7 REPLIES 7
Super User

## Re: PROC SGPLOT overlay problem

This appears to be an issue with the JITTER option, not the axis type or statement.

Try removing it and see what you get? I suspect jitter is partially random and is using something about the data to show how it displays.

@Norman21 wrote:

Hi,

Rather embarrassingly, I produced a draft report for a client, who immediately spotted that something was wrong. It took me an hour to find a "fix", but I'm worried that this problem arose in the first place. The data were calculated scores from a questionnaire given repeatedly (and at fixed but unequal intervals) to subjects; the example below uses a different dataset, bit illustrates the problem.

``````
/* Examples illustrating PROC SGPLOT overlay problem */;

data forplot ;
set sashelp.bmimen ;
If Age < 20 then delete; /* simplify the dataset */
agedec = 10*int(Age/10); /* calculate age decade */
run;

title3 'BMI in men by age group';
proc means data=forplot maxdec=2 min max median q1 q3 ;
var BMI;
class agedec ;
run;

title3 'BMI in men by age group - discrete xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=discrete ;
label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

title3 'BMI in men by age group - time xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=time ;
label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;
``````

Why is the plot using a discrete axis so different from a plot using a time axis? Even more alarmingly, why are there two men in their 20s with BMI > 40 in the first plot, but four such men in the second plot!?!

Two men with BMI > 40 have appeared out of thin air?!?

Lapis Lazuli | Level 10

## Re: PROC SGPLOT overlay problem

Thanks for the swift response, Reeza. You are (of course) correct - removing jitter does result in only two outliers, but in my questionnaire dataset there are many observations with identical scores, so the points overlap.

Also, is there any documentation describing this "jitter problem"?

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

Super User

## Re: PROC SGPLOT overlay problem

No, and I don't know anything for sure and just an educated guess mostly.

Jitter seems to be wrong here, you definitely only have two outliers for ageDec=20 - I would contact SAS tech support here. Not sure where those points are coming from.

@Norman21 wrote:

Thanks for the swift response, Reeza. You are (of course) correct - removing jitter does result in only two outliers, but in my questionnaire dataset there are many observations with identical scores, so the points overlap.

Also, is there any documentation describing this "jitter problem"?

SAS Super FREQ

## Re: PROC SGPLOT overlay problem

The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.

Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.

Thanks!
Dan

Super User

## Re: PROC SGPLOT overlay problem

Ok, took me a while to understand this, so I'll rephrase for anyone else.

The 4 dots are coming because two are outliers from the VBOX statement and this is overlayed with the two, jittered points, from the SCATTER statement. If you don't have JITTER specified, they just overlay, making this a non-issue.

To graph the scatter on top of the box plot you should suppress the outliers so they don't show and it doesn't look like multiple outliers are present.

``````*create sample data;
data forplot;
set sashelp.bmimen;
if age < 20 then delete;
agedec = 10*int(age/10);

run;

*adds tooltips to points in HTML to see x/y values when hovering;
ods graphics /imagemap;

title 'Just box plots';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec  ;
run;

title 'Just scatter plots';
proc sgplot data=forplot noautolegend;
scatter x=agedec y=BMI / jitter;
xaxis type=linear ;
label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

title 'Overlayed plots';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec  nooutliers ;
scatter x=agedec y=BMI / jitter;
xaxis type=linear ;
label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;
title;

*check number of outliers;
title "Number of Outliers for agedec=20";
proc sql;
select count(*) as Number_Outliers label="Number of Outliers"
from forplot
where agedec=20 and bmi > 40;
quit;
title;

``````

@DanH_sas wrote:

The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.

Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.

Thanks!
Dan

Lapis Lazuli | Level 10

## Re: PROC SGPLOT overlay problem

Thank you very much for the discussion and solution. After some experimentation, the following shows two "correct" solutions:

``````
/* Examples illustrating PROC SGPLOT overlay problem */;

data forplot ;
set sashelp.bmimen ;
If Age < 20 then delete; /* simplify the dataset */
agedec = 10*int(Age/10); /* calculate age decade */
run;

title3 'BMI in men by age group';
proc means data=forplot maxdec=2 min max median q1 q3 ;
var BMI;
class agedec ;
run;

title3 'BMI in men by age group - discrete xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec NOOUTLIERS boxwidth = 0.8 ;
scatter x=agedec y=BMI / jitter transparency = 0.75 ;
xaxis type=discrete ;
label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

title3 'BMI in men by age group - time xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec NOOUTLIERS ;
scatter x=agedec y=BMI / jitter transparency = 0.75 ;
xaxis type=time ;
label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;
``````

Conclusions:

1. You MUST use NOOUTLIERS when you overlay a SCATTER plot over a VBOX plot.

2. The results of JITTER differ depending on the axis type:

• With a DISCRETE axis, the jitter is random in one dimension (see first plot).
• With a TIME axis, the jitter is random in two dimensions (see second plot).

3. Use of the TRANSPARENCY option enables you to "see the wood for the trees", especially with a large number of data points.

4. The BOXWIDTH by default is different with a DISCRETE or TIME axis, but this can easily be modified with the BOXWIDTH option.

5. This SAS Support Communities forum is fantastic!

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

Calcite | Level 5

## Re: PROC SGPLOT overlay problem

I have a similar problem using SAS 9.4.

I have added "NOOUTLIERS" and "xaxis type = discrete".

However, 2 points with y-values 2.4 and 2.44 are still overlapping.

It seems the jitter function is not working

Could you please assist me in finding out why jitter function is not working?

Here the code :

proc sgplot data=mydata noautolegend noborder;

scatter y=level x=Group/ jitter jitterwidth=1 group=group  markerattrs=(size=10);

vbox level/ category= Group group = group discreteoffset=0.2 boxwidth=0.3 nooutliers /*nofill*/

xaxis type = discrete display= (nolabel noticks) values=("Bill" "Aman" "Rude") valueattrs=(Color=Black Family=Arial Size=16);

yaxis display=(noline noticks)
label= "Levels"  labelattrs= (Color = Black Family = Arial Size = 16) grid valueattrs=(Color=Black Family=Arial Size=16);
run; quit;

Thank you!

Discussion stats
• 7 replies
• 3298 views
• 7 likes
• 4 in conversation