BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Norman21
Lapis Lazuli | Level 10

Hi,

 

Rather embarrassingly, I produced a draft report for a client, who immediately spotted that something was wrong. It took me an hour to find a "fix", but I'm worried that this problem arose in the first place. The data were calculated scores from a questionnaire given repeatedly (and at fixed but unequal intervals) to subjects; the example below uses a different dataset, bit illustrates the problem.

 


/* Examples illustrating PROC SGPLOT overlay problem */;

data forplot ;
	set sashelp.bmimen ;
	If Age < 20 then delete; /* simplify the dataset */
	agedec = 10*int(Age/10); /* calculate age decade */
run;

title3 'BMI in men by age group';
proc means data=forplot maxdec=2 min max median q1 q3 ;
  var BMI;
  class agedec ;
run;

title3 'BMI in men by age group - discrete xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=discrete ;
      label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

title3 'BMI in men by age group - time xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=time ;
      label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

SGPLOT_problem.PNG

 

Why is the plot using a discrete axis so different from a plot using a time axis? Even more alarmingly, why are there two men in their 20s with BMI > 40 in the first plot, but four such men in the second plot!?!

 

Two men with BMI > 40 have appeared out of thin air?!?

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

1 ACCEPTED SOLUTION

Accepted Solutions
DanH_sas
SAS Super FREQ

The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.

 

Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.

 

Thanks!
Dan

View solution in original post

7 REPLIES 7
Reeza
Super User

This appears to be an issue with the JITTER option, not the axis type or statement. 

 

Try removing it and see what you get? I suspect jitter is partially random and is using something about the data to show how it displays.

 


@Norman21 wrote:

Hi,

 

Rather embarrassingly, I produced a draft report for a client, who immediately spotted that something was wrong. It took me an hour to find a "fix", but I'm worried that this problem arose in the first place. The data were calculated scores from a questionnaire given repeatedly (and at fixed but unequal intervals) to subjects; the example below uses a different dataset, bit illustrates the problem.

 


/* Examples illustrating PROC SGPLOT overlay problem */;

data forplot ;
	set sashelp.bmimen ;
	If Age < 20 then delete; /* simplify the dataset */
	agedec = 10*int(Age/10); /* calculate age decade */
run;

title3 'BMI in men by age group';
proc means data=forplot maxdec=2 min max median q1 q3 ;
  var BMI;
  class agedec ;
run;

title3 'BMI in men by age group - discrete xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=discrete ;
      label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

title3 'BMI in men by age group - time xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec ;
scatter x=agedec y=BMI / jitter;
xaxis type=time ;
      label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

SGPLOT_problem.PNG

 

Why is the plot using a discrete axis so different from a plot using a time axis? Even more alarmingly, why are there two men in their 20s with BMI > 40 in the first plot, but four such men in the second plot!?!

 

Two men with BMI > 40 have appeared out of thin air?!?


 

 

Norman21
Lapis Lazuli | Level 10

Thanks for the swift response, Reeza. You are (of course) correct - removing jitter does result in only two outliers, but in my questionnaire dataset there are many observations with identical scores, so the points overlap.

 

Also, is there any documentation describing this "jitter problem"?

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

Reeza
Super User

No, and I don't know anything for sure and just an educated guess mostly. 

 

Jitter seems to be wrong here, you definitely only have two outliers for ageDec=20 - I would contact SAS tech support here. Not sure where those points are coming from. 

 


@Norman21 wrote:

Thanks for the swift response, Reeza. You are (of course) correct - removing jitter does result in only two outliers, but in my questionnaire dataset there are many observations with identical scores, so the points overlap.

 

Also, is there any documentation describing this "jitter problem"?


 

DanH_sas
SAS Super FREQ

The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.

 

Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.

 

Thanks!
Dan

Reeza
Super User

Ok, took me a while to understand this, so I'll rephrase for anyone else. 

 

The 4 dots are coming because two are outliers from the VBOX statement and this is overlayed with the two, jittered points, from the SCATTER statement. If you don't have JITTER specified, they just overlay, making this a non-issue.

 

To graph the scatter on top of the box plot you should suppress the outliers so they don't show and it doesn't look like multiple outliers are present. 

 

*create sample data;
data forplot;
set sashelp.bmimen;
if age < 20 then delete;
agedec = 10*int(age/10);

decade = agedec;
format decade time8.;

run;

*adds tooltips to points in HTML to see x/y values when hovering;
ods graphics /imagemap;


title 'Just box plots';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec  ;
run;

title 'Just scatter plots';
proc sgplot data=forplot noautolegend;
scatter x=agedec y=BMI / jitter;
xaxis type=linear ;
      label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

title 'Overlayed plots';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec  nooutliers ;
scatter x=agedec y=BMI / jitter;
xaxis type=linear ;
      label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;
title;

*check number of outliers;
title "Number of Outliers for agedec=20";
proc sql;
select count(*) as Number_Outliers label="Number of Outliers"
from forplot
where agedec=20 and bmi > 40;
quit;
title;


@DanH_sas wrote:

The problem is that you did not specify the NOOUTLIERS option on the VBOX statement. When you used the DISCRETE axis, no discrete-style JITTERing occurred for the SCATTER plot because there were no collisions within the scatter. Therefore, the scatter points just drew on top of the outliers. When you changed the axis type to TIME (LINEAR would do this as well), some jittering is applied to all scatter points, which revealed the box outleirs.

 

Using the NOOUTLIERS option on the VBOX will clear all of this up for you. In general, this option should be used any time you create a box-scatter overlay.

 

Thanks!
Dan


 

Norman21
Lapis Lazuli | Level 10

Thank you very much for the discussion and solution. After some experimentation, the following shows two "correct" solutions:

 


/* Examples illustrating PROC SGPLOT overlay problem */;

data forplot ;
	set sashelp.bmimen ;
	If Age < 20 then delete; /* simplify the dataset */
	agedec = 10*int(Age/10); /* calculate age decade */
run;

title3 'BMI in men by age group';
proc means data=forplot maxdec=2 min max median q1 q3 ;
  var BMI;
  class agedec ;
run;

title3 'BMI in men by age group - discrete xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec NOOUTLIERS boxwidth = 0.8 ;
scatter x=agedec y=BMI / jitter transparency = 0.75 ;
xaxis type=discrete ;
      label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

title3 'BMI in men by age group - time xaxis';
proc sgplot data=forplot noautolegend;
vbox BMI / category=agedec NOOUTLIERS ;
scatter x=agedec y=BMI / jitter transparency = 0.75 ;
xaxis type=time ;
      label agedec = 'Age decade' BMI="Body Mass Index"  ;
run;

SGPLOT_problem_solved.PNG

 

Conclusions:

 

1. You MUST use NOOUTLIERS when you overlay a SCATTER plot over a VBOX plot.

2. The results of JITTER differ depending on the axis type:

  • With a DISCRETE axis, the jitter is random in one dimension (see first plot).
  • With a TIME axis, the jitter is random in two dimensions (see second plot).

3. Use of the TRANSPARENCY option enables you to "see the wood for the trees", especially with a large number of data points.

4. The BOXWIDTH by default is different with a DISCRETE or TIME axis, but this can easily be modified with the BOXWIDTH option.

5. This SAS Support Communities forum is fantastic!

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

CEvenepoel
Calcite | Level 5

I have a similar problem using SAS 9.4.

 

I have added "NOOUTLIERS" and "xaxis type = discrete".

However, 2 points with y-values 2.4 and 2.44 are still overlapping.

It seems the jitter function is not working

 

Could you please assist me in finding out why jitter function is not working?

 

Here the code :

proc sgplot data=mydata noautolegend noborder;

 

scatter y=level x=Group/ jitter jitterwidth=1 group=group  markerattrs=(size=10);

 

vbox level/ category= Group group = group discreteoffset=0.2 boxwidth=0.3 nooutliers /*nofill*/
dataskin=gloss  nomean;

 

xaxis type = discrete display= (nolabel noticks) values=("Bill" "Aman" "Rude") valueattrs=(Color=Black Family=Arial Size=16);

 

yaxis display=(noline noticks)
label= "Levels"  labelattrs= (Color = Black Family = Arial Size = 16) grid valueattrs=(Color=Black Family=Arial Size=16);
run; quit;

 

Thank you!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 3298 views
  • 7 likes
  • 4 in conversation