Hi All,
I'm making a boxplot with a character x-axis, on a fairly large dataset (3M records). When I added an AXISTABLE to the plot (to show the mean for each group), it slowed down dramatically.
In testing, it looks like a plot with a numeric x-axis runs fine (with or without an axistable). With a character x-axis and no axistable it runs fine. But as soon as I add an axistable to the character x-axis, it slows down dramatically.
Sample code:
data have ;
do cat=1 to 5 ;
catc=put(cat,1.) ;
do i=1 to 100000 ;
score=ranuni(0)*cat ;
output ;
end ;
end ;
run ;
options stimer ;
ods listing close ;
ods pdf file="%sysfunc(pathname(work))/mypdf.pdf" ;
*numeric category ;
proc sgplot data=have ;
vbox score/category=cat extreme;
run ;
proc sgplot data=have ;
vbox score/category=cat extreme;
xaxistable score /location=inside stat=mean ;
run ;
*character category ;
proc sgplot data=have ;
vbox score/category=catc extreme;
run ;
proc sgplot data=have ;
vbox score/category=catc extreme;
xaxistable score /location=inside stat=mean ;
run ;
ods pdf close ;
My log (PC SAS, 9.4M4):
14 ods pdf file="%sysfunc(pathname(work))/mypdf.pdf" ; NOTE: Writing ODS PDF output to DISK destination "C:\Users\Quentin\AppData\Local\Temp\SAS Temporary Files\_TD1828_MD1QCFVC_\mypdf.pdf", printer "PDF". 15 16 *numeric category ; 17 proc sgplot data=have ; 18 vbox score/category=cat extreme; 19 run ; NOTE: Since no format is assigned, the numeric category variable will use the default of BEST6. NOTE: PROCEDURE SGPLOT used (Total process time): real time 3.59 seconds cpu time 2.93 seconds NOTE: Compressing data set WORK._DOCTMP000000000000000000058 increased size by 54.03 percent. Compressed is 191 pages; un-compressed would require 124 pages. NOTE: Compressing data set WORK._DOCTMP000000000000000000059 decreased size by 27.45 percent. Compressed is 267 pages; un-compressed would require 368 pages. NOTE: There were 500000 observations read from the data set WORK.HAVE. 20 21 proc sgplot data=have ; 22 vbox score/category=cat extreme; 23 xaxistable score /location=inside stat=mean ; 24 run ; NOTE: Since no format is assigned, the numeric category variable will use the default of BEST6. NOTE: PROCEDURE SGPLOT used (Total process time): real time 2.58 seconds cpu time 1.79 seconds NOTE: Compressing data set WORK._DOCTMP000000000000000000060 increased size by 54.03 percent. Compressed is 191 pages; un-compressed would require 124 pages. NOTE: Compressing data set WORK._DOCTMP000000000000000000061 decreased size by 27.45 percent. Compressed is 267 pages; un-compressed would require 368 pages. NOTE: Marker and line antialiasing has been disabled for at least one plot because the threshold has been reached. You can set ANTIALIASMAX=500000 in the ODS GRAPHICS statement to enable antialiasing for all plots. NOTE: There were 500000 observations read from the data set WORK.HAVE. 25 26 *character category ; 27 proc sgplot data=have ; 28 vbox score/category=catc extreme; 29 run ; NOTE: PROCEDURE SGPLOT used (Total process time): real time 1.79 seconds cpu time 1.56 seconds NOTE: Compressing data set WORK._DOCTMP000000000000000000063 decreased size by 12.60 percent. Compressed is 215 pages; un-compressed would require 246 pages. NOTE: There were 500000 observations read from the data set WORK.HAVE. 30 31 proc sgplot data=have ; 32 vbox score/category=catc extreme; 33 xaxistable score /location=inside stat=mean ; 34 run ; NOTE: PROCEDURE SGPLOT used (Total process time): real time 59.75 seconds cpu time 25.57 seconds NOTE: Compressing data set WORK._DOCTMP000000000000000000065 decreased size by 12.60 percent. Compressed is 215 pages; un-compressed would require 246 pages. NOTE: Marker and line antialiasing has been disabled for at least one plot because the threshold has been reached. You can set ANTIALIASMAX=500000 in the ODS GRAPHICS statement to enable antialiasing for all plots. NOTE: There were 500000 observations read from the data set WORK.HAVE. 35 36 ods pdf close ;
So each of the first 3 plots runs in three or four seconds. In the 4th plot where I have character x-axis and add the axistable, it takes almost a minute.
If you're brave, bump the sample size up to 1,000,000 per group. The plots go from ~30 seconds per plot 1-3, to 11 minutes for plot 4. And I initially got java VM errors, so had to increase the memory per http://support.sas.com/kb/31/184.html just to get it to run.
So my questions:
1. Why would adding an XAXISTABLE slow down generation of the graph so dramatically (when the x-axis is a character var)? [In my head, SAS only has to run PROC MEANS in the background to compute the means]
2. For those with a more current version, do you see the same performance hit on the 4th plot?
3. Why would adding an XAXISTABLE trigger the marker and line disabled note? I understand that message when I have a complex plot with lots of shapes, but that doesn't apply here.
Should be easy enough for me to switch to using a numeric variable with a format attached on the x-axis, instead of a character variable, to avoid the tremendous performance hit. But curious if there is an good explanation for this, and if the problem exists in 9.4M5/6?
Thanks,
-Q.
Hey @Quentin , thanks for bringing this to our attention. For your use case, I believe you will find it faster to use the DISPLAYSTATS option on the VBOX statement. Give that a try and see if that works well for you.
Why not calculate its mean firstly , and then using proc sgplot ?
@Ksharp wrote:
Why not calculate its mean firstly , and then using proc sgplot ?
I could definitely do that, but I like axistable, and after making a graph, thought "hey, maybe I should show the mean/min/max values." When I added them, I was surprised that they slowed the program so much (I was making five or six graphs, the program went from running in less than a minute to > 10 minutes). Then when I tried to build a test case to post, I couldn't replicate the problem at first. And I was more surprised when I realized it was because my test case had a numeric variable, when I changed to character it slowed dramatically. (You always learn from making test cases).
I'm a huge fan of ODS graphics. But my one complaint would be that they seem slow to generate, especially with moderate to large data. Sometimes with box plots, I resort to calculating the values for the boxplot myself (using PROC MEANS or whatever), and then use GTL BOXPLOTPARM, which typically speeds things up, because there is less data to crunch.
Given that with bigger data I was getting java virtual machine memory errors, it makes me wonder if the JVM is doing statistical calculations and that is the cause of poor speed. In my head, I would have thought that when I make a box plot in SAS, it would use SAS to calculate the statistics needed. But it seems like often the statistical calculations are slower than you would get from a SAS PROC.
PROC MEANS can calculate the means in < 0.2 seconds, regardless of whether the class variables is numeric or character:
12 proc means data=have mean; 13 var score ; 14 class cat ; 15 run ; NOTE: There were 500000 observations read from the data set WORK.HAVE. NOTE: PROCEDURE MEANS used (Total process time): real time 0.13 seconds cpu time 0.12 seconds 16 17 18 proc means data=have mean; 19 var score ; 20 class catc ; 21 run ; NOTE: There were 500000 observations read from the data set WORK.HAVE. NOTE: PROCEDURE MEANS used (Total process time): real time 0.06 seconds cpu time 0.15 seconds
Seems fair to be surprised that SGPLOT would take 20-50 seconds to calculate the same values.
Yeah. That is what sas usually do to surprised us .
Sometimes I will also get something like yours .
Just ran in M6 and it's similar but not as bad.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; NOTE: ODS statements in the SAS Studio environment may disable some output features. 69 70 data have ; 71 do cat=1 to 5 ; 72 catc=put(cat,1.) ; 73 do i=1 to 100000 ; 74 score=ranuni(0)*cat ; 75 output ; 76 end ; 77 end ; 78 run ; NOTE: The data set WORK.HAVE has 500000 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.20 seconds cpu time 0.20 seconds 79 80 options stimer ; 81 82 ods listing close ; 83 ods pdf file="%sysfunc(pathname(work))/mypdf.pdf" ; NOTE: Writing ODS PDF output to DISK destination "/tmp/SAS_workAE9500004D83_localhost.localdomain/SAS_work5F1F00004D83_localhost.localdomain/mypdf.pdf", printer "PDF". 84 85 *numeric category ; 86 proc sgplot data=have ; 87 vbox score/category=cat extreme; 88 run ; NOTE: Since no format is assigned, the numeric category variable will use the default of BEST6. NOTE: PROCEDURE SGPLOT used (Total process time): real time 7.47 seconds cpu time 1.81 seconds NOTE: There were 500000 observations read from the data set WORK.HAVE. 89 90 proc sgplot data=have ; 91 vbox score/category=cat extreme; 92 xaxistable score /location=inside stat=mean ; 93 run ; NOTE: Since no format is assigned, the numeric category variable will use the default of BEST6. NOTE: PROCEDURE SGPLOT used (Total process time): real time 4.07 seconds cpu time 1.19 seconds NOTE: Marker and line antialiasing has been disabled for at least one plot because the threshold has been reached. You can set ANTIALIASMAX=500000 in the ODS GRAPHICS statement to enable antialiasing for all plots. NOTE: There were 500000 observations read from the data set WORK.HAVE. 94 95 *character category ; 96 proc sgplot data=have ; 97 vbox score/category=catc extreme; 98 run ; NOTE: PROCEDURE SGPLOT used (Total process time): real time 1.18 seconds cpu time 1.06 seconds NOTE: There were 500000 observations read from the data set WORK.HAVE. 99 100 proc sgplot data=have ; 101 vbox score/category=catc extreme; 102 xaxistable score /location=inside stat=mean ; 103 run ; NOTE: PROCEDURE SGPLOT used (Total process time): real time 31.64 seconds cpu time 19.39 seconds NOTE: Marker and line antialiasing has been disabled for at least one plot because the threshold has been reached. You can set ANTIALIASMAX=500000 in the ODS GRAPHICS statement to enable antialiasing for all plots. NOTE: There were 500000 observations read from the data set WORK.HAVE. 104 105 ods pdf close ; NOTE: ODS PDF printed 4 pages to /tmp/SAS_workAE9500004D83_localhost.localdomain/SAS_work5F1F00004D83_localhost.localdomain/mypdf.pdf. 106 107 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 117
Hey @Quentin , thanks for bringing this to our attention. For your use case, I believe you will find it faster to use the DISPLAYSTATS option on the VBOX statement. Give that a try and see if that works well for you.
Thanks @DanH_sas that looks lovely. Best argument yet that I should upgrade from 9.4M4. : )
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.