I have data of spreads in two different Rating classes (AA and AAA). I want to test the spreads for normality in each Rating class. I therefore produce this panel using Proc Univariate:
The problem is that it seems that the line through the qq-plot is the same for all ratings, disregarding the class.
Code used:
proc univariate data=testData;
var Spread;
Class Rating;
qqplot Spread/square normal(MU=EST SIGMA=EST );
run;
If I use BY instead of Class, I get individually fitted qq-plots. But that requires sorting and does not provide a panel layout. It should work for Class as well, and display properly in the panel layout.
Just for reference, here is the BY output for the AAA rating, clearly not matching the AAA output in the panel layout:
Yes, these are the observations:
data testdata;
input rating :$3. Spread :16.8;
datalines;
AA -16.233096
AA -12.366438
AA -6.320539
AA -3.732885
AA 7.516216
AA -4.689121
AA -14.602099
AA 2.9222857069
AA 3.1682522766
AA -14.341467
AA -10.905014
AA 10.171641418
AA -9.329653
AAA -28.50975333
AAA -28.534444
AAA -28.932457
AAA -28.760108
AAA -28.667521
AAA -28.935067
AAA -29.9774
AAA -28.891263
AAA -28.324166
AAA -26.943601
AAA -3.785452
AAA -26.157575
AAA -19.440571
AAA -17.983667
AAA -17.165252
AAA -6.671511
AAA -7.015316
AAA -8.647729
AAA 8.857968
AAA -13.716122
AAA -6.366132
AAA -6.11798
AAA -15.030489
run;
No. I don't think so . Both plot have same scale and share the same slope/line .
If you use OVERLAY option .
proc univariate data=testData; var Spread; Class Rating; qqplot Spread/overlay square normal(MU=EST SIGMA=EST ); run;
I think there must have some option to adjust this line.
@Rick_SAS may know this .
"Both plot have same scale and share the same slope/line"
That is the exact problem - they should not have the same line. They are two separate samples, separated by Rating as per the Class statement. BY processing does the correct thing, which Class should also be able to. I really hope there is some syntax I have not come across that can solve this. Overlay makes no difference
Yes, I would report this as a bug. The normal plot shown in the CLASS based graphs is plotting the Normal Line from the first panel (first CLASS value of AA) in all the panels. Other than that, the BY and CLASS based analysis results are identical.
This is a known issue. In a comparative Q-Q plot (requested with a CLASS statement) the quantiles of the plotted points and the reference line in each cell are computed using parameters of the distribution fitted to the data in the key cell, which by default is the cell in row 1 and column 1. When you're using a normal distribution the quantiles are unaffected by this, but that is not the case for distributions with shape parameters.
Currently you do need to use a BY statement to produce independently-fitted Q-Q plots.
Hi Bucky,
Thanks for acknowledging that there is a bug.
The BY workaround does not offer what the panel layout can provide. The above was just a toy example, if I want to inspect 20+ groups of data for normality, the panel plots would at a glance tell me which ones to focus on and what makes them fail the normality test (outliers, tails etc). I can code something myself, but this would be very convenient to have Univariate provide
Until the resolving hot fix is issued, you can SGPANEL your own QQ plot by computing the coordinates to be plotted
Example
Consider COVID-19 testing data available from New York State department of health. The number of tests performed is being plotted by county. The data is reduced to every 31st day within county in order to have a smaller data set that will run faster in this demonstration that only deals with counties whose name starts with A or B (again, less data === faster output).
Fetch the data
* https://data.ny.gov/browse?tags=covid-19 * New York State Statewide COVID-19 Testing; filename testing temp; filename headers temp; %if not %sysfunc(exist(work.testing,data)) %then %do; proc http url = 'https://health.data.ny.gov/api/views/xdss-u53e/rows.csv?accessType=DOWNLOAD&api_foundry=true' method = "get" out = testing headerout = headers ; run; proc import datafile=testing dbms=csv replace out=work.testing ; guessingrows=all; run; %end; data testing_31; set testing; by county; if first.county then seq=1; else seq+1; if mod(seq,31) = 0; run;
Plots only using UNIVARIATE and BY. Only 1 panel per by value;
proc univariate noprint data=testing_31;
var Total_Number_of_Tests_Performed;
*Class county;
by county;
qqplot Total_Number_of_Tests_Performed / square normal(MU=EST SIGMA=EST);
where county < 'C';
output out=unistats mean=mean std=std;
run;
Compute the coordinates for the per county QQ plots and plot them with SGPANEL and SCATTER and SERIES statements.
proc rank
data=testing_31
normal=BLOM
out=qq(keep=county Total_Number_of_Tests_Performed nq)
;
by county;
var Total_Number_of_Tests_Performed ;
ranks nq;
run;
proc means nway noprint data=testing_31;
class county;
var Total_Number_of_Tests_Performed ;
output out=line mean=mean std=std;
run;
data refline;
set line;
xn25 = quantile('normal', 0.25);
xn75 = quantile('normal', 0.75);
yn25 = xn25 * std + mean;
yn75 = xn75 * std + mean;
x = xn25; y = yn25; output;
x = xn75; y = yn75; output;
keep county x y;
run;
data plot;
merge qq refline;
by county;
keep county nq Total_Number_of_Tests_Performed x y ;
if first.county then seq=1; else seq+1;
if seq > 2 then call missing (x, y);
run;
proc sgpanel data=plot;
panelby county / columns=3 rows=3;
scatter x=nq y=Total_Number_of_Tests_Performed;
series x=x y=y;
where county < 'C';
run;
SGPANEL output
Thanks @RichardDeVen , that's a great approach for this.
Proc Rank is a real powertool. And I like how your SGPANEL produces more of a grid instead of the vectorized layout from Univariate
Cheers,
Jørgen
This is a possible problem when you have a large number of groups with a wide variation in group-wise values being QQ'd. SGPANEL will produce images that have uniform axes, and thus some groups could be human viewed as squashed or stretched to an extent that there is really no discernable information. Contrast that with the UNIVARIATE BY/CLASS PLOTs utilize the full plot area for it's graph output.
Another approach, which I have not explored yet, is presenting the UNIVARIATE plots (only the plots) in an ODS lattice LAYOUT which would thus grid-ify full plot area graphs.
@RichardDeVen Hello, sorry for the ping but I just took a look at the code for fetching the testing data. I noticed that you used a %IF/%DO/%END block without running it inside a %MACRO. And it works ?! Did they change something so that you can use macro DO/END blocks in programs without wrapping in a macro?
Cheers,
JB
%IF-%THEN-%ELSE in "open code" was added a few years ago, but it has some limitations: it need %do-%end even for a single statement, and it cannot be nested (%IF inside other %IF)
Doc: https://go.documentation.sas.com/doc/en/pgmmvacdc/9.4/mcrolref/n18fij8dqsue9pn1lp8436e5mvb7.htm
Blog about it from 2018:
https://blogs.sas.com/content/sasdummy/2018/07/05/if-then-else-sas-programs/
Bart
Thank you @yabwon , appreciate the links and info
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.