Bug? Proc Univariate QQplot with Class

JB1_DK · Posted 02-08-2021 06:34 AM

I have data of spreads in two different Rating classes (AA and AAA). I want to test the spreads for normality in each Rating class. I therefore produce this panel using Proc Univariate:

The problem is that it seems that the line through the qq-plot is the same for all ratings, disregarding the class.

Code used:

proc univariate data=testData;
var Spread;
Class Rating;
qqplot Spread/square normal(MU=EST SIGMA=EST );

run;

If I use BY instead of Class, I get individually fitted qq-plots. But that requires sorting and does not provide a panel layout. It should work for Class as well, and display properly in the panel layout.

Just for reference, here is the BY output for the AAA rating, clearly not matching the AAA output in the panel layout:

RichardDeVen · Posted 02-08-2021 11:16 AM

Can you add the 'testData' to the post ?

JB1_DK · Posted 02-09-2021 04:26 AM

Yes, these are the observations:

data testdata;
input rating :$3. Spread :16.8;
datalines;
AA -16.233096
AA -12.366438
AA -6.320539
AA -3.732885
AA 7.516216
AA -4.689121
AA -14.602099
AA 2.9222857069
AA 3.1682522766
AA -14.341467
AA -10.905014
AA 10.171641418
AA -9.329653
AAA -28.50975333
AAA -28.534444
AAA -28.932457
AAA -28.760108
AAA -28.667521
AAA -28.935067
AAA -29.9774
AAA -28.891263
AAA -28.324166
AAA -26.943601
AAA -3.785452
AAA -26.157575
AAA -19.440571
AAA -17.983667
AAA -17.165252
AAA -6.671511
AAA -7.015316
AAA -8.647729
AAA 8.857968
AAA -13.716122
AAA -6.366132
AAA -6.11798
AAA -15.030489
run;

Ksharp · Posted 02-09-2021 08:04 AM

No. I don't think so . Both plot have same scale and share the same slope/line .

If you use OVERLAY option .

proc univariate data=testData;
var Spread;
Class Rating;
qqplot Spread/overlay square normal(MU=EST SIGMA=EST );

run;

I think there must have some option to adjust this line.

@Rick_SAS may know this .

JB1_DK · Posted 02-09-2021 10:50 AM

"Both plot have same scale and share the same slope/line"

That is the exact problem - they should not have the same line. They are two separate samples, separated by Rating as per the Class statement. BY processing does the correct thing, which Class should also be able to. I really hope there is some syntax I have not come across that can solve this. Overlay makes no difference

RichardDeVen · Posted 02-09-2021 11:45 AM

Yes, I would report this as a bug. The normal plot shown in the CLASS based graphs is plotting the Normal Line from the first panel (first CLASS value of AA) in all the panels. Other than that, the BY and CLASS based analysis results are identical.

BuckyRansdell · Posted 02-09-2021 03:36 PM

This is a known issue. In a comparative Q-Q plot (requested with a CLASS statement) the quantiles of the plotted points and the reference line in each cell are computed using parameters of the distribution fitted to the data in the key cell, which by default is the cell in row 1 and column 1. When you're using a normal distribution the quantiles are unaffected by this, but that is not the case for distributions with shape parameters.

Currently you do need to use a BY statement to produce independently-fitted Q-Q plots.

JB1_DK · Posted 02-10-2021 04:42 AM

Hi Bucky,

Thanks for acknowledging that there is a bug.

The BY workaround does not offer what the panel layout can provide. The above was just a toy example, if I want to inspect 20+ groups of data for normality, the panel plots would at a glance tell me which ones to focus on and what makes them fail the normality test (outliers, tails etc). I can code something myself, but this would be very convenient to have Univariate provide

RichardDeVen · Posted 02-10-2021 04:46 PM

Until the resolving hot fix is issued, you can SGPANEL your own QQ plot by computing the coordinates to be plotted

Example

Proc RANK - Compute BLOM normal quintiles
Proc MEANS - Compute mean and std
DATA Step - Use QUANTILE, std, and mean to compute normal reference line end points at .25 .75 quartile
- Needs some more thinking on how to extend line to plot edges
DATA Step - Merge quintiles with reference line data
SGPANEL - Output 'tight' QQ plots for quick review

Consider COVID-19 testing data available from New York State department of health. The number of tests performed is being plotted by county. The data is reduced to every 31st day within county in order to have a smaller data set that will run faster in this demonstration that only deals with counties whose name starts with A or B (again, less data === faster output).

Fetch the data

Spoiler

* https://data.ny.gov/browse?tags=covid-19

* New York State Statewide COVID-19 Testing;

filename testing temp;
filename headers temp;

%if not %sysfunc(exist(work.testing,data)) %then %do;
  proc http 
    url = 'https://health.data.ny.gov/api/views/xdss-u53e/rows.csv?accessType=DOWNLOAD&api_foundry=true'
    method = "get"
    out = testing
    headerout = headers
  ;
  run;

  proc import datafile=testing dbms=csv replace out=work.testing ;
    guessingrows=all;
  run;
%end;

data testing_31;
  set testing;
  by county;
  if first.county then seq=1; else seq+1;
  if mod(seq,31) = 0;
run;

* https://data.ny.gov/browse?tags=covid-19 * New York State Statewide COVID-19 Testing; filename testing temp; filename headers temp; %if not %sysfunc(exist(work.testing,data)) %then %do; proc http url = 'https://health.data.ny.gov/api/views/xdss-u53e/rows.csv?accessType=DOWNLOAD&api_foundry=true' method = "get" out = testing headerout = headers ; run; proc import datafile=testing dbms=csv replace out=work.testing ; guessingrows=all; run; %end; data testing_31; set testing; by county; if first.county then seq=1; else seq+1; if mod(seq,31) = 0; run;

Plots only using UNIVARIATE and BY. Only 1 panel per by value;

proc univariate noprint data=testing_31;
  var Total_Number_of_Tests_Performed;
  *Class county;
  by county;
  qqplot Total_Number_of_Tests_Performed / square normal(MU=EST SIGMA=EST);
  where county < 'C';
  output out=unistats mean=mean std=std;
run;

Compute the coordinates for the per county QQ plots and plot them with SGPANEL and SCATTER and SERIES statements.

proc rank 
  data=testing_31
  normal=BLOM
  out=qq(keep=county Total_Number_of_Tests_Performed nq)
;
  by county;
  var Total_Number_of_Tests_Performed ;
  ranks nq;
run;

proc means nway noprint data=testing_31;
  class county;
  var Total_Number_of_Tests_Performed ;
  output out=line mean=mean std=std;
run;

data refline;
  set line;
  
  xn25 = quantile('normal', 0.25);
  xn75 = quantile('normal', 0.75);

  yn25 = xn25 * std + mean;
  yn75 = xn75 * std + mean;

  x = xn25; y = yn25; output;
  x = xn75; y = yn75; output;

  keep county x y;
run;

data plot;
  merge qq refline;
  by county;
  keep county nq Total_Number_of_Tests_Performed x y ;

  if first.county then seq=1; else seq+1;
  if seq > 2 then call missing (x, y);
run;

proc sgpanel data=plot;
  panelby county / columns=3 rows=3;
  scatter x=nq y=Total_Number_of_Tests_Performed;
  series  x=x y=y;

  where county < 'C';
run;

SGPANEL output

JB1_DK · Posted 02-11-2021 04:20 AM

Thanks @RichardDeVen , that's a great approach for this.

Proc Rank is a real powertool. And I like how your SGPANEL produces more of a grid instead of the vectorized layout from Univariate

Cheers,

Jørgen

RichardDeVen · Posted 02-11-2021 12:29 PM

This is a possible problem when you have a large number of groups with a wide variation in group-wise values being QQ'd. SGPANEL will produce images that have uniform axes, and thus some groups could be human viewed as squashed or stretched to an extent that there is really no discernable information. Contrast that with the UNIVARIATE BY/CLASS PLOTs utilize the full plot area for it's graph output.

Another approach, which I have not explored yet, is presenting the UNIVARIATE plots (only the plots) in an ODS lattice LAYOUT which would thus grid-ify full plot area graphs.

JB1_DK · Posted 05-01-2024 03:08 AM

@RichardDeVen Hello, sorry for the ping but I just took a look at the code for fetching the testing data. I noticed that you used a %IF/%DO/%END block without running it inside a %MACRO. And it works ?! Did they change something so that you can use macro DO/END blocks in programs without wrapping in a macro?

Cheers,

JB

yabwon · Posted 05-01-2024 06:45 AM

%IF-%THEN-%ELSE in "open code" was added a few years ago, but it has some limitations: it need %do-%end even for a single statement, and it cannot be nested (%IF inside other %IF)

Doc: https://go.documentation.sas.com/doc/en/pgmmvacdc/9.4/mcrolref/n18fij8dqsue9pn1lp8436e5mvb7.htm

Blog about it from 2018:

https://blogs.sas.com/content/sasdummy/2018/07/05/if-then-else-sas-programs/

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

JB1_DK · Posted 05-01-2024 08:02 AM

Thank you @yabwon , appreciate the links and info

Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Re: Bug? Proc Univariate QQplot with Class

Registration is open

SAS Training: Just a Click Away