BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Cuneyt
Obsidian | Level 7

Hi All,

 

I am running the following code to compare two empirical distributions.

 

proc npar1way edf;
   class source;
   var naics;
   freq emp;
run;

I am getting p-values for Kolmogorov-Smirnov Two-Sample Test and Kuiper Two-Sample Test, but NOT for Cramer-von Mises test. The output includes Cramer-von Mises statistics only (see below table). How can I get the p-value for these statistics?

 

 

Cramer-von Mises Statistics
(Asymptotic)
CM 0.000607 CMa 704.035210

 

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

 can confirm from the PROC NPAR1WAY documentation

https://go.documentation.sas.com/doc/en/statcdc/14.2/statug/statug_npar1way_details32.htm

that the procedure computes the statistics, but not the p-value.

 

Like you, I don't know why this p-value is not computed. Perhaps it is very expensive to estimate. I suggest you use the K-S test or the Kuiper test if you need to test the null hypothesis.

View solution in original post

8 REPLIES 8
ballardw
Super User

It may help to provide both the entire log, to see if the procedure reported any issues, and all of the output, to see if something tells us there may be an issue with the data itself.

 

Cuneyt
Obsidian | Level 7

I am attaching the SAS data set. The code, log, and output are shown below:

proc npar1way edf data=kl plots=none;
   class source;
   var naics;
   freq est_emp_acc_N;
run;
68   proc npar1way edf data=kl plots=none;
69      class source;
70      var naics;
71      freq est_emp_acc_N;
72   run;

NOTE: PROCEDURE NPAR1WAY used (Total process time):
      real time           0.04 seconds
      cpu time            0.03 seconds
                                         The SAS System         18:02 Friday, August 25, 2023   1

                                      The NPAR1WAY Procedure

                           Kolmogorov-Smirnov Test for Variable naics
                                  Classified by Variable source

                                               EDF at    Deviation from Mean
                     source           N       Maximum        at Maximum

                     1            42522      0.577936         58.387741
                     2          1117060      0.284009        -11.391745
                     Total      1159582      0.294787

                          Maximum Deviation Occurred at Observation 19
                                Value of naics at Maximum = 33.0

                          Kolmogorov-Smirnov Two-Sample Test (Asymptotic)

                               KS   0.055244     D         0.293927
                               KSa  59.488655    Pr > KSa  <.0001


                             Cramer-von Mises Test for Variable naics
                                  Classified by Variable source

                                                     Summed Deviation
                             source            N         from Mean

                             1             42522        678.218161
                             2           1117060         25.817049

                             Cramer-von Mises Statistics (Asymptotic)

                                 CM  0.000607    CMa  704.035210


                                  Kuiper Test for Variable naics
                                  Classified by Variable source

                                                        Deviation
                                 source           N     from Mean

                                 1            42522      0.293927
                                 2          1117060      0.087944

                               Kuiper Two-Sample Test (Asymptotic)

                         K  0.381872    Ka  77.287947    Pr > Ka  <.0001


Rick_SAS
SAS Super FREQ

 can confirm from the PROC NPAR1WAY documentation

https://go.documentation.sas.com/doc/en/statcdc/14.2/statug/statug_npar1way_details32.htm

that the procedure computes the statistics, but not the p-value.

 

Like you, I don't know why this p-value is not computed. Perhaps it is very expensive to estimate. I suggest you use the K-S test or the Kuiper test if you need to test the null hypothesis.

Cuneyt
Obsidian | Level 7
Thanks, Rick!
FreelanceReinh
Jade | Level 19

Hello @Cuneyt,

 

I was curious about the missing p-value for the Cramér-von Mises two-sample test and investigated the issue.

 

Apparently, PROC NPAR1WAY does not compute the p-value and the idea is that the user should compare the test statistic, which the procedure does provide, to the appropriate critical value (for the chosen significance level) found in statistical tables.

 

Textbooks on nonparametric statistics like [1] refer to exact quantiles for small samples (such as n+m<=17;  n and m being the two sample sizes) published in the 1960s and they report quantiles of the asymptotic distribution of the test statistic (the CMa value in the PROC NPAR1WAY output), see details below (and [2]).

 

All of these calculations (done under the null hypothesis) assume continuous distributions, in particular the absence of ties. So, real-world data with ties, such as your heavily tied NAICS values, were maybe one of the problems the PROC NPAR1WAY developers were facing with regard to reliable p-value computations.

 

Otherwise, the small-sample calculations are feasible in a DATA step: Using the CALL LEXPERM routine in a loop over comb(m+n, m) permutations I was able to reproduce all numbers I picked from tables 1 - 5 (pp. 1154-1158) in [3]. There is a link to this paper (open access PDF version) in [4] (reference 3). Given that comb(30, 15)=155,117,520, an extension of the 1960s results to small samples with m+n up to around 30 appears feasible. So, exact p-values can be obtained from such calculations for those small sample sizes (assuming no ties).

 

The literature mentions that the large-sample quantiles are quite accurate even for small samples. For your example with m+n=1,159,582 they should be fine, aside from the ties issue. With your CMa value >700 compared to the 99.9% quantile of 1.168 ([1], p. 464) I guess that the p-value in your case would be extremely close to zero.

 

The asymptotic distribution of the CMa test statistic can be found in [2]. A link to a PDF version of this paper is available via Google Scholar. The relevant formula (4.35) for the asymptotic distribution function a1 on p. 202 of that article involves a series whose terms include varying values of the modified Bessel function K¼. In Base SAS there are two types of Bessel functions available, IBESSEL and JBESSEL, but not "KBESSEL." The "KBESSEL" function can be derived from the IBESSEL function (see [5], p. 375, formula 9.6.2, available online via a link in reference no. 23 of [6]). However, one of the IBESSEL functions in that formula uses a negative first argument (−1/4), which the SAS implementation of the function does not allow.

 

In my (finally successful) attempt to compute a1(z) for some values of z (confirming the results in [2], p. 203, table 1) I used the series expansion ([5], p. 375, formula 9.6.10) of the IBESSEL function to calculate that "IBESSEL(−1/4,...)" value. While the computation of the latter series worked quite well, the outer series turned out to be numerically challenging. "The convergence of this series is very rapid", as Anderson and Darling mention ([2], p. 202). Indeed, for the values I checked, a DO loop do j=0 to 1 (!) was basically sufficient to compute the series from j=0 to infinity, but increasing the end value of the loop to values >=4 would easily entail grossly incorrect results, sometimes without any hints (such as "invalid argument" notes) in the log.

 

I suppose that mainly computational difficulties like these led to the decision not to implement the calculation of p-values for the Cramér-von Mises two-sample test in PROC NPAR1WAY.

 

References:

[1] Conover, W. J. (1999). Practical Nonparametric Statistics. 3rd ed. New York: John Wiley & Sons.

[2] Anderson, T.W., Darling, D.A. (1952). Asymptotic theory of certain 'goodness of fit' criteria based on stochastic processes. The Annals of Mathematical Statistics, 23, 193-212. 

[3] Anderson, T.W. (1962). On the distribution of the two-sample Cramer-von Mises criterion. The Annals of Mathematical Statistics, 33, 1148-1159.

[4] Wikipedia article Cramér–von Mises criterion.

[5] Abramowitz, M., Stegun, I.A., eds. (1964/1983). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Washington: National Bureau of Standards; New York: Dover Publications (reprint).

[6] Wikipedia article Bessel function.

Cuneyt
Obsidian | Level 7

Hello @FreelanceReinh ,

 

Thank you so much for your detailed response.

 

To your first point: I looked at several papers from the 1960's for p-value tables, but I wasn't quite comfortable using any of these tables as I am not very familiar with nonparametric statistics. I did not want to invest much time into carefully reading these papers to understand exactly what they were doing.

 

To your second point about ties: You mention that there are many ties in my data set because both samples have the same NAICS codes. If my understanding is correct, a tie refers to the y-values. In my case, NAICS codes are my x-values (categories) and the frequencies are my y-values. I used Example 8.2 in SAS documentation on PROC NPAR1WAY, where the responses (1 through 5, categorical) are x-values, and their frequencies are y-values. PROC NPAR1WAY compares two samples (active vs placebo).

SAS Help Center: Example 85.1 Two-Sample Location Tests and Plots

SAS Help Center: Example 85.2 EDF Statistics and EDF Plot

From this point of view, I should not have any ties in my data set. Am I missing something?

 

To your last point about the computational difficulties: I ended up using an R function that implements K-S, C-vM, and Anderson-Darling tests and produces (if I understand correctly) exact p-values.

6.2 Comparison of two distributions | Notes for Nonparametric Statistics (bookdown.org)

The numerical values from SAS and R do not agree. Could I be using the wrong R function?

 

Any insights would be greatly appreciated.

 

Sincerely,

Cuneyt

 

FreelanceReinh
Jade | Level 19

You're welcome.

 

So, the NAICS values are codes. This is an important information. As far as I see, these codes are on a nominal scale or would you say that "Agriculture, ..." (code 11) is "less than" "Mining, ..." (code 21), less than "Utilities" (code 22), etc.? The Cramér-von Mises two-sample test, however, requires that the measurement scale is at least ordinal. And the same is true for all the other EDF tests because the concept of an empirical distribution function requires ordered values. Hence, these tests are inappropriate for your data. You may want to use a chi-square test (PROC FREQ) instead.

 

The response values in the arthritis data from the documentation example you mention are on an ordinal scale -- poor (1) < fair (2) < moderate (3) < good (4) < excellent (5) -- and this is essential for the tests for difference in location as well as the EDF tests. Their results would change dramatically (and thus get totally wrong) if the numeric codes 1, 2, 3, 4, 5 were arbitrarily permuted (e.g., poor=3, fair=1, moderate=5, ...).

 

The FREQ statement of PROC NPAR1WAY is just a convenient feature to allow for frequency data: The arthritis dataset from the documentation could be expanded from 10 to 59 observations by creating one observation per patient. Then the FREQ statement could be omitted or used with a constant variable equal to 1 for all observations. Either way, the results would be exactly the same as for the original dataset.

 

I have never used R, so cannot comment on R functions, except that I'm not surprised that the results from SAS and R do not perfectly agree in some cases. The material in the linked bookdown.org chapter is not applicable to your nominal (frequency) data anyway, as explained above.

Cuneyt
Obsidian | Level 7

Thank you very much for response. I think you hit the nail right on the head. My scale is nominal and EDF tests are, therefore, not appropriate. As you suggested, I will use the chi-square test. Again, thanks for your insights.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1327 views
  • 8 likes
  • 4 in conversation