Programming the statistical procedures from SAS

Kappa statistics for inter-rater reliability

Regular Contributor
Posts: 173

Kappa statistics for inter-rater reliability


I am trying to obtain a Kappa stat value to test the inter-rater reliability in data.

The number of records is 25. And out of those 25. there is agreement between 2 raters for 24 records; almost around 91%.

But when I use the SAS PROC as below, the output I get is very difficult to interpret and not sure why the Kappa is such a tiny value.

Will anyone be able to explain this? Any suggestion/advice would be greatly appreciated.


SAS code:

proc freq data = audio.audiokappa ;

tables jcomrisk * acomrisk /agree;

test kappa ;



Table of JComRisk by AComRisk

Statistics for Table of JComRisk by AComRisk

Statistic (S)

  1. 0.3333



Pr > S

  1. 0.5637




  1. 0.0404

95% Lower Conf Limit


95% Upper Conf Limit

  1. 0.0228

ASE under H0

  1. 0.1872



One-sided Pr < Z

  1. 0.3817

Two-sided Pr > |Z|

  1. 0.7634

Sample Size = 25

Respected Advisor
Posts: 4,606

Re: Kappa statistics for inter-rater reliability

Well, imagine the following experiment. You have a loaded coin that is engineered to fall on HEAD almost all the time, in fact, it falls on TAIL only 6% of the time. You ask two raters to toss the coin in turn 25 times and you note the results. The first rater, Jcom, gets only one TAIL and the second rater, Acom, gets two TAILs, but not on the same tosses as Jcom. The fact that they agree on 22 of the tosses is only due to chance. That's exactly your result table. The fact that those results could be obtained by chance only is what the Kappa statistic is telling you here.


Grand Advisor
Posts: 16,925

Re: Kappa statistics for inter-rater reliability

from trusty google:

Returning to our original example on chest findings in pneumonia, the agreement on the presence of tactile fremitus was high (85%), but the kappa of 0.01 would seem to indicate that this agreement is really very poor. The reason for the discrepancy between the unadjusted level of agreement and kappa is that tactile fremitus is such a rare finding, illustrating that kappa may not be reliable for rare observations. Kappa is affected by prevalence of the finding under consideration much like predictive values are affected by the prevalence of the disease under consideration.5 For rare findings, very low values of kappa may not necessarily reflect low rates of overall agreement.

Regular Contributor
Posts: 173

Re: Kappa statistics for inter-rater reliability

Thanks for the clarification Reeza! The article is useful too; I had referred to it earlier.

Regular Contributor
Posts: 152

Re: Kappa statistics for inter-rater reliability

I agree with Reeza.  See the following paper from the 2009 SAS User Group Proceedings:

Regular Contributor
Posts: 173

Re: Kappa statistics for inter-rater reliability

Thanks 1zmm for the article. It's indeed helpful.  As explained in this paper, I did calculate other parameters such as Prevalence index, Bias index and Pabak stat. However in my case, the prevalence index (-0.88) and the bias index (-0.04) turn out to be negative values, but the PABAK value looks alright 0.76. Please see below all the parameters that I calculated

Any advice on how I interpret these negative values? Is it alright to get the negative values in this calculation? I am sure I have calculated those correctly.

(The question that I am tasting here is " was the risk communication (between the physician and the patient) present " ? In the original example 1= Yes and 2=No.)

Below are other parameters:

  • ƒ expected proportion of agreement, 0.8864
  • ƒ proportion of positive agreement, =0
  • ƒ proportion of negative agreement = 0.94
  • ƒ prevalence index, index=     -0.88
  • ƒ bias index, index  =  -0.04
  • PABAK= 0.76

Any advice on what values would be ideal to be used in the paper and how to interpret the negative values would be appreciated!

Thanks much!


Regular Contributor
Posts: 152

Re: Kappa statistics for inter-rater reliability

In a 2-by-2 table where

   a=the number where observer #1 records a positive value and observer #2

         records a positive value;

    b=the number where observer #1 records a positive value and observer #2

          records a negative value;

    c=the number where observer #1 records a negative value and observer #2  

           records a positive value;

     d=the number where both observers #1 and #2 record a negative value; and

     N=a + b + c + d,

then prevalence index = (a - d)/N,

        bias index=(b-c)/N,

        proportion of agreement = (a + d)/N, and

        PABAK = 2*(a + b)/N     - 1.

The prevalence index measures how much the proportion of positive results differs substantially from 0.50.  The large negative prevalence index for your data implies that this proportion of positive results does differ substantially from 0.50, specifically, that  d is much larger than a.

The bias index describes how much the two observers differ on the proportion of positive results.  The bias index for your data, -0.04, is close to zero, indicating that the two observers do NOT differ very much on the proportion of positive results.

The large negative value of prevalence index implies that your observers rated a large proportion of the results as negative, and the large proportion of agreement on these negative results, 0.94, indicated that both observers agreed on these negative results.  However, the small proportion of agreement on the positive results, 0.00, indicated that both observers did NOT agree at all on the positive results.  Therefore, the original kappa statistic probably does summarize your results best of all by averaging the overall agreement on positive and negative results correcting for chance agreement:  The kappa statistic indicates no agreement between the observers on their ratings better than chance.   The prevalence index shows why this is so:  You don't have enough situations with positive responses where both raters can provide a rating.  Enrich your sample with situations that increase the proportion of positive responses.

Finally, I don't think the PABAK statistic is informative here because it adjusts both for the bias and the prevalence without indicating how to remedy the problem.  The PABAK statistic is a function of the percentage of observed agreement and does not provide any further information than that.

To answer your question, the risk communication between physician and patient was NOT present, statistically significantly more than one might expect by chance.  Further studies would require more situations in which positive responses by the raters could be evaluated.

Regular Contributor
Posts: 173

Re: Kappa statistics for inter-rater reliability

Thanks so much for explaining and helping me with the interpretation! Indeed helpful.

However, I would like to know; what difference it is going to make whether there is an agreement between 2 raters in positive results or negative results.

I mean I understand that in my sample  there are not enough records with positive responses; but how this particular Kappa result is going to affect the further analyses. In my further analyses, I will be comparing the mean decrease in lipids from baseline to year 1 between the cases those who received Risk communication and those who didn't ( we expect the the risk communication cohort shows significant decrease) . So how this analyses would be impacted given that there is a large proportion of agreement on the negative results (0.94) and "NO" agreement on positive results.

So I performed several t-tests and they show good amount of mean decrease in different kinds of lipids in risk communication cohort compared to no-risk communication cohort but none of them has yielded a Significant  p-value.  So does this mean that the kappa result indirectly is telling  us that we have small sample of Risk communication cohort compared to that of No-Risk communication cohort?

And also in my case the Kappa does show a higher agreement in negative responses; so  how can we say that the whatever agreement they show in the negative results is by chance ONLY ? What I mean that there is a good amount of agreement in the negative results, so the p value still should come out to be significant. Right?

So if I understand correctly the Kappa stat; the Kappa value is directly proportional with the number of positive responses; meaning the higher the Kappa means that there are higher number of positive responses than the negative responses and that there is greater proportion of agreement between the 2 raters in positive results than the negative results. Is this my correct understanding?

I think I am not able to understand why Kappa is concerned with ONLY positive responses; if the real use of it is to test the inter-rater reliability/agreement; and the agreement can be either in positive responses or negative response.

Sorry if I've confused you here ; but I would definitely appreciate if you could explain in more details.

Thanks again!


Respected Advisor
Posts: 4,606

Re: Kappa statistics for inter-rater reliability

The value of Kappa has nothing to do with the names of your categories. Exchange positives with negatives and you will get the same Kappa value. The Kappa statistic simply compares your scores with the scores expected from random rating (the null hypothesis), conditional on the average prevalence. If two raters were told to choose independently about 6% positives at random within your set of 25 cases, they would arrive at a score similar to yours. Agreement would require that they identify at least sometimes the same positives.

Compare the following tables



Ask a Question
Discussion stats
  • 8 replies
  • 4 in conversation