Thursday, March 28, 2013

Inter-rater Reliability, Cohen's Kappa and Surprises with R


This is part three in adventures with statsmodels.stats, after power and multicomparison.

This time it is about Cohen's Kappa, a measure of inter-rater agreement or reliability. Suppose we have two raters that each assigns the same subjects or objects to one of a fixed number of categories. The question then is: How well do the raters agree in their assignments? Kappa provides a measure of association, the largest possible value is one, the smallest is minus 1, and it has a corresponding statistical test for the hypothesis that agreement is only by chance. Cohen's kappa and similar measures have widespread use, among other fields, in medical or biostatistic. In one class of applications, raters are doctors, subjects are patients and the categories are medical diagnosis. Cohen's Kappa provides a measure how well the two sets of diagnoses agree, and a hypothesis test whether the agreement is purely random.

For more background see this Wikipedia page which was the starting point for my code.

Most of the following focuses on weighted kappa, and the interpretation of different weighting schemes. In the last part, I add some comments about R, which provided me with several hours of debugging, since I'm essentially an R newbie and have not yet figured out some of it's "funny" behavior.

Cohen's Kappa

The relevant information about the data can be summarized by a square contingency table, where the entries are the counts of observations for pairs of category assignments. For example, a are the number of observations where both raters assigned category one, e is the number of observations where rater 1 assigned category two and rater 2 assigned category one.

Rater   Rater 2
  cat cat1 cat2 cat3 cat4
Rater 1 cat1 a b c d
cat2 e f g h
cat3 i j k l
cat4 m n o p

To have an illustrative example and code to work with, I import the relevant function and define a table as a numpy array.

import numpy as np
from statsmodels.stats.inter_rater import cohens_kappa, to_table

table = np.array([[ 2.,  6.,  3.,  0.],
                  [ 5.,  4.,  2.,  2.],
                  [ 5.,  2.,  6.,  0.],
                  [ 2.,  2.,  3.,  7.]])

Calling the function from statsmodels and printing the results, we see that kappa for this table is 0.1661, and that we cannot reject at the 5% level the Null Hypothesis that kappa is zero.

>>> print cohens_kappa(table)
              Simple Kappa Coefficient
          Kappa                     0.1661
          ASE                       0.0901
          95% Lower Conf Limit      -0.0105
          95% Upper Conf Limit      0.3426

             Test of H0: Simple Kappa = 0

          ASE under H0              0.0798
          Z                         2.0806
          One-sided Pr >  Z         0.0187
          Two-sided Pr > |Z|        0.0375

cohens_kappa returns a results instance that provides additional information.

>>> res = cohens_kappa(table)
>>> type(res)
<class 'statsmodels.stats.inter_rater.KappaResults'>
>>> res.keys()
['distribution_kappa', 'pvalue_one_sided', 'kappa_low', 'kappa_max', 'std_kappa',
'z_value', 'alpha', 'pvalue_two_sided', 'var_kappa', 'kappa_upp', 'kind', 'kappa',
'std_kappa0', 'alpha_ci', 'weights', 'distribution_zero_null', 'var_kappa0']

Cohen's kappa takes the column and row total counts as given when the "pure chance" agreement is calculated. This has two consequences: While kappa is always smaller than or equal to one (as is the correlation coefficient), the largest kappa that can be achieved for given marginal counts, can be strictly smaller than one. cohens_kappa calculates the largest simple Kappa that can be obtained for the given margin totals, available as kappa_max

>>> res.kappa_max

The second consequence is that kappa values in unequally distributed tables might be surprising

>>> t3 = np.array([[98, 2], [2,0]])
>>> t3
array([[98,  2],
       [ 2,  0]])
>>> res3 = cohens_kappa(t3)
>>> res3.kappa

Kappa is negative, raters agree in 98 out of 100 cases, but disagree on the other two. Conditioning on the margin totals to define chance agreement, works as if raters have been told that there are 98 cases of category one and 2 cases of category two. Raters in this example disagree which are the two cases in category two.

An extension of the simple kappa allows for different weights when aggregating the agreement counts from the contingency table. Which weights are appropriate for a given dataset depends on the interpretation of the categories. Kappa is a measure of association between two random variables (the two ratings). If the categories were numeric, then we could also use the correlation coefficient. Weighted kappa allows us to measure agreement for different types of data, such as categorical or ordered categories. I am going through some examples for different interpretations of the categories, and show how the corresponding weights can be defined for cohens_kappa.

Case 1: Categorical Data

Suppose our categories are "clothing", "housing", "food" and "transportation", and raters have to assign each unit to one of the categories. In this case, the categories are purely categorical, there is no difference in closeness between any pairs. Agreement means that both raters assign the same category, if one rater assigns, for example, "housing" and the other rater assigns "food", then they don't "almost" agree, they just disagree.

The count of agreement cases only counts those on the main diagonal, a+f+k+p, off-diagonal counts are ignored. The corresponding Cohen's kappa is then called simple or unweighted.

Case 2: Equally Spaced Ordered Categories

Examples for this case are simple grading scales, such as 0 to 4 stars, or letter grades A, B, C, D. The assumption is that weighting for the agreement of points only depends on the distance. One star difference is the same whether it is zero stars versus one star, or 2 stars versus 3 stars. The weight or distance only depends on how far away the point is from the main diagonal, b, g, l and e, j, o all represent one level disagreement.

The distance table in this case is

>>> d
array([[ 0.,  1.,  2.,  3.],
       [ 1.,  0.,  1.,  2.],
       [ 2.,  1.,  0.,  1.],
       [ 3.,  2.,  1.,  0.]])

The actual weighting could be linear, quadratic or some arbitrary monotonic function of the distance. The corresponding options as described in the docstring are

wt : None or string
    wt in ['linear', 'ca' or None] : use linear weights, Cicchetti-Allison
        actual weights are linear in the score "weights" difference
    wt in ['quadratic', 'fc'] : use quadratic weights, Fleiss-Cohen
        actual weights are squared in the score "weights" difference

The kappa for these two weights are

>>> cohens_kappa(table, wt='linear').kappa
>>> cohens_kappa(table, wt='quadratic').kappa

We get the same result if we use the above distance matrix directly

>>> cohens_kappa(table, weights=d).kappa
>>> cohens_kappa(table, weights=d**2).kappa

This distance based weighting scheme implies that the weight matrix has a Toeplitz structure, the value of an entry only depends on the distance to the main diagonal. To allow more flexibility in the definition of weights, I added the toeplitz option

wt : None or string
    wt = 'toeplitz' : weight matrix is constructed as a toeplitz matrix
        from the one dimensional weights.

Consider the following examples, the categories are "solid blue", "leaning blue", "leaning red" and "solid red". We would like to include a half point weight for assignments where the raters chose neighboring categories, and not include pairs of category assignments that are more than one level apart.

weights = [0, 0.5, 1, 1]

This is again defined in terms of distance. We can also get the actual weights (which are 1 - the distance matrix)

>>> res = cohens_kappa(table, weights=np.array([0, 0.5, 1, 1]), wt='toeplitz')
>>> 1 - res.weights
array([[ 1. ,  0.5,  0. ,  0. ],
       [ 0.5,  1. ,  0.5,  0. ],
       [ 0. ,  0.5,  1. ,  0.5],
       [ 0. ,  0. ,  0.5,  1. ]])

The kappa given these weights is larger than the simple kappa:

>>> res.kappa

As a check, we can also calculate the simple kappa as a special case of toeplitz

>>> cohens_kappa(table, weights=np.array([0, 1, 1, 1]), wt='toeplitz').kappa
>>> _ - cohens_kappa(table).kappa

Aside: Since R's irr package does not offer any option that corresponds to toeplitz, the unit test only look at cases where toeplitz has an equivalent representation in on of the other options.

Case 3: Unequally Spaced Ordered Categories

We can also look at cases where categories are ordered but the distance between categories is not equal. Take as example a list of categories "apples", "oranges", "cars", "minivans", "trucks". We could assign a distance scale that reflects closeness across categories, for example (1, 2, 7, 8, 10). The distance between apples and oranges is 1, the distance between oranges and cars is 5. The scale will be rescaled for the weights, so only relative distances matter( in the linear reweighting case).

We have only 4 categories in our example table, so let's assume our categories are "excellent", "very good", "very bad" and "awful", and we assign distance scores (0, 1, 9, 10). wt='linear' uses the linear Cicchetti-Allison transformation to rescale the scores. Weighted kappa and weights (rescaled distances) are

>>> res = cohens_kappa(table, weights=np.array([0, 1, 9, 10]), wt='linear')
>>> res.kappa
>>> res.weights
array([[ 0. ,  0.1,  0.9,  1. ],
       [ 0.1,  0. ,  0.8,  0.9],
       [ 0.9,  0.8,  0. ,  0.1],
       [ 1. ,  0.9,  0.1,  0. ]])

The actual weights are

>>> 1 - res.weights
array([[ 1. ,  0.9,  0.1,  0. ],
       [ 0.9,  1. ,  0.2,  0.1],
       [ 0.1,  0.2,  1. ,  0.9],
       [ 0. ,  0.1,  0.9,  1. ]])

Identical category assignments have weight 1, ("excellent", "very good") and ("very bad" and "awful") pairs have weight 0.9 close to full agreement, ("very good" and "very bad") pairs have weight 0.2 (normalized distance is 0.8), and so on.

Another usage for this is to calculate kappa for merged tables. We can define a scale that has a distance of zero between the two categories at the extremes.

>>> res = cohens_kappa(table, weights=np.array([1, 1, 2, 2]), wt='linear')
>>> res.kappa

>>> res.weights
array([[ 0.,  0.,  1.,  1.],
       [ 0.,  0.,  1.,  1.],
       [ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.]])

>>> 1 - res.weights
array([[ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  1.],
       [ 0.,  0.,  1.,  1.]])

Which gives us the same kappa as the one from the merged table where we combine the two "good" and the two "bad" categories into one level each

>>> table_merged = np.array([[table[:2, :2].sum(), table[:2, 2:].sum()],
                             [table[2:, :2].sum(), table[2:, 2:].sum()]])
>>> table_merged
array([[ 17.,   7.],
       [ 11.,  16.]])

>>> cohens_kappa(table_merged).kappa

Besides weighted kappa, cohens_kappa also calculates the standard deviations and hypothesis test for it. I also implemented Fleiss' kappa, which considers the case when there are many raters, but I only have kappa itself, no standard deviation or tests yet (mainly because the SAS manual did not have the equations for it). An additional helper function to_table can convert the original observations given by the ratings for all individuals to the contingency table as required by cohen's kappa.

I had also added more functions for inter-rater agreement or reliability to my wish list, but I am not planning to work on it anytime soon (except maybe ICC Intraclass correlation .

Note: While writing this, after not having looked at cohens_kappa for some time, I discovered that the function does not check whether the user specified weights in the toeplitz case make sense. When the full weight matrix is given, the function also just uses whatever the user defined. I think, raising ValueErrors or automatic transformations to valid weight or distance matrices in the flexible cases will be necessary. Also, I think the docstring is not clear in what is a weight matrix (with 1 on the main diagonal) and what is a distance matrix (with 0 on the main diagonal)

Development Notes and Surprises with R

I had a quiet evening, was to tired or too lazy to do any serious work, and picked something from my wishlist that looked easy to implement as a late evening activity. I started with the Wikipedia article on Cohen's kappa and implemented it very fast. Then, I went to the SAS manual to get more background, and to see what other statistical packages provide. Since SAS has a lot more than what's described on Wikipedia and it has the formulas in the documentation, it started to get serious. Understanding and working through the weighting schemes started to be a fun challenge. A few days later, I was ready to look at R, to write more unit tests than the ones based on the printouts that I found for the SAS examples.

I found the package irr in R, that has Cohens' kappa. So, I started to write my tests cases verifying against it. I have been using R for several years to write unit tests for scipy.stats and statsmodels, but never tried to learn it systematically. My knowledge of R is very eclectic. Stackoverflow is often a good source to get around some of the trickier parts, once I know what to look out for.

R factors are one dimensional, I guess

I started to look at one of the datasets that come with the irr package. I converted the dataframe with factors to numeric values, and started to compare the results with my implementation. My numbers didn't match and I spent some time looking for a bug. Finally, I realized that R converts each factor separately to numeric.

The last rater in the following example does not use one of the categories, so the mapping between string and numeric levels is not the same across raters. Individuals (rows) 1 and 4 have the same diagnosis by all raters, but the numeric value of the last rater (column) is different.

> sapply(diagnoses, class)
  rater1   rater2   rater3   rater4   rater5   rater6
"factor" "factor" "factor" "factor" "factor" "factor"
> sapply(diagnoses, mode)
   rater1    rater2    rater3    rater4    rater5    rater6
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"

> diagnoses[1:5, ]
                   rater1                  rater2                  rater3           rater4
1             4. Neurosis             4. Neurosis             4. Neurosis      4. Neurosis
2 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder         5. Other
3 2. Personality Disorder        3. Schizophrenia        3. Schizophrenia 3. Schizophrenia
4                5. Other                5. Other                5. Other         5. Other
5 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder      4. Neurosis
            rater5      rater6
1      4. Neurosis 4. Neurosis
2         5. Other    5. Other
3 3. Schizophrenia    5. Other
4         5. Other    5. Other
5      4. Neurosis 4. Neurosis

> data.matrix(diagnoses)[1:5,]
  rater1 rater2 rater3 rater4 rater5 rater6
1      4      4      4      4      4      3
2      2      2      2      5      5      4
3      2      3      3      3      3      4
4      5      5      5      5      5      4
5      2      2      2      4      4      3

I had expected that a matrix is a matrix and not a collection of columns.

Columns with no observations

Since the last rater does not use one of the levels, the contingency table has a zero row or column when the last rater is compared to any of the other raters. Either kappa2 (or the factor handling in R) drops the column with zeros, which changes the value of Cohen's kappa. SAS had a similar problem in the past with several work-arounds published by users on the internet, but according to the latest documentation, SAS has an explicit option not to drop rows or columns with all zeros.

However, since I was already alerted to the problem from the SAS stories, debugging time wasn't very long for this.


  1. Hi Josef,

    Thanks for this post, this is a great function. I had a comment/question. Because cohens_kappa always requires 'square' input, I've run into difficulty using it regularly with pandas.crosstab which I was using to convert an n(subjects) x m(raters) data frame into pairwise frequency tables.

    The issue has been if some factors/levels have no members, the output of crosstab is not square. This is expected behaviour I'm sure, but does mean further fiddling is required to "re-square" to frequency table. This a big behavioural difference with the R version of kappa2(which accepts the original nxm matrix rather than a freq table) and likely part of the reason for your last few points in the article - it's their hack to cope with factors without members.

    My "hack" for the python version was to write a function that will accept an array that lists all possible factor levels (eg [0,1,2,4,9]) and produces a zero'd square dataframe that size. I then perform the crosstab and add the two. This preserves the squareness and keeps the zeros which I think statistically is the correct thing to do!

    I'm just wondering that considering the likely use cases of cohens_kappa, is there a better way of making it "factor/level" aware? Realistically, for anything other than 2 levels you won't want to be entering tables by hand as you have done in your examples. Making it play nice with crosstab output would make for nice pandas integraton!

    Thanks again,

  2. Hi David,

    The import at the top also shows a helper function "to_table" which can be used to create the table from the data. However, I have not tried it out with a pandas dataframe.

    I would like to keep the functions separate, but there might be ways to integrate them better. Please, open an issue for statsmodels on github, if there are problems with "to_table" or you have suggestions to make it more user friendly. For example, I don't think I keep track of label names.

  3. Hi,

    It's long after the fact, so you have probably discovered this by now. The problem you experienced with factors in R is one that can be easily addressed. One only has to explicitly list the expected factor levels when creating the factors.

    Say the variable 'color' represents the color of a series of objects which may be either red, blue, or green. But the rater has seen a sequence of objects that contains no blue ones.

    The default operation of the factor() function is to use only the levels that are found in the object being factored. That is illustrated with the code fragment here.

    > color = c("red","green","green","red")
    > fcolor = factor( color)
    > table( fcolor )
    green red
    2 2

    But if the expected levels are supplied in the call to factor as shown here, the desired and correct behavior results.

    > fcolor = factor( color, levels=c("blue","green","red"))
    > table(fcolor)
    blue green red
    0 2 2