This time it is about Cohen's Kappa, a measure of inter-rater agreement or reliability. Suppose we have two raters that each assigns the same subjects or objects to one of a fixed number of categories. The question then is: How well do the raters agree in their assignments? Kappa provides a measure of association, the largest possible value is one, the smallest is minus 1, and it has a corresponding statistical test for the hypothesis that agreement is only by chance. Cohen's kappa and similar measures have widespread use, among other fields, in medical or biostatistic. In one class of applications, raters are doctors, subjects are patients and the categories are medical diagnosis. Cohen's Kappa provides a measure how well the two sets of diagnoses agree, and a hypothesis test whether the agreement is purely random.
For more background see this Wikipedia page which was the starting point for my code.
Most of the following focuses on weighted kappa, and the interpretation of different weighting schemes. In the last part, I add some comments about R, which provided me with several hours of debugging, since I'm essentially an R newbie and have not yet figured out some of it's "funny" behavior.
The relevant information about the data can be summarized by a square contingency table, where the entries are the counts of observations for pairs of category assignments. For example, a are the number of observations where both raters assigned category one, e is the number of observations where rater 1 assigned category two and rater 2 assigned category one.
Rater Rater 2 cat cat1 cat2 cat3 cat4 Rater 1 cat1 a b c d cat2 e f g h cat3 i j k l cat4 m n o p
To have an illustrative example and code to work with, I import the relevant function and define a table as a numpy array.
import numpy as np from statsmodels.stats.inter_rater import cohens_kappa, to_table table = np.array([[ 2., 6., 3., 0.], [ 5., 4., 2., 2.], [ 5., 2., 6., 0.], [ 2., 2., 3., 7.]])
Calling the function from statsmodels and printing the results, we see that kappa for this table is 0.1661, and that we cannot reject at the 5% level the Null Hypothesis that kappa is zero.
>>> print cohens_kappa(table) Simple Kappa Coefficient -------------------------------- Kappa 0.1661 ASE 0.0901 95% Lower Conf Limit -0.0105 95% Upper Conf Limit 0.3426 Test of H0: Simple Kappa = 0 ASE under H0 0.0798 Z 2.0806 One-sided Pr > Z 0.0187 Two-sided Pr > |Z| 0.0375
cohens_kappa returns a results instance that provides additional information.
>>> res = cohens_kappa(table) >>> type(res) <class 'statsmodels.stats.inter_rater.KappaResults'> >>> res.keys() ['distribution_kappa', 'pvalue_one_sided', 'kappa_low', 'kappa_max', 'std_kappa', 'z_value', 'alpha', 'pvalue_two_sided', 'var_kappa', 'kappa_upp', 'kind', 'kappa', 'std_kappa0', 'alpha_ci', 'weights', 'distribution_zero_null', 'var_kappa0']
Cohen's kappa takes the column and row total counts as given when the "pure chance" agreement is calculated. This has two consequences: While kappa is always smaller than or equal to one (as is the correlation coefficient), the largest kappa that can be achieved for given marginal counts, can be strictly smaller than one. cohens_kappa calculates the largest simple Kappa that can be obtained for the given margin totals, available as kappa_max
>>> res.kappa_max 0.86969851814001009
The second consequence is that kappa values in unequally distributed tables might be surprising
>>> t3 = np.array([[98, 2], [2,0]]) >>> t3 array([[98, 2], [ 2, 0]]) >>> res3 = cohens_kappa(t3) >>> res3.kappa -0.019999999999998037
Kappa is negative, raters agree in 98 out of 100 cases, but disagree on the other two. Conditioning on the margin totals to define chance agreement, works as if raters have been told that there are 98 cases of category one and 2 cases of category two. Raters in this example disagree which are the two cases in category two.
An extension of the simple kappa allows for different weights when aggregating the agreement counts from the contingency table. Which weights are appropriate for a given dataset depends on the interpretation of the categories. Kappa is a measure of association between two random variables (the two ratings). If the categories were numeric, then we could also use the correlation coefficient. Weighted kappa allows us to measure agreement for different types of data, such as categorical or ordered categories. I am going through some examples for different interpretations of the categories, and show how the corresponding weights can be defined for cohens_kappa.
Case 1: Categorical Data
Suppose our categories are "clothing", "housing", "food" and "transportation", and raters have to assign each unit to one of the categories. In this case, the categories are purely categorical, there is no difference in closeness between any pairs. Agreement means that both raters assign the same category, if one rater assigns, for example, "housing" and the other rater assigns "food", then they don't "almost" agree, they just disagree.
The count of agreement cases only counts those on the main diagonal, a+f+k+p, off-diagonal counts are ignored. The corresponding Cohen's kappa is then called simple or unweighted.
Case 2: Equally Spaced Ordered Categories
Examples for this case are simple grading scales, such as 0 to 4 stars, or letter grades A, B, C, D. The assumption is that weighting for the agreement of points only depends on the distance. One star difference is the same whether it is zero stars versus one star, or 2 stars versus 3 stars. The weight or distance only depends on how far away the point is from the main diagonal, b, g, l and e, j, o all represent one level disagreement.
The distance table in this case is
>>> d array([[ 0., 1., 2., 3.], [ 1., 0., 1., 2.], [ 2., 1., 0., 1.], [ 3., 2., 1., 0.]])
The actual weighting could be linear, quadratic or some arbitrary monotonic function of the distance. The corresponding options as described in the docstring are
wt : None or string <...> wt in ['linear', 'ca' or None] : use linear weights, Cicchetti-Allison actual weights are linear in the score "weights" difference wt in ['quadratic', 'fc'] : use quadratic weights, Fleiss-Cohen actual weights are squared in the score "weights" difference
The kappa for these two weights are
>>> cohens_kappa(table, wt='linear').kappa 0.23404255319148948 >>> cohens_kappa(table, wt='quadratic').kappa 0.31609195402298862
We get the same result if we use the above distance matrix directly
>>> cohens_kappa(table, weights=d).kappa 0.23404255319148937 >>> cohens_kappa(table, weights=d**2).kappa 0.31609195402298873
This distance based weighting scheme implies that the weight matrix has a Toeplitz structure, the value of an entry only depends on the distance to the main diagonal. To allow more flexibility in the definition of weights, I added the toeplitz option
wt : None or string <...> wt = 'toeplitz' : weight matrix is constructed as a toeplitz matrix from the one dimensional weights.
Consider the following examples, the categories are "solid blue", "leaning blue", "leaning red" and "solid red". We would like to include a half point weight for assignments where the raters chose neighboring categories, and not include pairs of category assignments that are more than one level apart.
weights = [0, 0.5, 1, 1]
This is again defined in terms of distance. We can also get the actual weights (which are 1 - the distance matrix)
>>> res = cohens_kappa(table, weights=np.array([0, 0.5, 1, 1]), wt='toeplitz') >>> 1 - res.weights array([[ 1. , 0.5, 0. , 0. ], [ 0.5, 1. , 0.5, 0. ], [ 0. , 0.5, 1. , 0.5], [ 0. , 0. , 0.5, 1. ]])
The kappa given these weights is larger than the simple kappa:
>>> res.kappa 0.19131334022750779
As a check, we can also calculate the simple kappa as a special case of toeplitz
>>> cohens_kappa(table, weights=np.array([0, 1, 1, 1]), wt='toeplitz').kappa 0.16607051609606549 >>> _ - cohens_kappa(table).kappa 1.1102230246251565e-16
Aside: Since R's irr package does not offer any option that corresponds to toeplitz, the unit test only look at cases where toeplitz has an equivalent representation in on of the other options.
Case 3: Unequally Spaced Ordered Categories
We can also look at cases where categories are ordered but the distance between categories is not equal. Take as example a list of categories "apples", "oranges", "cars", "minivans", "trucks". We could assign a distance scale that reflects closeness across categories, for example (1, 2, 7, 8, 10). The distance between apples and oranges is 1, the distance between oranges and cars is 5. The scale will be rescaled for the weights, so only relative distances matter( in the linear reweighting case).
We have only 4 categories in our example table, so let's assume our categories are "excellent", "very good", "very bad" and "awful", and we assign distance scores (0, 1, 9, 10). wt='linear' uses the linear Cicchetti-Allison transformation to rescale the scores. Weighted kappa and weights (rescaled distances) are
>>> res = cohens_kappa(table, weights=np.array([0, 1, 9, 10]), wt='linear') >>> res.kappa 0.28157383419689141 >>> res.weights array([[ 0. , 0.1, 0.9, 1. ], [ 0.1, 0. , 0.8, 0.9], [ 0.9, 0.8, 0. , 0.1], [ 1. , 0.9, 0.1, 0. ]])
The actual weights are
>>> 1 - res.weights array([[ 1. , 0.9, 0.1, 0. ], [ 0.9, 1. , 0.2, 0.1], [ 0.1, 0.2, 1. , 0.9], [ 0. , 0.1, 0.9, 1. ]])
Identical category assignments have weight 1, ("excellent", "very good") and ("very bad" and "awful") pairs have weight 0.9 close to full agreement, ("very good" and "very bad") pairs have weight 0.2 (normalized distance is 0.8), and so on.
Another usage for this is to calculate kappa for merged tables. We can define a scale that has a distance of zero between the two categories at the extremes.
>>> res = cohens_kappa(table, weights=np.array([1, 1, 2, 2]), wt='linear') >>> res.kappa 0.298165137614679 >>> res.weights array([[ 0., 0., 1., 1.], [ 0., 0., 1., 1.], [ 1., 1., 0., 0.], [ 1., 1., 0., 0.]]) >>> 1 - res.weights array([[ 1., 1., 0., 0.], [ 1., 1., 0., 0.], [ 0., 0., 1., 1.], [ 0., 0., 1., 1.]])
Which gives us the same kappa as the one from the merged table where we combine the two "good" and the two "bad" categories into one level each
>>> table_merged = np.array([[table[:2, :2].sum(), table[:2, 2:].sum()], [table[2:, :2].sum(), table[2:, 2:].sum()]]) >>> table_merged array([[ 17., 7.], [ 11., 16.]]) >>> cohens_kappa(table_merged).kappa 0.29816513761467894
Besides weighted kappa, cohens_kappa also calculates the standard deviations and hypothesis test for it. I also implemented Fleiss' kappa, which considers the case when there are many raters, but I only have kappa itself, no standard deviation or tests yet (mainly because the SAS manual did not have the equations for it). An additional helper function to_table can convert the original observations given by the ratings for all individuals to the contingency table as required by cohen's kappa.
I had also added more functions for inter-rater agreement or reliability to my wish list, but I am not planning to work on it anytime soon (except maybe ICC Intraclass correlation .
Note: While writing this, after not having looked at cohens_kappa for some time, I discovered that the function does not check whether the user specified weights in the toeplitz case make sense. When the full weight matrix is given, the function also just uses whatever the user defined. I think, raising ValueErrors or automatic transformations to valid weight or distance matrices in the flexible cases will be necessary. Also, I think the docstring is not clear in what is a weight matrix (with 1 on the main diagonal) and what is a distance matrix (with 0 on the main diagonal)
Development Notes and Surprises with R
I had a quiet evening, was to tired or too lazy to do any serious work, and picked something from my wishlist that looked easy to implement as a late evening activity. I started with the Wikipedia article on Cohen's kappa and implemented it very fast. Then, I went to the SAS manual to get more background, and to see what other statistical packages provide. Since SAS has a lot more than what's described on Wikipedia and it has the formulas in the documentation, it started to get serious. Understanding and working through the weighting schemes started to be a fun challenge. A few days later, I was ready to look at R, to write more unit tests than the ones based on the printouts that I found for the SAS examples.
I found the package irr in R, that has Cohens' kappa. So, I started to write my tests cases verifying against it. I have been using R for several years to write unit tests for scipy.stats and statsmodels, but never tried to learn it systematically. My knowledge of R is very eclectic. Stackoverflow is often a good source to get around some of the trickier parts, once I know what to look out for.
R factors are one dimensional, I guess
I started to look at one of the datasets that come with the irr package. I converted the dataframe with factors to numeric values, and started to compare the results with my implementation. My numbers didn't match and I spent some time looking for a bug. Finally, I realized that R converts each factor separately to numeric.
The last rater in the following example does not use one of the categories, so the mapping between string and numeric levels is not the same across raters. Individuals (rows) 1 and 4 have the same diagnosis by all raters, but the numeric value of the last rater (column) is different.
> sapply(diagnoses, class) rater1 rater2 rater3 rater4 rater5 rater6 "factor" "factor" "factor" "factor" "factor" "factor" > sapply(diagnoses, mode) rater1 rater2 rater3 rater4 rater5 rater6 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" > diagnoses[1:5, ] rater1 rater2 rater3 rater4 1 4. Neurosis 4. Neurosis 4. Neurosis 4. Neurosis 2 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder 5. Other 3 2. Personality Disorder 3. Schizophrenia 3. Schizophrenia 3. Schizophrenia 4 5. Other 5. Other 5. Other 5. Other 5 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder 4. Neurosis rater5 rater6 1 4. Neurosis 4. Neurosis 2 5. Other 5. Other 3 3. Schizophrenia 5. Other 4 5. Other 5. Other 5 4. Neurosis 4. Neurosis > data.matrix(diagnoses)[1:5,] rater1 rater2 rater3 rater4 rater5 rater6 1 4 4 4 4 4 3 2 2 2 2 5 5 4 3 2 3 3 3 3 4 4 5 5 5 5 5 4 5 2 2 2 4 4 3
I had expected that a matrix is a matrix and not a collection of columns.
Columns with no observations
Since the last rater does not use one of the levels, the contingency table has a zero row or column when the last rater is compared to any of the other raters. Either kappa2 (or the factor handling in R) drops the column with zeros, which changes the value of Cohen's kappa. SAS had a similar problem in the past with several work-arounds published by users on the internet, but according to the latest documentation, SAS has an explicit option not to drop rows or columns with all zeros.
However, since I was already alerted to the problem from the SAS stories, debugging time wasn't very long for this.