Showing posts with label statistical tests. Show all posts
Showing posts with label statistical tests. Show all posts

Friday, April 19, 2013

Binomial Proportions, Equivalence and Power - part 0

Just a pre-announcement because I have a nice graph.

I am looking into tests for binomial proportions, especially equivalence (TOST) and non-inferiority tests.

SAS provides a good overview over the available methods and power for it

Power and significance levels in testing for proportions have a saw tooth pattern because the observed proportions are discrete, see for example this SAS page

Unfortunately for my unit testing, I have not found any equivalence tests for proportions in R. Currently, I'm just trying to match some examples that I found on the internet.

And here is the plot for my power function. It shows the power as a function of the sample size, for either the normal approximation or the binomial distribution, of the test for equivalence, TOST two one-sided tests. The TOST test itself is based on the normal approximation.

Wednesday, April 17, 2013

Multiple Testing P-Value Corrections in Statsmodels

series : "Statistics in Python"

This is a follow-up to my previous posts, here and this post, which are on software development, and multiple comparisons which looks at a specific case of pairwise mean comparisons.

In the following, I provide a brief introduction to the statsmodels functions for p-value asjustements to correct for multiple testing problems and then illustrate some properties using several Monte Carlo experiments.

Monday, April 15, 2013

Debugging: Multiple Testing P-Value Corrections in Statsmodels

subtitle: "The earth is round after all"

series : "Adventures with Statistics in Python"

If you run an experiment and it shows that the earth is not round, then you better check your experiment, your instrument, and don't forget to look up the definition of "round"

Statsmodels has 11 methods for correcting p-values to take account of multiple testing (or it will have after I merge my latest changes).

The following mainly describes how it took me some time to figure out what the interpretation of the results of a Monte Carlo run is. I wrote the Monte Carlo to verify that the multiple testing p-value corrections make sense. I will provide some additional explanations of the multiple testing function in statsmodels in a follow-up post.

Let's start with the earth that doesn't look round.

experiment:
Monte Carlo with 5000 or 10000 replications, to see how well the p-value corrections are doing. We have 30 p-values from hypothesis tests, for 10 of those the null hypothesis is false.
instrument:
statsmodels.stats.multipletests to make the p-value correction

The first results

==========================================================================================
         b      s      sh     hs     h    hommel  fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
------------------------------------------------------------------------------------------
reject 9.6118 9.619  9.7178 9.7274 9.7178  9.72  10.3128 9.8724  10.5152  10.5474  10.5328
r_>k   0.0236 0.0246 0.0374 0.0384 0.0374 0.0376  0.2908 0.0736   0.3962   0.4118   0.4022
------------------------------------------------------------------------------------------

The headers are shortcuts for the p-value correction method. In the first line, reject, are the average number of rejections across Monte Carlo iterations. The second line, r_>k, are the fraction of cases where we reject more than 10 hypothesis. The average number of rejections is large because the alternative in the simulation is far away from the null hypothesis, and the corresponding p-values are small. So all methods are able to reject most of the false hypotheses.

The last three methods estimate, as part of the algorithm, the number of null hypotheses that are correct. All three of those methods reject a true null hypothesis in roughly 40% of all cases. All methods are supposed to limit the false discovery rate (FDR) to alpha which is 5% in this simulations. I expected the fraction in the last line to be below 0.05. So what's wrong?

It looks obvious, after the fact, but it had me puzzled for 3 days.

Changing the experiment: The above data are based on p-values that are the outcome of 30 independent t-tests, which is already my second version for generating random p-values. For my third version, I changed to a data generating process similar to Benjamini, Krieger and Yekutieli 2006, which is the article on which fdr_tsbky is based. None of the changes makes a qualitative difference to the results.

Checking the instrument: All p-values corrections except fdr_tsbky and fdr_gbs are tested against R. For the case at hand, the p-values for fdr_tsbh are tested against R's multtest package. However, the first step is a discrete estimate (number of true null hypothesis) and since it is discrete, the tests will not find differences that show up only in borderline cases. I checked a few more cases which also verify against R. Also, most methods have a double implementation, separately for the p-value correction and for the rejection boolean. Since they all give identical or similar answers, I start to doubt that there is a problem with the instrument.

Is the earth really round? I try to read through the proof that these adaptive methods limit the FDR to alpha, to see if I missed some assumptions, but give up quickly. These are famous authors, and papers that have long been accepted and been widely used. I also don't find any assumption besides independence of the p-values, which I have in my Monte Carlo. However, looking a bit more closely at the proofs shows that I don't really understand FDR. When I implemented these functions, I focused on the algorithms and only skimmed the interpretation.

What is the False Discovery Rate? Got it. I should not rely on vague memories of definitions that I read two years ago. What I was looking at, is not the FDR.

One of my new results (with a different data generating process in the Monte Carlo, but still 10 out of 30 hypotheses are false)

==============================================================================================
              b      s      sh     hs     h    hommel fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
----------------------------------------------------------------------------------------------
reject      5.2924 5.3264 5.5316 5.5576 5.5272 5.5818 8.1904 5.8318   8.5982   8.692    8.633
rejecta     5.2596 5.2926 5.492  5.5176 5.488  5.5408 7.876  5.7744   8.162     8.23    8.1804
reject0     0.0328 0.0338 0.0396  0.04  0.0392 0.041  0.3144 0.0574   0.4362   0.462    0.4526
r_>k        0.0002 0.0002 0.0006 0.0006 0.0006 0.0006 0.0636 0.0016   0.1224   0.1344   0.1308
fdr         0.0057 0.0058 0.0065 0.0065 0.0064 0.0067 0.0336 0.0081   0.0438   0.046    0.0451
----------------------------------------------------------------------------------------------

reject : average number of rejections

rejecta : average number of rejections for cases where null hypotheses is false (10)

rejecta : average number of rejections for cases where null hypotheses is true (20)

r_>k : fraction of Monte Carlo iterations where we reject more than 10 hypotheses

fdr : average of the fraction of rejections when null is true out of all rejections

The last numbers look much better, the numbers are below alpha=0.05 as required, including the fdr for the last three methods.

"Consider the problem of testing m null hypotheses h1, ..., hm simultaneously, of which m0 are true nulls. The proportion of true null hypotheses is denoted by mu0 = m0/m. Benjamini and Hochberg(1995) used R and V to denote, respectively, the total number of rejections and the number of false rejections, and this notation has persisted in the literature. <...> The FDR was loosely defined by Benjamini and Hochberg(1995) as E(V/R) where V/R is interpreted as zero if R = 0." Benjamini, Krieger and Yekutieli 2006, page 2127

Some additional explanations are in this Wikipedia page

What I had in mind when I wrote the code for my Monte Carlo results, was the family wise error rate, FWER,

"The FWER is the probability of making even one type I error in the family, FWER = Pr(V >= 1)" Wikipedia

Although, I did not look up that definition either. What I actually used, is Pr(R > k) where k is the number of false hypothesis in the data generating process. Although, I had chosen my initial cases so Pr(R > k) is close to Pr(V > 0).

In the follow-up post I will go over the new Monte Carlo results, which now look all pretty good.

Reference

Benjamini, Yoav, Abba M. Krieger, and Daniel Yekutieli. 2006. “Adaptive Linear Step-up Procedures That Control the False Discovery Rate.” Biometrika 93 (3) (September 1): 491–507. doi:10.1093/biomet/93.3.491.

Thursday, March 28, 2013

Inter-rater Reliability, Cohen's Kappa and Surprises with R

Introduction

This is part three in adventures with statsmodels.stats, after power and multicomparison.

This time it is about Cohen's Kappa, a measure of inter-rater agreement or reliability. Suppose we have two raters that each assigns the same subjects or objects to one of a fixed number of categories. The question then is: How well do the raters agree in their assignments? Kappa provides a measure of association, the largest possible value is one, the smallest is minus 1, and it has a corresponding statistical test for the hypothesis that agreement is only by chance. Cohen's kappa and similar measures have widespread use, among other fields, in medical or biostatistic. In one class of applications, raters are doctors, subjects are patients and the categories are medical diagnosis. Cohen's Kappa provides a measure how well the two sets of diagnoses agree, and a hypothesis test whether the agreement is purely random.

For more background see this Wikipedia page which was the starting point for my code.

Most of the following focuses on weighted kappa, and the interpretation of different weighting schemes. In the last part, I add some comments about R, which provided me with several hours of debugging, since I'm essentially an R newbie and have not yet figured out some of it's "funny" behavior.

Monday, March 25, 2013

Multiple Comparison and Tukey HSD or why statsmodels is awful

Introduction

Statistical tests are often grouped into one-sample, two-sample and k-sample tests, depending on how many samples are involved in the test. In k-sample tests the usual Null hypothesis is that a statistic, for example the mean as in a one-way ANOVA, or the distribution in goodness-of-fit tests, is the same in all groups or samples. The common test is the joint test that all samples have the same value, against the alternative that at least one sample or group has a different value.

However, often we are not just interested in the joint hypothesis if all samples are the same, but we would also like to know for which pairs of samples the hypothesis of equal values is rejected. In this case we conduct several tests at the same time, one test for each pair of samples.

This results, as a consequence, in a multiple testing problem and we should correct our test distribution or p-values to account for this.

I mentioned some of the one- and two sample test in statsmodels before. Today, I just want to look at pairwise comparison of means. We have k samples and we want to test for each pair whether the mean is the same, or not.

Instead of adding more explanations here, I just want to point to R tutorial and also the brief description on Wikipedia. A search for "Tukey HSD" or multiple comparison on the internet will find many tutorials and explanations.

The following are examples in statsmodels and R interspersed with a few explanatory comments.

Sunday, March 17, 2013

Statistical Power in Statsmodels

I merged last week a branch of mine into statsmodels that contains large parts of basic power calculations and some effect size calculations. The documentation is in this section . Some parts are still missing but I thought I have worked enough on this for a while.

(Adding the power calculation for a new test now takes approximately: 3 lines of real code, 200 lines of wrapping it with mostly boiler plate and docstrings, and 30 to 100 lines of tests.)

The first part contains some information on the implementation. In the second part, I compare the calls to the function in the R pwr package to the calls in my (statsmodels') version.

I am comparing it to the pwr package because I ended up writing almost all unit tests against it. The initial development was based on the SAS manual, I used the explanations on the G-Power website for F-tests, and some parts were initially written based on articles that I read. However, during testing I adjusted the options (and fixed bugs), so I was able to match the results to pwr, and I think pwr has just the right level of abstraction and easiness of use, that I ended up with code that is pretty close to it.

If you just want to see the examples, skip to the second part.

Friday, March 15, 2013

Different Fields - Different Problems: Effect Size

or What's your scale?

Effect Size

I have been working on and off for a while now on adding statistical power calculations to statsmodels. One of the topics I ran into is the effect size.

At the beginning, I wasn't quite sure what to make of it. While I was working on power calculations, it just seemed to be a convenient way of specifying the distance between the alternative and the null hypothesis. However, there were references that sounded like it's something special and important. This was my first message to the mailing list

A classical alternative to NHST (null-hypothesis significance testing):
report your effect size and confidence intervals

http://en.wikipedia.org/wiki/Effect_size

http://onlinelibrary.wiley.com/doi/10.1111/j.1469-185X.2007.00027.x/abstract

http://onlinelibrary.wiley.com/doi/10.1111/j.1460-9568.2011.07902.x/abstract

I bumped into this while looking for power analysis.

MIA in python, as far as I can tell.

Now what is the fuss all about?

Scaling Issues

Today I finally found some good motivating quotes:

"In the behavioral, educational, and social sciences (BESS), units of measurement are many times arbitrary, in the sense that there is no necessary reason why the measurement instrument is based on a particular scaling. Many, but certainly not all, constructs dealt with in the BESS are not directly observable and the instruments used to measure such constructs do not generally have a natural scaling metric as do many measures, for example, in the physical sciences."

and

"However, effects sizes based on raw scores are not always helpful or generalizable due to the lack of natural scaling metrics and multiple scales existing for the same phenomenon in the BESS. A common methodological suggestion in the BESS is to report standardized effect sizes in order to facilitate the interpretation of results and for the cumulation of scientific knowledge across studies, which is the goal of meta-analysis (<...>). A standardized effect size is an effect size that describes the size of the effect but that does not depend on any particular measurement scale."

The two quotes are from the introduction in "Confidence Intervals for Standardized Effect Sizes: Theory, Application, and Implementation" by Ken Kelley http://www.jstatsoft.org/v20/a08 .

Large parts of the literature that I was browsing or reading on this, are in Psychological journals. This can also be seen in the list of references in the Wikipedia page on effect size.

One additional part, that I found puzzling was the definition of "conventional" effect sizes by Cohen. "For Cohen's d an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect." (sentence from the Wikipedia page)

"Small" what? small potatoes, small reduction in the number of deaths, low wages? or, "I'm almost indifferent" (+0 on the mailing lists)?

Where I come from

Now it's clearer why I haven't seen this in my traditional area, economics and econometrics.

Although economics falls into BESS, in the SS part, it has a long tradition of getting a common scale, money. Physical units also show up in some questions.

National Income Accounting tries to measure the economy with money as a unit. (And if something doesn't have a price associated with it, then it's ignored by most. That's another problem. Or, we make a price.) There are many measurement problems, but there is also a large industry to figure out common standards.

Effect sizes have a scale that is "natural":

  • What's the increase in lifetime salary, if you attend business school?
  • What's the increase in sales (in Dollars, or physical units) if you lower the price?
  • What's the increase in sales if you run an advertising campaign?
  • What's your rate of return if you invest in stocks?

Effects might not be easy to estimate, or cannot be estimated accurately, but we don't need a long debate about what to report as effect.

Post Scripts

(i) I just saw the table at the end of this SAS page http://support.sas.com/documentation/cdl/en/statug/65328/HTML/default/viewer.htm#statug_glm_details22.htm . I love replicating SAS tables, but will refrain for now, since I am supposed to go back to other things.

(ii) I started my last round of work on this because I was looking at effect size as distance measure for a chisquare goodness of fit test. When the sample size is very large, then small deviations from the Null Hypothesis will cause a statistical test to reject the Null, even if the effect, the difference to the Null is for all practical purposes irrelevant. My recent preferred solution to this is to switch to equivalence test or something similar, not testing the hypothesis that the effect is exactly zero, but to test whether the effect is "small" or not.

(iii) I have several plans for blog posts (cohens_kappa, power onion) but never found the quiet time or urge to actually write them.