Wednesday, April 24, 2013

Help: I caught a bug

I think I must be turning too much into a statistician and econometrician lately, I must have caught a virus or something. Maybe it started already a while ago

The theme of the scipy conference this year is "Machine Learning & Tools for Reproducible Science". However, I'm not doing any sexy twitter analysis, I just spent some days coding tests for proportion, boring stuff like pairwise comparisons of proportions.

Anyway, I decided to submit a tutorial proposal for econometrics with statsmodels to the scipy conference, see (lightly edited) proposal below. Since my proposal didn't get accepted, my first response was: Wrong topic, Too much statistic, We just want numbers, not check whether the model is correct, and find out how to fix it.

That leaves me with more time to go back to figuring out which other basic statistical tests are still missing in Python.

Statistics in Python: Reproducing Research

This is just a short comment on a blog post.

Ferando Perez wrote a nice article about "Literate computing" and computational reproducibility: IPython in the age of data-driven journalism

In the second part, he explains that Vincent Arel-Bundock came up with an ipython notebook within three hours to replicate some criticism of an economics journal article. Vincent's notebook can be seen here.

What I found most striking was not the presentation as a notebook, although that makes it easy to read, instead it was: pandas, patsy and statsmodels, and no R in sight. We have come a long way with Statistics in Python since I started to get involved in it five years ago.

Vincent has made many improvements and contributions to statsmodels in the last year.

Aside

I'm not following much of the economics debates these days, so I only know what I read in the two references that Fernando gave.

My impression is that it's just the usual (mis)use of economics research results. Politicians like the numbers that give them ammunition for their position. As economist, you are either very careful about how to present the results, or you join the political game (I worked for several years in an agricultural department of a small country). (An example for the use of economics results in another area, blaming the financial crisis on the work on copulas.)

"Believable" research: If your results sound too good or too interesting to be true, maybe they are not, and you better check your calculations. Although mistakes are not uncommon, the business as usual part is that the results are often very sensitive to assumptions, and it takes time to figure out what results are robust. I have seen enough economic debates where there never was a clear answer that convinced more than half of all economists. A long time ago, when the Asian Tigers where still tigers, one question was: Did they grow because of or in spite of government intervention?

Friday, April 19, 2013

Binomial Proportions, Equivalence and Power - part 0

Just a pre-announcement because I have a nice graph.

I am looking into tests for binomial proportions, especially equivalence (TOST) and non-inferiority tests.

SAS provides a good overview over the available methods and power for it

Power and significance levels in testing for proportions have a saw tooth pattern because the observed proportions are discrete, see for example this SAS page

Unfortunately for my unit testing, I have not found any equivalence tests for proportions in R. Currently, I'm just trying to match some examples that I found on the internet.

And here is the plot for my power function. It shows the power as a function of the sample size, for either the normal approximation or the binomial distribution, of the test for equivalence, TOST two one-sided tests. The TOST test itself is based on the normal approximation.

Wednesday, April 17, 2013

Multiple Testing P-Value Corrections in Statsmodels

series : "Statistics in Python"

This is a follow-up to my previous posts, here and this post, which are on software development, and multiple comparisons which looks at a specific case of pairwise mean comparisons.

In the following, I provide a brief introduction to the statsmodels functions for p-value asjustements to correct for multiple testing problems and then illustrate some properties using several Monte Carlo experiments.

Monday, April 15, 2013

Debugging: Multiple Testing P-Value Corrections in Statsmodels

subtitle: "The earth is round after all"

series : "Adventures with Statistics in Python"

If you run an experiment and it shows that the earth is not round, then you better check your experiment, your instrument, and don't forget to look up the definition of "round"

Statsmodels has 11 methods for correcting p-values to take account of multiple testing (or it will have after I merge my latest changes).

The following mainly describes how it took me some time to figure out what the interpretation of the results of a Monte Carlo run is. I wrote the Monte Carlo to verify that the multiple testing p-value corrections make sense. I will provide some additional explanations of the multiple testing function in statsmodels in a follow-up post.

Let's start with the earth that doesn't look round.

experiment:
Monte Carlo with 5000 or 10000 replications, to see how well the p-value corrections are doing. We have 30 p-values from hypothesis tests, for 10 of those the null hypothesis is false.
instrument:
statsmodels.stats.multipletests to make the p-value correction

The first results

==========================================================================================
         b      s      sh     hs     h    hommel  fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
------------------------------------------------------------------------------------------
reject 9.6118 9.619  9.7178 9.7274 9.7178  9.72  10.3128 9.8724  10.5152  10.5474  10.5328
r_>k   0.0236 0.0246 0.0374 0.0384 0.0374 0.0376  0.2908 0.0736   0.3962   0.4118   0.4022
------------------------------------------------------------------------------------------

The headers are shortcuts for the p-value correction method. In the first line, reject, are the average number of rejections across Monte Carlo iterations. The second line, r_>k, are the fraction of cases where we reject more than 10 hypothesis. The average number of rejections is large because the alternative in the simulation is far away from the null hypothesis, and the corresponding p-values are small. So all methods are able to reject most of the false hypotheses.

The last three methods estimate, as part of the algorithm, the number of null hypotheses that are correct. All three of those methods reject a true null hypothesis in roughly 40% of all cases. All methods are supposed to limit the false discovery rate (FDR) to alpha which is 5% in this simulations. I expected the fraction in the last line to be below 0.05. So what's wrong?

It looks obvious, after the fact, but it had me puzzled for 3 days.

Changing the experiment: The above data are based on p-values that are the outcome of 30 independent t-tests, which is already my second version for generating random p-values. For my third version, I changed to a data generating process similar to Benjamini, Krieger and Yekutieli 2006, which is the article on which fdr_tsbky is based. None of the changes makes a qualitative difference to the results.

Checking the instrument: All p-values corrections except fdr_tsbky and fdr_gbs are tested against R. For the case at hand, the p-values for fdr_tsbh are tested against R's multtest package. However, the first step is a discrete estimate (number of true null hypothesis) and since it is discrete, the tests will not find differences that show up only in borderline cases. I checked a few more cases which also verify against R. Also, most methods have a double implementation, separately for the p-value correction and for the rejection boolean. Since they all give identical or similar answers, I start to doubt that there is a problem with the instrument.

Is the earth really round? I try to read through the proof that these adaptive methods limit the FDR to alpha, to see if I missed some assumptions, but give up quickly. These are famous authors, and papers that have long been accepted and been widely used. I also don't find any assumption besides independence of the p-values, which I have in my Monte Carlo. However, looking a bit more closely at the proofs shows that I don't really understand FDR. When I implemented these functions, I focused on the algorithms and only skimmed the interpretation.

What is the False Discovery Rate? Got it. I should not rely on vague memories of definitions that I read two years ago. What I was looking at, is not the FDR.

One of my new results (with a different data generating process in the Monte Carlo, but still 10 out of 30 hypotheses are false)

==============================================================================================
              b      s      sh     hs     h    hommel fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
----------------------------------------------------------------------------------------------
reject      5.2924 5.3264 5.5316 5.5576 5.5272 5.5818 8.1904 5.8318   8.5982   8.692    8.633
rejecta     5.2596 5.2926 5.492  5.5176 5.488  5.5408 7.876  5.7744   8.162     8.23    8.1804
reject0     0.0328 0.0338 0.0396  0.04  0.0392 0.041  0.3144 0.0574   0.4362   0.462    0.4526
r_>k        0.0002 0.0002 0.0006 0.0006 0.0006 0.0006 0.0636 0.0016   0.1224   0.1344   0.1308
fdr         0.0057 0.0058 0.0065 0.0065 0.0064 0.0067 0.0336 0.0081   0.0438   0.046    0.0451
----------------------------------------------------------------------------------------------

reject : average number of rejections

rejecta : average number of rejections for cases where null hypotheses is false (10)

rejecta : average number of rejections for cases where null hypotheses is true (20)

r_>k : fraction of Monte Carlo iterations where we reject more than 10 hypotheses

fdr : average of the fraction of rejections when null is true out of all rejections

The last numbers look much better, the numbers are below alpha=0.05 as required, including the fdr for the last three methods.

"Consider the problem of testing m null hypotheses h1, ..., hm simultaneously, of which m0 are true nulls. The proportion of true null hypotheses is denoted by mu0 = m0/m. Benjamini and Hochberg(1995) used R and V to denote, respectively, the total number of rejections and the number of false rejections, and this notation has persisted in the literature. <...> The FDR was loosely defined by Benjamini and Hochberg(1995) as E(V/R) where V/R is interpreted as zero if R = 0." Benjamini, Krieger and Yekutieli 2006, page 2127

Some additional explanations are in this Wikipedia page

What I had in mind when I wrote the code for my Monte Carlo results, was the family wise error rate, FWER,

"The FWER is the probability of making even one type I error in the family, FWER = Pr(V >= 1)" Wikipedia

Although, I did not look up that definition either. What I actually used, is Pr(R > k) where k is the number of false hypothesis in the data generating process. Although, I had chosen my initial cases so Pr(R > k) is close to Pr(V > 0).

In the follow-up post I will go over the new Monte Carlo results, which now look all pretty good.

Reference

Benjamini, Yoav, Abba M. Krieger, and Daniel Yekutieli. 2006. “Adaptive Linear Step-up Procedures That Control the False Discovery Rate.” Biometrika 93 (3) (September 1): 491–507. doi:10.1093/biomet/93.3.491.