joepy: Multiple Testing P-Value Corrections in Statsmodels

series : "Statistics in Python"

This is a follow-up to my previous posts, here and this post, which are on software development, and multiple comparisons which looks at a specific case of pairwise mean comparisons.

In the following, I provide a brief introduction to the statsmodels functions for p-value asjustements to correct for multiple testing problems and then illustrate some properties using several Monte Carlo experiments.

The Multiple Testing Problem

Let us start with a single hypothesis test. As a simple case, consider a test that the mean, mu, of a random variable is zero. In this case the Null hypothesis is mu = 0, the alternative hypothesis is that mu is different from zero. Type I error is the error that we reject the null hypothesis even though it is true. In the classical testing concept, we choose a probability alpha, often 0.05, and we want to choose our test in such a way that the probability of the Type I error is equal to or less than alpha. If this probability is equal to alpha, then the test has the correct "size", if it is smaller than alpha, then the test is "conservative". One way of implementing the test decision, is to reject the null hypothesis if the p-value, that we obtain from a statistical package, is less than alpha, i.e. p-value <= alpha.

Now, instead of looking at one hypothesis test, we have many hypothesis tests at the same time. As examples, suppose we have 30 variables and we want to test for each of them whether the mean is zero, or we have 100 observations and we want to test each of them for being an outlier. We have decide for each test whether to reject the null hypothesis.

The statistical problem is that if we reject each hypothesis tests with probability alpha if the null hypothesis is really true, then the probability to reject at least one true hypothesis becomes large, much larger than alpha.

We can derive the probability of rejecting a true null hypothesis with some simple probability calculations for the case when all tests are independent of each other.

alpha : probability of rejecting true hypothesis for one test

(1 - alpha) : probability of not rejecting true hypothesis for one test

(1 - alpha)^m : probability of not rejecting m true hypothesis (we have m tests)

(1 - (1 - alpha)^m) : probability of rejecting at least one out of m true null hypotheses

We can use the numbers from my Monte Carlo to illustrate: alpha = 0.05, m = 30

>>> (1 - 0.05)**30
0.21463876394293727
>>> 1 - (1 - 0.05)**30
0.78536123605706276

If we use the conventional significance level for 30 tests, then there is a probability of 79% that we reject at least one true null hypothesis.

To avoid this high probability of at least one Type I error, we either need to adjust our significance level, alpha, or the p-values that we use in our decision. For the latter, we can use p-value adjustments, so we can still decide on rejecting a null hypothesis based on the original alpha.

reject if p-value_corrected <= alpha

The main questions that arise are: how much do we want to restrict the probability of a Type I error and what should we do if we expect that some hypothesis will not be true?

We can introduce some notation, to make the different cases clearer (source Wikipedia):

The following table gives a number of errors committed when testing m null hypotheses. It defines some random variables that are related to the m hypothesis tests.

                         Null hypothesis is True  Alternative hypothesis is True     Total
Declared significant              V                            S                       R
Declared non-significant          U                            T                     m - R
Total                            m_0                         m - m_0                   m

m is the total number hypotheses tested
m_0 is the number of true null hypotheses
m - m_0 is the number of true alternative hypotheses
V is the number of false positives (Type I error) (also called "false discoveries")
S is the number of true positives (also called "true discoveries")
T is the number of false negatives (Type II error)
U is the number of true negatives
R is the number of rejected null hypotheses (also called "discoveries")

In m hypothesis tests of which m_0 are true null hypotheses, R is an observable random variable, and S, T, U, and V are unobservable random variables.

Additionally, I will use k = m - m0 in my Monte Carlo simulations for the number of tests for which the null hypothesis is not true.

Although S, T, U, V are unobserved when we apply this to data, we can calculate those random variables in our Monte Carlo simulation. The theoretical results are based on statements about either the probabilities associated with or the expected values of those variables, or of functions of those. The two main statistics for multiple testing are the family wise error rate and the false discovery rate

FWER = P(V > 0)

FDR = E(V/R) where V/R is defined to be 0 when R = 0.

FWER is the probability that we reject at least one true null hypothesis. FDR is the expected fraction of false rejections out of all rejections and zero if there are no rejections.

In the following, I will show the Monte Carlo results for the case when we only have true null hypotheses. Then, I will introduce the p-value corrections that are implemented in statsmodels, and will describe some Monte Carlo results for cases when we have some null hypothesis that are false. We consider two assumptions on the dependence between p-values, first when p-values are independent, and then, when they are positively correlated.

Only True Null Hypothesis

For our first simulation, we assume that all null hypothesis are true, i.e. m = m0 and k = 0. In this case, all rejections are of true null hypotheses, V = R, S = 0.

This has as a consequence that V/R is either zero or one, and the FDR, E(V/R) is the same as the probability that V > 0, that is, FDR is equal to the FWER.

The following table shows the results from the Monte Carlo simulation with m = 30 tests.

alpha = 0.05 k = 0 corr = 0 nrep = 5000

=================================================================================================
              raw     b     s     sh    hs    h   hommel fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
-------------------------------------------------------------------------------------------------
no. reject   1.5112  0.05 0.051  0.05 0.051  0.05 0.0502 0.0532 0.0124   0.0502   0.0534   0.0514
no. rejecta   0.0    0.0   0.0   0.0   0.0   0.0   0.0    0.0    0.0      0.0      0.0      0.0
no. reject0  1.5112  0.05 0.051  0.05 0.051  0.05 0.0502 0.0532 0.0124   0.0502   0.0534   0.0514
P(R>0)       0.786  0.048 0.049 0.048 0.049 0.048 0.0482 0.049  0.0124   0.0466   0.049    0.048
P(R>k)       0.786  0.048 0.049 0.048 0.049 0.048 0.0482 0.049  0.0124   0.0466   0.049    0.048
FWER P(V>0)  0.786  0.048 0.049 0.048 0.049 0.048 0.0482 0.049  0.0124   0.0466   0.049    0.048
P(S<k)        0.0    0.0   0.0   0.0   0.0   0.0   0.0    0.0    0.0      0.0      0.0      0.0
E(R-k | R>k) 1.5112  0.05 0.051  0.05 0.051  0.05 0.0502 0.0532 0.0124   0.0502   0.0534   0.0514
FDR E(V/R)   0.786  0.048 0.049 0.048 0.049 0.048 0.0482 0.049  0.0124   0.0466   0.049    0.048
-------------------------------------------------------------------------------------------------

In the results, I take either the mean for expectations, or the fraction for probabilities of events. The number of iterations is 5000, so the resulting numbers should be close to their theoretical expectations and probabilities.

The first three rows are the average number of rejections across Monte Carlo replications, all, false, and correct rejections respectively, no. rejecta corresponds to E(R), no. rejecta corresponds to E(S), and no. reject0 corresponds to E(V).

The first column shows the results for the test based on raw p-values, which does not take multiple testing into account. The FWER is 0.786, very close to what we calculated above.

All other methods control the family-wise error rate, FWER at 5 percent.

Some null hypotheses are false

In the next case, we assume that 10 out of 30 null hypotheses are false, k = 10, m0 = 20. The p-values are random variables created with the normal distribution, delta is the mean of the normal distribution for the cases when the alternative is the data generating model. The actual p-values are calculated with the upper tail probability of the normal distribution. Delta = 2.5 is a moderately large effect or deviation from the Null.

This case is more interesting because we can see now a difference across methods in how they handle rejections when some rejections are correct. Testing based on raw p-values has the largest number of rejections, the FWER is 0.64 and the FDR is 10%, we reject to many true null hypotheses by both measures.

alpha = 0.05 k = 10 delta = 2.5 corr = 0 nrep = 5000

======================================================================================================
              raw     b      s      sh     hs     h    hommel fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
------------------------------------------------------------------------------------------------------
no. reject   9.0362 3.3646 3.399  3.5032 3.5322 3.5016 3.5498 5.9844 3.3256   6.322    6.4448    6.27
no. rejecta  8.036  3.334  3.3678 3.4668 3.4952 3.4652 3.5122 5.7516 3.2894   6.014    6.117    5.9602
no. reject0  1.0002 0.0306 0.0312 0.0364 0.037  0.0364 0.0376 0.2328 0.0362   0.308    0.3278   0.3098
P(R>0)        1.0   0.9848 0.9854 0.9848 0.9854 0.9848 0.9866 0.9922 0.9112   0.992    0.9922   0.9846
P(R>k)       0.1724  0.0    0.0    0.0    0.0    0.0    0.0   0.0106  0.0     0.0272   0.0304   0.0318
FWER P(V>0)   0.64  0.0296 0.0302 0.0352 0.0358 0.0352 0.0364 0.1982 0.035    0.246    0.2588   0.2438
P(S<k)       0.8876  1.0    1.0    1.0    1.0    1.0    1.0   0.988   1.0     0.9734   0.971    0.9706
E(R-k | R>k) 0.2458  0.0    0.0    0.0    0.0    0.0    0.0   0.0124  0.0     0.037    0.0416   0.0444
FDR E(V/R)   0.1028 0.008  0.0079 0.0087 0.0088 0.0087 0.0089 0.0313 0.0074   0.0382   0.0401   0.0377
------------------------------------------------------------------------------------------------------

Before interpreting the results, we can briefly review what the multiple testing methods are, and what theoretical results we should expect. I provide short descriptions of the methods further down.

The relevant code in my Monte Carlo simulation is

import statsmodels.stats.multitest as smm
rej, pval_corr = smm.multipletests(pval_raw, alpha=alpha, method=method)[:2]

The method is a string that stands for one of the following methods

`b`, `bonferroni` : one-step correction
`s`, `sidak` : one-step correction
`hs`, `holm-sidak` : step down method using Sidak adjustments
`h`, `holm` : step-down method using Bonferroni adjustments
`sh`, `simes-hochberg` : step-up method  (independent)
`hommel` : closed method based on Simes tests (non-negative)
`fdr_i`, `fdr_bh` : Benjamini/Hochberg  (non-negative)
`fdr_n`, `fdr_by` : Benjamini/Yekutieli (negative)
'fdr_tsbh' : two stage fdr correction (Benjamini/Hochberg)
'fdr_tsbky' : two stage fdr correction (Benjamini/Krieger/Yekutieli)
'fdr_gbs' : adaptive step-down fdr correction (Gavrilov, Benjamini, Sarkar)

In the test suite, I use the following mapping for the comparison to the multtest package in R Bioconductor, where the number refers to the column in the output of mt.rawp2adjp

rmethods = {'rawp':(0,'pval'), 'Bonferroni':(1,'b'), 'Holm':(2,'h'),
            'Hochberg':(3,'sh'), 'SidakSS':(4,'s'), 'SidakSD':(5,'hs'),
            'BH':(6,'fdr_i'), 'BY':(7,'fdr_n'),
            'TSBH_0.05':(9, 'fdr_tsbh')}

The equivalent of hommel is p_adjust(pval, method='hommel') in R's stats package.

The first six methods control the FWER under independence (and in some cases also under positive dependence), the last five methods control the FDR. This is also confirmed in the Monte Carlo results, the largest FWER (probability of rejecting at least one true null hypothesis) among the first six methods is 0.036 which is smaller than the required 5 percent. The FDR methods, with the exception of fdr_n, have a FWER that is larger than 5 percent, in about a fifth to a quarter of the iterations we reject at least one true null hypothesis, however, these methods restrict the FDR and not the FWER. They succeed in this, the largest FDR is 0.04, below the required 5 percent.

fdr_n is designed for negatively correlated p-values, and I will ignore it in most of the following.

Given that all methods satisfy their constraint on FWER or FDR respectively, we can look at the power to identify the false hypothesis. The second line, rejecta, shows the average number of false null hypothesis that are correctly rejected, in terms of our table above this is the expected value of S, E(S). All methods that control the FWER, correctly reject between 3.33 to 3.5 (out of 10) false hypothesis, Bonferroni correction has the lowest and Hommel has the highest power, but the differences are not very large. The FDR based methods correctly reject about 6 of the 10 false null hypotheses, but also have a slightly higher rejection of correct null hypothesis (third line).

Larger difference to null

To show the results in a more extreme way, I increase the difference to the null hypothesis for those hypothesis that are false. A delta = 3.5 corresponds to a mean p-value of 0.0002 for the false null hypothesis.

alpha = 0.05 k = 10 delta = 3.5 corr = 0 nrep = 5000

=======================================================================================================
               raw     b      s      sh     hs     h    hommel fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
-------------------------------------------------------------------------------------------------------
no. reject   10.6836 7.175  7.2042 7.4478 7.4716 7.4468 7.4802 9.4604 7.898    9.8112   9.8678   9.845
no. rejecta   9.6834 7.1444 7.173  7.4058 7.4278 7.4048 7.4362 9.114   7.83    9.2996   9.326    9.312
no. reject0   1.0002 0.0306 0.0312 0.042  0.0438 0.042  0.044  0.3464 0.068    0.5116   0.5418   0.533
P(R>0)         1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0      1.0      1.0      1.0
P(R>k)        0.5312 0.001  0.001  0.0034 0.0036 0.0034 0.0034 0.1552 0.0098   0.2542   0.271    0.2634
FWER P(V>0)    0.64  0.0296 0.0302 0.0408 0.0426 0.0408 0.0428 0.2838 0.0658   0.3802   0.3972   0.3886
P(S<k)        0.2782 0.9662 0.9644 0.9432 0.9414 0.9434 0.9428 0.5796 0.8832   0.4816   0.4682   0.474
E(R-k | R>k)  0.8156 0.001  0.001  0.0034 0.0038 0.0034 0.0034 0.1912 0.0106   0.3476   0.377     0.37
FDR E(V/R)    0.0866 0.0039 0.004  0.0051 0.0053 0.0051 0.0053 0.0327 0.0076   0.0464   0.0489   0.048
-------------------------------------------------------------------------------------------------------

All methods for correcting for multiple hypotheses achieve a control of the FWER or FDR below 5 percent as before. The power to identify the false null hypotheses is, however, much larger. The FWER controlling methods now correctly identify 7 false null hypothesis, while the FDR controlling methods identify now more than 9 out of 10 false null hypotheses on average (second line). The FWER for the FDR methods that estimate m0, has increased to almost 50%, that is, they reject at least one true null hypothesis in almost 50% of the simulated replications. However, we still only reject about half a true null hypothesis on average (third line).

These results illustrate the main advantage of the FDR controlling methods. We reject a slightly higher number of true null hypotheses on average, but we gain power for rejecting cases where the null hypothesis is false. In this case, the total fraction of false out of all rejections is around 5% (E(V)/E(R) = 0.5116 / 9.8112 = 0.052 for fdr_tsbky, and E(V)/E(R) = 0.5418 / 9.8678 = 0.054 for fdr_tsbh. Note: this is not the same as the FDR, since we take the ratio of the expectations and not the expectation of the ratio.)

Multiple Testing Procedures

It's time to present a brief description of the multiple testing procedures that are implemented in statsmodels. As I described earlier, it took me quite a long time to wrap my head around the various step-up and step-down procedures, I read many articles and parts of books and skimmed many more. However, the list of references that I actually used is lost on my old, broken computer. A good overview of the methods is, as often, the SAS manual Also Wikipedia has several good pages on the multiple testing problem and procedures.

FWER controlling methods

Bonferroni and Sidak make the adjustments to all tests at the same time. The Bonferroni correction divides the critical level alpha by the number of tests m, which is the same a multiplying all p-values by the number of tests.

reject if pval <= alpha / m    or
reject if pval_corrected = pval * m <= alpha

Sidak corrects the significance level analogously to our previous calculation for the FWER in the case of independent p-values

reject if pval <= (1 - (1 - alpha)**(1/m)    or
reject if pval_corrected = (1 - (1 - pval)**(m) <= alpha

Sidak is more powerful than Bonferroni under independence since it uses the exact probabilities instead of the approximation, but the difference is small.

>>> alpha * (1./np.arange(1, 20))
array([ 0.05      ,  0.025     ,  0.01666667,  0.0125    ,  0.01      ,
        0.00833333,  0.00714286,  0.00625   ,  0.00555556,  0.005     ,
        0.00454545,  0.00416667,  0.00384615,  0.00357143,  0.00333333,
        0.003125  ,  0.00294118,  0.00277778,  0.00263158])
>>> (1 - (1 - alpha)**(1./np.arange(1, 20)))
array([ 0.05      ,  0.02532057,  0.01695243,  0.01274146,  0.01020622,
        0.00851244,  0.00730083,  0.00639115,  0.00568304,  0.0051162 ,
        0.00465217,  0.00426532,  0.00393786,  0.0036571 ,  0.00341371,
        0.0032007 ,  0.00301271,  0.00284557,  0.00269601])

All other methods evaluate the tests sequentially, and apply different amounts of corrections in the sequence. Step-down methods start from the smallest p-value and stop when we don't reject a test. Step-up methods start from the largest p-value and stop when a test is significant and is rejected. Once we stop at a critical tests, all other tests are treated the same way to enforce monotonicity in our rejections. Similarly, the p-value correction also depend on the p-value of the critical test to impose monotonicity.

Holm is a step-down method that applies the Bonferroni correction at each step with increasing m = 1,2,3,..., where m is now the number of tests that have, so far, been included in the step. Holm-Sidak uses the same stepwise approach but uses the Sidak correction instead of the Bonferroni correction at each step.

Bonferroni, Sidak and Holm's procedures are valid independently of the dependence structure of the p-values.

Simes-Hochberg is a step-up method that uses a Bonferroni correction at each step similar to Holm's method. This test is valid under independence and under some forms of positive dependence of p-values. Simes-Hochberg is more powerful than Holm's method under independence.

Hommel is a closed test that controls the FWER. Closed tests like Hommel's evaluate all relevant joint sub-hypothesis, and are among the most powerful tests that control the FWER. Joint sub-hypothesis are formed by intersection tests. As an example, we partition the set of tests into two groups, A and B. One sub-hypothesis is that all test in A satisfy the null hypothesis, and all tests in B satisfy the alternative hypothesis. We can reject a null hypothesis if we can also reject all sub-hypotheses that include it. Note: There is a large literature on closed tests and intersection(-union) tests for various applications especially multiple comparisons, that looks interesting but I never tried to understand the details and they did not look easy to implement. Hommel is valid under independence and under positive dependence of the p-values.

Hommel's test has the reputation of being more difficult to calculate, but I have not experienced any remarkable difference. Hommel is the most powerful of the FWER controlling procedures implemented here. We can also see in the Monte Carlo simulations that Hommel has the largest power, but only by a small margin compared to the other stepwise methods.

FDR controlling methods

fdr_bh is the step-up procedure by Benjamini and Hochberg (1995) that controls the FDR under independence and is robust to some form of positive dependence of the p-values (positive regression dependency). fdr_by is the step-up procedure by Benjamini and Yekutiely (2001) that controls the FDR also for other forms of dependence including the case when the p-values are negatively correlated. In the above Monte Carlo we can see that fdr_by has less power than fdr_bh in the uncorrelated case.

The last three methods that control the FDR are adaptive methods, that estimate or adjust to the number of true null hypotheses. The basic idea is that we can adjust alpha in fdr_bh if we have an estimate of the number of true null hypotheses, m0 in the notation above. The adjusted alpha is alpha * m / m0. fdr_tsbh uses fdr_bh in a first stage to estimate m0, then then uses the corrected alpha in a second stage of fdr_bh. fdr_tsbky uses a different corrected alpha in both the first and the second stages, based on Benjamini, Krieger and Yekutieli (2006). fdr_gbs is based on Gavrilov, Benjamini, Sarkar (2009) that is conceptually similar but uses a single stage for the test.

Aside: I did not find an implementation in R for fdr_tsbky and fdr_gbs to verify my implementation. Additionally, Benjamini and co-authors usually only describe the procedure to reject hypothesis but not the corresponding p-value corrections. I derived the p-value correction in analogy to the other cases, and in my simulations there is now no divergence between the reject decisions and the corrected p-values. (In the current version of statsmodels, before I will merge my latest changes, fdrcorrection_twostage returns the correct reject decision, however, the returned p-values are based on the second stage and need the adjusted alpha for the correct decision.) After my latest changes, all p-values can be directly compared with the overall alpha, which is either the level for the FWER or the level for the FDR.

Important: One important comment on the usage of the two-stage fdr method. All methods except fdr_tsbky and fdr_tsbh, calculate the p-value correction independently of the alpha that is specified as function argument. The corrected p-values can be compared with any alpha to decide which hypothesis to reject. The two two-stage fdr methods fdr_tsbky and fdr_tsbh use alpha to correct the p-values. So, the corrected p-values are for the specific alpha given as argument to the function, and we cannot compare these corrected p-values to any other alpha to reject any hypothesis.

Monte Carlo with Positive Dependence

Back to our Monte Carlo simulations, we look next at the results when the p-values are positively correlated. The difference of the alternative to the null cases values is again delta = 2.5.

alpha = 0.05 k = 10 delta = 2.5 corr = 0.8 nrep = 5000

======================================================================================================
              raw     b      s      sh     hs     h    hommel fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
------------------------------------------------------------------------------------------------------
no. reject    9.07  3.3692 3.3958 3.6254 3.6134 3.5822 3.7362 6.0968 3.6852   6.6166   6.7446   6.5312
no. rejecta  8.045  3.3328 3.3592 3.5096 3.5374 3.5084 3.5896 5.5386 3.5466   5.6552   5.734    5.2934
no. reject0  1.025  0.0364 0.0366 0.1158 0.076  0.0738 0.1466 0.5582 0.1386   0.9614   1.0106   1.2378
P(R>0)       0.9508 0.6078 0.6098 0.6078 0.6098 0.6078 0.613  0.6604 0.4794   0.6526   0.6604   0.6076
P(R>k)        0.19  0.0134 0.0134 0.018  0.0184 0.018   0.02  0.0864 0.0302   0.1138   0.1196   0.116
FWER P(V>0)  0.1906 0.0134 0.0134 0.018  0.0184 0.018   0.02  0.0866 0.0304   0.1142    0.12    0.1164
P(S<k)       0.4292 0.8884 0.8874 0.8626 0.8606 0.8626 0.8624 0.6276 0.8134   0.5812   0.572    0.5816
E(R-k | R>k) 1.0244 0.0364 0.0366 0.1158 0.076  0.0738 0.1466 0.558  0.1384   0.961    1.0102   1.2374
FDR E(V/R)   0.0545 0.0024 0.0024 0.005  0.0041 0.004  0.0065 0.0272 0.0077   0.0415   0.0435   0.0471
------------------------------------------------------------------------------------------------------

The FWER has dropped for all methods, the FWER controlling methods now have FWER at or below 0.02 which is conservative compared to our target rate of 0.05. The adaptive FDR controlling procedures still have FDR close but below 0.05. Overall, I don't see any considerable differences to the independence case. All methods look robust to our positive correlation, although many of them become more conservative.

Monte Carlo with staggered alternatives

So far all p-values for which the null hypotheses was not true, had the same delta, the difference between alternative and null hypothesis. The following table shows the Monte Carlo results when delta is increasing from 1 to 4 for the 10 hypothesis for which the null hypothesis is false. This corresponds to the following mean of the normal random variables and the p-values at those means:

>>> np.round(np.linspace(1, 4, k), 2)
array([ 1.  ,  1.33,  1.67,  2.  ,  2.33,  2.67,  3.  ,  3.33,  3.67,  4.  ])
>>> stats.norm.sf(np.linspace(1, 4, k))
array([ 0.15865525,  0.09121122,  0.04779035,  0.02275013,  0.00981533,
        0.00383038,  0.0013499 ,  0.00042906,  0.00012287,  0.00003167])

I keep the positive correlation from the previous example.

alpha = 0.05 k = 10 delta = np.linspace(1,4, k) corr = 0.8 nrep = 5000

======================================================================================================
              raw     b      s      sh     hs     h    hommel fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
------------------------------------------------------------------------------------------------------
no. reject   8.2658 3.8924 3.9132 4.0912 4.0694 4.0488 4.1406 5.911  4.0438   6.4374   6.5454   6.6894
no. rejecta  7.2408 3.856  3.8766 3.9758 3.994  3.9754 3.9942 5.3608 3.909    5.5002    5.56    5.4762
no. reject0  1.025  0.0364 0.0366 0.1154 0.0754 0.0734 0.1464 0.5502 0.1348   0.9372   0.9854   1.2132
P(R>0)       0.9966 0.8982 0.8998 0.8982 0.8998 0.8982 0.8996 0.9104 0.8158   0.9076   0.9104   0.8982
P(R>k)       0.1576 0.0102 0.0102 0.0134 0.0134 0.0132 0.0164 0.0702 0.022    0.0926   0.0978   0.0954
FWER P(V>0)  0.1906 0.0134 0.0134 0.0178 0.018  0.0178 0.0198 0.082  0.0284   0.1054   0.1114   0.108
P(S<k)       0.7956 0.9832 0.9828 0.9752 0.9752 0.9756 0.9738 0.9056 0.9642   0.881    0.8758   0.8772
E(R-k | R>k) 0.978  0.0322 0.0326 0.1104  0.07  0.068  0.1416 0.5336 0.127    0.919    0.9656   1.196
FDR E(V/R)   0.0552 0.0025 0.0025 0.0051 0.0041 0.0041 0.0066 0.0268 0.0075   0.0401   0.0421   0.0458
------------------------------------------------------------------------------------------------------

The most remarkable observation is that I don't see much of a difference to the other cases. All conclusions that we drew so far, carry over to this case. Most methods are conservative, the adaptive FDR procedures are still the closes to the nominal level.

Conclusions

The Monte Carlo Simulations confirm the results that we expect from the statistical theory for multiple testing corrections. All methods either control the family-wise error rate (FWER) or the false discovery rate (FDR). If we suspect that there are false hypothesis in our set of tests, then the methods that control the FDR are more powerful in rejecting false null hypothesis at the cost of a small increase in a family-wise number of rejected true null hypotheses. Among the FWER controlling procedures Hommel has the highest power, closely followed by Holm-Sidak, however, the differences in power among stepwise methods appears to be rather small. Among the methods that control the FDR, fdr_tsbh seems to have the best properties. fdr_bh (fdr_i) has lower power than the adaptive methods, and has lower FDR under independence and under positive correlation of the p-values. Somewhat surprisingly, among the adaptive methods both fdr_tsbky and fdr_gbs are not as successful as fdr_tsbh in the Monte Carlo simulations, for which I do not know any theoretical explanation.

References

(partial list)

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological) 57 (1) (January 1): 289–300. doi:10.2307/2346101.

Benjamini, Yoav, Abba M. Krieger, and Daniel Yekutieli. 2006. “Adaptive Linear Step-up Procedures That Control the False Discovery Rate.” Biometrika 93 (3) (September 1): 491–507. doi:10.1093/biomet/93.3.491.

Benjamini, Yoav, and Daniel Yekutieli. 2001. “The Control of the False Discovery Rate in Multiple Testing Under Dependency.” The Annals of Statistics 29 (4) (August): 1165–1188. doi:10.1214/aos/1013699998.

Gavrilov, Yulia, Yoav Benjamini, and Sanat K. Sarkar. 2009. “An Adaptive Step-down Procedure with Proven FDR Control Under Independence.” The Annals of Statistics 37 (2) (April): 619–629. doi:10.1214/07-AOS586.

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scand. J. Statist 6 65-70.

Hommel, G. (1988). A stage-wise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75 383-386.

Editorial Notes :

to cite Wikipedia: This article needs additional citations for verification. (It's just a blog post.)

I added the license by-nc-sa to this blog (in the footer). This follows Doug Hellmann's PYMOTW

joepy

Wednesday, April 17, 2013

Multiple Testing P-Value Corrections in Statsmodels