Sunday, March 17, 2013

Statistical Power in Statsmodels

I merged last week a branch of mine into statsmodels that contains large parts of basic power calculations and some effect size calculations. The documentation is in this section . Some parts are still missing but I thought I have worked enough on this for a while.

(Adding the power calculation for a new test now takes approximately: 3 lines of real code, 200 lines of wrapping it with mostly boiler plate and docstrings, and 30 to 100 lines of tests.)

The first part contains some information on the implementation. In the second part, I compare the calls to the function in the R pwr package to the calls in my (statsmodels') version.

I am comparing it to the pwr package because I ended up writing almost all unit tests against it. The initial development was based on the SAS manual, I used the explanations on the G-Power website for F-tests, and some parts were initially written based on articles that I read. However, during testing I adjusted the options (and fixed bugs), so I was able to match the results to pwr, and I think pwr has just the right level of abstraction and easiness of use, that I ended up with code that is pretty close to it.

If you just want to see the examples, skip to the second part.

The Power Onion

About onions (although the current onion is not so large):

Armin and our mailing list discussion

The main underlying calculations are just getting the power for the main distributions

  • normal distribution
  • non-central t distribution
  • non-central F distribution and
  • non-central chi-square distribution

I did not get yet to the binomial distribution for proportions. I also have not yet coded anything on power calculations for equivalence tests, although I went through the relevant parts in the SAS documentation.

The base for the calculations are standalone functions that calculate the power given a distribution and are possibly targeted to a subgroup of tests. So far, two functions for t, two functions for F, and one function each for chi-square and normal distribution.

Given the power function, and the identity that it should be equal to a specified power, we can solve for any of the arguments.

Take the one sample t-test as example. We can solve for any of effect_size, nobs, alpha, and power given all other parameters

def solve_power(self, effect_size=None, nobs=None, alpha=None, power=None,
                alternative='two-sided'):
    '''solve for any one parameter of the power of a one sample t-test

    for the one sample t-test the keywords are:
        effect_size, nobs, alpha, power

    Exactly one needs to be ``None``, all others need numeric values.

The signature also contains additional keywords as extra options, which always need to be given and have a default.

For solving the power equation, I created classes. The base class has a method that generically finds the root for any keyword that is set to None, or at least tries to. The individual classes specify the power methods, which just delegates to the corresponding function. The solve_power method of the specific classes does some precalculations in some of the cases, but essentially delegates all the work to the method of the base class. Initially, I added it only the solve_power to the individual classes, so I have a place to add an explicit docstring (instead of the **kwds) of the base class.

Aside: So far I could have designed most of it also in a purely functional structure, just calling functions with extra keywords. However, classes keep the code nicer together, and it's easier to add new attributes. At some point, I needed to adjust starting parameters for the root-finding in one class. Since I used initially a module global for starting values, they got messed up (and it took me some time to find the bug). Now, starting values are just an instance attribute, and I don't have to worry about changes affecting other classes or other instances. And, I don't have to hand them around to each function.

Back to the onion: It is at the level of these power classes where we can start to specialize the calculations for specific statistical tests. There are still many missing. For example, there are many tests that have as an underlying distribution the normal distribution, or the t or the F distribution. It is possible to create new classes that are targeted to specific statistical tests. This will make it easier to use, since we do not need to know how to translate a test to the base distribution. The examples below illustrate some cases where we just use the basic normal power class for different statistical tests.

So far, all calculations are based on effect size. The effect size is a standardized measure that is usually independent of the sample size. For example, for the one-sample t-test, effect size is the mean divided by the standard deviation of the sample, and is equal to the t-statistic times the square root of the sample size.

The effect size depends on a specific statistical test and some additional assumptions, for example, on the estimator for the standard deviation. Functions to calculate effect sizes are also useful in translating the function arguments from more familiar statistics. For example in R, package pwr separates the effect size calculations from the power calculations, while package power functions in package stats take original statistics, for example power.prop.test takes two probabilities instead of the effect size.

So far, I have only added functions for the effect size for chi-square and for proportion (based on normal distribution). More need to follow.

Aside (another one): So far, effect size is taken as given for the power calculations. It is supposed to represent the difference or distance of the alternative to the null hypothesis that we want to test. Given the effect size we can choose or calculate both type I (alpha) and type II (beta = 1-power) errors. A related issue is estimating the effect size corresponding to a statistical test for a given sample or dataset. In this case, effect size is a random variable, with statistical properties that we would like to know, like confidence intervals. My initial motivation was to find the type II error (post-hoc) for a chi-square goodness of fit test with a large sample. Since the sample size was large, I suspected that I have a very small type II error for a fixed alpha in the usual range 0.05 or 0.1.

The last part of the onion is that I would like to connect everything related to a test together. My target object (eventually) is a statistical tests that has everything connected to it, the distribution under the Null, the distribution under an alternative, confidence intervals, effect size estimates and power calculations (and nice plots, and the kitchen sink).

The Examples: pwr versus statsmodels

For the following, I just copied the examples from the pwr help pages to R and ran them, and copied my solution below it. It is the first time that I go systematically through all cases, and I saw some things that I would still like to change, and it also shows where parts are still missing.

The main import is

>>> import statsmodels.stats.power as smp

but replacing the module import by the subpackage import should also work (replace smp by sms)

>>> import statsmodels.stats.api as sms

T-Tests

T-test with one sample

This is a test for the Null hypothesis that the mean of one random sample is equal to a given value, under the assumption that the mean is normally distributed and the variance is estimated.

in R pwr:

> ## Exercise 2.5 p. 47 from Cohen (1988)
> p = pwr.t.test(d=0.2, n=60, sig.level=0.10, type="one.sample", alternative="two.sided")
> p$power
[1] 0.4555817640839868

in python (this calls the standalone power function):

>>> print smp.ttest_power(0.2, nobs=60, alpha=0.1, alternative='two-sided')
0.455581759963

We can also check the other alternatives

> p = pwr.t.test(d=0.2, n=60, sig.level=0.10, type="one.sample", alternative="less")
> p$power
[1] 0.002400866110904731
> p = pwr.t.test(d=0.2, n=60, sig.level=0.10, type="one.sample", alternative="greater")
> p$power
[1] 0.6013403206947192

and

>>> smp.ttest_power(0.2, nobs=60, alpha=0.1, alternative='smaller')
0.00240086591201
>>> smp.ttest_power(0.2, nobs=60, alpha=0.1, alternative='larger')
0.601340316701

and the same by calling the method

>>> smp.TTestPower().power(0.2, nobs=60, alpha=0.1, alternative='larger')
0.601340316701

Note: I like smaller and larger better than R's less and greater

T-test with paired samples

> ## Exercise p. 50 from Cohen (1988)
> d<-8/(16*sqrt(2*(1-0.6)))
> d
[1] 0.5590169943749475
> p = pwr.t.test(d=d, n=40, sig.level=0.05, type="paired", alternative="two.sided")
> p$power
[1] 0.931524825320931

My three ways to call the same underlying function

>>> smp.TTestPower().solve_power(0.5590169943749475, nobs=40, alpha=0.05, alternative='two-sided')
0.93152481603307669
>>> smp.TTestPower().power(0.5590169943749475, nobs=40, alpha=0.05, alternative='two-sided')
0.93152481603307669
>>> smp.ttest_power(0.5590169943749475, nobs=40, alpha=0.05, alternative='two-sided')
0.93152481603307669

Note: Paired t-test is the same as the one sample t-test with the pairwise difference of the data.

T-test with two independent samples

> ## Exercise 2.1 p. 40 from Cohen (1988)
> d<-2/2.8
> d
[1] 0.7142857142857143
> p = pwr.t.test(d=d, n=30, sig.level=0.05, type="two.sample", alternative="two.sided")
> p$power
[1] 0.7764888766441673

and

>>> smp.TTestIndPower().solve_power(2/2.8, nobs1=30, ratio=1, alpha=0.05, alternative='two-sided')
0.776488873187

Calculating the sample size to achieve a given power

> ## Exercise 2.10 p. 59
> p = pwr.t.test(d=0.3, power=0.75, sig.level=0.05, type="two.sample", alternative="greater")
> p$n
[1] 120.2232019425424

and

>>> smp.TTestIndPower().solve_power(0.3, power=0.75, ratio=1, alpha=0.05, alternative='larger')
array([ 120.22320279])

If we want to have unequal sample size: example with twice as many observations in group 2 as in group 1

>>> smp.TTestIndPower().solve_power(0.3, power=0.75, ratio=2, alpha=0.05, alternative='larger')
array([ 90.11015096])

In this case we need about 90 observations in sample 1, and about 180 observations in sample 2

>>> 90.11015096 * 2
180.22030192

Note: In contrast to the pwr package, I am not specifying the number of observations for the second sample, but instead I define ratio with nobs2 = nobs1 * ratio. I thought this would be more useful, since we can calculate the sample size to achieve a given power, when we want, for example, twice as many observations in one than the other sample.

For the z-test, the normal distribution based equivalent to the t-test, I allow for ratio=0 to represent the one-sample test, so I'm currently using only one function for all three cases, one sample and two independent samples with equal or unequal sample sizes.

The t-test for unequal sample size is a separate function in pwr, in my case they are the same, with a default ratio=1 which is equal sample size across groups

> ## Exercise 2.3 p. 437 from Cohen (1988)
> p = pwr.t2n.test(d=0.6, n1=90, n2=60, alternative="greater")
> p$power
[1] 0.97372615462922

and

>>> smp.TTestIndPower().solve_power(0.6, nobs1=90, ratio=60./90, alpha=0.05, alternative='larger')
0.973726135488

two sided, which is the default alternative:

>>> smp.TTestIndPower().solve_power(0.6, nobs1=90, ratio=60./90, alpha=0.05)
0.947015360249

Tests for Proportions

Proportion for one sample (Normal Approximation)

This is a test that the proportion in a sample is equal to a hypothesized value using the normal distribution to approximate the distribution of the test statistic.

> ## Exercise 6.5 p. 203 from Cohen
> h<-ES.h(0.5,0.4)
> h
[1] 0.2013579207903309
> p = pwr.p.test(h=h, n=60, sig.level=0.05, alternative="two.sided")
> p$power
[1] 0.3447014091272153

>>> from statsmodels.stats._proportion import proportion_effectsize
>>> proportion_effectsize(0.4, 0.5)
0.20135792079033088
>>> smp.NormalIndPower().solve_power(0.2013579207903309, nobs1=60, alpha=0.05, ratio=0)
0.34470140912721525

TODO: It looks like proportion_effectsize escaped my checking before I merged the branch

>>> proportion_effectsize(0.5, 0.4)
-0.20135792079033088
>>> proportion_effectsize([0.5, 0.5], [0.4, 0.6])[1]
0.20135792079033066

Compared to pwr, the role of prob1 and prob2 is reversed. I don't find any unit tests for it. And the two functions in _proportions, proportion_effectsize and confint_proportion, have not been added yet to the stats.api.

Calculating the sample size for given power

> ## Exercise 6.8 p. 208
> p = pwr.p.test(h=0.2, power=0.95, sig.level=0.05, alternative="two.sided")
> p$n
[1] 324.867699185228

and

>>> smp.NormalIndPower().solve_power(0.2, power=0.95, alpha=0.05, ratio=0)
array([ 324.86772737])

Note: this currently returns an array with one element instead of just a number.

Tests for two Proportions (Normal Approximation)

This is a two sample test for equality of proportion.

Note: I did not look at this case as a test case, and I wasn't sure this will work out of the box until I started to write this.

> ## Exercise 6.1 p. 198 from Cohen (1988)
> p = pwr.2p.test(h=0.3, n=80, sig.level=0.05, alternative="greater")
> p$power
[1] 0.5996777045464378

However, this is also just a normal test, and we can just use the basic power class.

>>> smp.NormalIndPower().solve_power(0.3, nobs1=80, alpha=0.05, alternative='larger')
0.59967770454643765

Testing with different sample sizes (pwr uses a separate function, I just use ratio as in the t-tests)

> ## Exercise 6.3 P. 200 from Cohen (1988)
> p = pwr.2p2n.test(h=0.30, n1=80, n2=245, sig.level=0.05, alternative="greater")
> p$power
[1] 0.7532924472318865

and

>>> smp.NormalIndPower().solve_power(0.3, nobs1=80, alpha=0.05, ratio=245./80, alternative='larger')
0.75329244723188638

Finding the required size for the second sample to obtain a given power

> ## Exercise 6.7 p. 207 from Cohen (1988)
> p = pwr.2p2n.test(h=0.20, n1=1600, power=0.9, sig.level=0.01, alternative="two.sided")
> p$n2
[1] 484.6645457384055

and

>>> smp.NormalIndPower().solve_power(0.2, nobs1=1600, alpha=0.01, power=0.9, ratio=None, alternative='two-sided')
array([ 0.30291534])
>>> 1600 * _
484.66454494576033

So, besides 1600 observations in the first sample, we also need about 484 observations in the second sample to get a power of 0.9. My version solves for the ratio and we need to calculate the number of observations for the second sample from it. (Reminder: _ in Python is just a stand-in for the last result.)

F-Tests

I started to write the power for the F-tests based on an explanation of it on the G-Power website, so some of it doesn't follow the same convention as the pwr version.

Technical Note: All my cases that use the non-central F distribution have a precision at around 6 decimals. I have not checked yet whether this is a limitation of the implementation of ncf in scipy.

F-test for ANOVA

This is an F-test that the mean in several groups is the identical. Both pwr and my version only consider the balanced case, that is sample size is assumed to be equal in all groups.

> ## Exercise 8.1 P. 357 from Cohen (1988)
> p = pwr.anova.test(f=0.28, k=4, n=20, sig.level=0.05)
> p$power
[1] 0.514979292196866

and

>>> smp.FTestAnovaPower().power(0.28, 80, 0.05, k_groups=4)
0.51498349798649223

Note: In my case, nobs is the total sample size, while n in power is the sample size for each group.

Solving for the sample size (for each group in pwr, for total in statsmodels

> ## Exercise 8.10 p. 391
> p = pwr.anova.test(f=0.28, k=4, power=0.80, sig.level=0.05)
> p$n
[1] 35.75789484377913

and

>>> smp.FTestAnovaPower().solve_power(0.28, alpha=0.05, power=0.8, k_groups=4)
array([ 143.03088865])
>>> _ / 4
array([ 35.75772216])

We need about 143 observations in total, which makes about 36 observations per group. Rounding up, we need 36*4 = 144 observations in total.

F-Test for linear Model

The power calculations for this F-test require that the user directly specifies the degrees of freedom for the numerator and denominator.

> ## Exercise 9.1 P. 424 from Cohen (1988)
> p = pwr.f2.test(u=5, v=89, f2=0.1/(1-0.1), sig.level=0.05)
> p$power
[1] 0.6735857709143944

and

>>> smp.FTestPower().solve_power(effect_size=np.sqrt(0.1/(1-0.1)), df_num=89, df_denom=5, alpha=0.05)
0.67358904832076627

Note: Following G-Power, the effect size is defined in terms of f not f2, which is the normalized square root of the F statistic.

Chi-square Tests

Chi-square goodness-of-fit

This test was one of my first cases that I was interested in. For this case, I also added the effect size calculation, given two probability distributions or two frequency distributions. It has a few additional options compared to the pwr version.

Given two probability distributions, we can calculate the effect size, and then the power of the test

> ## Exercise 7.1 p. 249 from Cohen
> P0<-rep(1/4,4)
> P1<-c(0.375,rep((1-0.375)/3,3))
> P0
[1] 0.25 0.25 0.25 0.25
> P1
[1] 0.3750000000000000 0.2083333333333333 0.2083333333333333 0.2083333333333333
> ES.w1(P0,P1)
[1] 0.288675134594813
> p = pwr.chisq.test(w=ES.w1(P0,P1), N=100, df=(4-1))
> p$power
[1] 0.673983392432693

using python and statsmodels

>>> p0 = np.ones(4.) / 4
>>> p1 = np.concatenate(([0.375], np.ones(3.) * (1- 0.375) / 3))
>>> import statsmodels.stats.api as sms
>>> sms.chisquare_effectsize(p0, p1)
>>> es = sms.chisquare_effectsize(p0, p1)
>>> es
0.28867513459481292
>>> smp.GofChisquarePower().solve_power(es, nobs=100, n_bins=4, alpha=0.05)
0.67398350421626085

Note: I am using the number of bins, n_bins, as argument and not the degrees of freedom, df, as pwr, since I always have problems remembering how the degrees of freedom are calculated. If there are no parameters in the distribution that are estimated, then df = nobs - 1. If there are parameters that are (appropriately) estimated, then df = nobs - 1 - ddof, where ddof is the number of estimated parameters. This matches the definition of the arguments in scipy.stats.chisquare.

Chisquare test for contingency tables

This is another case that I have not looked at before. The chisquare test for independence in contingency tables is also in scipy.stats

We can just use the chisquare test and adjust the degrees of freedom through n_bins.

Find the power, in R

> ## Exercise 7.3 p. 251
> p = pwr.chisq.test(w=0.346, df=(2-1)*(3-1), N=140, sig.level=0.01)
> p$power
[1] 0.88540528724145

and using statsmodels

>>> smp.GofChisquarePower().solve_power(0.346, nobs=140, n_bins=(2-1)*(3-1) + 1, alpha=0.01)
0.88540533954389766
>>> _ - 0.88540528724145
5.2302447706153998e-08

Finding the sample size to obtain a given power

> ## Exercise 7.8 p. 270
> p = pwr.chisq.test(w=0.1, df=(5-1)*(6-1), power=0.80, sig.level=0.05)
> p$N
[1] 2096.079183648519

and

>>> smp.GofChisquarePower().solve_power(0.1, n_bins=(5-1)*(6-1) + 1, alpha=0.05, power=0.8)
array([ 2096.07846526])

Looks close enough, but maybe we should add a convenience class for this case (and verify it for more cases).

Correlation

For completness, let's look also at correlation, another case that I skipped so far. I only skimmed the pwr help page before.

The page mentions the following transformation, citing Cohen.

transf = lambda r, n: np.arctanh(r)+r/(2.*(n-1))

However, this depends on the number of observations. Effect size and sample size cannot be calculated separately, so, we cannot just reuse the existing power calculation. I also don't manage to get it to match the result of pwr in some quick tries with the existing normal power function.

Let's try the simpler case of plain Fisher z-transform first, see Wikipedia

(I assume this has a three degrees of freedom correction, but have not read up on the details yet.)

The example in R

> ## Exercise 3.1 p. 96 from Cohen (1988)
> p = pwr.r.test(r=0.3,n=50,sig.level=0.05,alternative="two.sided")
> p$power
[1] 0.563888283901816
> p = pwr.r.test(r=0.3,n=50,sig.level=0.05,alternative="greater")
> p$power
[1] 0.6853182676922907

and my Fisher z-transform approximation

>>> smp.NormalIndPower().power(np.arctanh(0.3), nobs1=50-3, alpha=0.05, ratio=0, alternative='two-sided')
0.56436763896354003
>>> smp.NormalIndPower().power(np.arctanh(0.3), nobs1=50-3, alpha=0.05, ratio=0, alternative='larger')
0.68335663306958949

sample size calculations:

> ## Exercise 3.4 p. 208
> p = pwr.r.test(r=0.3,power=0.80,sig.level=0.05,alternative="two.sided")
> p$n
[1] 84.7489125589482
> p = pwr.r.test(r=0.5,power=0.80,sig.level=0.05,alternative="two.sided")
> p$n
[1] 28.8737622795125
> p = pwr.r.test(r=0.1,power=0.80,sig.level=0.05,alternative="two.sided")
> p$n
[1] 782.4485960810694

and the Fisher z-transform approximation

>>> [smp.NormalIndPower().solve_power(np.arctanh(r), alpha=0.05, ratio=0,
            power= 0.8, alternative='two-sided') + 3 for r in [0.3, 0.5, 0.1]]
[array([ 84.92761044]), array([ 29.01223669]), array([ 782.64821794])]

It looks close enough, given that we use a different approximation, but this needs more work and testing before it will go into statsmodels.

Finally

I would welcome feedback on this new part of statsmodels. Some of the interface issues are not fully settled yet. I have also added some alias functions for users that don't like the class creation, tt_solve_power, tt_ind_solve_power, zt_ind_solve_power, but I'm not quite happy about them yet.

The root-finding in the power calculations is based on scipy.optimize. I have tested quite a few cases, but I think there will be adjustments necessary, as they are used with different cases. Related, currently I am using only the generic root-finding algorithm, even though we can calculate the solution explicitly in some cases. It wouldn't be difficult to add them to the power classes.

I don't know yet how many power classes that are just small variation on the existing once, should be added, or if we should add options to existing power classes. Examples above are chi-square contingency table and two proportion tests, which might deserve a more explicit specialized interface.

For some of the current power classes I did not add the corresponding effect size functions. One example, t-tests with different ways of estimating the population variance.

And then, of course, there are power and effect size calculations missing for many statistical tests.

Statsmodels is open source, and contribution for missing pieces would be welcome.

4 comments:

  1. Joe, I think you are doing a hell of an amount of great work. But I think it is really difficult for others to build on it, given the level of documentation in statsmodels. Should not what you have written here go into the docu of statsmodels? I think statsmodels would/could be used a LOT more if the documentation were improved. As you may have seen, I have started some stats-intro (http://work.thaslwanter.at/Stats/html/StatsFH.html), and will in the future try to incorporate introductory documentation for statsmodels. My statistical knowledge is narrow enough that I am hesitant to "mess" around with the stats-models documentation myself.

    ReplyDelete
  2. Thomas, I think your statistics in python introduction looks very nice and is very needed to make python more popular in this area.

    The documentation on the power calculations will go into statsmodels in some form.
    My main reason for not adding much in terms of introductory information in statsmodels is the time constraint (and personal priorities), I already have a hard time keeping the minimal documentation and docstrings in a usable form.

    The requirements to improve the statsmodels documentation are not really high. There are many areas that need improvements, and in most cases it is not necessary to understand all the details. And, I would rather review some pull requests for improvements than coming up with the explanations and examples by myself.

    ReplyDelete
  3. I came across this blog because I'm really struggling finding examples of how to use statsmodels, which functions to use, how to even just get the functions to work. To take statsmodels to the next level, and gain a wider audience, there really SHOULD be a more urgent effort to at least provide EXAMPLES of each function so a user doesn't have to screw around with figuring it out on their own with little help from the documentation.

    Think about it. If you make the best product but people don't know how to use it and get frustrated, many will just give up and find alternatives. Maybe that's the way statsmodels wants it. If they didn't want it that way, they'd provide clear examples like pandas does. That's why pandas is popular - because people can LEARN how to use it QUICKLY with EXAMPLES.

    If you have time constraints, think outside the box and let people submit examples to you so you just review and post them. Have users curate the content. Lots of alternative options rather than taking it on all by yourself.

    ReplyDelete
  4. Hi Jarad and Zoey,
    Writing comprehensive documentation is a lot of work, and I usually stay with a topic just long enough to figure out the statistical background and the computational issues.

    statsmodels is very open to contributions, preferably as a pull request on github. There is also the possibility to contribute notebooks to or link notebooks from the wiki https://github.com/statsmodels/statsmodels/wiki/Examples . In some popular areas users are writing examples, notebooks and blog posts, but they usually don't end up in statsmodels because they are not submitted for inclusion.

    (I have neglected this blog for years now, because I'm not patient enough or don't want to allocate enough time to write something "comprehensible".)

    On the other hand, the documentation has improved considerably over the years, many developers, contributors and users provide documentation and notebook. However, in some areas they are not easy to find, and in some areas code development moves faster than writing documentation.

    As example, I just wrote an overview of which models can be used depending on the type (continuous, binary, count, ..) of the response variable https://github.com/statsmodels/statsmodels/issues/2642 This should be extended to several pages with examples or link to examples, but I doubt I will take time for this right now, but will write some notebooks for specific topics in the next months.

    Another recommendation for specific questions is to ask on stackoverflow. The response time is in most cases very good. Helping there takes a limited amount of time, and points out some priorities when "improve statsmodels" is a task that is too large and daunting.

    ReplyDelete