Wednesday, October 17, 2012

TOST: statistically significant difference and equivalence

or "Look I found a dime"

The Story

Suppose we have two strategies (treatments) for making money. We want to test whether there is difference in the payoffs that we get with the two strategies. Assume that we are confident enough to rely on t tests, that is, means are approximately normally distributed. For some reasons, like transaction cost or cost differences, we don't care about the difference in the strategies if the difference is less than 50 cents.
To have an example we can simulate two samples, and let's take as a true difference a dime, 0.1
payoff_s1 = sigma * np.random.randn(nobs)
payoff_s2 = 0.1 + sigma * np.random.randn(nobs)
I picked sigma=0.5 to get good numbers for the story.

Two Tests: t-test and TOST

We compare two test, a standard t test for independent samples and a test for equivalence, two one-sided tests, TOST:
stats.ttest_ind(payoff_s1, payoff_s2)
smws.tost_ind(payoff_s1, payoff_s2, -0.5, 0.5, usevar='pooled')
The null hypothesis for the t-test is that the two samples have the same mean. If the p-value of the t-test is below, say 0.05, we reject the hypothesis that the two means are the same. If the p-value is above 0.05, then we don't have enough evidence to reject the null hypothesis. This can also happen when the power of the test is not high enough given our sample size.
As the sample size increases, we have more information and the test becomes more powerful.
If the true means are different, then in large samples we will always reject the null hypothesis of equal means. (As the number of observations goes to infinity the probability of rejection goes to one if the means are different.)
The second test, TOST, has as null hypothesis that the difference is outside an interval. In the symmetric case, this means that the absolute difference is at least as large as a given threshold. If the p-value is below 0.05, then we reject the null hypothesis that the two means differ more than the threshold. If the p-value is above 0.05, we have insufficient evidence to reject the hypothesis that the two means differ enough.
Note that the null hypothesis of t-test and of TOST are reversed, rejection means significant difference in t-test and significant equivalence in TOST.

The Results

Looking at the simulated results:
small sample size:
nobs: 10 diff in means: -0.14039151695
ttest: 0.606109617438 not different    tost: 0.0977715582206 different
With 10 observations the information is not enough to reject the null hypothesis in either test. The t-test says we cannot reject that they are different. The TOST test says we cannot reject that they are the same.
medium sample size:
nobs: 100 diff in means: 0.131634043864
ttest: 0.0757146249227 not different    tost: 6.39909387346e-07 not different
The t-test does not reject that they are the same at a significance size of 0.05. The TOST test now rejects the hypothesis that there is a large (at least 0.5) difference.
large sample size:
nobs: 1000 diff in means: 0.107020981612
ttest: 1.51161249802e-06 different        tost: 1.23092818968e-65 not different
Both tests no reject their null hypothesis. The t-test rejects that the means are the same. However the mean is only 0.1, so the statistically significant difference is not large enough that we really care. Statistical significance doesn't mean it's also an important difference. The TOST test strongly rejects that there is a difference of at least 0.5, indicating that given our threshold of 0.5, the two strategies are the same.

The Script

import numpy as np
from scipy import stats
import statsmodels.stats.weightstats as smws

nobs_all = [10, 100, 1000]
sigma = 0.5

seed = 628561  #chosen to produce nice result in small sample
print seed
for nobs in nobs_all:
    payoff_s1 = sigma * np.random.randn(nobs)
    payoff_s2 = 0.1 + sigma * np.random.randn(nobs)

    p1 = stats.ttest_ind(payoff_s1, payoff_s2)[1]
    p2 = smws.tost_ind(payoff_s1, payoff_s2, -0.5, 0.5, usevar='pooled')[0]

    print 'nobs:', nobs, 'diff in means:', payoff_s2.mean() - payoff_s1.mean()
    print 'ttest:', p1,    ['not different', 'different    '][p1 < 0.05],
    print '   tost:', p2, ['different    ', 'not different'][p2 < 0.05]


The t-tests are available in scipy.stats. I wrote the first version for paired sample TOST just based on a scipy.stats ttest . My new versions including tost_ind will soon come to statsmodels.
Editorial note:
I looked at tests for equivalence like TOST a while ago in response to some discussion on the scipy-user mailing list about statistical significance. This time I mainly coded, and spend some time looking at how to verify my code against SAS and R. Finding references and quotes is left to the reader or to another time. There are some controversies around TOST and some problems with it, but from all I saw, it's still the most widely accepted approach and is recommended by the US goverment for bio-equivalence tests.


  1. What if the distribution is not normal?

  2. The TOST (two one sided tests) principal is very general, and we can convert many of the usual hypothesis tests into TOSTs.

    I worked a bit more on this since I wrote this article. I now also have TOST for proportion and for the normal distribution, proportion_ztost, binom_tost and ztost, and renamed the t-test version to ttost.

    As far as I understand, all the usual properties and assumptions for parametric hypothesis test also apply for the TOST version. I haven't looked at whether there are TOST or similar tests for nonparametric hypothesis tests (Mann-Whitney U, Wilcoxon, ...).

    For parametric test, we have:

    In large samples the normal distribution is often a good approximation (asymptotically)

    We can transform the statistic so it is closer to normal, for example Fisher z-transform for proportions.

    We can create TOST tests based on other distribution, binom_tost for proportion uses the binomial distribution. I haven't looked yet at tests for equality of variances. The problem with tests based on F and chisquare distribution is that they don't have a one sided test in the standard formulation, so we would need a signed version.
    Just another idea that I haven't looked at, is kstost, Kolmogorov-Smirnov goodness-of-fit test as a TOST to see whether distributions are "equivalent".