Showing posts with label stats. Show all posts
Showing posts with label stats. Show all posts

Friday, March 15, 2013

Different Fields - Different Problems: Effect Size

or What's your scale?

Effect Size

I have been working on and off for a while now on adding statistical power calculations to statsmodels. One of the topics I ran into is the effect size.

At the beginning, I wasn't quite sure what to make of it. While I was working on power calculations, it just seemed to be a convenient way of specifying the distance between the alternative and the null hypothesis. However, there were references that sounded like it's something special and important. This was my first message to the mailing list

A classical alternative to NHST (null-hypothesis significance testing):
report your effect size and confidence intervals

http://en.wikipedia.org/wiki/Effect_size

http://onlinelibrary.wiley.com/doi/10.1111/j.1469-185X.2007.00027.x/abstract

http://onlinelibrary.wiley.com/doi/10.1111/j.1460-9568.2011.07902.x/abstract

I bumped into this while looking for power analysis.

MIA in python, as far as I can tell.

Now what is the fuss all about?

Scaling Issues

Today I finally found some good motivating quotes:

"In the behavioral, educational, and social sciences (BESS), units of measurement are many times arbitrary, in the sense that there is no necessary reason why the measurement instrument is based on a particular scaling. Many, but certainly not all, constructs dealt with in the BESS are not directly observable and the instruments used to measure such constructs do not generally have a natural scaling metric as do many measures, for example, in the physical sciences."

and

"However, effects sizes based on raw scores are not always helpful or generalizable due to the lack of natural scaling metrics and multiple scales existing for the same phenomenon in the BESS. A common methodological suggestion in the BESS is to report standardized effect sizes in order to facilitate the interpretation of results and for the cumulation of scientific knowledge across studies, which is the goal of meta-analysis (<...>). A standardized effect size is an effect size that describes the size of the effect but that does not depend on any particular measurement scale."

The two quotes are from the introduction in "Confidence Intervals for Standardized Effect Sizes: Theory, Application, and Implementation" by Ken Kelley http://www.jstatsoft.org/v20/a08 .

Large parts of the literature that I was browsing or reading on this, are in Psychological journals. This can also be seen in the list of references in the Wikipedia page on effect size.

One additional part, that I found puzzling was the definition of "conventional" effect sizes by Cohen. "For Cohen's d an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect." (sentence from the Wikipedia page)

"Small" what? small potatoes, small reduction in the number of deaths, low wages? or, "I'm almost indifferent" (+0 on the mailing lists)?

Where I come from

Now it's clearer why I haven't seen this in my traditional area, economics and econometrics.

Although economics falls into BESS, in the SS part, it has a long tradition of getting a common scale, money. Physical units also show up in some questions.

National Income Accounting tries to measure the economy with money as a unit. (And if something doesn't have a price associated with it, then it's ignored by most. That's another problem. Or, we make a price.) There are many measurement problems, but there is also a large industry to figure out common standards.

Effect sizes have a scale that is "natural":

  • What's the increase in lifetime salary, if you attend business school?
  • What's the increase in sales (in Dollars, or physical units) if you lower the price?
  • What's the increase in sales if you run an advertising campaign?
  • What's your rate of return if you invest in stocks?

Effects might not be easy to estimate, or cannot be estimated accurately, but we don't need a long debate about what to report as effect.

Post Scripts

(i) I just saw the table at the end of this SAS page http://support.sas.com/documentation/cdl/en/statug/65328/HTML/default/viewer.htm#statug_glm_details22.htm . I love replicating SAS tables, but will refrain for now, since I am supposed to go back to other things.

(ii) I started my last round of work on this because I was looking at effect size as distance measure for a chisquare goodness of fit test. When the sample size is very large, then small deviations from the Null Hypothesis will cause a statistical test to reject the Null, even if the effect, the difference to the Null is for all practical purposes irrelevant. My recent preferred solution to this is to switch to equivalence test or something similar, not testing the hypothesis that the effect is exactly zero, but to test whether the effect is "small" or not.

(iii) I have several plans for blog posts (cohens_kappa, power onion) but never found the quiet time or urge to actually write them.

Wednesday, October 17, 2012

TOST: statistically significant difference and equivalence

or "Look I found a dime"

The Story

Suppose we have two strategies (treatments) for making money. We want to test whether there is difference in the payoffs that we get with the two strategies. Assume that we are confident enough to rely on t tests, that is, means are approximately normally distributed. For some reasons, like transaction cost or cost differences, we don't care about the difference in the strategies if the difference is less than 50 cents.
To have an example we can simulate two samples, and let's take as a true difference a dime, 0.1
payoff_s1 = sigma * np.random.randn(nobs)
payoff_s2 = 0.1 + sigma * np.random.randn(nobs)
I picked sigma=0.5 to get good numbers for the story.

Two Tests: t-test and TOST

We compare two test, a standard t test for independent samples and a test for equivalence, two one-sided tests, TOST:
stats.ttest_ind(payoff_s1, payoff_s2)
smws.tost_ind(payoff_s1, payoff_s2, -0.5, 0.5, usevar='pooled')
The null hypothesis for the t-test is that the two samples have the same mean. If the p-value of the t-test is below, say 0.05, we reject the hypothesis that the two means are the same. If the p-value is above 0.05, then we don't have enough evidence to reject the null hypothesis. This can also happen when the power of the test is not high enough given our sample size.
As the sample size increases, we have more information and the test becomes more powerful.
If the true means are different, then in large samples we will always reject the null hypothesis of equal means. (As the number of observations goes to infinity the probability of rejection goes to one if the means are different.)
The second test, TOST, has as null hypothesis that the difference is outside an interval. In the symmetric case, this means that the absolute difference is at least as large as a given threshold. If the p-value is below 0.05, then we reject the null hypothesis that the two means differ more than the threshold. If the p-value is above 0.05, we have insufficient evidence to reject the hypothesis that the two means differ enough.
Note that the null hypothesis of t-test and of TOST are reversed, rejection means significant difference in t-test and significant equivalence in TOST.

The Results

Looking at the simulated results:
small sample size:
nobs: 10 diff in means: -0.14039151695
ttest: 0.606109617438 not different    tost: 0.0977715582206 different
With 10 observations the information is not enough to reject the null hypothesis in either test. The t-test says we cannot reject that they are different. The TOST test says we cannot reject that they are the same.
medium sample size:
nobs: 100 diff in means: 0.131634043864
ttest: 0.0757146249227 not different    tost: 6.39909387346e-07 not different
The t-test does not reject that they are the same at a significance size of 0.05. The TOST test now rejects the hypothesis that there is a large (at least 0.5) difference.
large sample size:
nobs: 1000 diff in means: 0.107020981612
ttest: 1.51161249802e-06 different        tost: 1.23092818968e-65 not different
Both tests no reject their null hypothesis. The t-test rejects that the means are the same. However the mean is only 0.1, so the statistically significant difference is not large enough that we really care. Statistical significance doesn't mean it's also an important difference. The TOST test strongly rejects that there is a difference of at least 0.5, indicating that given our threshold of 0.5, the two strategies are the same.

The Script

import numpy as np
from scipy import stats
import statsmodels.stats.weightstats as smws

nobs_all = [10, 100, 1000]
sigma = 0.5

seed = 628561  #chosen to produce nice result in small sample
print seed
for nobs in nobs_all:
    np.random.seed(seed)
    payoff_s1 = sigma * np.random.randn(nobs)
    payoff_s2 = 0.1 + sigma * np.random.randn(nobs)

    p1 = stats.ttest_ind(payoff_s1, payoff_s2)[1]
    p2 = smws.tost_ind(payoff_s1, payoff_s2, -0.5, 0.5, usevar='pooled')[0]

    print 'nobs:', nobs, 'diff in means:', payoff_s2.mean() - payoff_s1.mean()
    print 'ttest:', p1,    ['not different', 'different    '][p1 < 0.05],
    print '   tost:', p2, ['different    ', 'not different'][p2 < 0.05]

Notes

Implementation:
The t-tests are available in scipy.stats. I wrote the first version for paired sample TOST just based on a scipy.stats ttest https://gist.github.com/3900314 . My new versions including tost_ind will soon come to statsmodels.
Editorial note:
I looked at tests for equivalence like TOST a while ago in response to some discussion on the scipy-user mailing list about statistical significance. This time I mainly coded, and spend some time looking at how to verify my code against SAS and R. Finding references and quotes is left to the reader or to another time. There are some controversies around TOST and some problems with it, but from all I saw, it's still the most widely accepted approach and is recommended by the US goverment for bio-equivalence tests.

Tuesday, April 28, 2009

Having fun with expectations

Using the distribution classes in scipy.stats it is easy to calculate expectations of a function with respect to any distributions using numerical integration.

I’m going to write a function that calculates the expectation, then I attach it to the class of continuous distributions in scipy.stats. Finally we can use our new method with any existing distribution.

Warning: Monkey Patching a class can have unintended effects if the new or changed methods interfere with other uses. In this case we just add a new method, which does not effect any of the original use of the distributions.

import numpy as np
from scipy import stats, integrate

def expectedfunc(self, fn=None, args=(), lb=None, ub=None, conditional=False):
    '''calculate expected value of a function with respect to the distribution

    only for standard version of distribution,
    location and scale not tested

    Parameters
    ----------
        all parameters are keyword parameters
        fn : function (default: identity mapping)
           Function for which integral is calculated. Takes only one argument.
        args : tuple
           argument (parameters) of the distribution
        lb, ub : numbers
           lower and upper bound for integration, default is set to the support
           of the distribution
        conditional : boolean (False)
           If true then the integral is corrected by the conditional probability
           of the integration interval. The return value is the expectation
           of the function, conditional on being in the given interval.

    Returns
    -------
        expected value : float
    '''
    if fn is None:
        def fun(x, *args):
            return x*self.pdf(x, *args)
    else:
        def fun(x, *args):
            return fn(x)*self.pdf(x, *args)
    if lb is None:
        lb = self.a
    if ub is None:
        ub = self.b
    if conditional:
        invfac = self.sf(lb,*args) - self.sf(ub,*args)
    else:
        invfac = 1.0
    return integrate.quad(fun, lb, ub,
                                args=args)[0]/invfac

For now this is just a function where the first argument is a distribution instance, as they are available in scipy.stats. We can call this function to calculate the forth moment of the standard normal distribution and compare it with the moment of stats.norm

>>> print stats.norm.moment(4)
3.0
>>> print expectedfunc(stats.norm, lambda(x): (x)**4)
3.0

We obtain the same result, which means in this case our function works correctly.

Now we can attach it to stats.distributions.rv_continuous, which is the superclass of all continuous distributions. We could have alse used new.instancemethod which is, however, depreciated and will be removen in py3k.

>>> import types
>>> stats.distributions.rv_continuous.expectedfunc =
...       types.MethodType(expectedfunc,None,stats.distributions.rv_continuous)
>>> #print dir(stats.norm)
>>> print stats.norm.expectedfunc
<bound method norm_gen.expectedfunc of <scipy.stats.distributions.norm_gen object at 0x02122830>>

Examples

Here is the forth moment for both the normal and the t distribution. The t distribution requires one parameter, the degrees of freedom, which I set to 10 to get fatter tails:

>>> print stats.norm.expectedfunc(lambda(x): (x)**4)
3.0
>>> print stats.norm.moment(4)
3.0
>>> print stats.t.expectedfunc(lambda(x): (x)**4, 10)
6.25
>>> print stats.t.moment(4, 10)
6.25

Expectation of some additional functions:

>>> print stats.norm.expectedfunc(lambda(x): np.sqrt(np.abs(x)))
0.822178958662
>>> print stats.norm.expectedfunc(lambda(x): np.exp(-np.abs(x)))
0.52315658373
>>> print stats.norm.expectedfunc(lambda(x): np.exp(-x), lb=0)
0.261578291865

If our function is identical to one, and we use integration bounds on our integral, then we get the probability of the interval:

>>> print stats.norm.expectedfunc(lambda(x): 1, lb=-1, ub=0.5)
0.532807207343
>>> print stats.norm.cdf(0.5) - stats.norm.cdf(-1)
0.532807207343

Can we calculate the expectation of exp(x)?

>>> print stats.norm.expectedfunc(lambda(x): np.exp(x))
Warning: The ocurrence of roundoff error is detected, which prevents
  the requested tolerance from being achieved.  The error may be
  underestimated.
-1.#IND

The expectation of exp(x) with respect to the standard normal distribution is unbound, and our numerical integration function returns nan, not a number.

If we integrate with respect to a distribution with finite support, for example the rdistribution, rdist, then the expectation is finite:

>>> print stats.rdist.expectedfunc(lambda(x): np.exp(x),0.1)
1.49242160729

We can also try complex values:

>>> print stats.norm.expectedfunc(lambda(x): np.exp(1j*x))
0.606530659713

I have no idea if this is correct, but this is the basic calculation for the characteristic function of a distribution.

Next we can try out conditional expectation. As an example, we calculate the expectation of a standard normal random variable conditional on values being in the top decile, i.e. the expectation of all values in the top 10 %.

>>> lbdec = stats.norm.isf(0.1)
>>> print stats.norm.expectedfunc(lb=lbdec, conditional=True)
1.75498331932
>>> print expectedfunc(stats.norm, lb=lbdec, conditional=True)
1.75498331932
>>> print stats.norm.expectedfunc(lambda(x): 1, lb=-lbdec, ub=lbdec)
0.8
>>> #should be 0.8

What’s the variance if we truncate the normal distribution at the 0.1 and 0.9 deciles?

>>> print stats.norm.expectedfunc(lambda(x): x**2, lb=-lbdec, ub=lbdec,conditional=True)
0.437724594904
>>> print expectedfunc(stats.norm, lambda(x): x**2, lb=-lbdec, ub=lbdec,conditional=True)
0.437724594904

and verify the result with truncated normal

>>> print stats.truncnorm.moment(2,-lbdec,lbdec)
0.437724594904
>>> lbdect = stats.t.isf(0.1, 10)
>>> print stats.t.expectedfunc(args=(10,), lb=lbdect, conditional=True)
1.9892028422
>>> print expectedfunc(stats.t, args=(10,), lb=lbdect, conditional=True)
1.9892028422

The t distribution has fatter tails than the normal distribution, so the conditional expectation of the top decile is larger for the t distribution than for the normal distribution, 1.989 versus 1.755.

Sunday, March 15, 2009

Using the Gaussian Kernel Density Estimation

In scipy.stats we can find a class to estimate and use a gaussian kernel density estimator, scipy.stats.stats.gaussian_kde.

Until recently, I didn’t know how this part of scipy works, and the following describes roughly how I figured out what it does.

First, we can look at help to see whether the documentation provides enough information for its use. Here is the first part.

>>> help(stats.gaussian_kde)
Help on class gaussian_kde in module scipy.stats.kde:

class gaussian_kde(__builtin__.object)
 |  Representation of a kernel-density estimate using Gaussian kernels.
 |
 |   Parameters
 |   ----------
 |   dataset : (# of dims, # of data)-array
 |       datapoints to estimate from
 |
 |   Members
 |   -------
 |   d : int
 |       number of dimensions
 |   n : int
 |       number of datapoints
 |
 |   Methods
 |   -------
 |   kde.evaluate(points) : array
 |       evaluate the estimated pdf on a provided set of points
 |   kde(points) : array
 |       same as kde.evaluate(points)
 |   kde.integrate_gaussian(mean, cov) : float
 |       multiply pdf with a specified Gaussian and integrate over the whole domain
 |   kde.integrate_box_1d(low, high) : float
 |       integrate pdf (1D only) between two bounds
 |   kde.integrate_box(low_bounds, high_bounds) : float
 |       integrate pdf over a rectangular space between low_bounds and high_bounds
 |   kde.integrate_kde(other_kde) : float
 |       integrate two kernel density estimates multiplied together
 |
 |  Internal Methods
 |  ----------------
 |   kde.covariance_factor() : float
 |       computes the coefficient that multiplies the data covariance matrix to
 |       obtain the kernel covariance matrix. Set this method to
 |       kde.scotts_factor or kde.silverman_factor (or subclass to provide your
 |       own). The default is scotts_factor.
 |
 |  Methods defined here:
 |
 |  __call__ = evaluate(self, points)
 |
 |  __init__(self, dataset)
 |
 |  covariance_factor = scotts_factor(self)
 |
 |  evaluate(self, points)
 |      Evaluate the estimated pdf on a set of points.
 |
 ...

The first two basic methods that I was interested in is the initialization, __init__ just takes a data set as a argument, and then evaluate, or __call__ which takes as function argument the list of points at which we want to evaluated the estimated pdf.

Let’s just try this out:

First, I get the standard imports:

import numpy as np
from scipy import stats
import matplotlib.pylab as plt

then I generate a sample that I can feed to the kde, as usual for continuous variables, I start with a normal distribution:

n_basesample = 1000
np.random.seed(8765678)
xn = np.random.randn(n_basesample)

Now, we create an instance of the gaussian_kde class and feed our sample to it:

gkde=stats.gaussian_kde(xn)

We need some points at which we evaluate the density funtion for the estimated density function:

ind = np.linspace(-7,7,101)
kdepdf = gkde.evaluate(ind)

and finally we create the plot of the histogram of our data, together with the density that created our sample, the data generating process DGP, and finally the estimated density.

plt.figure()
# plot histgram of sample
plt.hist(xn, bins=20, normed=1)
# plot data generating density
plt.plot(ind, stats.norm.pdf(ind), color="r", label='DGP normal')
# plot estimated density
plt.plot(ind, kdepdf, label='kde', color="g")
plt.title('Kernel Density Estimation')
plt.legend()
#plt.show()

and this is the graph that we get

kde_normal.png

This graph looks pretty good, when the underlying distribution is the normal distribution, then the gaussian kernel density estimate follows very closely the true distribution, at least for a large sample as we used.

Now, to make it a bit more difficult we can look at a bimodal distribution, and see if it is still able to fit so well. As an example, I pick a mixture of two normal distributions, 60% of the sample comes from a normal distribution with mean -3 and standard deviation equal to one, 40% of the observation are from the normal distribution with mean 3 and the same standard deviation.

alpha = 0.6 #weight for (prob of) lower distribution
mlow, mhigh = (-3,3)  #mean locations for gaussian mixture
xn =  np.concatenate([mlow + np.random.randn(alpha * n_basesample),
               mhigh + np.random.randn((1-alpha) * n_basesample)])

With the new data set, we can run the same commands as above, except that we have to change the plot of the data generating density to use the mixture density instead:

plt.plot(ind, alpha * stats.norm.pdf(ind, loc=mlow) +
              (1-alpha) * stats.norm.pdf(ind, loc=mhigh),
              color="r", label='normal mix')

The next graph shows what we get in this case.

kde_mixture.png

The estimated density is oversmoothing, the peaks of the estimated pdf are too small. So the automatic selection of the smoothing parameter doesn’t work in this case. Now, it’s time to find out how gaussian_kde actually selects the smoothing parameter or bandwith of the kernel.

But I’m out of time for today. We dig into this next time.

Saturday, March 14, 2009

Warmup: Fitting Distributions

As a warmup exercise, I generate some random samples, fit two distributions and plot the results. I also calculate the Kolmogorvo-Smirnov test using scipy.stats.kstest.

This is a shortened, simplified version of a script that I wrote to see how the dostrinutions in scipy.stats can be used to automatically fit some data and select the best fitting distribution.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def plothist(x, distfn, args, loc, scale, right=1):
    '''plot histogram and pdf, based on matplotlib doc example'''
    plt.figure()
    # the histogram of the data
    n, bins, patches = plt.hist(x, 25, normed=1, facecolor='green', alpha=0.75)
    maxheight = max([p.get_height() for p in patches])
    axlim = list(plt.axis())
    axlim[-1] = maxheight*1.05

    # add more points for density plot
    pdfpoints = bins + np.diff(bins)[0]*np.linspace(-0.5,0.5,11)[:,np.newaxis]
    pdfpoints = np.sort(pdfpoints.ravel())

    # calculate and plot the estimated density function
    yt = distfn.pdf(pdfpoints, loc=loc, scale=scale, *args)
    yt[yt>maxheight] = maxheight
    lt = plt.plot(pdfpoints, yt, 'r--', linewidth=1, label='estimated')
    # calculate and plot the density function that generated the data
    ys = stats.t.pdf(pdfpoints, 10, scale=10,)*right
    ls = plt.plot(pdfpoints, ys, 'b-', linewidth=1, label='true')

    plt.legend()
    plt.xlabel('values')
    plt.ylabel('Probability')
    plt.title(r'$\mathrm{Fitting\ Distribution\ %s :}\ \mu=%f,\ \sigma=%f$'%(distfn.name,loc,scale))
    plt.grid(True)
    plt.draw()


n = 500
dgp_arg = 10
dgp_scale = 10
np.random.seed(6543789)
rvs = stats.t.rvs(dgp_arg, scale=dgp_scale, size=n)
sm = rvs.mean()
sstd = np.sqrt(rvs.var())
ssupp = (rvs.min(), rvs.max())

for distr in [stats.norm, stats.t]:
    distname = distr.name
    # estimate parameters
    par_est = distr.fit(rvs,loc=sm, scale=sstd)
    print '\nFitting distribution %s' % distname
    print 'estimated distribution parameters\n', par_est
    arg_est = par_est[:-2]  # get scale parameters if any
    loc_est = par_est[-2]
    scale_est = par_est[-1]
    rvs_normed = (rvs-loc_est)/scale_est
    ks_stat, ks_pval = stats.kstest(rvs_normed,distname, arg_est)
    print 'ks-stat = %f, ks-pval = %f)' % (ks_stat, ks_pval)
    plothist(rvs, distr, arg_est, loc_est, scale_est, right = 1)
    plt.savefig('ex_dist1_%s.png'% distname)

#plt.show() # if we want to see it on screen
(Source code)

Output

ex_dist1_norm.png
ex_dist1_t.png

The script produces the following output for the parameter estimate and the Kolmogorov-Smirnov test

Fitting distribution norm
estimated distribution parameters
[ -0.70287027  11.22248481]
ks-stat = 0.037073, ks-pval = 0.493706)

Fitting distribution t
estimated distribution parameters
[ 7.8518085  -0.69695469  9.71315677]
ks-stat = 0.019562, ks-pval = 0.990926)

The p-values of Kolmogorov-Smirnov test are derived under the assumption that we test against a known distribution and not against a distribution with estimated parameters. But in this example, the numbers look pretty informative, the p-values are large for both distributions, from which I conclude that both distributions fit overall the sample relatively well. The pvalue of the t-distribution is about twice the p-value of the normal distribution, which strongly suggest that the distribution was generated by a t-distribution and not by a normal distribution. Although, if we look only at the pvalue of the normal distribution, we wouldn't reject the hypothesis that the sample was generated by the normal distribution.

However, I want to emphasis that this is an informal interpretation, in which we can be quite confident and which we know to be correct since we generated the data, but that it is not the result of a formal statistical test for it.

script file ex_dist1.py

Editorial comments:

Sphinx and restructured text work well, and the highlighting with pygments also works without additional intervention, but adding the graphs to blogger requires some manual upload and editing of the formatting, which still takes too much time. I haven't figured out yet how to upload source files, the upload file button seems to be missing from my blog editor.

update: some google searches later, it seems that it is not possible to attach regular text files to blogger, so it needs to be hosted somewhere else