joepy: python

Showing posts with label python. Show all posts

Monday, March 25, 2013

Multiple Comparison and Tukey HSD or why statsmodels is awful

Introduction

Statistical tests are often grouped into one-sample, two-sample and k-sample tests, depending on how many samples are involved in the test. In k-sample tests the usual Null hypothesis is that a statistic, for example the mean as in a one-way ANOVA, or the distribution in goodness-of-fit tests, is the same in all groups or samples. The common test is the joint test that all samples have the same value, against the alternative that at least one sample or group has a different value.

However, often we are not just interested in the joint hypothesis if all samples are the same, but we would also like to know for which pairs of samples the hypothesis of equal values is rejected. In this case we conduct several tests at the same time, one test for each pair of samples.

This results, as a consequence, in a multiple testing problem and we should correct our test distribution or p-values to account for this.

I mentioned some of the one- and two sample test in statsmodels before. Today, I just want to look at pairwise comparison of means. We have k samples and we want to test for each pair whether the mean is the same, or not.

Instead of adding more explanations here, I just want to point to R tutorial and also the brief description on Wikipedia. A search for "Tukey HSD" or multiple comparison on the internet will find many tutorials and explanations.

The following are examples in statsmodels and R interspersed with a few explanatory comments.

Statistical Power in Statsmodels

I merged last week a branch of mine into statsmodels that contains large parts of basic power calculations and some effect size calculations. The documentation is in this section . Some parts are still missing but I thought I have worked enough on this for a while.

(Adding the power calculation for a new test now takes approximately: 3 lines of real code, 200 lines of wrapping it with mostly boiler plate and docstrings, and 30 to 100 lines of tests.)

The first part contains some information on the implementation. In the second part, I compare the calls to the function in the R pwr package to the calls in my (statsmodels') version.

I am comparing it to the pwr package because I ended up writing almost all unit tests against it. The initial development was based on the SAS manual, I used the explanations on the G-Power website for F-tests, and some parts were initially written based on articles that I read. However, during testing I adjusted the options (and fixed bugs), so I was able to match the results to pwr, and I think pwr has just the right level of abstraction and easiness of use, that I ended up with code that is pretty close to it.

If you just want to see the examples, skip to the second part.

Having fun with expectations

Using the distribution classes in scipy.stats it is easy to calculate expectations of a function with respect to any distributions using numerical integration.

I’m going to write a function that calculates the expectation, then I attach it to the class of continuous distributions in scipy.stats. Finally we can use our new method with any existing distribution.

Warning: Monkey Patching a class can have unintended effects if the new or changed methods interfere with other uses. In this case we just add a new method, which does not effect any of the original use of the distributions.

import numpy as np
from scipy import stats, integrate

def expectedfunc(self, fn=None, args=(), lb=None, ub=None, conditional=False):
    '''calculate expected value of a function with respect to the distribution

    only for standard version of distribution,
    location and scale not tested

    Parameters
    ----------
        all parameters are keyword parameters
        fn : function (default: identity mapping)
           Function for which integral is calculated. Takes only one argument.
        args : tuple
           argument (parameters) of the distribution
        lb, ub : numbers
           lower and upper bound for integration, default is set to the support
           of the distribution
        conditional : boolean (False)
           If true then the integral is corrected by the conditional probability
           of the integration interval. The return value is the expectation
           of the function, conditional on being in the given interval.

    Returns
    -------
        expected value : float
    '''
    if fn is None:
        def fun(x, *args):
            return x*self.pdf(x, *args)
    else:
        def fun(x, *args):
            return fn(x)*self.pdf(x, *args)
    if lb is None:
        lb = self.a
    if ub is None:
        ub = self.b
    if conditional:
        invfac = self.sf(lb,*args) - self.sf(ub,*args)
    else:
        invfac = 1.0
    return integrate.quad(fun, lb, ub,
                                args=args)[0]/invfac

For now this is just a function where the first argument is a distribution instance, as they are available in scipy.stats. We can call this function to calculate the forth moment of the standard normal distribution and compare it with the moment of stats.norm

>>> print stats.norm.moment(4)
3.0
>>> print expectedfunc(stats.norm, lambda(x): (x)**4)
3.0

We obtain the same result, which means in this case our function works correctly.

Now we can attach it to stats.distributions.rv_continuous, which is the superclass of all continuous distributions. We could have alse used new.instancemethod which is, however, depreciated and will be removen in py3k.

>>> import types
>>> stats.distributions.rv_continuous.expectedfunc =
...       types.MethodType(expectedfunc,None,stats.distributions.rv_continuous)

>>> #print dir(stats.norm)
>>> print stats.norm.expectedfunc
<bound method norm_gen.expectedfunc of <scipy.stats.distributions.norm_gen object at 0x02122830>>

Examples

Here is the forth moment for both the normal and the t distribution. The t distribution requires one parameter, the degrees of freedom, which I set to 10 to get fatter tails:

>>> print stats.norm.expectedfunc(lambda(x): (x)**4)
3.0
>>> print stats.norm.moment(4)
3.0
>>> print stats.t.expectedfunc(lambda(x): (x)**4, 10)
6.25
>>> print stats.t.moment(4, 10)
6.25

Expectation of some additional functions:

>>> print stats.norm.expectedfunc(lambda(x): np.sqrt(np.abs(x)))
0.822178958662
>>> print stats.norm.expectedfunc(lambda(x): np.exp(-np.abs(x)))
0.52315658373
>>> print stats.norm.expectedfunc(lambda(x): np.exp(-x), lb=0)
0.261578291865

If our function is identical to one, and we use integration bounds on our integral, then we get the probability of the interval:

>>> print stats.norm.expectedfunc(lambda(x): 1, lb=-1, ub=0.5)
0.532807207343
>>> print stats.norm.cdf(0.5) - stats.norm.cdf(-1)
0.532807207343

Can we calculate the expectation of exp(x)?

>>> print stats.norm.expectedfunc(lambda(x): np.exp(x))
Warning: The ocurrence of roundoff error is detected, which prevents
  the requested tolerance from being achieved.  The error may be
  underestimated.
-1.#IND

The expectation of exp(x) with respect to the standard normal distribution is unbound, and our numerical integration function returns nan, not a number.

If we integrate with respect to a distribution with finite support, for example the rdistribution, rdist, then the expectation is finite:

>>> print stats.rdist.expectedfunc(lambda(x): np.exp(x),0.1)
1.49242160729

We can also try complex values:

>>> print stats.norm.expectedfunc(lambda(x): np.exp(1j*x))
0.606530659713

I have no idea if this is correct, but this is the basic calculation for the characteristic function of a distribution.

Next we can try out conditional expectation. As an example, we calculate the expectation of a standard normal random variable conditional on values being in the top decile, i.e. the expectation of all values in the top 10 %.

>>> lbdec = stats.norm.isf(0.1)
>>> print stats.norm.expectedfunc(lb=lbdec, conditional=True)
1.75498331932
>>> print expectedfunc(stats.norm, lb=lbdec, conditional=True)
1.75498331932
>>> print stats.norm.expectedfunc(lambda(x): 1, lb=-lbdec, ub=lbdec)
0.8
>>> #should be 0.8

What’s the variance if we truncate the normal distribution at the 0.1 and 0.9 deciles?

>>> print stats.norm.expectedfunc(lambda(x): x**2, lb=-lbdec, ub=lbdec,conditional=True)
0.437724594904
>>> print expectedfunc(stats.norm, lambda(x): x**2, lb=-lbdec, ub=lbdec,conditional=True)
0.437724594904

and verify the result with truncated normal

>>> print stats.truncnorm.moment(2,-lbdec,lbdec)
0.437724594904

>>> lbdect = stats.t.isf(0.1, 10)
>>> print stats.t.expectedfunc(args=(10,), lb=lbdect, conditional=True)
1.9892028422
>>> print expectedfunc(stats.t, args=(10,), lb=lbdect, conditional=True)
1.9892028422

The t distribution has fatter tails than the normal distribution, so the conditional expectation of the top decile is larger for the t distribution than for the normal distribution, 1.989 versus 1.755.

Sunday, March 15, 2009

Using the Gaussian Kernel Density Estimation

In scipy.stats we can find a class to estimate and use a gaussian kernel density estimator, scipy.stats.stats.gaussian_kde.

Until recently, I didn’t know how this part of scipy works, and the following describes roughly how I figured out what it does.

First, we can look at help to see whether the documentation provides enough information for its use. Here is the first part.

>>> help(stats.gaussian_kde)
Help on class gaussian_kde in module scipy.stats.kde:

class gaussian_kde(__builtin__.object)
 |  Representation of a kernel-density estimate using Gaussian kernels.
 |
 |   Parameters
 |   ----------
 |   dataset : (# of dims, # of data)-array
 |       datapoints to estimate from
 |
 |   Members
 |   -------
 |   d : int
 |       number of dimensions
 |   n : int
 |       number of datapoints
 |
 |   Methods
 |   -------
 |   kde.evaluate(points) : array
 |       evaluate the estimated pdf on a provided set of points
 |   kde(points) : array
 |       same as kde.evaluate(points)
 |   kde.integrate_gaussian(mean, cov) : float
 |       multiply pdf with a specified Gaussian and integrate over the whole domain
 |   kde.integrate_box_1d(low, high) : float
 |       integrate pdf (1D only) between two bounds
 |   kde.integrate_box(low_bounds, high_bounds) : float
 |       integrate pdf over a rectangular space between low_bounds and high_bounds
 |   kde.integrate_kde(other_kde) : float
 |       integrate two kernel density estimates multiplied together
 |
 |  Internal Methods
 |  ----------------
 |   kde.covariance_factor() : float
 |       computes the coefficient that multiplies the data covariance matrix to
 |       obtain the kernel covariance matrix. Set this method to
 |       kde.scotts_factor or kde.silverman_factor (or subclass to provide your
 |       own). The default is scotts_factor.
 |
 |  Methods defined here:
 |
 |  __call__ = evaluate(self, points)
 |
 |  __init__(self, dataset)
 |
 |  covariance_factor = scotts_factor(self)
 |
 |  evaluate(self, points)
 |      Evaluate the estimated pdf on a set of points.
 |
 ...

The first two basic methods that I was interested in is the initialization, __init__ just takes a data set as a argument, and then evaluate, or __call__ which takes as function argument the list of points at which we want to evaluated the estimated pdf.

Let’s just try this out:

First, I get the standard imports:

import numpy as np
from scipy import stats
import matplotlib.pylab as plt

then I generate a sample that I can feed to the kde, as usual for continuous variables, I start with a normal distribution:

n_basesample = 1000
np.random.seed(8765678)
xn = np.random.randn(n_basesample)

Now, we create an instance of the gaussian_kde class and feed our sample to it:

gkde=stats.gaussian_kde(xn)

We need some points at which we evaluate the density funtion for the estimated density function:

ind = np.linspace(-7,7,101)
kdepdf = gkde.evaluate(ind)

and finally we create the plot of the histogram of our data, together with the density that created our sample, the data generating process DGP, and finally the estimated density.

plt.figure()
# plot histgram of sample
plt.hist(xn, bins=20, normed=1)
# plot data generating density
plt.plot(ind, stats.norm.pdf(ind), color="r", label='DGP normal')
# plot estimated density
plt.plot(ind, kdepdf, label='kde', color="g")
plt.title('Kernel Density Estimation')
plt.legend()
#plt.show()

and this is the graph that we get

This graph looks pretty good, when the underlying distribution is the normal distribution, then the gaussian kernel density estimate follows very closely the true distribution, at least for a large sample as we used.

Now, to make it a bit more difficult we can look at a bimodal distribution, and see if it is still able to fit so well. As an example, I pick a mixture of two normal distributions, 60% of the sample comes from a normal distribution with mean -3 and standard deviation equal to one, 40% of the observation are from the normal distribution with mean 3 and the same standard deviation.

alpha = 0.6 #weight for (prob of) lower distribution
mlow, mhigh = (-3,3)  #mean locations for gaussian mixture
xn =  np.concatenate([mlow + np.random.randn(alpha * n_basesample),
               mhigh + np.random.randn((1-alpha) * n_basesample)])

With the new data set, we can run the same commands as above, except that we have to change the plot of the data generating density to use the mixture density instead:

plt.plot(ind, alpha * stats.norm.pdf(ind, loc=mlow) +
              (1-alpha) * stats.norm.pdf(ind, loc=mhigh),
              color="r", label='normal mix')

The next graph shows what we get in this case.

The estimated density is oversmoothing, the peaks of the estimated pdf are too small. So the automatic selection of the smoothing parameter doesn’t work in this case. Now, it’s time to find out how gaussian_kde actually selects the smoothing parameter or bandwith of the kernel.

But I’m out of time for today. We dig into this next time.

Saturday, March 14, 2009

Warmup: Fitting Distributions

As a warmup exercise, I generate some random samples, fit two distributions and plot the results. I also calculate the Kolmogorvo-Smirnov test using scipy.stats.kstest.

This is a shortened, simplified version of a script that I wrote to see how the dostrinutions in scipy.stats can be used to automatically fit some data and select the best fitting distribution.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def plothist(x, distfn, args, loc, scale, right=1):
    '''plot histogram and pdf, based on matplotlib doc example'''
    plt.figure()
    # the histogram of the data
    n, bins, patches = plt.hist(x, 25, normed=1, facecolor='green', alpha=0.75)
    maxheight = max([p.get_height() for p in patches])
    axlim = list(plt.axis())
    axlim[-1] = maxheight*1.05

    # add more points for density plot
    pdfpoints = bins + np.diff(bins)[0]*np.linspace(-0.5,0.5,11)[:,np.newaxis]
    pdfpoints = np.sort(pdfpoints.ravel())

    # calculate and plot the estimated density function
    yt = distfn.pdf(pdfpoints, loc=loc, scale=scale, *args)
    yt[yt>maxheight] = maxheight
    lt = plt.plot(pdfpoints, yt, 'r--', linewidth=1, label='estimated')
    # calculate and plot the density function that generated the data
    ys = stats.t.pdf(pdfpoints, 10, scale=10,)*right
    ls = plt.plot(pdfpoints, ys, 'b-', linewidth=1, label='true')

    plt.legend()
    plt.xlabel('values')
    plt.ylabel('Probability')
    plt.title(r'$\mathrm{Fitting\ Distribution\ %s :}\ \mu=%f,\ \sigma=%f$'%(distfn.name,loc,scale))
    plt.grid(True)
    plt.draw()


n = 500
dgp_arg = 10
dgp_scale = 10
np.random.seed(6543789)
rvs = stats.t.rvs(dgp_arg, scale=dgp_scale, size=n)
sm = rvs.mean()
sstd = np.sqrt(rvs.var())
ssupp = (rvs.min(), rvs.max())

for distr in [stats.norm, stats.t]:
    distname = distr.name
    # estimate parameters
    par_est = distr.fit(rvs,loc=sm, scale=sstd)
    print '\nFitting distribution %s' % distname
    print 'estimated distribution parameters\n', par_est
    arg_est = par_est[:-2]  # get scale parameters if any
    loc_est = par_est[-2]
    scale_est = par_est[-1]
    rvs_normed = (rvs-loc_est)/scale_est
    ks_stat, ks_pval = stats.kstest(rvs_normed,distname, arg_est)
    print 'ks-stat = %f, ks-pval = %f)' % (ks_stat, ks_pval)
    plothist(rvs, distr, arg_est, loc_est, scale_est, right = 1)
    plt.savefig('ex_dist1_%s.png'% distname)

#plt.show() # if we want to see it on screen

(Source code)

Output

The script produces the following output for the parameter estimate and the Kolmogorov-Smirnov test

Fitting distribution norm
estimated distribution parameters
[ -0.70287027  11.22248481]
ks-stat = 0.037073, ks-pval = 0.493706)

Fitting distribution t
estimated distribution parameters
[ 7.8518085  -0.69695469  9.71315677]
ks-stat = 0.019562, ks-pval = 0.990926)

The p-values of Kolmogorov-Smirnov test are derived under the assumption that we test against a known distribution and not against a distribution with estimated parameters. But in this example, the numbers look pretty informative, the p-values are large for both distributions, from which I conclude that both distributions fit overall the sample relatively well. The pvalue of the t-distribution is about twice the p-value of the normal distribution, which strongly suggest that the distribution was generated by a t-distribution and not by a normal distribution. Although, if we look only at the pvalue of the normal distribution, we wouldn't reject the hypothesis that the sample was generated by the normal distribution.

However, I want to emphasis that this is an informal interpretation, in which we can be quite confident and which we know to be correct since we generated the data, but that it is not the result of a formal statistical test for it.

script file ex_dist1.py

Editorial comments:

Sphinx and restructured text work well, and the highlighting with pygments also works without additional intervention, but adding the graphs to blogger requires some manual upload and editing of the formatting, which still takes too much time. I haven't figured out yet how to upload source files, the upload file button seems to be missing from my blog editor.

update: some google searches later, it seems that it is not possible to attach regular text files to blogger, so it needs to be hosted somewhere else

joepy

Monday, March 25, 2013

Multiple Comparison and Tukey HSD or why statsmodels is awful

Introduction

Sunday, March 17, 2013

Statistical Power in Statsmodels

Tuesday, April 28, 2009