joepy: distributions

Showing posts with label distributions. Show all posts

Wednesday, December 5, 2012

Visual Inspection of Random Numbers

This is another post on showing what a few lines of matplotlib can produce.

Background

When I wrote the test suite for scipy.stats.distributions, I had to mark quite a few distributions as slow so they are skipped under the normal test runs, because they were very slow. One of the reasons that some distributions are slow is because the generic random number generation is very indirect if only the density function is available.

For some time I was looking at spline interpolation of the inverse cumulative distribution, ppf, as a approximate way of generating random numbers. However, since scipy has no splines that impose monotonicity, that did not work.

Finally, I wrote a script that just uses linear interpolation of the cdf of a distribution, using scipy.interpolate.interp1d so we can use standard inversion to create random numbers. To check whether the interpolation looks accurate enough, I went to "proof by plotting".

The interpolating random number generator takes about 0.3 seconds for one million random numbers, not counting the setup cost of creating the interpolator. The script is currently just a quick hack to see if it works.

The Plots

As an example I took the t distribution with 5 degrees of freedom, which has somewhat heavy tails. I calculated the approximation for 1000 intervals, and then for 10 and 20 intervals as contrast.

Since a large part of the "action" is in the tails, and I want to get those to be resonably accurate, I could not look just at a regular histogram since the tails are not very visible. So I looked at two variations, one with log scale, the other where the bin width is chosen so each bin has equal probability instead of equal length.

The result are the following four plots, with equal-length bins in the first row, equal-probability bins in the second row, and linear scale on the left side and log scaled probabilites on the right side. With 1000 segments in the interpolation, I don't see any systematic deviation of the random numbers from the true distribution. Below is the qqplot, generated with statsmodels, that indicates that the random numbers are consistent with a t(5) distribution.

As contrast, below are the same kind of plots for 20 intervals in the interpolation, which is a symmetric step function density with 20 intervals, many of them close to zero. The histogram shows clearly the steps, the qqplot shows systematic curved segments, which are more visible in the qqplot for 10 intervals.

The plots for 10 intervals are in my gallery histogram and qqplot

Saturday, December 1, 2012

Characteristic Functions and scipy.stats

scipy.stats.distributions is among other things a nice formula collection.

One of the parts that are missing are the characteristic functions for the distributions.

Wikipedia is very good for large areas of statistics, see for some details and examples http://en.wikipedia.org/wiki/Characteristic_function_%28probability_theory%29 Wikipedia lists the characteristic funtion also on the pages for many distributions.

I wrote something like the script below already several times (for the easy cases).

The characteristic function for the normal distribution is easy, but looking at the characteristic function of the t-distribution, I wish someone had translated it into code already.

t-distribution

Since I haven't seen it yet, I sat down and tried it myself. I managed to code the characteristic function of the t-distribution, but it returns NaNs when it is evaluated close to zero for large df.

I didn't find a Bessel "k" function that works in this case

>>> special.kv(50/2., 1e-30)
inf
>>> special.kve(50/2., 1e-30)
inf

The t-distribution approaches the normal distribution as the shape parameter, the degrees of freedom, gets large. So, the characteristic function of the t-distribution should be well behaved for large df. However, the individual terms go to infinity or zero.

Since in my current case, I don't care about the behavior of the characteristic function around zero, I stopped trying to get a better implementation.

Warning: monkey patching included in the script

Aside: I cannot make up my mind whether the abbreviation for characteristic function should be chf or cf. I have both versions in various scripts that I wrote.

The script

Here's my current script

# -*- coding: utf-8 -*-
"""Characteristic Functions

Created on Fri Nov 30 22:43:36 2012
Author: Josef Perktold
"""

import numpy as np
from scipy import stats, special

def chf_normal(t, loc=0, scale=1):
    '''characteristic function of normal distribution

    Parameters
    ----------
    t : array_like
        points at which characteristic function is evaluated
    loc : float (or array_like ?)
        mean of underlying normal distribution
    scale : float (or array_like ?)
        standard deviation, scale of normal distribution

    Returns
    -------
    chfval : ndarray
        characteristic function evaluated at x

    Notes
    -----
    Can be used for higher dimensional arguments if ``t``, ``mean`` and ``var``
    broadcast.

    '''
    t = np.asarray(t)
    return np.exp(1j * t * loc - 0.5 * t**2 * scale**2)


def chf_t(t, df, loc=0, scale=1):
    '''characteristic function of t distribution

    breaks down for large df and t close to zero
    '''
    t = np.asarray(t)
    vhalf = df / 2.
    term = np.sqrt(df) * np.abs(t*scale)
    cf = special.kv(vhalf, term) * np.power(term, vhalf)
    cf = cf / special.gamma(vhalf) / 2**(vhalf - 1)
    cf = cf * np.exp(1j * loc * t)
    if cf.shape == () and t == 0:
        #special case: kv(0) returns nan
        #for df>15 or so, we also get nans in the result in neighborhood of 0
        cf = np.ones((), cf.dtype)  #return same dtype as cf would
    else:
        cf[t == 0] = 1

    return cf

def chf_t_(t, df, loc=0, scale=1):
    #not much, but a bit better with log
    vhalf = df / 2.
    term = np.sqrt(df) * np.abs(t*scale)
    cf = np.log(special.kv(vhalf, term)) + vhalf * np.log(term)
    cf = cf - special.gammaln(vhalf) - (vhalf - 1) * np.log(2)
    cf = cf + 1j * loc * t
    if cf.shape == () and t == 0:
        #special case: kv(0) returns nan
        #for df>15 or so, we also get nans in the result in neighborhood of 0
        cf = np.zeros((), cf.dtype)  #return same dtype as cf would
    else:
        cf[t == 0] = 0
    return np.exp(cf)


def chfn(self, t, *args, **kwds):
    return chf_normal(t, *args, **kwds)

#monkeypatch scipy.stats
stats.distributions.norm_gen._chf = chfn

t = np.linspace(-1, 1, 11)
print stats.norm._chf(t, loc=1, scale=2)
print chf_t(t, 50, loc=1, scale=2)
print chf_t_(t, 50, loc=1, scale=2)

Editorial note: I had written this initially for the scipy-user mailing list. (Old habits are difficult to break.) But I remembered just before hitting Send that the recommendation is to put it in a blog.

Application and Plot: Wrapped Circular T Distribution

As explained in my previous post, once we have the characteristic function of a distribution defined on the real line, it is simple to get the Fourier approximation for the wrapped circular distribution. As an application of the characteristic function of the t distribution, I constructed the wrapped circular distributions.

The following plot shows an example of the density functions of the wrapped Cauchy, the wrapped normal distribution, and the wrapped t distribution for a few values of the degrees of freedom. Normal and Cauchy distributions are the two extreme cases of the t distribution, when the degrees of freedom go to infinity and when the degrees of freedom is one, respectively.

The distribution in the plot have the same location and scale parameter. However, this implies that the variance of the distributions is not the same. As a different comparison we could have adjusted the scale parameter to obtain distributions with identical variance. The latter is a more informative comparison when we are estimating the parameters based on data, and the estimated distribution reflects a similar variance as the data.

The fatter tails of Cauchy and t distributions with small t are clearly visible in the plot.

Tuesday, November 20, 2012

Orthogonal Series and Wrapped Circular Distribution

This is just a quick follow-up on the previous posting.

recommended reading: Mardia and Jupp, section 3.5.7 on wrapped distributions http://www.amazon.com/Directional-Statistics-Kanti-V-Mardia/dp/0471953334

To construct a wrapped distributions on a circle, we can take a distribution that is defined on the real line, like the normal, cauchy, t or stable distribution, and wrap it around the circle. Essentially it's just taking the support modulo (2 pi) and adding the overlapping densities. For some distributions the wrapped density has a nice closed form expression, for example the wrapped cauchy distribution that is also available in scipy.stats.
For other distributions, the density is given as infinite sum, that however converges in many cases very fast.
Mardia and Jupp show how to construct the series representation of the wrapped distribution from the characteristic function of the original, not wrapped distribution.
The basic idea is that for circular wrapped distributions the characteristic function is only evaluated at the integers, and we can construct the Fourier expansion of the wrapped density directly from the real and imaginary parts of the characteristic function. (In contrast, for densities on the real line we need a continuous inverse Fourier transform that involves integration.)

To see that it works, I did a "proof by plotting"

For the wrapped Cauchy distribution, I can use scipy.stats.wrapcauchy.pdf as check. For both wrapped Cauchy and wrapped normal distributions, I also coded directly the series from Mardia and Jupp's book (pdf-series1). I also draw a large number (10000) of random numbers to be able to compare to the histogram. The generic construction from only the characteristic function is pdf-series2-chf in the plots. I used 10 terms in the series representation.
The plots are a bit "boring" because all 2 resp. 3 lines for the density coincide up to a few decimals

Here's the wrapped Cauchy:

And here's the wrapped normal distribution:

Sunday, November 18, 2012

Density Estimation with Orthogonal Series - circular data

Background

Orthogonal Series are very useful. If we have a basis

(g_{i})_{i \in N}

for some function space (usually with smoothness and integrability conditions), then we can represent function as linear combinations of the basis functions:

\begin{matrix} f (x) = \sum_{i \in N} c_{i} g_{i} (x) \end{matrix}

To get an approximation to the function f, we can truncate after some finite number of terms. (N is all positive integers.)
Orthonormal polynomials are convenient for density estimation, because we can directly estimate the coefficients

c_{i}

from the data without having to run a non-linear optimization. In the basic case, we just need to calculate the moments of the data.
The orthogonality and normalization of the basis function is defined with respect to a weighting function

w

\begin{matrix} \int g_{i} (x) g_{j} (x) w (x) & = 0 if i \neq j \\ = 1 if i = j \end{matrix}

In the case of estimating or approximating a density we can use a reference density as weighting function. Then, the first term corresponds to the reference density, higher order terms are deviations from the reference density. This forms the basis for smooth goodness-of-fit tests. It is also very similar to series expansion of distributions, for example the Gram-Charlier expansion for the normal distribution. The reference density is the normal distribution. Higher order terms are based on Hermite polynomials.
In the basic form, we can just add the weighting function to the expansion above

\begin{matrix} f (x) = \sum_{i \in N} c_{i} g_{i} (x) w (x) \end{matrix}

However, these kinds of series expansion do not necessarily have densities that are non-negative over the full range of the density function. As a consequence, several non-linear transformation have been introduced in the literature, for example squaring or taking the exponential. The transformed expansion always results in non-negative densities. However, they loose the simple estimation property and have to be estimated with non-linear optimization. (I haven't actually coded any of those yet.)
These series approximation to densities can be extended to the multivariate case, but I haven't coded those yet either.

The Quest

I got started with this after a "random" reading, "Orthogonal series density estimation" http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS97.html and later "Smooth tests of goodness of fit" http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS171.html Both papers give well motivated introductions.
In the mean time I have read dozens more papers in this direction. The latest is a lot more theoretical http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2141299 and goes into continuous time stochastic processes, where I'm not yet ready to go back to, and along a similar line, orthonormal series variance estimator http://onlinelibrary.wiley.com/doi/10.1111/j.1368-423X.2012.00390.x/abstract
scipy.special has a nice collection of orthogonal polynomials. Now also numpy.polynomial has a good implementation of orthogonal polynomials, but they were not available when I started with this. The scipy.special documentation is a bit "thin". It is good enough when we know what we are looking for, but not very helpful when we only have a vague idea what kind of "animals" those functions are.
The first problem was to find the normalization, since the polynomials in scipy are orthogonal but not orthonormal. http://old.nabble.com/orthogonal-polynomials---tt31619489.html
Also, on the way I had to figure out how to construct orthonormal polynomials for an arbitrary given weight function (reference density), and learn about recurrence equations and how we can construct and evaluate orthogonal polynomials. Neither of those are standard training where I come from.
Plots of some of my previous results can be seen in my gallery. Two examples:

Fourier polynomials

and Hermite polynomials (black line, green line is a normal distribution)

The latest Installment

Recently, I started to figure out the basics of circular or directional statistics, see for example http://en.wikipedia.org/wiki/Directional_statistics .
Trying to understand the usual smooth goodness-of-fit tests, I read http://onlinelibrary.wiley.com/doi/10.1111/j.1467-842X.2009.00558.x/abstract However, orthonormal polynomials on the unit circle are "different". To get orthogonal polynomials with the Von Mises distribution as the weight functions, we need Verblunsky coefficients and Szego recurrence. Now what are those? Searching with Google, I didn't find any basic explanations. I don't really want to get a math book on the topic (by Barry Simon) and read it.
http://old.nabble.com/Orthogonal-polynomials-on-the-unit-circle-tt34608320.html
To get started with something easier, I went back to orthogonal polynomials with a uniform weight function, that is no weights. In this case, the polynomials are just trigonometric functions or Fourier series.
An explanation and application that imposes additionally non-negativity of the density function (which I do not impose in my code) is http://onlinelibrary.wiley.com/doi/10.1111/j.0006-341X.2004.00195.x/full
The coefficients of the series approximation are just the circular moments of the underlying distribution. We can calculate those for a given distribution, or we can calculate the empirical moments from the data.
Detour: scipy.integrate.quad
An under used feature of scipy.integrate.quad is that we are able to use a weight function. For example, calculating the cosine and sine parts of the circular moments can be done with

integrate.quad(pdf_func, low, upp, weight='cos', wvar=k)
integrate.quad(pdf_func, low, upp, weight='sin', wvar=k)

which calculates the k-th circular moment of a circular distribution given by pdf_func. The integration limits are either

(0, 2 π)

(- π, π)

. We cannot integrate with the complex definition:

integrate.quad(lambda x: np.exp(1j*k*x)*pdf_func(x, *args), low, upp)

because quad throws away the imaginary part and issues a warning about the casting to float.

The Plots

And now, the plots. I draw random numbers from a two component mixture of Von Mises distributions [1]. The plots contain the histogram of the data and the estimated density based on the trigonometric series. For reference it also contains the density of the data generating process, the mixture distribution, and the density given by the 15 component series based on the circular moments of the data generating distribution (calculated by integration as above). With 15 components the series distribution based on the true moments is essentially indistinguishable from the true density.

First plot: 10000 observations, which makes the histogram and estimated moments close to the true density and moments.

Second plot: 200 observations, given the randomness in the data, the histogram is pretty uneven (given the number of bins). I fitted 3 components in the series density estimate.

Third and fourth plots: 1000 observations, in one plot I used 5 components, in the other plot I used 15 components in the series density. The density with 15 components is fitting random "bumps" in the histogram.

Some Comments

Orthogonal series expansion could be or is very useful. The advantage compared to kernel density estimation is that it is much faster and we do not need to keep the original data for evaluating the density. All we need are the coefficients for the series. It also works better on bounded support than kernel density estimation. One of the disadvantages is that it is a global method and will not be able to adjust locally if there are regions with different features, unless we sufficiently increase the number of terms in the series. Increasing the number of terms will make the density estimate more sensitive to random effects.
My impression is that orthogonal series expansion for densities are limited in their usefulness when the distribution contains a shape parameter and not just location and scale. A while ago, I wrote the recursion for polynomial series with Poisson as the weight function. It can be used for testing whether a distribution is Poisson, as in the paper I used as reference. However, I finally came to the conclusion that this is not really so useful, since in many cases we want to have count regression, with the shape parameter as a function of some explanatory variables. The series expansion of the Poisson distribution is specific to a given shape parameter, which means that we cannot create the orthonormal base independently of the regressors. I also have not seen any articles that uses orthogonal expansion in regression outside the normal distribution case, as far as I remember.
One of the main missing pieces in my code is automatic selection of the bandwidth or of the optimal penalization. For the former, we need to select the number of components in the series expansion. For the later, we use a larger number of terms but need to find an increasingly strong penalization for higher order terms. I only know of one journal article that derives the penalization for Fourier series on the real line.
Related to the last point: One of the main work that George and Ralph did during GSOC last summer is to get automatic bandwidth selection for kernel density estimation and kernel regression for the new nonparametric extension in statsmodels. There are many other new features besides this. statsmodels will get a good coverage of kernel methods when the branch is merged, which will happen very soon.
(My code is mostly only part of my private collection of "dirty" scripts.)

[1]	I just concatenated the data and didn't randomize on the number of observations in each component.

Editorial Note: I'm still using rst2blogger with the default settings. I am able to edit Latex math in restructured text for the use with sphinx which I used for the draft. With rst2blogger the default is conversion to mathml, which doesn't recognize all Latex math that I was using, and some fine-tunig got lost. Additionally, the math doesn't display in Internet Explorer on my computer.
PS: Where does your black box start?
Just a thought after reading this again.
Sometimes I'm happy to use somebody else's code or recipes without worrying about why it works. Sometimes I have to dig in myself because there are no recipes available. But often I have to dig in because I feel a need to understand whatever I'm using and I cannot rest until my ignorance is sufficiently reduced (or I realize that the mountain is too big.)
And after five or more years of Statistics in Python, I can safely say that I learned a lot about topics that I never heard of before.

Monday, November 5, 2012

Polar Histogram

Just posting two plots from my gallery to show what we can do with matplotlib, and numpy and scipy. (No public code for now)
Both plots show the histogram of the data and the density function (pdf) of the Von Mises distribution with estimated mu and kappa.
The first one shows arrival times in a 24 hour clock

The second one shows wind direction (zero is north, I didn't use the offset in this plot.)

Saturday, January 28, 2012

Distributions with matplotlib in 3d

I finally managed to figure out the settings for matplotlib's surface plot that makes a bivariate distribution look more like those in published articles.

The first version uses

ax2 = fig2.add_subplot(111, projection='3d')
surf = ax2.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap = cm.gray_r, alpha=0.9, linewidth=1)

The second version uses

surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, color='0.8',
alpha=0.85, linewidth=1)

Previously I was trying out mainly the contour plots, since I didn't get 3d to look nice enough.

For example the following shows a mixture of two bivariate normal distributions and the estimate by gaussian_kde. The colored areas are the differences between kde and the true distribution, in the blue areas the kde is too large, in the redish areas the kde is too small. It's a contour plot version for showing that gaussian_kde with default settings for the bandwidth lowers the hills and fills the valleys in the case of bimodal distributions.

Tuesday, April 28, 2009

Having fun with expectations

Using the distribution classes in scipy.stats it is easy to calculate expectations of a function with respect to any distributions using numerical integration.

I’m going to write a function that calculates the expectation, then I attach it to the class of continuous distributions in scipy.stats. Finally we can use our new method with any existing distribution.

Warning: Monkey Patching a class can have unintended effects if the new or changed methods interfere with other uses. In this case we just add a new method, which does not effect any of the original use of the distributions.

import numpy as np
from scipy import stats, integrate

def expectedfunc(self, fn=None, args=(), lb=None, ub=None, conditional=False):
    '''calculate expected value of a function with respect to the distribution

    only for standard version of distribution,
    location and scale not tested

    Parameters
    ----------
        all parameters are keyword parameters
        fn : function (default: identity mapping)
           Function for which integral is calculated. Takes only one argument.
        args : tuple
           argument (parameters) of the distribution
        lb, ub : numbers
           lower and upper bound for integration, default is set to the support
           of the distribution
        conditional : boolean (False)
           If true then the integral is corrected by the conditional probability
           of the integration interval. The return value is the expectation
           of the function, conditional on being in the given interval.

    Returns
    -------
        expected value : float
    '''
    if fn is None:
        def fun(x, *args):
            return x*self.pdf(x, *args)
    else:
        def fun(x, *args):
            return fn(x)*self.pdf(x, *args)
    if lb is None:
        lb = self.a
    if ub is None:
        ub = self.b
    if conditional:
        invfac = self.sf(lb,*args) - self.sf(ub,*args)
    else:
        invfac = 1.0
    return integrate.quad(fun, lb, ub,
                                args=args)[0]/invfac

For now this is just a function where the first argument is a distribution instance, as they are available in scipy.stats. We can call this function to calculate the forth moment of the standard normal distribution and compare it with the moment of stats.norm

>>> print stats.norm.moment(4)
3.0
>>> print expectedfunc(stats.norm, lambda(x): (x)**4)
3.0

We obtain the same result, which means in this case our function works correctly.

Now we can attach it to stats.distributions.rv_continuous, which is the superclass of all continuous distributions. We could have alse used new.instancemethod which is, however, depreciated and will be removen in py3k.

>>> import types
>>> stats.distributions.rv_continuous.expectedfunc =
...       types.MethodType(expectedfunc,None,stats.distributions.rv_continuous)

>>> #print dir(stats.norm)
>>> print stats.norm.expectedfunc
<bound method norm_gen.expectedfunc of <scipy.stats.distributions.norm_gen object at 0x02122830>>

Examples

Here is the forth moment for both the normal and the t distribution. The t distribution requires one parameter, the degrees of freedom, which I set to 10 to get fatter tails:

>>> print stats.norm.expectedfunc(lambda(x): (x)**4)
3.0
>>> print stats.norm.moment(4)
3.0
>>> print stats.t.expectedfunc(lambda(x): (x)**4, 10)
6.25
>>> print stats.t.moment(4, 10)
6.25

Expectation of some additional functions:

>>> print stats.norm.expectedfunc(lambda(x): np.sqrt(np.abs(x)))
0.822178958662
>>> print stats.norm.expectedfunc(lambda(x): np.exp(-np.abs(x)))
0.52315658373
>>> print stats.norm.expectedfunc(lambda(x): np.exp(-x), lb=0)
0.261578291865

If our function is identical to one, and we use integration bounds on our integral, then we get the probability of the interval:

>>> print stats.norm.expectedfunc(lambda(x): 1, lb=-1, ub=0.5)
0.532807207343
>>> print stats.norm.cdf(0.5) - stats.norm.cdf(-1)
0.532807207343

Can we calculate the expectation of exp(x)?

>>> print stats.norm.expectedfunc(lambda(x): np.exp(x))
Warning: The ocurrence of roundoff error is detected, which prevents
  the requested tolerance from being achieved.  The error may be
  underestimated.
-1.#IND

The expectation of exp(x) with respect to the standard normal distribution is unbound, and our numerical integration function returns nan, not a number.

If we integrate with respect to a distribution with finite support, for example the rdistribution, rdist, then the expectation is finite:

>>> print stats.rdist.expectedfunc(lambda(x): np.exp(x),0.1)
1.49242160729

We can also try complex values:

>>> print stats.norm.expectedfunc(lambda(x): np.exp(1j*x))
0.606530659713

I have no idea if this is correct, but this is the basic calculation for the characteristic function of a distribution.

Next we can try out conditional expectation. As an example, we calculate the expectation of a standard normal random variable conditional on values being in the top decile, i.e. the expectation of all values in the top 10 %.

>>> lbdec = stats.norm.isf(0.1)
>>> print stats.norm.expectedfunc(lb=lbdec, conditional=True)
1.75498331932
>>> print expectedfunc(stats.norm, lb=lbdec, conditional=True)
1.75498331932
>>> print stats.norm.expectedfunc(lambda(x): 1, lb=-lbdec, ub=lbdec)
0.8
>>> #should be 0.8

What’s the variance if we truncate the normal distribution at the 0.1 and 0.9 deciles?

>>> print stats.norm.expectedfunc(lambda(x): x**2, lb=-lbdec, ub=lbdec,conditional=True)
0.437724594904
>>> print expectedfunc(stats.norm, lambda(x): x**2, lb=-lbdec, ub=lbdec,conditional=True)
0.437724594904

and verify the result with truncated normal

>>> print stats.truncnorm.moment(2,-lbdec,lbdec)
0.437724594904

>>> lbdect = stats.t.isf(0.1, 10)
>>> print stats.t.expectedfunc(args=(10,), lb=lbdect, conditional=True)
1.9892028422
>>> print expectedfunc(stats.t, args=(10,), lb=lbdect, conditional=True)
1.9892028422

The t distribution has fatter tails than the normal distribution, so the conditional expectation of the top decile is larger for the t distribution than for the normal distribution, 1.989 versus 1.755.

Saturday, March 14, 2009

Warmup: Fitting Distributions

As a warmup exercise, I generate some random samples, fit two distributions and plot the results. I also calculate the Kolmogorvo-Smirnov test using scipy.stats.kstest.

This is a shortened, simplified version of a script that I wrote to see how the dostrinutions in scipy.stats can be used to automatically fit some data and select the best fitting distribution.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def plothist(x, distfn, args, loc, scale, right=1):
    '''plot histogram and pdf, based on matplotlib doc example'''
    plt.figure()
    # the histogram of the data
    n, bins, patches = plt.hist(x, 25, normed=1, facecolor='green', alpha=0.75)
    maxheight = max([p.get_height() for p in patches])
    axlim = list(plt.axis())
    axlim[-1] = maxheight*1.05

    # add more points for density plot
    pdfpoints = bins + np.diff(bins)[0]*np.linspace(-0.5,0.5,11)[:,np.newaxis]
    pdfpoints = np.sort(pdfpoints.ravel())

    # calculate and plot the estimated density function
    yt = distfn.pdf(pdfpoints, loc=loc, scale=scale, *args)
    yt[yt>maxheight] = maxheight
    lt = plt.plot(pdfpoints, yt, 'r--', linewidth=1, label='estimated')
    # calculate and plot the density function that generated the data
    ys = stats.t.pdf(pdfpoints, 10, scale=10,)*right
    ls = plt.plot(pdfpoints, ys, 'b-', linewidth=1, label='true')

    plt.legend()
    plt.xlabel('values')
    plt.ylabel('Probability')
    plt.title(r'$\mathrm{Fitting\ Distribution\ %s :}\ \mu=%f,\ \sigma=%f$'%(distfn.name,loc,scale))
    plt.grid(True)
    plt.draw()


n = 500
dgp_arg = 10
dgp_scale = 10
np.random.seed(6543789)
rvs = stats.t.rvs(dgp_arg, scale=dgp_scale, size=n)
sm = rvs.mean()
sstd = np.sqrt(rvs.var())
ssupp = (rvs.min(), rvs.max())

for distr in [stats.norm, stats.t]:
    distname = distr.name
    # estimate parameters
    par_est = distr.fit(rvs,loc=sm, scale=sstd)
    print '\nFitting distribution %s' % distname
    print 'estimated distribution parameters\n', par_est
    arg_est = par_est[:-2]  # get scale parameters if any
    loc_est = par_est[-2]
    scale_est = par_est[-1]
    rvs_normed = (rvs-loc_est)/scale_est
    ks_stat, ks_pval = stats.kstest(rvs_normed,distname, arg_est)
    print 'ks-stat = %f, ks-pval = %f)' % (ks_stat, ks_pval)
    plothist(rvs, distr, arg_est, loc_est, scale_est, right = 1)
    plt.savefig('ex_dist1_%s.png'% distname)

#plt.show() # if we want to see it on screen

(Source code)

Output

The script produces the following output for the parameter estimate and the Kolmogorov-Smirnov test

Fitting distribution norm
estimated distribution parameters
[ -0.70287027  11.22248481]
ks-stat = 0.037073, ks-pval = 0.493706)

Fitting distribution t
estimated distribution parameters
[ 7.8518085  -0.69695469  9.71315677]
ks-stat = 0.019562, ks-pval = 0.990926)

The p-values of Kolmogorov-Smirnov test are derived under the assumption that we test against a known distribution and not against a distribution with estimated parameters. But in this example, the numbers look pretty informative, the p-values are large for both distributions, from which I conclude that both distributions fit overall the sample relatively well. The pvalue of the t-distribution is about twice the p-value of the normal distribution, which strongly suggest that the distribution was generated by a t-distribution and not by a normal distribution. Although, if we look only at the pvalue of the normal distribution, we wouldn't reject the hypothesis that the sample was generated by the normal distribution.

However, I want to emphasis that this is an informal interpretation, in which we can be quite confident and which we know to be correct since we generated the data, but that it is not the result of a formal statistical test for it.

script file ex_dist1.py

Editorial comments:

Sphinx and restructured text work well, and the highlighting with pygments also works without additional intervention, but adding the graphs to blogger requires some manual upload and editing of the formatting, which still takes too much time. I haven't figured out yet how to upload source files, the upload file button seems to be missing from my blog editor.

update: some google searches later, it seems that it is not possible to attach regular text files to blogger, so it needs to be hosted somewhere else

joepy

Wednesday, December 5, 2012

Visual Inspection of Random Numbers

Background

The Plots

Saturday, December 1, 2012

Characteristic Functions and scipy.stats

t-distribution

The script

Application and Plot: Wrapped Circular T Distribution

Tuesday, November 20, 2012

Orthogonal Series and Wrapped Circular Distribution

Sunday, November 18, 2012

Density Estimation with Orthogonal Series - circular data

Background

The Quest

The latest Installment

The Plots

Some Comments

Monday, November 5, 2012

Polar Histogram

Saturday, January 28, 2012

Distributions with matplotlib in 3d

Tuesday, April 28, 2009

Having fun with expectations

Examples

Saturday, March 14, 2009

Warmup: Fitting Distributions

Editorial comments:

Blog Archive

Labels

josef-pkt's Activity