Showing posts with label circular. Show all posts
Showing posts with label circular. Show all posts

Saturday, December 1, 2012

Characteristic Functions and scipy.stats

scipy.stats.distributions is among other things a nice formula collection.

One of the parts that are missing are the characteristic functions for the distributions.

Wikipedia is very good for large areas of statistics, see for some details and examples http://en.wikipedia.org/wiki/Characteristic_function_%28probability_theory%29 Wikipedia lists the characteristic funtion also on the pages for many distributions.

I wrote something like the script below already several times (for the easy cases).

The characteristic function for the normal distribution is easy, but looking at the characteristic function of the t-distribution, I wish someone had translated it into code already.

t-distribution

Since I haven't seen it yet, I sat down and tried it myself. I managed to code the characteristic function of the t-distribution, but it returns NaNs when it is evaluated close to zero for large df.

I didn't find a Bessel "k" function that works in this case

>>> special.kv(50/2., 1e-30)
inf
>>> special.kve(50/2., 1e-30)
inf

The t-distribution approaches the normal distribution as the shape parameter, the degrees of freedom, gets large. So, the characteristic function of the t-distribution should be well behaved for large df. However, the individual terms go to infinity or zero.

Since in my current case, I don't care about the behavior of the characteristic function around zero, I stopped trying to get a better implementation.

Warning: monkey patching included in the script

Aside: I cannot make up my mind whether the abbreviation for characteristic function should be chf or cf. I have both versions in various scripts that I wrote.

The script

Here's my current script

# -*- coding: utf-8 -*-
"""Characteristic Functions

Created on Fri Nov 30 22:43:36 2012
Author: Josef Perktold
"""

import numpy as np
from scipy import stats, special

def chf_normal(t, loc=0, scale=1):
    '''characteristic function of normal distribution

    Parameters
    ----------
    t : array_like
        points at which characteristic function is evaluated
    loc : float (or array_like ?)
        mean of underlying normal distribution
    scale : float (or array_like ?)
        standard deviation, scale of normal distribution

    Returns
    -------
    chfval : ndarray
        characteristic function evaluated at x

    Notes
    -----
    Can be used for higher dimensional arguments if ``t``, ``mean`` and ``var``
    broadcast.

    '''
    t = np.asarray(t)
    return np.exp(1j * t * loc - 0.5 * t**2 * scale**2)


def chf_t(t, df, loc=0, scale=1):
    '''characteristic function of t distribution

    breaks down for large df and t close to zero
    '''
    t = np.asarray(t)
    vhalf = df / 2.
    term = np.sqrt(df) * np.abs(t*scale)
    cf = special.kv(vhalf, term) * np.power(term, vhalf)
    cf = cf / special.gamma(vhalf) / 2**(vhalf - 1)
    cf = cf * np.exp(1j * loc * t)
    if cf.shape == () and t == 0:
        #special case: kv(0) returns nan
        #for df>15 or so, we also get nans in the result in neighborhood of 0
        cf = np.ones((), cf.dtype)  #return same dtype as cf would
    else:
        cf[t == 0] = 1

    return cf

def chf_t_(t, df, loc=0, scale=1):
    #not much, but a bit better with log
    vhalf = df / 2.
    term = np.sqrt(df) * np.abs(t*scale)
    cf = np.log(special.kv(vhalf, term)) + vhalf * np.log(term)
    cf = cf - special.gammaln(vhalf) - (vhalf - 1) * np.log(2)
    cf = cf + 1j * loc * t
    if cf.shape == () and t == 0:
        #special case: kv(0) returns nan
        #for df>15 or so, we also get nans in the result in neighborhood of 0
        cf = np.zeros((), cf.dtype)  #return same dtype as cf would
    else:
        cf[t == 0] = 0
    return np.exp(cf)


def chfn(self, t, *args, **kwds):
    return chf_normal(t, *args, **kwds)

#monkeypatch scipy.stats
stats.distributions.norm_gen._chf = chfn

t = np.linspace(-1, 1, 11)
print stats.norm._chf(t, loc=1, scale=2)
print chf_t(t, 50, loc=1, scale=2)
print chf_t_(t, 50, loc=1, scale=2)

Editorial note: I had written this initially for the scipy-user mailing list. (Old habits are difficult to break.) But I remembered just before hitting Send that the recommendation is to put it in a blog.

Application and Plot: Wrapped Circular T Distribution

As explained in my previous post, once we have the characteristic function of a distribution defined on the real line, it is simple to get the Fourier approximation for the wrapped circular distribution. As an application of the characteristic function of the t distribution, I constructed the wrapped circular distributions.

The following plot shows an example of the density functions of the wrapped Cauchy, the wrapped normal distribution, and the wrapped t distribution for a few values of the degrees of freedom. Normal and Cauchy distributions are the two extreme cases of the t distribution, when the degrees of freedom go to infinity and when the degrees of freedom is one, respectively.

The distribution in the plot have the same location and scale parameter. However, this implies that the variance of the distributions is not the same. As a different comparison we could have adjusted the scale parameter to obtain distributions with identical variance. The latter is a more informative comparison when we are estimating the parameters based on data, and the estimated distribution reflects a similar variance as the data.

The fatter tails of Cauchy and t distributions with small t are clearly visible in the plot.

Tuesday, November 20, 2012

Orthogonal Series and Wrapped Circular Distribution

This is just a quick follow-up on the previous posting.

recommended reading: Mardia and Jupp, section 3.5.7 on wrapped distributions http://www.amazon.com/Directional-Statistics-Kanti-V-Mardia/dp/0471953334

To construct a wrapped distributions on a circle, we can take a distribution that is defined on the real line, like the normal, cauchy, t or stable distribution, and wrap it around the circle. Essentially it's just taking the support modulo (2 pi) and adding the overlapping densities. For some distributions the wrapped density has a nice closed form expression, for example the wrapped cauchy distribution that is also available in scipy.stats.
For other distributions, the density is given as infinite sum, that however converges in many cases very fast.
Mardia and Jupp show how to construct the series representation of the wrapped distribution from the characteristic function of the original, not wrapped distribution.
The basic idea is that for circular wrapped distributions the characteristic function is only evaluated at the integers, and we can construct the Fourier expansion of the wrapped density directly from the real and imaginary parts of the characteristic function. (In contrast, for densities on the real line we need a continuous inverse Fourier transform that involves integration.)

To see that it works, I did a "proof by plotting"

For the wrapped Cauchy distribution, I can use scipy.stats.wrapcauchy.pdf as check. For both wrapped Cauchy and wrapped normal distributions, I also coded directly the series from Mardia and Jupp's book (pdf-series1). I also draw a large number (10000) of random numbers to be able to compare to the histogram. The generic construction from only the characteristic function is pdf-series2-chf in the plots. I used 10 terms in the series representation.
The plots are a bit "boring" because all 2 resp. 3 lines for the density coincide up to a few decimals

Here's the wrapped Cauchy:


And here's the wrapped normal distribution:

Sunday, November 18, 2012

Density Estimation with Orthogonal Series - circular data

Background

Orthogonal Series are very useful. If we have a basis (gi)iN for some function space (usually with smoothness and integrability conditions), then we can represent function as linear combinations of the basis functions:
f(x)=iNcigi(x)
To get an approximation to the function f, we can truncate after some finite number of terms. (N is all positive integers.)
Orthonormal polynomials are convenient for density estimation, because we can directly estimate the coefficients ci from the data without having to run a non-linear optimization. In the basic case, we just need to calculate the moments of the data.
The orthogonality and normalization of the basis function is defined with respect to a weighting function w:
gi(x)gj(x)w(x)=0 if ij=1 if i=j
In the case of estimating or approximating a density we can use a reference density as weighting function. Then, the first term corresponds to the reference density, higher order terms are deviations from the reference density. This forms the basis for smooth goodness-of-fit tests. It is also very similar to series expansion of distributions, for example the Gram-Charlier expansion for the normal distribution. The reference density is the normal distribution. Higher order terms are based on Hermite polynomials.
In the basic form, we can just add the weighting function to the expansion above
f(x)=iNcigi(x)w(x)
However, these kinds of series expansion do not necessarily have densities that are non-negative over the full range of the density function. As a consequence, several non-linear transformation have been introduced in the literature, for example squaring or taking the exponential. The transformed expansion always results in non-negative densities. However, they loose the simple estimation property and have to be estimated with non-linear optimization. (I haven't actually coded any of those yet.)
These series approximation to densities can be extended to the multivariate case, but I haven't coded those yet either.

The Quest

I got started with this after a "random" reading, "Orthogonal series density estimation" http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS97.html and later "Smooth tests of goodness of fit" http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS171.html Both papers give well motivated introductions.
In the mean time I have read dozens more papers in this direction. The latest is a lot more theoretical http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2141299 and goes into continuous time stochastic processes, where I'm not yet ready to go back to, and along a similar line, orthonormal series variance estimator http://onlinelibrary.wiley.com/doi/10.1111/j.1368-423X.2012.00390.x/abstract
scipy.special has a nice collection of orthogonal polynomials. Now also numpy.polynomial has a good implementation of orthogonal polynomials, but they were not available when I started with this. The scipy.special documentation is a bit "thin". It is good enough when we know what we are looking for, but not very helpful when we only have a vague idea what kind of "animals" those functions are.
The first problem was to find the normalization, since the polynomials in scipy are orthogonal but not orthonormal. http://old.nabble.com/orthogonal-polynomials---tt31619489.html
Also, on the way I had to figure out how to construct orthonormal polynomials for an arbitrary given weight function (reference density), and learn about recurrence equations and how we can construct and evaluate orthogonal polynomials. Neither of those are standard training where I come from.
Plots of some of my previous results can be seen in my gallery. Two examples:

Fourier polynomials


 and Hermite polynomials (black line, green line is a normal distribution)

The latest Installment

Recently, I started to figure out the basics of circular or directional statistics, see for example http://en.wikipedia.org/wiki/Directional_statistics .
Trying to understand the usual smooth goodness-of-fit tests, I read http://onlinelibrary.wiley.com/doi/10.1111/j.1467-842X.2009.00558.x/abstract However, orthonormal polynomials on the unit circle are "different". To get orthogonal polynomials with the Von Mises distribution as the weight functions, we need Verblunsky coefficients and Szego recurrence. Now what are those? Searching with Google, I didn't find any basic explanations. I don't really want to get a math book on the topic (by Barry Simon) and read it.
http://old.nabble.com/Orthogonal-polynomials-on-the-unit-circle-tt34608320.html
To get started with something easier, I went back to orthogonal polynomials with a uniform weight function, that is no weights. In this case, the polynomials are just trigonometric functions or Fourier series.
An explanation and application that imposes additionally non-negativity of the density function (which I do not impose in my code) is http://onlinelibrary.wiley.com/doi/10.1111/j.0006-341X.2004.00195.x/full
The coefficients of the series approximation are just the circular moments of the underlying distribution. We can calculate those for a given distribution, or we can calculate the empirical moments from the data.
Detour: scipy.integrate.quad
An under used feature of scipy.integrate.quad is that we are able to use a weight function. For example, calculating the cosine and sine parts of the circular moments can be done with
integrate.quad(pdf_func, low, upp, weight='cos', wvar=k)
integrate.quad(pdf_func, low, upp, weight='sin', wvar=k)
which calculates the k-th circular moment of a circular distribution given by pdf_func. The integration limits are either (0,2π) or (-π,π). We cannot integrate with the complex definition:
integrate.quad(lambda x: np.exp(1j*k*x)*pdf_func(x, *args), low, upp)
because quad throws away the imaginary part and issues a warning about the casting to float.

The Plots

And now, the plots. I draw random numbers from a two component mixture of Von Mises distributions [1]. The plots contain the histogram of the data and the estimated density based on the trigonometric series. For reference it also contains the density of the data generating process, the mixture distribution, and the density given by the 15 component series based on the circular moments of the data generating distribution (calculated by integration as above). With 15 components the series distribution based on the true moments is essentially indistinguishable from the true density.

First plot: 10000 observations, which makes the histogram and estimated moments close to the true density and moments.

Second plot: 200 observations, given the randomness in the data, the histogram is pretty uneven (given the number of bins). I fitted 3 components in the series density estimate.



Third and fourth plots: 1000 observations, in one plot I used 5 components, in the other plot I used 15 components in the series density. The density with 15 components is fitting random "bumps" in the histogram.

Some Comments

Orthogonal series expansion could be or is very useful. The advantage compared to kernel density estimation is that it is much faster and we do not need to keep the original data for evaluating the density. All we need are the coefficients for the series. It also works better on bounded support than kernel density estimation. One of the disadvantages is that it is a global method and will not be able to adjust locally if there are regions with different features, unless we sufficiently increase the number of terms in the series. Increasing the number of terms will make the density estimate more sensitive to random effects.
My impression is that orthogonal series expansion for densities are limited in their usefulness when the distribution contains a shape parameter and not just location and scale. A while ago, I wrote the recursion for polynomial series with Poisson as the weight function. It can be used for testing whether a distribution is Poisson, as in the paper I used as reference. However, I finally came to the conclusion that this is not really so useful, since in many cases we want to have count regression, with the shape parameter as a function of some explanatory variables. The series expansion of the Poisson distribution is specific to a given shape parameter, which means that we cannot create the orthonormal base independently of the regressors. I also have not seen any articles that uses orthogonal expansion in regression outside the normal distribution case, as far as I remember.
One of the main missing pieces in my code is automatic selection of the bandwidth or of the optimal penalization. For the former, we need to select the number of components in the series expansion. For the later, we use a larger number of terms but need to find an increasingly strong penalization for higher order terms. I only know of one journal article that derives the penalization for Fourier series on the real line.
Related to the last point: One of the main work that George and Ralph did during GSOC last summer is to get automatic bandwidth selection for kernel density estimation and kernel regression for the new nonparametric extension in statsmodels. There are many other new features besides this. statsmodels will get a good coverage of kernel methods when the branch is merged, which will happen very soon.
(My code is mostly only part of my private collection of "dirty" scripts.)
[1]I just concatenated the data and didn't randomize on the number of observations in each component.
Editorial Note: I'm still using rst2blogger with the default settings. I am able to edit Latex math in restructured text for the use with sphinx which I used for the draft. With rst2blogger the default is conversion to mathml, which doesn't recognize all Latex math that I was using, and some fine-tunig got lost. Additionally, the math doesn't display in Internet Explorer on my computer.
PS: Where does your black box start?
Just a thought after reading this again.
Sometimes I'm happy to use somebody else's code or recipes without worrying about why it works. Sometimes I have to dig in myself because there are no recipes available. But often I have to dig in because I feel a need to understand whatever I'm using and I cannot rest until my ignorance is sufficiently reduced (or I realize that the mountain is too big.)
And after five or more years of Statistics in Python, I can safely say that I learned a lot about topics that I never heard of before.

Monday, November 5, 2012

Polar Histogram

Just posting two plots from my gallery to show what we can do with matplotlib, and numpy and scipy. (No public code for now)
Both plots show the histogram of the data and the density function (pdf) of the Von Mises distribution with estimated mu and kappa.
The first one shows arrival times in a 24 hour clock
The second one shows wind direction (zero is north, I didn't use the offset in this plot.)