Wednesday, December 5, 2012

Visual Inspection of Random Numbers

This is another post on showing what a few lines of matplotlib can produce.

Background

When I wrote the test suite for scipy.stats.distributions, I had to mark quite a few distributions as slow so they are skipped under the normal test runs, because they were very slow. One of the reasons that some distributions are slow is because the generic random number generation is very indirect if only the density function is available.

For some time I was looking at spline interpolation of the inverse cumulative distribution, ppf, as a approximate way of generating random numbers. However, since scipy has no splines that impose monotonicity, that did not work.

Finally, I wrote a script that just uses linear interpolation of the cdf of a distribution, using scipy.interpolate.interp1d so we can use standard inversion to create random numbers. To check whether the interpolation looks accurate enough, I went to "proof by plotting".

The interpolating random number generator takes about 0.3 seconds for one million random numbers, not counting the setup cost of creating the interpolator. The script is currently just a quick hack to see if it works.

The Plots

As an example I took the t distribution with 5 degrees of freedom, which has somewhat heavy tails. I calculated the approximation for 1000 intervals, and then for 10 and 20 intervals as contrast.

Since a large part of the "action" is in the tails, and I want to get those to be resonably accurate, I could not look just at a regular histogram since the tails are not very visible. So I looked at two variations, one with log scale, the other where the bin width is chosen so each bin has equal probability instead of equal length.

The result are the following four plots, with equal-length bins in the first row, equal-probability bins in the second row, and linear scale on the left side and log scaled probabilites on the right side. With 1000 segments in the interpolation, I don't see any systematic deviation of the random numbers from the true distribution. Below is the qqplot, generated with statsmodels, that indicates that the random numbers are consistent with a t(5) distribution.




As contrast, below are the same kind of plots for 20 intervals in the interpolation, which is a symmetric step function density with 20 intervals, many of them close to zero. The histogram shows clearly the steps, the qqplot shows systematic curved segments, which are more visible in the qqplot for 10 intervals.


The plots for 10 intervals are in my gallery histogram and qqplot

Saturday, December 1, 2012

Characteristic Functions and scipy.stats

scipy.stats.distributions is among other things a nice formula collection.

One of the parts that are missing are the characteristic functions for the distributions.

Wikipedia is very good for large areas of statistics, see for some details and examples http://en.wikipedia.org/wiki/Characteristic_function_%28probability_theory%29 Wikipedia lists the characteristic funtion also on the pages for many distributions.

I wrote something like the script below already several times (for the easy cases).

The characteristic function for the normal distribution is easy, but looking at the characteristic function of the t-distribution, I wish someone had translated it into code already.

t-distribution

Since I haven't seen it yet, I sat down and tried it myself. I managed to code the characteristic function of the t-distribution, but it returns NaNs when it is evaluated close to zero for large df.

I didn't find a Bessel "k" function that works in this case

>>> special.kv(50/2., 1e-30)
inf
>>> special.kve(50/2., 1e-30)
inf

The t-distribution approaches the normal distribution as the shape parameter, the degrees of freedom, gets large. So, the characteristic function of the t-distribution should be well behaved for large df. However, the individual terms go to infinity or zero.

Since in my current case, I don't care about the behavior of the characteristic function around zero, I stopped trying to get a better implementation.

Warning: monkey patching included in the script

Aside: I cannot make up my mind whether the abbreviation for characteristic function should be chf or cf. I have both versions in various scripts that I wrote.

The script

Here's my current script

# -*- coding: utf-8 -*-
"""Characteristic Functions

Created on Fri Nov 30 22:43:36 2012
Author: Josef Perktold
"""

import numpy as np
from scipy import stats, special

def chf_normal(t, loc=0, scale=1):
    '''characteristic function of normal distribution

    Parameters
    ----------
    t : array_like
        points at which characteristic function is evaluated
    loc : float (or array_like ?)
        mean of underlying normal distribution
    scale : float (or array_like ?)
        standard deviation, scale of normal distribution

    Returns
    -------
    chfval : ndarray
        characteristic function evaluated at x

    Notes
    -----
    Can be used for higher dimensional arguments if ``t``, ``mean`` and ``var``
    broadcast.

    '''
    t = np.asarray(t)
    return np.exp(1j * t * loc - 0.5 * t**2 * scale**2)


def chf_t(t, df, loc=0, scale=1):
    '''characteristic function of t distribution

    breaks down for large df and t close to zero
    '''
    t = np.asarray(t)
    vhalf = df / 2.
    term = np.sqrt(df) * np.abs(t*scale)
    cf = special.kv(vhalf, term) * np.power(term, vhalf)
    cf = cf / special.gamma(vhalf) / 2**(vhalf - 1)
    cf = cf * np.exp(1j * loc * t)
    if cf.shape == () and t == 0:
        #special case: kv(0) returns nan
        #for df>15 or so, we also get nans in the result in neighborhood of 0
        cf = np.ones((), cf.dtype)  #return same dtype as cf would
    else:
        cf[t == 0] = 1

    return cf

def chf_t_(t, df, loc=0, scale=1):
    #not much, but a bit better with log
    vhalf = df / 2.
    term = np.sqrt(df) * np.abs(t*scale)
    cf = np.log(special.kv(vhalf, term)) + vhalf * np.log(term)
    cf = cf - special.gammaln(vhalf) - (vhalf - 1) * np.log(2)
    cf = cf + 1j * loc * t
    if cf.shape == () and t == 0:
        #special case: kv(0) returns nan
        #for df>15 or so, we also get nans in the result in neighborhood of 0
        cf = np.zeros((), cf.dtype)  #return same dtype as cf would
    else:
        cf[t == 0] = 0
    return np.exp(cf)


def chfn(self, t, *args, **kwds):
    return chf_normal(t, *args, **kwds)

#monkeypatch scipy.stats
stats.distributions.norm_gen._chf = chfn

t = np.linspace(-1, 1, 11)
print stats.norm._chf(t, loc=1, scale=2)
print chf_t(t, 50, loc=1, scale=2)
print chf_t_(t, 50, loc=1, scale=2)

Editorial note: I had written this initially for the scipy-user mailing list. (Old habits are difficult to break.) But I remembered just before hitting Send that the recommendation is to put it in a blog.

Application and Plot: Wrapped Circular T Distribution

As explained in my previous post, once we have the characteristic function of a distribution defined on the real line, it is simple to get the Fourier approximation for the wrapped circular distribution. As an application of the characteristic function of the t distribution, I constructed the wrapped circular distributions.

The following plot shows an example of the density functions of the wrapped Cauchy, the wrapped normal distribution, and the wrapped t distribution for a few values of the degrees of freedom. Normal and Cauchy distributions are the two extreme cases of the t distribution, when the degrees of freedom go to infinity and when the degrees of freedom is one, respectively.

The distribution in the plot have the same location and scale parameter. However, this implies that the variance of the distributions is not the same. As a different comparison we could have adjusted the scale parameter to obtain distributions with identical variance. The latter is a more informative comparison when we are estimating the parameters based on data, and the estimated distribution reflects a similar variance as the data.

The fatter tails of Cauchy and t distributions with small t are clearly visible in the plot.