Wednesday, December 5, 2012

Visual Inspection of Random Numbers

This is another post on showing what a few lines of matplotlib can produce.

Background

When I wrote the test suite for scipy.stats.distributions, I had to mark quite a few distributions as slow so they are skipped under the normal test runs, because they were very slow. One of the reasons that some distributions are slow is because the generic random number generation is very indirect if only the density function is available.

For some time I was looking at spline interpolation of the inverse cumulative distribution, ppf, as a approximate way of generating random numbers. However, since scipy has no splines that impose monotonicity, that did not work.

Finally, I wrote a script that just uses linear interpolation of the cdf of a distribution, using scipy.interpolate.interp1d so we can use standard inversion to create random numbers. To check whether the interpolation looks accurate enough, I went to "proof by plotting".

The interpolating random number generator takes about 0.3 seconds for one million random numbers, not counting the setup cost of creating the interpolator. The script is currently just a quick hack to see if it works.

The Plots

As an example I took the t distribution with 5 degrees of freedom, which has somewhat heavy tails. I calculated the approximation for 1000 intervals, and then for 10 and 20 intervals as contrast.

Since a large part of the "action" is in the tails, and I want to get those to be resonably accurate, I could not look just at a regular histogram since the tails are not very visible. So I looked at two variations, one with log scale, the other where the bin width is chosen so each bin has equal probability instead of equal length.

The result are the following four plots, with equal-length bins in the first row, equal-probability bins in the second row, and linear scale on the left side and log scaled probabilites on the right side. With 1000 segments in the interpolation, I don't see any systematic deviation of the random numbers from the true distribution. Below is the qqplot, generated with statsmodels, that indicates that the random numbers are consistent with a t(5) distribution.




As contrast, below are the same kind of plots for 20 intervals in the interpolation, which is a symmetric step function density with 20 intervals, many of them close to zero. The histogram shows clearly the steps, the qqplot shows systematic curved segments, which are more visible in the qqplot for 10 intervals.


The plots for 10 intervals are in my gallery histogram and qqplot

4 comments:

  1. Re: "scipy has no splines that impose monotonicity"

    http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.pchip.html

    Been there forever, but was undocumented until recently. Best to keep the good stuff hidden ;)

    ReplyDelete
  2. Thank you for this pointer.

    Embarrassing that I didn't see that. I went through that module hunting for monotone piecewise polynomials. My only excuse is, that last time that I asked the question on the mailinglist nobody else seems to know about it either.
    http://mail.scipy.org/pipermail/scipy-user/2010-May/025243.html
    That thread is about when I looked for the first time at the ppf interpolation question.

    I need to try it out (and maybe increase the test coverage).

    (My other lame excuse for not knowing about it, is that I might have mixed up pchip with chirp, which I wasn't really interested in. I love informative names.)

    (Since I'm starting to get older in python years, I don't think anymore that 3 years is forever.)

    ReplyDelete
  3. Here is an example plot using pchip for ppf approximation
    https://picasaweb.google.com/lh/photo/ccYFRhgye7fOvoS0a3PDU9MTjNZETYmyPJy0liipFm0?feat=directlink

    I didn't have much time to do anything with it, but initial results look good. (Maybe a tiny boundary effect, but it's too small in relative terms to worry about it.)

    ReplyDelete
  4. Just a brief comment about the timing.
    the pchip version for random numbers is 32 times slower than linear interpolation for one million random numbers and 1000 segments in the approximation.

    Using only a tenth in number of segments in the pchip version (since it's a better approximation) still has pchip four times slower than the linear approximation.

    ReplyDelete