In scipy.stats we can find a class to estimate and use a gaussian kernel density estimator, scipy.stats.stats.gaussian_kde.
Until recently, I didn’t know how this part of scipy works, and the following describes roughly how I figured out what it does.
First, we can look at help to see whether the documentation provides enough information for its use. Here is the first part.
>>> help(stats.gaussian_kde)
Help on class gaussian_kde in module scipy.stats.kde:
class gaussian_kde(__builtin__.object)
| Representation of a kernel-density estimate using Gaussian kernels.
|
| Parameters
| ----------
| dataset : (# of dims, # of data)-array
| datapoints to estimate from
|
| Members
| -------
| d : int
| number of dimensions
| n : int
| number of datapoints
|
| Methods
| -------
| kde.evaluate(points) : array
| evaluate the estimated pdf on a provided set of points
| kde(points) : array
| same as kde.evaluate(points)
| kde.integrate_gaussian(mean, cov) : float
| multiply pdf with a specified Gaussian and integrate over the whole domain
| kde.integrate_box_1d(low, high) : float
| integrate pdf (1D only) between two bounds
| kde.integrate_box(low_bounds, high_bounds) : float
| integrate pdf over a rectangular space between low_bounds and high_bounds
| kde.integrate_kde(other_kde) : float
| integrate two kernel density estimates multiplied together
|
| Internal Methods
| ----------------
| kde.covariance_factor() : float
| computes the coefficient that multiplies the data covariance matrix to
| obtain the kernel covariance matrix. Set this method to
| kde.scotts_factor or kde.silverman_factor (or subclass to provide your
| own). The default is scotts_factor.
|
| Methods defined here:
|
| __call__ = evaluate(self, points)
|
| __init__(self, dataset)
|
| covariance_factor = scotts_factor(self)
|
| evaluate(self, points)
| Evaluate the estimated pdf on a set of points.
|
...
The first two basic methods that I was interested in is the initialization, __init__ just takes a data set as a argument, and then evaluate, or __call__ which takes as function argument the list of points at which we want to evaluated the estimated pdf.
Let’s just try this out:
First, I get the standard imports:
import numpy as np
from scipy import stats
import matplotlib.pylab as plt
then I generate a sample that I can feed to the kde, as usual for continuous variables, I start with a normal distribution:
n_basesample = 1000
np.random.seed(8765678)
xn = np.random.randn(n_basesample)
Now, we create an instance of the gaussian_kde class and feed our sample to it:
gkde=stats.gaussian_kde(xn)
We need some points at which we evaluate the density funtion for the estimated density function:
ind = np.linspace(-7,7,101)
kdepdf = gkde.evaluate(ind)
and finally we create the plot of the histogram of our data, together with the density that created our sample, the data generating process DGP, and finally the estimated density.
plt.figure()
# plot histgram of sample
plt.hist(xn, bins=20, normed=1)
# plot data generating density
plt.plot(ind, stats.norm.pdf(ind), color="r", label='DGP normal')
# plot estimated density
plt.plot(ind, kdepdf, label='kde', color="g")
plt.title('Kernel Density Estimation')
plt.legend()
#plt.show()
and this is the graph that we get
This graph looks pretty good, when the underlying distribution is the normal distribution, then the gaussian kernel density estimate follows very closely the true distribution, at least for a large sample as we used.
Now, to make it a bit more difficult we can look at a bimodal distribution, and see if it is still able to fit so well. As an example, I pick a mixture of two normal distributions, 60% of the sample comes from a normal distribution with mean -3 and standard deviation equal to one, 40% of the observation are from the normal distribution with mean 3 and the same standard deviation.
alpha = 0.6 #weight for (prob of) lower distribution
mlow, mhigh = (-3,3) #mean locations for gaussian mixture
xn = np.concatenate([mlow + np.random.randn(alpha * n_basesample),
mhigh + np.random.randn((1-alpha) * n_basesample)])
With the new data set, we can run the same commands as above, except that we have to change the plot of the data generating density to use the mixture density instead:
plt.plot(ind, alpha * stats.norm.pdf(ind, loc=mlow) +
(1-alpha) * stats.norm.pdf(ind, loc=mhigh),
color="r", label='normal mix')
The next graph shows what we get in this case.
The estimated density is oversmoothing, the peaks of the estimated pdf are too small. So the automatic selection of the smoothing parameter doesn’t work in this case. Now, it’s time to find out how gaussian_kde actually selects the smoothing parameter or bandwith of the kernel.
But I’m out of time for today. We dig into this next time.