In scipy.stats we can find a class to estimate and use a gaussian kernel density estimator, scipy.stats.stats.gaussian_kde.
Until recently, I didn’t know how this part of scipy works, and the following describes roughly how I figured out what it does.
First, we can look at help to see whether the documentation provides enough information for its use. Here is the first part.
>>> help(stats.gaussian_kde)
Help on class gaussian_kde in module scipy.stats.kde:
class gaussian_kde(__builtin__.object)
 Representation of a kerneldensity estimate using Gaussian kernels.

 Parameters
 
 dataset : (# of dims, # of data)array
 datapoints to estimate from

 Members
 
 d : int
 number of dimensions
 n : int
 number of datapoints

 Methods
 
 kde.evaluate(points) : array
 evaluate the estimated pdf on a provided set of points
 kde(points) : array
 same as kde.evaluate(points)
 kde.integrate_gaussian(mean, cov) : float
 multiply pdf with a specified Gaussian and integrate over the whole domain
 kde.integrate_box_1d(low, high) : float
 integrate pdf (1D only) between two bounds
 kde.integrate_box(low_bounds, high_bounds) : float
 integrate pdf over a rectangular space between low_bounds and high_bounds
 kde.integrate_kde(other_kde) : float
 integrate two kernel density estimates multiplied together

 Internal Methods
 
 kde.covariance_factor() : float
 computes the coefficient that multiplies the data covariance matrix to
 obtain the kernel covariance matrix. Set this method to
 kde.scotts_factor or kde.silverman_factor (or subclass to provide your
 own). The default is scotts_factor.

 Methods defined here:

 __call__ = evaluate(self, points)

 __init__(self, dataset)

 covariance_factor = scotts_factor(self)

 evaluate(self, points)
 Evaluate the estimated pdf on a set of points.

...
The first two basic methods that I was interested in is the initialization, __init__ just takes a data set as a argument, and then evaluate, or __call__ which takes as function argument the list of points at which we want to evaluated the estimated pdf.
Let’s just try this out:
First, I get the standard imports:
import numpy as np
from scipy import stats
import matplotlib.pylab as plt
then I generate a sample that I can feed to the kde, as usual for continuous variables, I start with a normal distribution:
n_basesample = 1000
np.random.seed(8765678)
xn = np.random.randn(n_basesample)
Now, we create an instance of the gaussian_kde class and feed our sample to it:
gkde=stats.gaussian_kde(xn)
We need some points at which we evaluate the density funtion for the estimated density function:
ind = np.linspace(7,7,101)
kdepdf = gkde.evaluate(ind)
and finally we create the plot of the histogram of our data, together with the density that created our sample, the data generating process DGP, and finally the estimated density.
plt.figure()
# plot histgram of sample
plt.hist(xn, bins=20, normed=1)
# plot data generating density
plt.plot(ind, stats.norm.pdf(ind), color="r", label='DGP normal')
# plot estimated density
plt.plot(ind, kdepdf, label='kde', color="g")
plt.title('Kernel Density Estimation')
plt.legend()
#plt.show()
and this is the graph that we get
This graph looks pretty good, when the underlying distribution is the normal distribution, then the gaussian kernel density estimate follows very closely the true distribution, at least for a large sample as we used.
Now, to make it a bit more difficult we can look at a bimodal distribution, and see if it is still able to fit so well. As an example, I pick a mixture of two normal distributions, 60% of the sample comes from a normal distribution with mean 3 and standard deviation equal to one, 40% of the observation are from the normal distribution with mean 3 and the same standard deviation.
alpha = 0.6 #weight for (prob of) lower distribution
mlow, mhigh = (3,3) #mean locations for gaussian mixture
xn = np.concatenate([mlow + np.random.randn(alpha * n_basesample),
mhigh + np.random.randn((1alpha) * n_basesample)])
With the new data set, we can run the same commands as above, except that we have to change the plot of the data generating density to use the mixture density instead:
plt.plot(ind, alpha * stats.norm.pdf(ind, loc=mlow) +
(1alpha) * stats.norm.pdf(ind, loc=mhigh),
color="r", label='normal mix')
The next graph shows what we get in this case.
The estimated density is oversmoothing, the peaks of the estimated pdf are too small. So the automatic selection of the smoothing parameter doesn’t work in this case. Now, it’s time to find out how gaussian_kde actually selects the smoothing parameter or bandwith of the kernel.
But I’m out of time for today. We dig into this next time.
Hello,
ReplyDeleteI work daily with the scipy.stats.kde package and I find it quite useful. If you make some improvements, I would really appreciate to know it!
Regards
Stefano
Hi,
ReplyDeleteI found this tutorial very helpful, thanks!
I think that any kde will always broaden distributions somewhat, more so for small sample sizes. I tried to illustrate this point by giving just one sample, but this results in a ValueError. (Btw, giving an array of ints results in a gaussian_kde value of 0, which probably is a bug?) One very clear example of broadening is sampling a uniform distribution within limits 0 to 1 and computing the gaussian_kde, which will always have Gaussian wings below 0 and above 1, where the pdf really was 0.
If you find out how the automatic kernel width selection works, whether it is adaptive with position (smaller width in regions of higher sample density) and how it can be influenced by the user, or if it is possible to introduce limits xmin, xmax beyond which the gaussian_kde pdf is 0, please share!
Cheers, Christoph
thank you for this tutorial!
ReplyDeleteThis fail with 2D data set. How to estimate KDE if you have 2D point (X,Y) ?
ReplyDeleteThanks for the example code
ReplyDeleteIn case you want to fiddle with the bandwidth, you can use the "set_bandwidth" function  it takes a constant or a function or a name of it's inbuilt functions. In some cases you may want to reduce to smoothing  I find it estimates far too high
ReplyDelete