joepy: correlation

An ongoing unfinished project of mine is to look at dependency measures between to random variables if we observe two samples.

Pearson's correlation is the most commonly used measure. However it has the disadvantage that it can only measure linear (affine) relationship between the two variables. Scipy.stats also has Spearman's rho and Kendall's tau which both measure monotonic relationship. Neither of them can capture all types of non-linear dependency.

There exist measures that can capture any type of non-linear relationship, mutual information is one of them and the first that I implemented as an experimental version in the statsmodels sandbox.
Another one that has been popular in econometrics for some time is Hellinger distance, which is a proper metric, for example Granger, Lin 1994.

Today, Satrajit and Yaroslav had a short discussion about Distance correlation on the scikit-learn mailing list. I had downloaded some papers about it a while ago, but haven't read them yet. However, the Wikipedia link in the mailing list discussion looked reasonably easy to implement. After fixing a silly bug copying from my session to a script file, I get the same results as R. (Here's my gist.)

Yaroslav has a more feature rich implementation in pymvpa, and Satrajit is working on a version for scikit-learn.

I'm looking only at two univariate variables, a possible benefit of Distance correlation is that it naturally extends to two multivariate variables, while mutual information looks, at least computationally more difficult, and I have not seen a multivariate version of Hellinger's Distance yet.

But I wanted to see examples, so I wrote something that looks similar to the graphs in Wikipedia - Distance_correlation. . (They are not the same examples, since I only looked briefly at the source of the Wikipedia plots after writing mine.) Compared to mutual information, distance correlation shows a much weaker correlation or dependence between the two variables, while Kendall's tau doesn't find any correlation at all (since none of the relationships are monotonic).

Note: I didn't take the square root in the Distance correlation, taking the square root as the energy package in R, I get

>>> np.sqrt([0.23052, 0.0621, 0.03831, 0.0042])

array([ 0.48012498,  0.24919872,  0.19572941,  0.06480741])

Now, what is this useful for. There main reason that I started to look in this direction was related to tests of independence of random variables. In the Gaussian case, zero correlation implies independence, but this is not the case for other distributions. There are many statistical tests that the Null Hypothesis that two variables are independent, the ones I started to look at were based on copulas. For goodness-of-fit tests there are many measures of divergence, that define a distance between two probability function, which can similarly be used to measure the distance between the estimated joint probability density function and the density under the hypothesis of independence.

While those are oriented towards testing, looking at the measures themselves gives a quantification of the strength of the relationship. For example, auto- or cross-correlation in time series analysis only measures linear dependence. The non-linear correlation measures are also able to capture dependencies that will be hidden to Pearson's correlation or also to Kendall's tau.

PS: There are more interesting things to do, than time for it.

joepy

Friday, June 15, 2012

Non-linear dependence measures - Distance Correlation, Kendall's Tau and Mutual Information

Blog Archive

Labels

josef-pkt's Activity