An ongoing unfinished project of mine is to look at dependency measures between to random variables if we observe two samples.

Pearson's correlation is the most commonly used measure. However it has the disadvantage that it can only measure linear (affine) relationship between the two variables. Scipy.stats also has Spearman's rho and Kendall's tau which both measure monotonic relationship. Neither of them can capture all types of non-linear dependency.

There exist measures that can capture any type of non-linear relationship, mutual information is one of them and the first that I implemented as an experimental version in the statsmodels sandbox.

Another one that has been popular in econometrics for some time is Hellinger distance, which is a proper metric, for example Granger, Lin 1994.

Today, Satrajit and Yaroslav had a short discussion about Distance correlation on the scikit-learn mailing list. I had downloaded some papers about it a while ago, but haven't read them yet. However, the Wikipedia link in the mailing list discussion looked reasonably easy to implement. After fixing a silly bug copying from my session to a script file, I get the same results as R. (Here's my gist.)

Yaroslav has a more feature rich implementation in pymvpa, and Satrajit is working on a version for scikit-learn.

I'm looking only at two univariate variables, a possible benefit of Distance correlation is that it naturally extends to two multivariate variables, while mutual information looks, at least computationally more difficult, and I have not seen a multivariate version of Hellinger's Distance yet.

But I wanted to see examples, so I wrote something that looks similar to the graphs in Wikipedia - Distance_correlation. . (They are not the same examples, since I only looked briefly at the source of the Wikipedia plots after writing mine.) Compared to mutual information, distance correlation shows a much weaker correlation or dependence between the two variables, while Kendall's tau doesn't find any correlation at all (since none of the relationships are monotonic).

Note: I didn't take the square root in the Distance correlation, taking the square root as the

*energy*package in R, I get

>>> np.sqrt([0.23052, 0.0621, 0.03831, 0.0042])

array([ 0.48012498, 0.24919872, 0.19572941, 0.06480741])

array([ 0.48012498, 0.24919872, 0.19572941, 0.06480741])

Now, what is this useful for. There main reason that I started to look in this direction was related to tests of independence of random variables. In the Gaussian case, zero correlation implies independence, but this is not the case for other distributions. There are many statistical tests that the Null Hypothesis that two variables are independent, the ones I started to look at were based on copulas. For goodness-of-fit tests there are many measures of divergence, that define a distance between two probability function, which can similarly be used to measure the distance between the estimated joint probability density function and the density under the hypothesis of independence.

While those are oriented towards testing, looking at the measures themselves gives a quantification of the strength of the relationship. For example, auto- or cross-correlation in time series analysis only measures linear dependence. The non-linear correlation measures are also able to capture dependencies that will be hidden to Pearson's correlation or also to Kendall's tau.

PS: There are more interesting things to do, than time for it.

There is another non-linear, a very robust, measure of dependence called the Hilbert Schmidt Independence Criterion, which is being recently used in machine-learning literature. Here's a resource for it:

ReplyDeletehttp://www.kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/hsicALT05_%5b0%5d.pdf

Cameron,

ReplyDeleteThanks for the reference.

I'm a bit out of practice reading Smola, Schoelkopf papers. I will leave any serious implementation to the machine learning developers since incomplete Cholesky decomposition sounds out of my area.

However, the definition of HSIC looks suspiciously similar to the Distance correlation when the euclidean distance is replaced by a kernel, given my fast reading.

Just playing with the idea: if I replace absolute distance in my univariate case with a distance measure that levels off, then I can get the "Distance" correlation for the two middle cases to np.sqrt(0.33)=0.57, which looks much closer to mutual information.

A preliminary update:

ReplyDeleteI calculated several versions using a gaussian kernel. I was puzzled for a while by the observation that I can make the correlation measure whatever I like by varying the scale. It looks like that in contrast to Distance correlation, HSIC is not invariant to a common rescaling or bandwidth choice.

There is still some chance that I made a mistake, but I don't think so.

Hello Josef,

ReplyDeletethanks to your inspiration, I've written a scipy.weave based implementation of distance correlation that's not quadratic in space - and even a bit faster then the numpy stuff.

I'd like to include your implementation since it's so easy to understand, and also as reference for the unit tests before I publish it on pypi.

Would you mind sticking a suitable license (MIT?) to https://gist.github.com/2938402#file_try_distance_corr.py ?

Best regards,

Florian

Hi Florian,

ReplyDeleteNo problem about the license. Since it will end up in statsmodels, the license will be BSD-3, but I can also add an MIT license to the gist.

as aside: scipy.weave is not maintained anymore, and most developers switch to cython to implement extensions.

I'm staying for now with the easier to understand version that is quadratic in number of observations, but I added a loop for the multivariate dimension in my latest version to not increase memory consumption even more.

I'm glad you find it useful.

Hello Josef,

ReplyDeletethat's awesome!

I'm not particularly keen on maintaining a package - statsmodels seems to be a fine home, and BSD-3 would work for me if you'd want to include it.

Thanks for the hint about weave - I missed that it was depreciated.

I've redone it in cython - it's about 30% slower than weave though. Still about 7 times faster than the simple version.

I've put it up at https://gist.github.com/2960386 for now - the dcov_all doesn't read much more complicated ;).

I'm no statistician - can you elaborate on why you don't take the square root?

So long,

Florian

Hi Florian,

ReplyDeleteThanks for the cython version. The main advantage is to move away from the memory requirement, which will allow much larger arrays.

About the square root: I coded it this way mainly in analogy to the standard Pearson correlation coefficient, before I saw the square root in the R package. The kernel version in Cameron's reference doesn't take a square root either, as far as I can tell from my superficial reading. Reading the small print in the Wikipedia page again, I see now that they mention the square root.

I need to look at the papers on Distance Correlation to justify it either way.

Dear Josef,

ReplyDeleteI found your blog when I was browsing the web for info on the Distance Correlation coefficient and it was very nice to find your implementation on python, my favourite programming language. I am now using it on my research. Thank you very much!

Cheers,

Pedro

I can only find the original gist. Has any more work been done on distance correlation (either in statsmodels or sklearn)?

ReplyDeleteThanks,

Ian

There was a discussion on the sklearn mailinglist, but I don't know if they implemented it.

ReplyDeleteThere is nothing yet in statsmodels. It was for me mainly "educational", trying to understand additional nonlinear dependence measures. My code is not yet cleaned up, tested and used for statistical measures and tests. (Part of my non-existing non-linear dependence estimation and testing toolbox.)

hi josef, it's funny i ran back into this today! will try to take florian's implementation and push it scikit-learn and hopefully i can add the stats for it as well.

ReplyDeleteDo you think there is a one to one mapping relationship between kendall's tau and MI. Because once I plot the MI against tau for different copula family, I found that they are monotone increasing functions, which are also convex.

ReplyDeleteThank you

Hao

Hi He Hao,

ReplyDeleteI think there is no direct relationship from the definition between Kendall's Tau and Mutual Information. Kendall's Tau is defined by comparing all pairs of observation pairs, while Mutual Information uses the (empirical) distribution.

However, for some parametric families there might be a one to one mapping. Some copula families can be parameterized by Kendall's Tau or Spearman's Rho, and might or will also have a corresponding unique MI. That's more a property of copulas than of the definition of the distance measures.

(Written based on what I remember, I haven't looked at copulas in a while.)

what do the "mfk" and "mfb" mean? on the picture?

ReplyDeleteHi Marcin,

ReplyDeleteIf I remember correctly "mfk" is mutual information calculated based on the kernel density, and "mfb" is mutual information based on the binned sample. I would have to find the script file again to be sure about what I used.