Sunday, June 17, 2012

QR and Sequential Regression with Fixed Order

Background

I was fixing some bugs in determining the number of lags in statistical tests for time series, and remembered that we still haven't optimized our sequential regressions. So, I opened an issue .
I learned a lot of linear algebra through the numpy and scipy mailing lists. The relevant idea and thread is here, quoting myself:
But from your description of QR, I thought specifically of the case
where we have a "natural" ordering of the regressors, similar to the
polynomial case of you and Anne. In the timeseries case, it would be by
increasing lags

yt on y_{t-1}
yt on y_{t-1}, y_{t-2}
...
...
yt on y_{t-k} for k= 1,...,K

or yt on xt and the lags of xt

This is really sequential LS with a predefined sequence, not PLS or
PCA/PCR or similar orthogonalization by "importance".
The usual procedure for deciding on the appropriate number of lags
usually loops over OLS with increasing number of regressors.
>From the discussion, I thought there might be a way to "cheat" in this
using QR and Gram-Schmidt
I never got around trying this out, but I thought I give it a try today. Some hours of trial and error and working on some algebra later, I have what looks like a reasonable solution.

Using QR Decomposition for Sequential Regression

The following is just a commented walk through my current script, the core is just a few lines.
First, some imports
import numpy as np
from statsmodels.regression.linear_model import OLS

from numpy.testing import assert_almost_equal
Then, we define a toy example to try out whether it works. The last two variables have no effect. I used that to get the minimum of the information criteria, aic, bic, to be in interior.
nobs, k_vars = 50, 4

np.random.seed(85325783)
x = np.random.randn(nobs, k_vars)
y = x[:, :k_vars-2].sum(1) + np.random.randn(nobs)
We start with the boring way of doing the sequential regression: use OLS from statsmodels and loop with expanding number of explanatory variables in exog.
Note, this uses the generalized inverse, pinv, for each new matrix of explanatory variables. I'm storing parameters, residual sum of squares and information criteria to have a benchmark to compare against later.
ssr_ols = np.zeros(k_vars)
params_ols = np.zeros((k_vars, k_vars))
ic_ols = np.zeros((2, k_vars))
for i in range(4):
    res = OLS(y, x[:,:i+1]).fit()
    ssr_ols[i] = res.ssr
    params_ols[i, :i+1] = res.params
    ic_ols[0, i] = res.aic
    ic_ols[1, i] = res.bic
In my example, the estimated coefficients are
>>> params_ols
array([[ 0.7585129 ,  0.        ,  0.        ,  0.        ],
       [ 1.00564191,  0.96414302,  0.        ,  0.        ],
       [ 1.00776594,  0.93035613, -0.10759121,  0.        ],
       [ 1.03379054,  0.91302697, -0.12998046, -0.08787965]])
The first two coefficients are close to one, the second two coefficients are close to zero, all in the neighborhood of the true values of my simulated model.
Instead of using pinv, statsmodels also has the option to use QR for estimating a linear model, with the basic code as the following. This uses the QR decomposition of the matrix of explanatory variables.
q, r = np.linalg.qr(x)
qy = np.dot(q.T, y)

print '\nparams full model'
print params_ols[-1]
print np.linalg.solve(r, qy)
It gives the same result as the pinv version
params full model
[ 1.03379054  0.91302697 -0.12998046 -0.08787965]
[ 1.03379054  0.91302697 -0.12998046 -0.08787965]
We already have the QR decomposition for the full model. QR` calculates the orthogonalization using the sequence as it is given by the matrix of explanatory variables.
My first attempt was to find the OLS parameters sequentially using the appropriate subset or slices of the QR matrices
params1 = np.zeros((k_vars, k_vars))
for i in range(4):
    params1[i, :i+1] = np.linalg.solve(r[:i+1, :i+1], qy[:i+1])
This gives us the same parameter estimates as the OLS loop. The computations are on much smaller arrays than the original problem, r is (k_vars, k_vars) and qy is just one dimensional with length k_vars.
There is still one loop, although a much smaller one, and I want to get rid of it. The following is trivial once we know what's going on, but it took me a while to figure this out. Some observations about QR:
  • q is the orthogonal matrix that spans the same space as x.
  • theta = r beta = qy is the parameter vector in the orthogonalized problem.
  • beta = inv(r) theta is the parameter vector for the original regressors.
All we need to do is to select the sequentially increasing subsets of r and theta. For calculating predicted values, we can work directly in the orthogonalized system.
Using an upper triangular matrix of ones
upp_tri = (np.arange(4)[:,None] <= np.arange(4)).astype(float)
we can get the expanding theta or qy in a matrix
>>> upp_tri * qy[:,None]
array([[ 4.87045847,  4.87045847,  4.87045847,  4.87045847],
       [-0.        , -6.94085588, -6.94085588, -6.94085588],
       [ 0.        ,  0.        ,  0.70410517,  0.70410517],
       [ 0.        ,  0.        ,  0.        ,  0.60930516]])
Then, we can just solve the system for all steps at once. First, I tried the matrix inverse since that's easier to understand theoretically
params2 = np.dot(np.linalg.inv(r), upp_tri * qy[:,None]).T
But, we can avoid the inverse and use linalg.solve directly:
params2 = np.linalg.solve(r, upp_tri * qy[:,None]).T
That's it, one line and I have the parameters for the four sequential regression problems. Comparing the parameter estimates of the three methods, we see they are the same (at floating point precision)
>>> np.max(np.abs(params1 - params2))
0.0
>>> np.max(np.abs(params1 - params_ols))
5.5511151231257827e-16
The rest is calculating fitted values and residuals
contrib = q * qy
fitted = contrib.cumsum(1)
resids = y[:,None] - fitted
ssrs = (resids**2).sum(0)

print '\nresidual sum of squares ssr'
print ssrs
print ssr_ols
which gives us identical residual sum of squares, ssr
[ 80.31708213  32.14160173  31.64583764  31.27458486]
[ 80.31708213  32.14160173  31.64583764  31.27458486]
I added some asserts in the script, so I don't break anything by accident
assert_almost_equal(ssrs, ssr_ols, decimal=13)
assert_almost_equal(params1, params_ols, decimal=13)
assert_almost_equal(params2, params_ols, decimal=13)
The applications, that I have in mind for this, are selection of the number of regressors or lags in the time series case, so I want to calculate the information criteria, aic and bic. In the previous result, I had calculated the residual sum of squares, which I can feed to helper functions that I had written to calculate information criteria.
The values cannot be directly compared to the results from OLS, since aic and bic reported by the OLS Results instance are based on the likelihood value, while here I use the ssr based definition, which uses a different normalization and drops a constant term. However, the minimum is achieved at two variables for all cases, which was expected since I selected the simulated model for this.
print '\naic, bic'
print ic_ols
import statsmodels.tools.eval_measures as evm
dfmodelwc = range(1, k_vars+1)  #number of variables in regression
aics = [evm.aic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)]
bics = [evm.bic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)]
print aics
print bics
print 'aic, bic minimum number of variables'
print np.argmin(ic_ols, 1) + 1
print np.argmin(aics) + 1, np.argmin(aics) + 1
prints
aic, bic
[[ 167.59181941  123.80026281  125.02303444  126.43299217]
 [ 169.50384241  127.62430882  130.75910345  134.08108419]]
[4.4259823271765022, 3.5501511951835187, 3.5746066277468964, 3.6028057824070068]
[4.4642227872850651, 3.6266321154006445, 3.689328008072585, 3.7557676228412582]
aic, bic minimum number of variables
[2 2]
2 2
PS: Since I never had any numerical linear algebra training, I can still get excited about figuring out a two-liner. I could have tried to find a book for it, but that would have been boring compared to working it out on my own.
PS: Following pep-8, I'm used to not using capital letters very often.

Friday, June 15, 2012

Non-linear dependence measures - Distance Correlation, Kendall's Tau and Mutual Information



An ongoing unfinished project of mine is to look at dependency measures between to random variables if we observe two samples.

Pearson's correlation is the most commonly used measure. However it has the disadvantage that it can only measure linear (affine) relationship between the two variables. Scipy.stats also has Spearman's rho and Kendall's tau which both measure monotonic relationship. Neither of them can capture all types of non-linear dependency.

There exist measures that can capture any type of non-linear relationship, mutual information is one of them and the first that I implemented as an experimental version in the statsmodels sandbox.
Another one that has been popular in econometrics for some time is Hellinger distance, which is a proper metric, for example Granger, Lin 1994.

Today, Satrajit and Yaroslav had a short discussion about Distance correlation on the scikit-learn mailing list. I had downloaded some papers about it a while ago, but haven't read them yet. However, the Wikipedia link in the mailing list discussion looked reasonably easy to implement. After fixing a silly bug copying from my session to a script file, I get the same results as R. (Here's my gist.)

Yaroslav has a more feature rich implementation in pymvpa, and Satrajit is working on a version for scikit-learn.

I'm looking only at two univariate variables, a possible benefit of Distance correlation is that it naturally extends to two multivariate variables, while mutual information looks, at least computationally more difficult, and I have not seen a multivariate version of Hellinger's Distance yet.

But I wanted to see examples, so I wrote something that looks similar to the graphs in  Wikipedia - Distance_correlation. . (They are not the same examples, since I only looked briefly at the source of the Wikipedia plots after writing mine.)  Compared to mutual information, distance correlation shows a much weaker correlation or dependence between the two variables, while Kendall's tau doesn't find any correlation at all (since none of the relationships are monotonic).

Note: I didn't take the square root in the Distance correlation, taking the square root as the energy  package in R, I get

>>> np.sqrt([0.23052, 0.0621, 0.03831, 0.0042])
array([ 0.48012498,  0.24919872,  0.19572941,  0.06480741])

  
Now, what is this useful for. There main reason that I started to look in this direction was related to tests of independence of random variables. In the Gaussian case, zero correlation implies independence, but this is not the case for other distributions. There are many statistical tests that the Null Hypothesis that two variables are independent, the ones I started to look at were based on copulas. For goodness-of-fit tests there are many measures of divergence, that define a distance between two probability function, which can similarly be used to measure the distance between the estimated joint probability density function and the density under the hypothesis of independence.

While those are oriented towards testing, looking at the measures themselves gives a quantification of the strength of the relationship. For example, auto- or cross-correlation in time series analysis only measures linear dependence. The non-linear correlation measures are also able to capture dependencies that will be hidden to Pearson's correlation or also to Kendall's tau.

PS: There are more interesting things to do, than time for it.