Sunday, June 17, 2012

QR and Sequential Regression with Fixed Order

Background

I was fixing some bugs in determining the number of lags in statistical tests for time series, and remembered that we still haven't optimized our sequential regressions. So, I opened an issue .
I learned a lot of linear algebra through the numpy and scipy mailing lists. The relevant idea and thread is here, quoting myself:
But from your description of QR, I thought specifically of the case
where we have a "natural" ordering of the regressors, similar to the
polynomial case of you and Anne. In the timeseries case, it would be by
increasing lags

yt on y_{t-1}
yt on y_{t-1}, y_{t-2}
...
...
yt on y_{t-k} for k= 1,...,K

or yt on xt and the lags of xt

This is really sequential LS with a predefined sequence, not PLS or
PCA/PCR or similar orthogonalization by "importance".
The usual procedure for deciding on the appropriate number of lags
usually loops over OLS with increasing number of regressors.
>From the discussion, I thought there might be a way to "cheat" in this
using QR and Gram-Schmidt
I never got around trying this out, but I thought I give it a try today. Some hours of trial and error and working on some algebra later, I have what looks like a reasonable solution.

Using QR Decomposition for Sequential Regression

The following is just a commented walk through my current script, the core is just a few lines.
First, some imports
import numpy as np
from statsmodels.regression.linear_model import OLS

from numpy.testing import assert_almost_equal
Then, we define a toy example to try out whether it works. The last two variables have no effect. I used that to get the minimum of the information criteria, aic, bic, to be in interior.
nobs, k_vars = 50, 4

np.random.seed(85325783)
x = np.random.randn(nobs, k_vars)
y = x[:, :k_vars-2].sum(1) + np.random.randn(nobs)
We start with the boring way of doing the sequential regression: use OLS from statsmodels and loop with expanding number of explanatory variables in exog.
Note, this uses the generalized inverse, pinv, for each new matrix of explanatory variables. I'm storing parameters, residual sum of squares and information criteria to have a benchmark to compare against later.
ssr_ols = np.zeros(k_vars)
params_ols = np.zeros((k_vars, k_vars))
ic_ols = np.zeros((2, k_vars))
for i in range(4):
    res = OLS(y, x[:,:i+1]).fit()
    ssr_ols[i] = res.ssr
    params_ols[i, :i+1] = res.params
    ic_ols[0, i] = res.aic
    ic_ols[1, i] = res.bic
In my example, the estimated coefficients are
>>> params_ols
array([[ 0.7585129 ,  0.        ,  0.        ,  0.        ],
       [ 1.00564191,  0.96414302,  0.        ,  0.        ],
       [ 1.00776594,  0.93035613, -0.10759121,  0.        ],
       [ 1.03379054,  0.91302697, -0.12998046, -0.08787965]])
The first two coefficients are close to one, the second two coefficients are close to zero, all in the neighborhood of the true values of my simulated model.
Instead of using pinv, statsmodels also has the option to use QR for estimating a linear model, with the basic code as the following. This uses the QR decomposition of the matrix of explanatory variables.
q, r = np.linalg.qr(x)
qy = np.dot(q.T, y)

print '\nparams full model'
print params_ols[-1]
print np.linalg.solve(r, qy)
It gives the same result as the pinv version
params full model
[ 1.03379054  0.91302697 -0.12998046 -0.08787965]
[ 1.03379054  0.91302697 -0.12998046 -0.08787965]
We already have the QR decomposition for the full model. QR` calculates the orthogonalization using the sequence as it is given by the matrix of explanatory variables.
My first attempt was to find the OLS parameters sequentially using the appropriate subset or slices of the QR matrices
params1 = np.zeros((k_vars, k_vars))
for i in range(4):
    params1[i, :i+1] = np.linalg.solve(r[:i+1, :i+1], qy[:i+1])
This gives us the same parameter estimates as the OLS loop. The computations are on much smaller arrays than the original problem, r is (k_vars, k_vars) and qy is just one dimensional with length k_vars.
There is still one loop, although a much smaller one, and I want to get rid of it. The following is trivial once we know what's going on, but it took me a while to figure this out. Some observations about QR:
  • q is the orthogonal matrix that spans the same space as x.
  • theta = r beta = qy is the parameter vector in the orthogonalized problem.
  • beta = inv(r) theta is the parameter vector for the original regressors.
All we need to do is to select the sequentially increasing subsets of r and theta. For calculating predicted values, we can work directly in the orthogonalized system.
Using an upper triangular matrix of ones
upp_tri = (np.arange(4)[:,None] <= np.arange(4)).astype(float)
we can get the expanding theta or qy in a matrix
>>> upp_tri * qy[:,None]
array([[ 4.87045847,  4.87045847,  4.87045847,  4.87045847],
       [-0.        , -6.94085588, -6.94085588, -6.94085588],
       [ 0.        ,  0.        ,  0.70410517,  0.70410517],
       [ 0.        ,  0.        ,  0.        ,  0.60930516]])
Then, we can just solve the system for all steps at once. First, I tried the matrix inverse since that's easier to understand theoretically
params2 = np.dot(np.linalg.inv(r), upp_tri * qy[:,None]).T
But, we can avoid the inverse and use linalg.solve directly:
params2 = np.linalg.solve(r, upp_tri * qy[:,None]).T
That's it, one line and I have the parameters for the four sequential regression problems. Comparing the parameter estimates of the three methods, we see they are the same (at floating point precision)
>>> np.max(np.abs(params1 - params2))
0.0
>>> np.max(np.abs(params1 - params_ols))
5.5511151231257827e-16
The rest is calculating fitted values and residuals
contrib = q * qy
fitted = contrib.cumsum(1)
resids = y[:,None] - fitted
ssrs = (resids**2).sum(0)

print '\nresidual sum of squares ssr'
print ssrs
print ssr_ols
which gives us identical residual sum of squares, ssr
[ 80.31708213  32.14160173  31.64583764  31.27458486]
[ 80.31708213  32.14160173  31.64583764  31.27458486]
I added some asserts in the script, so I don't break anything by accident
assert_almost_equal(ssrs, ssr_ols, decimal=13)
assert_almost_equal(params1, params_ols, decimal=13)
assert_almost_equal(params2, params_ols, decimal=13)
The applications, that I have in mind for this, are selection of the number of regressors or lags in the time series case, so I want to calculate the information criteria, aic and bic. In the previous result, I had calculated the residual sum of squares, which I can feed to helper functions that I had written to calculate information criteria.
The values cannot be directly compared to the results from OLS, since aic and bic reported by the OLS Results instance are based on the likelihood value, while here I use the ssr based definition, which uses a different normalization and drops a constant term. However, the minimum is achieved at two variables for all cases, which was expected since I selected the simulated model for this.
print '\naic, bic'
print ic_ols
import statsmodels.tools.eval_measures as evm
dfmodelwc = range(1, k_vars+1)  #number of variables in regression
aics = [evm.aic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)]
bics = [evm.bic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)]
print aics
print bics
print 'aic, bic minimum number of variables'
print np.argmin(ic_ols, 1) + 1
print np.argmin(aics) + 1, np.argmin(aics) + 1
prints
aic, bic
[[ 167.59181941  123.80026281  125.02303444  126.43299217]
 [ 169.50384241  127.62430882  130.75910345  134.08108419]]
[4.4259823271765022, 3.5501511951835187, 3.5746066277468964, 3.6028057824070068]
[4.4642227872850651, 3.6266321154006445, 3.689328008072585, 3.7557676228412582]
aic, bic minimum number of variables
[2 2]
2 2
PS: Since I never had any numerical linear algebra training, I can still get excited about figuring out a two-liner. I could have tried to find a book for it, but that would have been boring compared to working it out on my own.
PS: Following pep-8, I'm used to not using capital letters very often.

6 comments:

  1. Nice. We use a similar trick for getting SSR in ANOVA type I. I'm wondering if we should switch to QR by default in OLS and keep around these matrices so we even save this computation.

    https://github.com/jseabold/statsmodels/blob/formula-integration/statsmodels/stats/anova.py#L118

    ReplyDelete
  2. The problem with switching from pinv to QR as default is that we rely on pinv to handle singular design matrices. I don't know how QR behaves or how to work with it in the singular case, but a quick run of my example shows that it doesn't produce the same results. It will take some effort to figure out how to handle the singular case if QR is the default.

    We already store the Q and R matrices in GLS for reuse:
    self._exog_Q, self._exog_R = Q, R

    ReplyDelete
  3. Hi Josef,

    You can speed things up even more by appending y as a column on the right of the design matrix. The Q multiplication is then part of the QR factorization, and the elements of the last column of R are coefficients of the orthogonal problem, except the last (bottom) element is the residual. Said differently, sum of squares of the last column is the variance of y and the cumsum of the squares gives you the amount of variance removed for different numbers of variables. You don't even need Q.

    Skipper, QR with column pivoting was once popular, it is one of the group of factorizations called rank preserving, but in practice it isn't much faster than SVD because determining which column to reduce next is a bit of computational work. Without the column pivoting it is difficult to deal with near singular matrices, which svd does better.

    ReplyDelete
  4. Thanks Chuck,

    As you know, I learned a lot of the numerical linear algebra from you on the mailing lists. I was rereading your comments on that thread again this weekend but still couldn't understand that part.

    I think now, the point why your improvement works is because Q is not just an orthogonal basis but it's an ortho*normal* basis. In terms of linear regression, this will give us standardized beta coefficients with an orthogonal design matrix. In that case, we get directly the variance decomposition from the beta coefficients.

    Proof by coding: I think I need to prepare an update.

    >>> q, r = np.linalg.qr(np.column_stack((x,y)))
    >>> s2 = r[:,-1]**2
    >>> ssrs = s2.sum() - s2.cumsum()[:-1]
    >>> np.max(np.abs(ssrs - ssr_ols))
    1.4210854715202004e-14

    nice

    Aside: I was reading the SPSS algorithm description for the Sweep Operations that adds or removes one variable, and there they mention at the end that the coefficients are standardized.

    on singular design matrices:
    Is there an easy and cheap way of detecting a singular matrix from a QR decomposition? Given that singular matrices is not the common case, we could just try QR and if it turns out to be singular, switch to pinv.

    ReplyDelete
  5. The QR factorization without column pivoting does not handle singular matrices well. What happens is that you end up dividing by (near) zero in the Householder reflection I - 2*v*v^T/|v|^2 when v is small and that happens when a column is not linearly independent of the preceding columns. I suspect this isn't normally a problem in the lagged case. However, as long as v isn't exactly zero, you do end up with an orthogonal transformation, so the transformed problem itself should still be valid. That allows QR to be used as the first step in the SVD.

    To understand the column appended approach, note that the augmented matrix is reduced to upper triangular form by an orthogonal change of basis (rows), so the least squares problem remains the same as none of the Euclidean distances change. What you end up with in the three variable case is

    | * * * | | * |
    | 0 * * | X = | * |
    | 0 0 * | | * |
    | 0 0 0 | | * |

    The rhs is y in the new basis. The number of variables you use then corresponds to the number of components of y you can eliminate. If you use all three variables only the last component remains and is the residual. If you use the first two variables, then the last two components are the residuals, so on and so forth. The last column of Q for this process is the normalized residual in the original coordinates and it is orthogonal to the subspace spanned by the columns of the design matrix as it should be.

    ReplyDelete