Background
I was fixing some bugs in determining the number of lags in statistical tests for time series, and remembered that we still haven't optimized our sequential regressions. So, I opened an issue .I learned a lot of linear algebra through the numpy and scipy mailing lists. The relevant idea and thread is here, quoting myself:
But from your description of QR, I thought specifically of the case where we have a "natural" ordering of the regressors, similar to the polynomial case of you and Anne. In the timeseries case, it would be by increasing lags yt on y_{t-1} yt on y_{t-1}, y_{t-2} ... ... yt on y_{t-k} for k= 1,...,K or yt on xt and the lags of xt This is really sequential LS with a predefined sequence, not PLS or PCA/PCR or similar orthogonalization by "importance". The usual procedure for deciding on the appropriate number of lags usually loops over OLS with increasing number of regressors. >From the discussion, I thought there might be a way to "cheat" in this using QR and Gram-SchmidtI never got around trying this out, but I thought I give it a try today. Some hours of trial and error and working on some algebra later, I have what looks like a reasonable solution.
Using QR Decomposition for Sequential Regression
The following is just a commented walk through my current script, the core is just a few lines.First, some imports
import numpy as np from statsmodels.regression.linear_model import OLS from numpy.testing import assert_almost_equalThen, we define a toy example to try out whether it works. The last two variables have no effect. I used that to get the minimum of the information criteria, aic, bic, to be in interior.
nobs, k_vars = 50, 4 np.random.seed(85325783) x = np.random.randn(nobs, k_vars) y = x[:, :k_vars-2].sum(1) + np.random.randn(nobs)We start with the boring way of doing the sequential regression: use OLS from statsmodels and loop with expanding number of explanatory variables in exog.
Note, this uses the generalized inverse, pinv, for each new matrix of explanatory variables. I'm storing parameters, residual sum of squares and information criteria to have a benchmark to compare against later.
ssr_ols = np.zeros(k_vars) params_ols = np.zeros((k_vars, k_vars)) ic_ols = np.zeros((2, k_vars)) for i in range(4): res = OLS(y, x[:,:i+1]).fit() ssr_ols[i] = res.ssr params_ols[i, :i+1] = res.params ic_ols[0, i] = res.aic ic_ols[1, i] = res.bicIn my example, the estimated coefficients are
>>> params_ols array([[ 0.7585129 , 0. , 0. , 0. ], [ 1.00564191, 0.96414302, 0. , 0. ], [ 1.00776594, 0.93035613, -0.10759121, 0. ], [ 1.03379054, 0.91302697, -0.12998046, -0.08787965]])The first two coefficients are close to one, the second two coefficients are close to zero, all in the neighborhood of the true values of my simulated model.
Instead of using pinv, statsmodels also has the option to use QR for estimating a linear model, with the basic code as the following. This uses the QR decomposition of the matrix of explanatory variables.
q, r = np.linalg.qr(x) qy = np.dot(q.T, y) print '\nparams full model' print params_ols[-1] print np.linalg.solve(r, qy)It gives the same result as the pinv version
params full model [ 1.03379054 0.91302697 -0.12998046 -0.08787965] [ 1.03379054 0.91302697 -0.12998046 -0.08787965]We already have the QR decomposition for the full model. QR` calculates the orthogonalization using the sequence as it is given by the matrix of explanatory variables.
My first attempt was to find the OLS parameters sequentially using the appropriate subset or slices of the QR matrices
params1 = np.zeros((k_vars, k_vars)) for i in range(4): params1[i, :i+1] = np.linalg.solve(r[:i+1, :i+1], qy[:i+1])This gives us the same parameter estimates as the OLS loop. The computations are on much smaller arrays than the original problem, r is (k_vars, k_vars) and qy is just one dimensional with length k_vars.
There is still one loop, although a much smaller one, and I want to get rid of it. The following is trivial once we know what's going on, but it took me a while to figure this out. Some observations about QR:
All we need to do is to select the sequentially increasing subsets of r and theta. For calculating predicted values, we can work directly in the orthogonalized system.
- q is the orthogonal matrix that spans the same space as x.
- theta = r beta = qy is the parameter vector in the orthogonalized problem.
- beta = inv(r) theta is the parameter vector for the original regressors.
Using an upper triangular matrix of ones
upp_tri = (np.arange(4)[:,None] <= np.arange(4)).astype(float)we can get the expanding theta or qy in a matrix
>>> upp_tri * qy[:,None] array([[ 4.87045847, 4.87045847, 4.87045847, 4.87045847], [-0. , -6.94085588, -6.94085588, -6.94085588], [ 0. , 0. , 0.70410517, 0.70410517], [ 0. , 0. , 0. , 0.60930516]])Then, we can just solve the system for all steps at once. First, I tried the matrix inverse since that's easier to understand theoretically
params2 = np.dot(np.linalg.inv(r), upp_tri * qy[:,None]).TBut, we can avoid the inverse and use linalg.solve directly:
params2 = np.linalg.solve(r, upp_tri * qy[:,None]).TThat's it, one line and I have the parameters for the four sequential regression problems. Comparing the parameter estimates of the three methods, we see they are the same (at floating point precision)
>>> np.max(np.abs(params1 - params2)) 0.0 >>> np.max(np.abs(params1 - params_ols)) 5.5511151231257827e-16The rest is calculating fitted values and residuals
contrib = q * qy fitted = contrib.cumsum(1) resids = y[:,None] - fitted ssrs = (resids**2).sum(0) print '\nresidual sum of squares ssr' print ssrs print ssr_olswhich gives us identical residual sum of squares, ssr
[ 80.31708213 32.14160173 31.64583764 31.27458486] [ 80.31708213 32.14160173 31.64583764 31.27458486]I added some asserts in the script, so I don't break anything by accident
assert_almost_equal(ssrs, ssr_ols, decimal=13) assert_almost_equal(params1, params_ols, decimal=13) assert_almost_equal(params2, params_ols, decimal=13)The applications, that I have in mind for this, are selection of the number of regressors or lags in the time series case, so I want to calculate the information criteria, aic and bic. In the previous result, I had calculated the residual sum of squares, which I can feed to helper functions that I had written to calculate information criteria.
The values cannot be directly compared to the results from OLS, since aic and bic reported by the OLS Results instance are based on the likelihood value, while here I use the ssr based definition, which uses a different normalization and drops a constant term. However, the minimum is achieved at two variables for all cases, which was expected since I selected the simulated model for this.
print '\naic, bic' print ic_ols import statsmodels.tools.eval_measures as evm dfmodelwc = range(1, k_vars+1) #number of variables in regression aics = [evm.aic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)] bics = [evm.bic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)] print aics print bics print 'aic, bic minimum number of variables' print np.argmin(ic_ols, 1) + 1 print np.argmin(aics) + 1, np.argmin(aics) + 1prints
aic, bic [[ 167.59181941 123.80026281 125.02303444 126.43299217] [ 169.50384241 127.62430882 130.75910345 134.08108419]] [4.4259823271765022, 3.5501511951835187, 3.5746066277468964, 3.6028057824070068] [4.4642227872850651, 3.6266321154006445, 3.689328008072585, 3.7557676228412582] aic, bic minimum number of variables [2 2] 2 2PS: Since I never had any numerical linear algebra training, I can still get excited about figuring out a two-liner. I could have tried to find a book for it, but that would have been boring compared to working it out on my own.
PS: Following pep-8, I'm used to not using capital letters very often.