Using QR Decomposition for Sequential Regression
The following is just a commented walk through my current script, the core is just a few lines.
First, some imports
import numpy as np
from statsmodels.regression.linear_model import OLS
from numpy.testing import assert_almost_equal
Then, we define a toy example to try out whether it works.
The last two variables have no effect. I used that to get the minimum of the
information criteria, aic, bic, to be in interior.
nobs, k_vars = 50, 4
np.random.seed(85325783)
x = np.random.randn(nobs, k_vars)
y = x[:, :k_vars-2].sum(1) + np.random.randn(nobs)
We start with the boring way of doing the sequential regression: use OLS from
statsmodels and loop with expanding number of explanatory variables in exog.
Note, this uses the generalized inverse,
pinv, for each new matrix of
explanatory variables. I'm storing parameters, residual sum of squares and
information criteria to have a benchmark to compare against later.
ssr_ols = np.zeros(k_vars)
params_ols = np.zeros((k_vars, k_vars))
ic_ols = np.zeros((2, k_vars))
for i in range(4):
res = OLS(y, x[:,:i+1]).fit()
ssr_ols[i] = res.ssr
params_ols[i, :i+1] = res.params
ic_ols[0, i] = res.aic
ic_ols[1, i] = res.bic
In my example, the estimated coefficients are
>>> params_ols
array([[ 0.7585129 , 0. , 0. , 0. ],
[ 1.00564191, 0.96414302, 0. , 0. ],
[ 1.00776594, 0.93035613, -0.10759121, 0. ],
[ 1.03379054, 0.91302697, -0.12998046, -0.08787965]])
The first two coefficients are close to one, the second two coefficients are close to zero, all in the neighborhood of the true values of my simulated model.
Instead of using
pinv, statsmodels also has the option to use
QR
for estimating a linear model, with the basic code as the following.
This uses the
QR decomposition of the matrix of explanatory variables.
q, r = np.linalg.qr(x)
qy = np.dot(q.T, y)
print '\nparams full model'
print params_ols[-1]
print np.linalg.solve(r, qy)
It gives the same result as the
pinv version
params full model
[ 1.03379054 0.91302697 -0.12998046 -0.08787965]
[ 1.03379054 0.91302697 -0.12998046 -0.08787965]
We already have the
QR decomposition for the full model. QR` calculates
the orthogonalization using the sequence as it is given by the matrix of
explanatory variables.
My first attempt was to find the OLS parameters sequentially using the
appropriate subset or slices of the
QR matrices
params1 = np.zeros((k_vars, k_vars))
for i in range(4):
params1[i, :i+1] = np.linalg.solve(r[:i+1, :i+1], qy[:i+1])
This gives us the same parameter estimates as the OLS loop. The computations
are on much smaller arrays than the original problem,
r is
(k_vars, k_vars)
and
qy is just one dimensional with length
k_vars.
There is still one loop, although a much smaller one, and I want to get rid of it.
The following is trivial once we know what's going on, but it took me a while to
figure this out. Some observations about
QR:
- q is the orthogonal matrix that spans the same space as x.
- theta = r beta = qy is the parameter vector in the orthogonalized problem.
- beta = inv(r) theta is the parameter vector for the original regressors.
All we need to do is to select the sequentially increasing subsets of
r and
theta.
For calculating predicted values, we can work directly in the orthogonalized system.
Using an upper triangular matrix of ones
upp_tri = (np.arange(4)[:,None] <= np.arange(4)).astype(float)
we can get the expanding
theta or
qy in a matrix
>>> upp_tri * qy[:,None]
array([[ 4.87045847, 4.87045847, 4.87045847, 4.87045847],
[-0. , -6.94085588, -6.94085588, -6.94085588],
[ 0. , 0. , 0.70410517, 0.70410517],
[ 0. , 0. , 0. , 0.60930516]])
Then, we can just solve the system for all steps at once. First, I tried
the matrix inverse since that's easier to understand theoretically
params2 = np.dot(np.linalg.inv(r), upp_tri * qy[:,None]).T
But, we can avoid the inverse and use
linalg.solve directly:
params2 = np.linalg.solve(r, upp_tri * qy[:,None]).T
That's it, one line and I have the parameters for the four sequential
regression problems. Comparing the parameter estimates of the three methods,
we see they are the same (at floating point precision)
>>> np.max(np.abs(params1 - params2))
0.0
>>> np.max(np.abs(params1 - params_ols))
5.5511151231257827e-16
The rest is calculating fitted values and residuals
contrib = q * qy
fitted = contrib.cumsum(1)
resids = y[:,None] - fitted
ssrs = (resids**2).sum(0)
print '\nresidual sum of squares ssr'
print ssrs
print ssr_ols
which gives us identical residual sum of squares,
ssr
[ 80.31708213 32.14160173 31.64583764 31.27458486]
[ 80.31708213 32.14160173 31.64583764 31.27458486]
I added some asserts in the script, so I don't break anything by accident
assert_almost_equal(ssrs, ssr_ols, decimal=13)
assert_almost_equal(params1, params_ols, decimal=13)
assert_almost_equal(params2, params_ols, decimal=13)
The applications, that I have in mind for this, are selection of the
number of regressors or lags in the time series case, so I want to
calculate the information criteria,
aic and
bic. In the previous result, I
had calculated the residual sum of squares, which I can feed to
helper functions that I had written to calculate information criteria.
The values cannot be directly compared to the results from OLS,
since
aic and
bic reported by the OLS Results instance are
based on the likelihood value, while here I use the
ssr based
definition, which uses a different normalization and drops a constant term.
However, the minimum is achieved at two variables for all cases, which was
expected since I selected the simulated model for this.
print '\naic, bic'
print ic_ols
import statsmodels.tools.eval_measures as evm
dfmodelwc = range(1, k_vars+1) #number of variables in regression
aics = [evm.aic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)]
bics = [evm.bic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)]
print aics
print bics
print 'aic, bic minimum number of variables'
print np.argmin(ic_ols, 1) + 1
print np.argmin(aics) + 1, np.argmin(aics) + 1
prints
aic, bic
[[ 167.59181941 123.80026281 125.02303444 126.43299217]
[ 169.50384241 127.62430882 130.75910345 134.08108419]]
[4.4259823271765022, 3.5501511951835187, 3.5746066277468964, 3.6028057824070068]
[4.4642227872850651, 3.6266321154006445, 3.689328008072585, 3.7557676228412582]
aic, bic minimum number of variables
[2 2]
2 2
PS: Since I never had any numerical linear algebra training, I can still get excited about
figuring out a two-liner. I could have tried to find a book for it, but that would
have been boring compared to working it out on my own.
PS: Following pep-8, I'm used to not using capital letters very often.