The Problem
(Warning upfront: there are problems to replicate this, more below)
A week ago, I stumbled on this Numerical_Comparison_of_Statistical_Software which presents some test results for numerical accuracy of statistical packages.
For linear regression, there is one test, Longley, that we have in the datasets in statsmodels. But I wanted to look at Filip which sounds like a difficult case, neither SAS nor SPSS produced a solution. Let's see how badly statsmodels and numpy are doing, or maybe not.
The model is a polynomial of degree 10. Description, data, certified values and a plot are on the NIST website here
1 Predictor Variable 82 Observations Higher Level of Difficulty Model: Polynomial, 11 Parameters
I parsed the data into an array dta, first column is the endogeous, y, variable second column is the exogenous, x, variable. I saved y in endog. I also parsed the main NIST result in params_nist, first column parameters, second column their standard deviation.
Fitting a Polynomial
Since it is a polynomial problem, let us treat it as such and use the polynomials from numpy.
First try, use the old polyfit function
>>> p_params = np.polyfit(dta[:,1], endog, 10)[::-1] >>> p_params array([-1467.48963361, -2772.17962811, -2316.37111156, -1127.97395547, -354.47823824, -75.1242027 , -10.87531817, -1.062215 , -0.06701912, -0.00246781, -0.0000403 ]) >>> log_relative_error(p_params, params_nist[:,0]) array([ 7.87929761, 7.88443445, 7.88840683, 7.89138269, 7.89325784, 7.89395336, 7.89341841, 7.89162977, 7.88859034, 7.88432427, 7.87887292])
Not bad, following the description on the Wikipedia page, I wrote a function log_relative_error that tells us how many significant digits agreement is between the two arrays. polyfit agrees at 7 to 8 significant digits, that's about the same as S-Plus on the Wikipedia page.
Let's work properly with polynomials and use the new polynomial package in numpy. Charles Harris wrote it and is still expanding and improving it.
>>> poly = np.polynomial.Polynomial.fit(dta[:,1],endog, 10) >>> poly Polynomial([ 0.88784146, 0.10879327, -0.53636698, 0.28747072, 2.20567367, -1.31072158, -4.21841581, 1.76229897, 3.8096025 , -0.77251557, -1.30327368], [-8.78146449, -3.13200249])
Oops, these numbers don't look like the NIST numbers. The last numbers, [-8.78146449, -3.13200249], show the domain of the polynomial, our values have been transformed. A bit of introspection, and we figure out how to change the domain. To get the "standard" representation, we can transform the domain back to the standard domain (-1, 1).
>>> poly.convert(domain=(-1,1)) Polynomial([-1467.48961423, -2772.17959193, -2316.37108161, -1127.97394098, -354.4782337 , -75.12420174, -10.87531804, -1.06221499, -0.06701912, -0.00246781, -0.0000403 ], [-1., 1.])
Now, this looks more like NIST, it even agrees at 13 to 14 significant digits
>>> log_relative_error(poly.convert(domain=(-1,1)).coef, params_nist[:,0]) array([ 13.72347502, 13.84056851, 13.81494335, 13.70878715, 13.78207216, 13.79374075, 13.6729684 , 13.71128925, 13.75445952, 13.68695573, 13.67736436])
Nice job Charles. No problem fitting this polynomial with numpy.
Linear Regression
In the previous part we knew we were fitting a polynomial, but lets treat it just as a standard linear regression problem and use scikits.statsmodels.
First try: just create the design matrix in the simplest way and estimate
>>> exog0 = dta[:,1:]**np.arange(11) >>> res0 = sm.OLS(endog, exog0).fit() >>> res0.params array([ 8.443046917097718, 1.364996059973237, -5.350750046084954, -3.34190287892045 , -0.406458629495091, 0.257727311296307, 0.119771653524165, 0.023140891929892, 0.002403995206457, 0.000131618839544, 0.000002990001222]) >>> log_relative_error(res0.params, params_nist[:,0]) array([-0.002491507096328, -0.000213790029725, 0.00100436814061 , 0.001288615104161, 0.000498264786078, -0.00148737673275 , -0.004756810105056, -0.009359738327099, -0.015305377783833, -0.022566206229652, -0.031085341541384])
Bummer, 0 significant digits, way off.
We can print the full summary of the results, maybe we see something
>>> print res0.summary() OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.996 Model: OLS Adj. R-squared: 0.995 Method: Least Squares F-statistic: 2390. Date: Sat, 03 Mar 2012 Prob (F-statistic): 1.85e-84 Time: 23:47:45 Log-Likelihood: 344.73 No. Observations: 82 AIC: -673.5 Df Residuals: 74 BIC: -654.2 Df Model: 7 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const 8.4430 12.864 0.656 0.514 -17.189 34.075 x1 1.3650 6.496 0.210 0.834 -11.578 14.308 x2 -5.3508 9.347 -0.572 0.569 -23.974 13.273 x3 -3.3419 11.702 -0.286 0.776 -26.659 19.975 x4 -0.4065 5.923 -0.069 0.945 -12.209 11.396 x5 0.2577 1.734 0.149 0.882 -3.197 3.712 x6 0.1198 0.321 0.373 0.710 -0.520 0.759 x7 0.0231 0.038 0.604 0.548 -0.053 0.099 x8 0.0024 0.003 0.838 0.405 -0.003 0.008 x9 0.0001 0.000 1.072 0.287 -0.000 0.000 x10 2.99e-06 2.29e-06 1.303 0.197 -1.58e-06 7.56e-06 ============================================================================== Omnibus: 1.604 Durbin-Watson: 1.627 Prob(Omnibus): 0.449 Jarque-Bera (JB): 1.379 Skew: -0.317 Prob(JB): 0.502 Kurtosis: 2.961 Cond. No. -1.#J ============================================================================== The smallest eigenvalue is -0.38. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
R square is 0.996, so we are fitting the curve pretty well, but our design matrix with the polynomial is not positive definite. There is even a negative eigenvalue. A negative eigenvalue sounds strange, a quadratic form shouldn't have them. Just to make sure that this is not a bug, check with numpy
>>> np.linalg.eigvalsh(np.dot(exog0.T, exog0)).min() -0.38006444279775781 >>> np.linalg.eigvals(np.dot(exog0.T, exog0)).min() -0.00011161167956978682
I'm still suspicious, but I delay the detour into numpy's and scipy's linalg modules.
One more check of our regression results, the residual standard error is not very far away from the Nist numbers:
>>> np.sqrt(res0.mse_resid), 0.334801051324544E-02, (0.0038044343586352601, 0.0033480105132454399)
Conclusion: If you try to fit a linear regression with a non-positive definite design matrix, then the parameters are not identified, but we can still get a good fit.
(Technical aside: statsmodels uses by default the generalized inverse, pinv, for linear regression. So it just drops the eigenvalues below a threshold close to zero. The parameter estimates will be closer to a penalized Ridge regression. But don't quote me on the last part since I don't remember where I read that pinv is the limit of a Ridge problem.)
The question for statsmodels is what to do about it.
One solution that works in this case, as we have seen with numpy polynomials, is to rescale the explanatory variables or design matrix. I'm showing one example below. My working title for this post was: Don't do this, or do we have to do it for you? Is it the responsibility of the user not to use a design matrix that numerically doesn't make much sense and we can only warn, or should we automatically transform the design matrix to make it numerically more stable. The latter will be costly and might not be required in 99% of the cases?
Another issue is that there are many different ways to do the linear algebra, and we have not investigated much what might work better or worse in different cases. See Addendum below for the effect that linear algebra can have in numerically unstable problems.
Rescaling the Design Matrix
Our design matrix looks pretty bad, variables vary in a large range and the correlation is very high
>>> np.set_printoptions(precision=3) >>> print np.max(np.abs(exog0),0) [ 1.000e+00 8.781e+00 7.711e+01 6.772e+02 5.947e+03 5.222e+04 4.586e+05 4.027e+06 3.536e+07 3.105e+08 2.727e+09]>>> print np.corrcoef(exog0[:,1:], rowvar=0) [[ 1. -0.991 0.969 -0.938 0.904 -0.87 0.838 -0.808 0.782 -0.758] [-0.991 1. -0.993 0.975 -0.951 0.925 -0.899 0.874 -0.851 0.83 ] [ 0.969 -0.993 1. -0.994 0.981 -0.963 0.943 -0.924 0.904 -0.886] [-0.938 0.975 -0.994 1. -0.996 0.986 -0.973 0.958 -0.943 0.928] [ 0.904 -0.951 0.981 -0.996 1. -0.997 0.99 -0.98 0.968 -0.957] [-0.87 0.925 -0.963 0.986 -0.997 1. -0.998 0.992 -0.985 0.976] [ 0.838 -0.899 0.943 -0.973 0.99 -0.998 1. -0.998 0.994 -0.988] [-0.808 0.874 -0.924 0.958 -0.98 0.992 -0.998 1. -0.999 0.995] [ 0.782 -0.851 0.904 -0.943 0.968 -0.985 0.994 -0.999 1. -0.999] [-0.758 0.83 -0.886 0.928 -0.957 0.976 -0.988 0.995 -0.999 1. ]]
Now we can use just the simplest transform, limit the maximum absolute value to be one:
exog1 = exog0 / np.max(np.abs(exog0),0)
After running the regression on the rescaled design matrix, we see an agreement with the NIST benchmark results at around 7 to 8 significant digits both for the parameters and for the standard deviation of the parameter estimates, bse in statsmodels:
>>> res1 = sm.OLS(endog, exog1).fit() >>> params_rescaled = res1.params / np.max(np.abs(exog0), 0) >>> log_relative_error(params_rescaled, params_nist[:,0]) array([ 7.419, 7.414, 7.409, 7.402, 7.394, 7.384, 7.373, 7.36 , 7.346, 7.331, 7.314]) >>> bse_rescaled = res1.bse / np.max(np.abs(exog0),0) >>> log_relative_error(bse_rescaled, params_nist[:,1]) array([ 8.512, 8.435, 8.368, 8.308, 8.255, 8.207, 8.164, 8.124, 8.089, 8.057, 8.028])
Also R squared and the standard deviation of the residuals (using appropriate degrees of freedom) agrees with the NIST numbers at around 10 and 7 digits, resp.
>>> log_relative_error(res1.rsquared, 0.996727416185620) 10.040156510920081 >>> log_relative_error(np.sqrt(res1.mse_resid), 0.334801051324544E-02) 7.8575015681097371
So we are doing pretty well just with a simple rescaling of the variables. Although, the comment at the end of print res1.summary() still reports a smallest eigenvalue of -1.51e-15, so essentially zero. But I worry about this later. I looked initially at another way of rescaling the design matrix but didn't check yet how the choice of the rescaling will affect the results
Addendum 1: Smallest eigenvalue of ill-conditioned array
Going back to the original design matrix without rescaling, define the moment matrix X'X:
>>> xpx0 = np.dot(exog0.T, exog0)
the eigenvalues, assuming a symmetric matrix, are
>>> np.sort(np.linalg.eigvalsh(xpx0)) array([ -3.79709e-01, 1.14869e-05, 4.40507e-03, 3.20670e+00, 7.91804e+02, 1.05833e+03, 3.98410e+05, 2.31485e+08, 4.28415e+11, 1.93733e+15, 5.17955e+19])
This looks very badly conditioned. the largest eigenvalue is 5e19, the smallest is "around zero".
We can compare different algorithms to calculate the smallest eigenvalues (splinalg is scipy.linalg)
>>> np.sort(np.linalg.eigvals(xpx0))[:4] array([ 3.41128e-04, 5.58946e-04, 1.23213e-02, 3.33365e+00]) >>> np.sort(splinalg.eigvalsh(xpx0))[:4] array([ -2.14363e+03, -2.00323e-01, 1.26094e-05, 4.40956e-03]) >>> np.sort(splinalg.eigvals(xpx0))[:4] array([ -3.66973e-05+0.j, 1.61750e-04+0.j, 7.90465e-03+0.j, 2.01592e+00+0.j]) >>> np.sort(np.linalg.svd(xpx0)[1])[:4] array([ 2.84057e-05, 4.91555e-04, 7.28252e-03, 3.41739e+00]) >>> np.sort(splinalg.svd(xpx0)[1])[:4] array([ 2.19202e-05, 7.11920e-04, 7.00790e-03, 3.28229e+00]) >>> np.sort(np.linalg.svd(exog0)[1]**2)[:4] array([ 1.65709e-11, 3.08225e-08, 2.48138e-05, 1.08036e-02]) >>> np.sort(splinalg.svd(exog0)[1]**2)[:4] array([ 1.65708e-11, 3.08225e-08, 2.48138e-05, 1.08036e-02])
So, we see that they are pretty much all over the place, from -0.38 to 2.8e-05. The last version with singular value decomposition is the closest to what statsmodels uses with pinv. It also looks like I picked the worst algorithm for the regression summary in this case.
Warning: Calculations at machine precision are not necessarily deterministic, in the sense that if you run it repeatedly you might not always get the same results. There are several cases on the scipy and numpy mailing lists that report that the results might "randomly" switch between several patterns. And the results won't agree on different operating systems, compilers and versions of the linear algebra libraries. So, I don't expect that these results can be replicated in exactly the same way.
Addendum 2: scipy.linalg versus numpy.linalg
To avoid getting these changing results whenever I re-ran the script while preparing this post, I changed the statsmodels source to use scipy.linalg.pinv instead of numpy.linalg.pinv. I expected more replicable results, however what I found is:
>>> exog0 = dta[:,1:]**np.arange(11) >>> res0 = sm.OLS(endog, exog0).fit() >>> log_relative_error(res0.params, params_nist[:,0]) array([ 5.31146488, 5.7400516 , 6.53794562, 6.81318335, 6.81855769, 7.22333339, 8.13319742, 7.38788711, 7.24457806, 7.18580677, 7.12494176]) >>> log_relative_error(res0.bse, params_nist[:,1]) array([ 2.25861611, 2.25837872, 2.25825903, 2.25822427, 2.2582245 , 2.25823174, 2.25823693, 2.25823946, 2.25824058, 2.25824108, 2.25824165])
Just by changing the algorithm that calculates the generalized inverse, I get agreement with the NIST data at 5 to 7 significant digits for the parameters and 2 digits for the standard error of the parameter estimates even with the very ill-conditioned original design matrix. That doesn't look so bad, much better than when using the numpy.linalg version.
(But I need to write proper tests and look at this when I can trust the results. I now have two python sessions open, one that imported the original source, and one that imported the source after changing the statsmodels source. Also, if I run this regression repeatedly the numbers changed once, but remained within the same neighborhood. Besides different algorithm there is also rcond which defines the cutoff in pinv. I didn't check whether that differs in the numpy and scipy versions.)
Conclusions
I think this test case on the NIST site is very well "cooked" to test the numerical accuracy of a linear regression program. The main lesson is that we shouldn't throw a numerically awful problem at a statistical package, unless we know that the package takes care for us of the basic tricks for making the problem numerically more stable. It's safer to make sure our design matrix is numerically sound.
Also, if we just want to estimate a polynomial function, then use the information and use a specialized algorithm, or, even better, use an orthogonal polynomial basis instead of power polynomials.
What does it mean for statistical analysis?
That, I'm not so sure. Multicollinearity is a serious issue, and there a various approaches for dealing with it. But my attitude so far has been:
If you work with real data and run into numerical problems, it's not a problem with numerical accuracy but with your data, or with your model.
We should still use basic precautions like scaling our variables appropriately, but if we have high multicollinearity, then it mainly means that the model that we specified is asking for information that's not in the data. In certain directions the data is not informative enough to reliably identify some parameters. Given measurement errors, noise in the data and misspecification, there are many other parts to worry about before machine precision becomes important. For a related discussion see this thread on the statsmodels mailinglist.
I tried before to come up with a case where standardizing (zscoring) the design matrix helps in improving the precision of the estimates but I didn't manage. Whether I zscored or not, the results where essentially the same. Now, I have a test case to add to statsmodels. I am sceptical about automatic rescaling, but I started a while ago to look into how to make it easier for users to use predefined transforms in statsmodels, instead of having to code them from scratch.
I'm not an expert in numerical analysis and don't intend to become one, my "numerical incompetence" has improved only a bit since this although I know now a bit more linear algebra.
I put a script with the NIST case in this gist. I haven't yet copied over the parts from the interpreter sessions.
A final comment:
I don't like long interpreter sessions, I usually convert everything as fast as possible to a script. For this, I copied everything directly from the session. After cleaning up the original script a bit, I'm getting different numbers for the log relative error (LRE). I'm now using scipy.linalg.pinv inside statsmodels, and LRE is in this case a measure for the behavior at machine precision, and bounces anywhere between 5 and 8. This is a good result in that we can still get estimates with a reasonable precision, but it makes LRE unreliable for replicating the results. I will make a proper script and unittest later, so that I can be more certain about how much the numbers change and whether there isn't a bug somewhere in my "live" session.