joepy

Thursday, May 10, 2012

Regression Plots - Part 1

I started to work on improving the documentation for the regressions plot in statsmodels. (However, I realized I have to improve them a bit.)

For now, just a question: Can you spot the mis-specification of the model?

I simulate a model, run a linear regression on three variables and a constant. Here is the estimation summary, which looks quite good, large R-squared, all variables significant, no obvious problems:

>>> print res.summary()
                                                        OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.901
Model:                            OLS   Adj. R-squared:                  0.898
Method:                 Least Squares   F-statistic:                     290.3
Date:                Thu, 10 May 2012   Prob (F-statistic):           5.31e-48
Time:                        13:15:22   Log-Likelihood:                -173.85
No. Observations:                 100   AIC:                             355.7
Df Residuals:                      96   BIC:                             366.1
Df Model:                           3
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.4872      0.024     20.076      0.000         0.439     0.535
x2             0.5408      0.045     12.067      0.000         0.452     0.630
x3             0.5136      0.030     16.943      0.000         0.453     0.574
const          4.6294      0.372     12.446      0.000         3.891     5.368
==============================================================================
Omnibus:                        0.945   Durbin-Watson:                   1.570
Prob(Omnibus):                  0.624   Jarque-Bera (JB):                1.031
Skew:                          -0.159   Prob(JB):                        0.597
Kurtosis:                       2.617   Cond. No.                         33.2
==============================================================================

The following three graphs are refactored versions of the regression plots. Each graph looks at the data and estimation results with respect to one of the three variables. (The graphs look better in original size.)
The short lines in the first subplot of each graph are the prediction confidence intervals for each observation.

The code is short, if we have the (still unpublished) helper functions.
res is an OLS results instance

from regressionplots_new import plot_regress_exog

fig9 = plot_regress_exog(res, exog_idx=0)
add_lowess(fig9, ax_idx=1, lines_idx=0)
add_lowess(fig9, ax_idx=2, lines_idx=0)
add_lowess(fig9, ax_idx=3, lines_idx=0)

fig10 = plot_regress_exog(res, exog_idx=1)
add_lowess(fig10, ax_idx=1, lines_idx=0)
add_lowess(fig10, ax_idx=2, lines_idx=0)
add_lowess(fig10, ax_idx=3, lines_idx=0)

fig11 = plot_regress_exog(res, exog_idx=2)
add_lowess(fig11, ax_idx=1, lines_idx=0)
add_lowess(fig11, ax_idx=2, lines_idx=0)
add_lowess(fig11, ax_idx=3, lines_idx=0)

Tuesday, May 8, 2012

Plots in statsmodels: qqplot

Other news first, since I haven't managed to catch up with the blogs:

statsmodels has four students in GSoC, the first four projects described in my previous post. Congratulations to Alexandre, Divyanshu, George and Justin

statsmodels 0.4.0 has been release with new name without scikits in front, more on pypi

statsmodels has a graphics subdirectory, where we started to collect some of the common statistical plots. To make the documentation a bit more exciting, I am adding plots directly to the docstrings for the individual functions. Currently, we don't have many of them in the online documentation yet, two examples violin_plot and bean_plot.
A note on the documentation: Skipper improved the frontpage, which makes it easier to find the documentation for the latest released version and for the development version. Currently, the development version is better and is improving, and it is incompatible with the 0.4.0 release in only one part.

quantile-quantile plot: qqplot

The documentation for the function is here. The function signature is

qqplot(data, dist=stats.norm, distargs=(), a=0, loc=0, scale=1, fit=False, line=False, ax=None)

I am not copying the entire docstring, what I would like to present here are some examples and how to work with the plots.
The first example is from the docstring. I don't like the default, so I kept adding keyword arguments until the plot is more to my taste.

The first plot uses no keywords and assumes normal distribution, and does not standardize the data.
The second plot adds line='s', which according to the docstring
's' - standardized line, the expected order statistics are scaled
      by the standard deviation of the given sample and have the mean
      added to them
corresponds to the line after fitting location and scale for the normal distribution
The third plot adds fit=True to get standardized sample quantiles and plots the 45 degree line. That's the plot I would prefer.

The fourth plot is similar to the third plot, but with the t distribution instead of the normal distribution. I was surprised that the third and fourth plot look the same, until I checked and it turned out that the fitted t distribution has a huge degrees of freedom parameter and so is essentially identical to the normal distribution.

I will go over the code to produce this below.
I started the second example to see whether fitting the t distribution works correctly. Instead of using real data, I generate 1000 observations with a t distribution with df=4 and standard location(0) and scale (1).

The first plot fits a normal distribution, keywords: line='45', fit=True

The second plot fits the t distribution, keywords: dist=stats.t, line='45', fit=True

The third plot is the same as the second plot, but I fit the t distribution myself, instead of having qqplot do it. keywords: dist=stats.t, distargs=(dof,), loc=loc, scale=scale, line='45'. I added the estimated parameters into a text insert in the plot. qqplot currently doesn't tell us what the fitted parameters are.

The Code

Here was my first attempt, following the docstring example

from scipy import stats
import statsmodels.api as sm

#estimate to get the residuals
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
mod_fit = sm.OLS(data.endog, data.exog).fit()
res = mod_fit.resid

fig = sm.graphics.qqplot(res, dist=stats.t, line='45', fit=True)
fig.show()

It works but the x-axis goes from -3 to 3, while there are only values from -2 to 2.

Detour to some background

A while ago we had a discussion on the mailing list what a plot in statsmodels should return. With the helpful comments of John Hunter, we finally agreed that plots should take an ax (matplotlib axis) argument if it's meaningful, and always return a figure instance fig. If ax is None, or the plot is a combination plot (several plots in one figure), then a figure is created and returned. If ax is given, then that is used to attach the plot elements. Ralf Gommers converted our plot functions to follow this pattern, besides that, he also wrote several of the plots that are currently in statsmodels.
So, to change the axis limits in the above graph, all I have to add is:

fig.axes[0].set_xlim(-2, 2)

The resulting plot is then the same as the third plot in the first graph above.

The first graph

Here is now the script for the first graph in several stages:
First I import some modules and calculate the residuals following the example

from scipy import stats
from matplotlib import pyplot as plt
import statsmodels.api as sm

#example from docstring
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
mod_fit = sm.OLS(data.endog, data.exog).fit()
res = mod_fit.resid

Then I hardcode a left position for text inserts, and create a matplotlib figure instance

left = -1.8
fig = plt.figure()

Next we can add the first subplot. The only keyword arguments for qqplot is ax to tell qqplot to attach the plot to my first subplot. Since I want to insert a text to describe the keywords, I needed to spend some time with the matplotlib documentation. As we have a reference to the axis instance, it is easy to change or add plot elements

ax = fig.add_subplot(2, 2, 1)
sm.graphics.qqplot(res, ax=ax)
top = ax.get_ylim()[1] * 0.75
txt = ax.text(left, top, 'no keywords', verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))

The other subplots follow the same pattern. I didn't try to generalize or avoid hardcoding

ax = fig.add_subplot(2, 2, 2)
sm.graphics.qqplot(res, line='s', ax=ax)
top = ax.get_ylim()[1] * 0.75
txt = ax.text(left, top, "line='s'", verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))

ax = fig.add_subplot(2, 2, 3)
sm.graphics.qqplot(res, line='45', fit=True, ax=ax)
ax.set_xlim(-2, 2)
top = ax.get_ylim()[1] * 0.75
txt = ax.text(left, top, "line='45', \nfit=True", verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))

ax = fig.add_subplot(2, 2, 4)
sm.graphics.qqplot(res, dist=stats.t, line='45', fit=True, ax=ax)
ax.set_xlim(-2, 2)
top = ax.get_ylim()[1] * 0.75
txt = ax.text(left, top, "dist=stats.t, \nline='45', \nfit=True",
              verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))

The final step is to adjust the layout, so that axis labels don't overlap with other subplots if the graph is not very large

fig.tight_layout()

The second graph

The second graph follows the same pattern with a few changes.
First we generate a random sample using scipy.stats which under the hood uses the random numbers from numpy. You can notice here that I am cheating. I ran the script several times to find "nice" seeds. Especially in smaller samples, qqplot might often not be very good in distinguishing normal and t distributions.

import numpy as np
seed = np.random.randint(1000000)
print 'seed', seed
seed = 461970  #nice seed for nobs=1000
#seed = 571478  #nice seed for nobs=100
#seed = 247819  #for nobs=100, estimated t is essentially normal
np.random.seed(seed)
rvs = stats.t.rvs(4, size=1000)

The first two subplot are very similar to what is in the first graph

fig2 = plt.figure()
ax = fig2.add_subplot(2, 2, 1)
fig2 = sm.graphics.qqplot(rvs, dist=stats.norm, line='45', fit=True, ax=ax)
top = ax.get_ylim()[1] * 0.75
xlim = ax.get_xlim()
frac = 0.1
left = xlim[0] * (1-frac) + xlim[1] * frac
txt = ax.text(left, top, "normal", verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))

ax = fig2.add_subplot(2, 2, 2)
fig2 = sm.graphics.qqplot(rvs, dist=stats.t, line='45', fit=True, ax=ax)
top = ax.get_ylim()[1] * 0.75
xlim = ax.get_xlim()
frac = 0.1
left = xlim[0] * (1-frac) + xlim[1] * frac
txt = ax.text(left, top, "t", verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))

For the third plot, I estimate the parameters of the t-distribution to see whether I get the same results as in the second plot (I do), and so I can insert the parameter estimates into the plot

params = stats.t.fit(rvs)
dof, loc, scale = params
ax = fig2.add_subplot(2, 2, 4)
fig2 = sm.graphics.qqplot(rvs, dist=stats.t, distargs=(dof,), loc=loc,
                 scale=scale, line='45', fit=False, ax=ax)
top = ax.get_ylim()[1] * 0.75
xlim = ax.get_xlim()
frac = 0.1
left = xlim[0] * (1-frac) + xlim[1] * frac
txt = ax.text(left, top, "t \ndof=%3.2F \nloc=%3.2F, \nscale=%3.2F" % tuple(params),
              verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))

That's it for the plots, now I need to add them to the statsmodels documentation.

PS: normality tests, details left for another day

qqplots give us a visual check whether a sample follows a specific distribution. The case that we are interested in most often, is a test for normality. Scipy.stats and statsmodels have several normality tests. The ones I have written recently are Anderson-Darling and Lillifors. Lillifors is a Kolmogorov-Smirnov test for normality when mean and variance are estimated. Calculating a statistical test provides a more reliable test than a "vague" visual inspection, but these tests do not point us to a specific alternative and provide less information about the direction in which a null hypothesis might be incorrect.
Using the residuals in the first example, neither test rejects the Null Hypothesis that the residuals come from a normal distribution

>>> normal_ad(res)
(0.43982328207860633, 0.25498161947268855)
>>> lillifors(res)
(0.17229856392873188, 0.2354638181341876)

On the other hand, in the second example with 1000 observations from the t distribution, the assumption that the data comes from a normal distribution is clearly rejected

>>> normal_ad(rvs)
(6.5408483355136013, 4.7694160497092537e-16)
>>> lillifors(rvs)
(0.05919821253474411, 8.5872265678140885e-09)

PPS:
I'm reluctant to publish the import path, because I had forgotten to add them to a proper place for 0.4.0, and the import location will not stay where it is. It took me a few minutes to find out that they are not on any recommended import path when I wrote these scripts

>>> from statsmodels.stats.adnorm import normal_ad
>>> from statsmodels.stats.lilliefors import lillifors

Wednesday, April 11, 2012

Statsmodels and Google Summer of Code 2012

I didn't have much time or motivation to work on my blog these last weeks, mainly because I was busy discussing Google Summer of Code and preparing a new release for statsmodels.

So here is just an update on our Google Summer of Code candidates and their projects. This year was a successful year in attracting student proposals. We have six proposals, five of them we discussed quite extensively on our mailing list before the application.

Of the five projects, the first two are must-haves for econometrics or statistical packages, one on System of Equations, the other on Nonlinear Least-Squares and Nonlinear Robust Models. The next two are nonparametric or semi-parametric methods, one more traditional kernel estimation, the other using Empirical Likelihood which is a relatively new approach that has become popular in recent research both in econometrics and in statistics. The fifth is on Dynamic Linear Models mainly using Kalman filter and a Bayesian approach, which would extend the depth of statsmodels in time series analysis.

All topics would be valuable extensions to statsmodels and significantly increase our coverage of statistics and econometrics. From the discussion on the mailing list I think that all candidates are qualified to bring the projects to a successful finish.

Estimating System of Equations

This is a standard econometrics topic, but I only recently found that graphical models and causal models discussed in other fields have a large overlap with this. In the case of a system of simultaneous equations, we have several variables that depend on each other. The simplest case in economics is a market equilibrium, where the demanded and supplied quantities depend on the price, and the price depends on the supply and demand. The estimation methods commonly used in this area are two-stage and three-stage least-squares and limited and full information maximum likelihood estimation. The first part of the project starts with the simpler case when we have several response variables, but they don't depend on each other simultaneously, although they can depend on past values of other response variables. I'm very glad that someone is picking this one up.

Extension of Linear to Non Linear Models

This project has two parts, the first is extending the linear least-squares model to the non-linear case, the second part is to implement non-linear models for robust estimation. Non-linear least squares is available in scipy for example with scipy.optimize.curve_fit. However, in the statsmodels version we want to provide all the usual results statistics and statistical tests. The second part will implement two robust estimators for non-linear model, that have been shown to be most successful in recent Monte Carlo studies comparing different robust estimators for non-linear equations. Robust estimation here refers to the case when there are possibly many outliers in the data. My guess is that these will become the most used models of all proposals.

Nonparametric Estimation

This project extends the kernel based method in statsmodels from the univariate to the multivariate case, will provide better bandwidth selection, and then implement nonparametric function estimation. Multivariate kernel density estimation should complement scipy.stats.gaussian_kde which only works well with distributions that are approximately normal shaped or have only a single peak. Another extension is to provide kernels and estimation methods for discrete variables. These methods have been on our wishlist for a while, but only the univariate case has been included in statsmodels so far.

Empirical Likelihood

This is a relatively new approach in statistics and econometrics that avoids the distributional assumptions in estimation and in statistical tests. Instead of relying on a known distribution in small samples, where we often assume normal distribution, or instead of relying on the asymptotic normal distribution in large samples, this approach estimates the distribution in a nonparametric way. This is similar, to some extend, to the difference between, for example, a t-test and a rank-based Mann–Whitney U or Wilcoxon test, which are available in scipy.stats. The advantages are that in smaller samples the estimates and tests are more accurate when the distribution is not known, and in many cases, for example in finance, most tests show that the assumption of normal distribution does not hold. For this project, I still have to catch up with some readings because I'm only familiar with a small part of this literature, mainly on empirical likelihood in relation to Generalized Method of Moments (GMM).

Dynamic Linear Models

This covers statespace models implemented by Kalman Filter for multivariate time series models, both from a likelihood and a Bayesian perspective. The project expands the coverage of statsmodels in linear time series analysis, the first area where we get a good coverage of models. Currently, we have univariate AR and ARIMA, vector autoregressive models VAR, and structural VAR. Part of this project would be to get a good cython based implementation of Kalman filters. Wes has started a libray, statlib, for this last year, however, it is still incomplete and needs to be integrated with statsmodels. Another advantage of this project is that it increases our infrastructure and models for macro-econometrics, estimation of macroeconomic models and dynamic stochastic general equilibrium DSGE models, which is currently still Matlab dominated, as far as I can tell.

Now we still have to see how many GSoC slots we will get, but we have the chance this year to get a large increase in the speed of development of statsmodels, and we can reduce the number of cases where someone needs to run to R, or Stata, or Matlab because there is no implementation for a statistical analysis available in Python.