joepy: outliers

Now that I got these outlier and influence measures (see last post), we can look at the Anscombe example, Anscombe Quartet

I took the example from matplotlib here. With a few changes, mainly adding the observation number. I get this graph

The first example is what we would usually expect and shouldn't have any problems for estimating the regression line. The second example is nonlinear, but we are fitting only a linear model. The third example has an outlier in y, which pulls the regression line towards it. The last example has an outlier in x.

Interpreting the last case is a bit tricky. The slope of the regression line is dominated by observation 7, if we think it is a reliable (correctly measured) observation, then it is very useful since it contains a lot of information. If we doubt the reliability of that measurement, then we are left without any information about the slope and the effect of the explanatory variable.
(I don't remember where I read this, someone argued that we shouldn't through out x outliers.)

Now we can run a loop over the four problems and check the influence diagnostic table, and also Ramsey's RESET test for linearity.

example 1: ok
-------------

Parameter estimates (constant, slope): [ 3.00009091 0.50009091]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      8.040      8.001      0.000      0.033      0.100      0.011      0.031      0.010      0.003
         1      6.950      7.001      0.000     -0.043      0.100     -0.014     -0.041     -0.014      0.004
         2      7.580      9.501      0.489     -1.778      0.236     -0.989     -2.081     -1.158     -0.908
         3      8.810      7.501      0.062      1.110      0.091      0.351      1.127      0.356      0.000
         4      8.330      8.501      0.002     -0.148      0.127     -0.057     -0.140     -0.053     -0.029
         5      9.960     10.001      0.000     -0.041      0.318     -0.028     -0.038     -0.026     -0.022
         6      7.240      6.001      0.127      1.102      0.173      0.503      1.117      0.510     -0.351
         7      4.260      5.000      0.123     -0.725      0.318     -0.495     -0.705     -0.481      0.407
         8     10.840      9.001      0.279      1.635      0.173      0.747      1.838      0.840      0.578
         9      4.820      6.501      0.154     -1.455      0.127     -0.556     -1.569     -0.599      0.320
        10      5.680      5.500      0.004      0.166      0.236      0.092      0.157      0.087     -0.068
=============================================================================================================
RESET
F test: F=array([[ 0.44964358]]), p=[[ 0.77063577]], df_denom=5, df_num=4

Everything looks mostly ok, the RESET test doesn't reject the linear specification. Observation 2 might be an outlier, the studentized residual is at around 2, which is about the threshold for the t-distribution.

example 2: nonlinear
--------------------

Parameter estimates (constant, slope): [ 3.00090909 0.5       ]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      9.140      8.001      0.052      0.971      0.100      0.324      0.967      0.322      0.097
         1      8.140      7.001      0.052      0.971      0.100      0.324      0.967      0.322     -0.097
         2      8.740      9.501      0.077     -0.704      0.236     -0.392     -0.683     -0.380     -0.298
         3      8.770      7.501      0.058      1.076      0.091      0.340      1.087      0.344      0.000
         4      9.260      8.501      0.032      0.657      0.127      0.251      0.635      0.242      0.130
         5      8.100     10.001      0.808     -1.861      0.318     -1.271     -2.236     -1.528     -1.291
         6      6.130      6.001      0.001      0.115      0.173      0.052      0.108      0.050     -0.034
         7      3.100      5.001      0.808     -1.861      0.318     -1.271     -2.236     -1.528      1.291
         8      9.130      9.001      0.001      0.115      0.173      0.052      0.108      0.050      0.034
         9      7.260      6.501      0.032      0.657      0.127      0.251      0.635      0.242     -0.130
        10      4.740      5.501      0.077     -0.704      0.236     -0.392     -0.683     -0.380      0.298
=============================================================================================================
RESET
F test: F=array([[ 820836.08290181]]), p=[[ 0.]], df_denom=5, df_num=4

Observation 2 and 7 look like outliers. However, the RESET test strongly rejects the hypothesis that the linear specification is correct. So we better look for another model, for example including polynomial terms for the explanatory variable.

example 3: outlier at 2
----------------------

Parameter estimates (constant, slope): [ 3.00245455 0.49972727]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      7.460      8.000      0.012     -0.460      0.100     -0.153     -0.439     -0.146     -0.044
         1      6.770      7.000      0.002     -0.196      0.100     -0.065     -0.186     -0.062      0.019
         2     12.740      9.499      1.393      3.000      0.236      1.669   1203.540    669.587    525.268
         3      7.110      7.500      0.005     -0.331      0.091     -0.105     -0.314     -0.099      0.000
         4      7.810      8.499      0.026     -0.597      0.127     -0.228     -0.574     -0.219     -0.117
         5      8.840      9.999      0.301     -1.135      0.318     -0.775     -1.156     -0.790     -0.667
         6      6.080      6.001      0.001      0.070      0.173      0.032      0.066      0.030     -0.021
         7      5.390      5.001      0.034      0.381      0.318      0.260      0.362      0.247     -0.209
         8      8.150      8.999      0.059     -0.755      0.173     -0.345     -0.736     -0.336     -0.231
         9      6.420      6.500      0.000     -0.070      0.127     -0.027     -0.066     -0.025      0.013
        10      5.730      5.501      0.007      0.212      0.236      0.118      0.200      0.111     -0.087
=============================================================================================================
RESET
F test: F=array([[ 1.42791012]]), p=[[ 0.34730273]], df_denom=5, df_num=4

The results from the table are pretty clear, observation 2 has to be an outlier, all residual measures are huge. The RESET test doesn't indicate any problem with the linear specification.

example 4: exog_outlier at 7
----------------------------

Parameter estimates (constant, slope): [ 3.00172727 0.49990909]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      6.580      7.001      0.007     -0.359      0.100     -0.120     -0.341     -0.114      0.034
         1      5.760      7.001      0.062     -1.059      0.100     -0.353     -1.067     -0.356      0.107
         2      7.710      7.001      0.020      0.605      0.100      0.202      0.582      0.194     -0.059
         3      8.840      7.001      0.137      1.569      0.100      0.523      1.735      0.578     -0.174
         4      8.470      7.001      0.087      1.253      0.100      0.418      1.300      0.433     -0.131
         5      7.040      7.001      0.000      0.033      0.100      0.011      0.031      0.011     -0.003
         6      5.250      7.001      0.124     -1.494      0.100     -0.498     -1.624     -0.541      0.163
         7     12.500     12.500     -1.#IO     -1.#IO      1.000     -1.#IO     -1.#IO     -1.#IO     -3.070
         8      5.560      7.001      0.084     -1.229      0.100     -0.410     -1.271     -0.423      0.128
         9      7.910      7.001      0.033      0.775      0.100      0.259      0.757      0.252     -0.076
        10      6.890      7.001      0.001     -0.095      0.100     -0.032     -0.089     -0.030      0.009
=============================================================================================================
RESET
F test: F=array([[ 126.84716907]]), p=[[ 0.00000007]], df_denom=9, df_num=4

What's going on with observation 7? The "-1.#IO" stands for numpy.NaN. All the measures that rely on leaving out observation 7 are indeterminate, since without observation 7 we cannot identify a slope.
The diagonal value of the hat matrix and the dfbeta for the slope indicate that this observation is very influential.

example 4a: exog_outlier at 7 (v2)
----------------------------------


Parameter estimates (constant, slope): [ 3.00172741  0.49990908]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      7.006      7.006      0.000      0.000      0.091      0.000      0.000      0.000     -0.000
         1      6.580      7.001      0.007     -0.377      0.091     -0.119     -0.360     -0.114      0.033
         2      5.760      7.001      0.062     -1.110      0.091     -0.351     -1.125     -0.356      0.103
         3      7.710      7.001      0.020      0.634      0.091      0.201      0.614      0.194     -0.056
         4      8.840      7.001      0.135      1.645      0.091      0.520      1.828      0.578     -0.167
         5      8.470      7.001      0.086      1.314      0.091      0.416      1.371      0.434     -0.125
         6      7.040      7.001      0.000      0.035      0.091      0.011      0.033      0.011     -0.003
         7      5.250      7.001      0.123     -1.567      0.091     -0.495     -1.711     -0.541      0.156
         8     12.500     12.500      0.000     -0.000      1.000     -0.001     -0.000     -0.001     -0.001
         9      5.560      7.001      0.083     -1.289      0.091     -0.408     -1.339     -0.424      0.122
        10      7.910      7.001      0.033      0.813      0.091      0.257      0.798      0.253     -0.073
        11      6.890      7.001      0.001     -0.099      0.091     -0.031     -0.094     -0.030      0.009
=============================================================================================================
RESET
F test: F=array([[ 113.86707171]]), p=[[ 0.00000011]], df_denom=9, df_num=4

Since I didn't like the NaN results in the previous example, I made one change. I added an additional observation that has an x value slightly larger the the large group, 8.01 insted of 8.00, the y value was chosen so it lies on the predicted line. The new observation is the first observation, the x outlier is now at index 8.

What happened to our influence measures? Both observations 0 and 8 have zero influence, since we only need one of them to identify the slope. Leave one observation out doesn't help if we have two "similar" (in terms of parameter estimation) outliers.

Since this was interesting, let's try another variation. This time I change the value of the new observation to lie below the regression line (6.999 instead of 7.006). One more table and we are done.

example 4b: exog_outlier at 7 (v3)
----------------------------------


Parameter estimates (constant, slope): [ 3.00063327  0.49996637]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      6.999      7.005      0.000     -0.006      0.091     -0.002     -0.005     -0.002      0.001
         1      6.580      7.000      0.007     -0.376      0.091     -0.119     -0.359     -0.114      0.033
         2      5.760      7.000      0.062     -1.110      0.091     -0.351     -1.124     -0.356      0.103
         3      7.710      7.000      0.020      0.635      0.091      0.201      0.615      0.194     -0.056
         4      8.840      7.000      0.136      1.646      0.091      0.520      1.829      0.578     -0.167
         5      8.470      7.000      0.086      1.315      0.091      0.416      1.372      0.434     -0.125
         6      7.040      7.000      0.000      0.035      0.091      0.011      0.034      0.011     -0.003
         7      5.250      7.000      0.123     -1.566      0.091     -0.495     -1.710     -0.541      0.156
         8     12.500     12.500     21.566      0.006      1.000      6.567      0.005      6.231      5.965
         9      5.560      7.000      0.083     -1.289      0.091     -0.407     -1.339     -0.423      0.122
        10      7.910      7.000      0.033      0.814      0.091      0.257      0.799      0.253     -0.073
        11      6.890      7.000      0.001     -0.099      0.091     -0.031     -0.094     -0.030      0.009
=============================================================================================================
RESET
F test: F=array([[ 113.85091319]]), p=[[ 0.00000011]], df_denom=9, df_num=4

Now we see that the outlier measures like the studentized residuals are very small for observation 8, however the influence measures, diagonal value of hat matrix, Cook's distance, dffits and dfbeta, are all large and clearly signal that this is a very influential observation.

Extra comment:

The RESET test in the versions of example four has different degrees of freedom than the first three examples, even though it uses the same code. After looking at this surprising result a bit, it turns out that because of the structure of the explanatory variable (only 2 or 3 distinct x values) the auxilliary regression for the RESET test has reduced rank.

A bit of history:
I didn't know about the Anscombe example until this thread http://mail.scipy.org/pipermail/scipy-user/2011-February/028462.html
It took me a year to get back to this.

Suppose we run a simple regression, and want to know whether there are possible outliers, or observations that have a large influence on the estimated parameters. In the past, I added several specification test to statsmodels. These are tests that check whether your model assumptions are "correct" or whether there are violations.
(This is sloppy phrasing, they are statistical tests after all.)

What is missing so far in statsmodels are the usual outlier and influence measures. (Although there are robust linear models.)

I found a nice example for influence and outlier measures using SAS in http://www.bios.unc.edu/~truong/b545/reg/simple/wa.html

Trying to do something similar as an extension to statsmodels, I ended up with the table below. This is supposed to correspond to the bottom tables on the above linked page.
All numbers look the same, but I'm still missing `Cov Ratio` and haven't looked for the prediction intervals yet. I also have dfbetas, but ran out of time to add them nicely to the table. Some of the measures have associated pvalues or recommended thresholds that can be used to interpret the results and find which observations might "mess up" the estimation.

Additionally, I also coded Variance Inflation Factor (VIF), a measure of multicollinearity for the explanatory variables, and the Ramsey RESET Test which is unrelated to the rest here and is a test for linear specification. I think I got now all from this page http://en.wikipedia.org/wiki/Category:Regression_diagnostics.

The code is roughly, after imports and estimating the model by OLS

infl = Influence(res_ols)

print infl.summary()

obs	endog	fitted	Cook's d	studentized	hat diag	dffits	ext.stud.	dffits
		value		residuals		internal	residuals
0.0	64.0	59.7143	0.0317	0.7513	0.1008	0.2516	0.7338	0.2457
1.0	71.0	67.0	0.0334	0.7079	0.1176	0.2585	0.6891	0.2516
2.0	53.0	52.4286	0.0025	0.1124	0.2857	0.0711	0.1067	0.0675
3.0	67.0	70.6429	0.058	-0.6778	0.2017	-0.3407	-0.6583	-0.3309
4.0	55.0	59.7143	0.0383	-0.8265	0.1008	-0.2768	-0.8123	-0.272
5.0	58.0	56.0714	0.0125	0.3515	0.1681	0.158	0.3355	0.1508
6.0	77.0	67.0	0.2088	1.7697	0.1176	0.6462	2.0259	0.7398
7.0	57.0	63.3571	0.0559	-1.1042	0.084	-0.3345	-1.1179	-0.3386
8.0	56.0	67.0	0.2526	-1.9467	0.1176	-0.7108	-2.3435	-0.8557
9.0	51.0	52.4286	0.0158	-0.281	0.2857	-0.1777	-0.2676	-0.1693
10.0	76.0	74.2857	0.031	0.3498	0.3361	0.2489	0.3339	0.2376
11.0	68.0	63.3571	0.0298	0.8064	0.084	0.2443	0.7912	0.2397

The html for this table was created with scikits.statsmodels.iolib.table.SimpleTable, but I never tried to fine tune the html options for a nice result. The text table that is getting printed in the python shell is

>>> print infl.summary_obs()
===================================================================================================
       obs      endog     fitted   Cook's d studentized   hat diag    dffits   ext.stud.     dffits
                           value              residuals              internal residuals
---------------------------------------------------------------------------------------------------
         0    64.0000    59.7143     0.0317      0.7513     0.1008     0.2516     0.7338     0.2457
         1    71.0000    67.0000     0.0334      0.7079     0.1176     0.2585     0.6891     0.2516
         2    53.0000    52.4286     0.0025      0.1124     0.2857     0.0711     0.1067     0.0675
         3    67.0000    70.6429     0.0580     -0.6778     0.2017    -0.3407    -0.6583    -0.3309
         4    55.0000    59.7143     0.0383     -0.8265     0.1008    -0.2768    -0.8123    -0.2720
         5    58.0000    56.0714     0.0125      0.3515     0.1681     0.1580     0.3355     0.1508
         6    77.0000    67.0000     0.2088      1.7697     0.1176     0.6462     2.0259     0.7398
         7    57.0000    63.3571     0.0559     -1.1042     0.0840    -0.3345    -1.1179    -0.3386
         8    56.0000    67.0000     0.2526     -1.9467     0.1176    -0.7108    -2.3435    -0.8557
         9    51.0000    52.4286     0.0158     -0.2810     0.2857    -0.1777    -0.2676    -0.1693
        10    76.0000    74.2857     0.0310      0.3498     0.3361     0.2489     0.3339     0.2376
        11    68.0000    63.3571     0.0298      0.8064     0.0840     0.2443     0.7912     0.2397
===================================================================================================

Observations 6 and 8 look like outliers from these measures.

This was the fun part, and I managed to get it done in a few (or more) hours this weekend. The rest sounds more like work.

Some comments on the implementation.

The first measures can be calculated efficiently from only the original regression, the second part of this list of measures requires a loop that estimates the model dropping each observation one at a time (Leave One Observation Out). The latter can get expensive if the number of observations is large.

Initially I wanted to cache all Leave One Observation Out result instances, but this might get expensive in terms of memory, since at least some copies of the data will be made and stored together with the result instances. Instead, I finally decided to loop once but store all the attributes that will be required for the various measures. (Right now, I actually needed only the estimate of the error variance and the parameter estimates).

If the number of observations is large, then calculating the Leave One Observation Out results for dropping every observation will be a waste, because many observations will not be likely candidates for the "full treatment". The, not yet implemented, idea is to calculate the cheap measures first, and use them to select those observations that are likely candidates for the more precise LOOO loop.

Also still missing: plots.
I did some cleanup for the regression plots, but the outlier plots still look plain vanilla.

Completely unrelated:
We went this weekend to the Fete des neiges and looked at the snow village (giant igloos).

Quote: "Parc Jean-Drapeau is proud to welcome a North-American premiere – a Snow Village which has a hotel, restaurant, replicas of familiar Montréal buildings and lighted snow tunnels to explore all winter long. This Finnish concept attraction allows visitors and clients to enjoy a unique Montréal-winter experience."

I'm definitely NOT going to book a room there. But if anyone is interested, here's the link.

joepy

Monday, January 30, 2012

Anscombe and Diagnostic Statistics

Sunday, January 29, 2012

Influence and Outlier measures in regression

Blog Archive

Labels

josef-pkt's Activity