Tuesday, January 31, 2012

The nice thing about seeing zeros

Now, why am I happy to see a result like this?
>>> np.round(table_sm - table_sas, 3)
array([[ 0.,  0.,  0., -0.,  0., -0., -0.,  0., -0.,  0., -0.,  0.],
       [ 0.,  0., -0., -0.,  0., -0., -0.,  0.,  0.,  0., -0.,  0.],
       [ 0.,  0., -0., -0., -0.,  0., -0., -0.,  0., -0.,  0., -0.],
       [ 0.,  0., -0.,  0., -0.,  0., -0.,  0.,  0.,  0., -0.,  0.],
       [ 0.,  0., -0., -0.,  0., -0., -0.,  0.,  0., -0.,  0.,  0.],
       [ 0.,  0., -0.,  0.,  0.,  0., -0., -0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0., -0.,  0., -0., -0., -0., -0.,  0.],
       [ 0.,  0., -0., -0., -0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0., -0., -0.,  0., -0., -0.],
       [ 0.,  0.,  0.,  0.,  0.,  0., -0., -0., -0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0., -0.,  0., -0.,  0., -0.,  0.],
       [ 0.,  0.,  0., -0., -0.,  0., -0., -0., -0.,  0., -0., -0.],
       [ 0.,  0., -0.,  0.,  0.,  0.,  0., -0.,  0., -0., -0., -0.],
       [ 0.,  0.,  0.,  0., -0., -0.,  0., -0., -0.,  0.,  0.,  0.],
       [ 0.,  0.,  0., -0., -0.,  0.,  0., -0., -0., -0.,  0.,  0.],
       [ 0.,  0., -0.,  0.,  0., -0.,  0., -0.,  0.,  0., -0.,  0.],
       [ 0.,  0.,  0., -0.,  0., -0.,  0., -0., -0., -0., -0., -0.],
       [ 0.,  0.,  0., -0.,  0.,  0.,  0.,  0., -0.,  0.,  0., -0.],
       [ 0.,  0., -0., -0.,  0., -0.,  0.,  0.,  0.,  0.,  0.,  0.]])
It means I get the same results, up to print precision, as SAS in this table http://www.sfu.ca/sasdoc/sashtml/stat/chap55/sect4.htm#regg2f
table_sm = np.column_stack([
                              np.arange(res.nobs) + 1,
                              res.model.endog,
                              res.fittedvalues,
                              predict_mean_se,
                              predict_mean_ci[:,0],
                              predict_mean_ci[:,1],
                              predict_ci[:,0],
                              predict_ci[:,1],
                              res.resid,
                              resid_se,
                              infl.resid_studentized_internal,
                              infl.cooks_distance()[0]
                              ])
from numpy.testing import assert_almost_equal
assert_almost_equal(table_sm, table_sas, decimal=4)
>>> print st
=====================================================================================================================================
       Obs    Dep Var  Predicted    Std Error    Mean ci    Mean ci Predict ci Predict ci   Residual  Std Error    Student     Cook's
           Population      Value Mean Predict    95% low    95% upp    95% low    95% upp              Residual   Residual          D
-------------------------------------------------------------------------------------------------------------------------------------
         1      3.929      5.038        1.729      1.373      8.704     -1.903     11.980     -1.109      2.178     -0.509      0.055
         2      5.308      5.039        1.391      2.090      7.987     -1.553     11.631      0.269      2.408      0.112      0.001
         3      7.239      6.309        1.130      3.912      8.705     -0.055     12.672      0.930      2.541      0.366      0.009
         4      9.638      8.847        0.957      6.818     10.876      2.612     15.082      0.791      2.611      0.303      0.004
         5     12.866     12.655        0.872     10.806     14.504      6.476     18.834      0.211      2.641      0.080      0.000
         6     17.069     17.732        0.858     15.913     19.550     11.562     23.901     -0.663      2.645     -0.251      0.002
         7     23.191     24.078        0.883     22.205     25.951     17.892     30.264     -0.887      2.637     -0.336      0.004
         8     31.443     31.693        0.920     29.742     33.644     25.483     37.903     -0.250      2.624     -0.095      0.000
         9     39.818     40.577        0.949     38.566     42.589     34.348     46.807     -0.759      2.614     -0.290      0.004
        10     50.155     50.731        0.959     48.697     52.764     44.494     56.967     -0.576      2.610     -0.221      0.002
        11     62.947     62.153        0.949     60.142     64.164     55.924     68.382      0.794      2.614      0.304      0.004
        12     75.994     74.845        0.920     72.894     76.796     68.635     81.055      1.149      2.624      0.438      0.008
        13     91.972     88.806        0.883     86.933     90.679     82.620     94.992      3.166      2.637      1.201      0.054
        14    105.710    104.035        0.858    102.217    105.854     97.866    110.205      1.675      2.645      0.633      0.014
        15    122.775    120.534        0.872    118.686    122.383    114.356    126.713      2.241      2.641      0.849      0.026
        16    131.669    138.303        0.957    136.274    140.332    132.068    144.537     -6.633      2.611     -2.541      0.289
        17    151.325    157.340        1.130    154.943    159.736    150.976    163.704     -6.015      2.541     -2.367      0.370
        18    179.323    177.646        1.391    174.697    180.595    171.054    184.238      1.677      2.408      0.696      0.054
        19    203.211    199.221        1.729    195.556    202.887    192.280    206.163      3.990      2.178      1.831      0.704
===================================================================================================================================== 
 
This is again just produced with SimpleTable, and I didn't spend any time on formatting yet.

(aside: I'm still trying to figure out how to create an article without a large administrative overhead. This time I used sphinx but it doesn't quite work yet.)

Monday, January 30, 2012

Anscombe and Diagnostic Statistics

Now that I got these outlier and influence measures (see last post), we can look at the Anscombe example, Anscombe Quartet

I took the example from matplotlib here. With a few changes, mainly adding the observation number. I get this graph

The first example is what we would usually expect and shouldn't have any problems for estimating the regression line. The second example is nonlinear, but we are fitting only a linear model. The third example has an outlier in y, which pulls the regression line towards it. The last example has an outlier in x.

Interpreting the last case is a bit tricky. The slope of the regression line is dominated by observation 7, if we think it is a reliable (correctly measured) observation, then it is very useful since it contains a lot of information. If we doubt the reliability of that measurement, then we are left without any information about the slope and the effect of the explanatory variable.
(I don't remember where I read this, someone argued that we shouldn't through out x outliers.)

Now we can run a loop over the four problems and check the influence diagnostic table, and also Ramsey's RESET test for linearity.

example 1: ok
-------------


Parameter estimates (constant, slope): [ 3.00009091  0.50009091]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      8.040      8.001      0.000      0.033      0.100      0.011      0.031      0.010      0.003
         1      6.950      7.001      0.000     -0.043      0.100     -0.014     -0.041     -0.014      0.004
         2      7.580      9.501      0.489     -1.778      0.236     -0.989     -2.081     -1.158     -0.908
         3      8.810      7.501      0.062      1.110      0.091      0.351      1.127      0.356      0.000
         4      8.330      8.501      0.002     -0.148      0.127     -0.057     -0.140     -0.053     -0.029
         5      9.960     10.001      0.000     -0.041      0.318     -0.028     -0.038     -0.026     -0.022
         6      7.240      6.001      0.127      1.102      0.173      0.503      1.117      0.510     -0.351
         7      4.260      5.000      0.123     -0.725      0.318     -0.495     -0.705     -0.481      0.407
         8     10.840      9.001      0.279      1.635      0.173      0.747      1.838      0.840      0.578
         9      4.820      6.501      0.154     -1.455      0.127     -0.556     -1.569     -0.599      0.320
        10      5.680      5.500      0.004      0.166      0.236      0.092      0.157      0.087     -0.068
=============================================================================================================
RESET

F test: F=array([[ 0.44964358]]), p=[[ 0.77063577]], df_denom=5, df_num=4

Everything looks mostly ok, the RESET test doesn't reject the linear specification. Observation 2 might be an outlier, the studentized residual is at around 2, which is about the threshold for the t-distribution.

example 2: nonlinear
--------------------


Parameter estimates (constant, slope): [ 3.00090909  0.5       ]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      9.140      8.001      0.052      0.971      0.100      0.324      0.967      0.322      0.097
         1      8.140      7.001      0.052      0.971      0.100      0.324      0.967      0.322     -0.097
         2      8.740      9.501      0.077     -0.704      0.236     -0.392     -0.683     -0.380     -0.298
         3      8.770      7.501      0.058      1.076      0.091      0.340      1.087      0.344      0.000
         4      9.260      8.501      0.032      0.657      0.127      0.251      0.635      0.242      0.130
         5      8.100     10.001      0.808     -1.861      0.318     -1.271     -2.236     -1.528     -1.291
         6      6.130      6.001      0.001      0.115      0.173      0.052      0.108      0.050     -0.034
         7      3.100      5.001      0.808     -1.861      0.318     -1.271     -2.236     -1.528      1.291
         8      9.130      9.001      0.001      0.115      0.173      0.052      0.108      0.050      0.034
         9      7.260      6.501      0.032      0.657      0.127      0.251      0.635      0.242     -0.130
        10      4.740      5.501      0.077     -0.704      0.236     -0.392     -0.683     -0.380      0.298
=============================================================================================================
RESET

F test: F=array([[ 820836.08290181]]), p=[[ 0.]], df_denom=5, df_num=4

Observation 2 and 7 look like outliers. However, the RESET test strongly rejects the hypothesis that the linear specification is correct. So we better look for another model, for example including polynomial terms for the explanatory variable.


example 3: outlier at 2
----------------------


Parameter estimates (constant, slope): [ 3.00245455  0.49972727]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      7.460      8.000      0.012     -0.460      0.100     -0.153     -0.439     -0.146     -0.044
         1      6.770      7.000      0.002     -0.196      0.100     -0.065     -0.186     -0.062      0.019
         2     12.740      9.499      1.393      3.000      0.236      1.669   1203.540    669.587    525.268
         3      7.110      7.500      0.005     -0.331      0.091     -0.105     -0.314     -0.099      0.000
         4      7.810      8.499      0.026     -0.597      0.127     -0.228     -0.574     -0.219     -0.117
         5      8.840      9.999      0.301     -1.135      0.318     -0.775     -1.156     -0.790     -0.667
         6      6.080      6.001      0.001      0.070      0.173      0.032      0.066      0.030     -0.021
         7      5.390      5.001      0.034      0.381      0.318      0.260      0.362      0.247     -0.209
         8      8.150      8.999      0.059     -0.755      0.173     -0.345     -0.736     -0.336     -0.231
         9      6.420      6.500      0.000     -0.070      0.127     -0.027     -0.066     -0.025      0.013
        10      5.730      5.501      0.007      0.212      0.236      0.118      0.200      0.111     -0.087
=============================================================================================================
RESET
F test: F=array([[ 1.42791012]]), p=[[ 0.34730273]], df_denom=5, df_num=4


The results from the table are pretty clear, observation 2 has to be an outlier, all residual measures are huge. The RESET test doesn't indicate any problem with the linear specification.


example 4: exog_outlier at 7
----------------------------


Parameter estimates (constant, slope): [ 3.00172727  0.49990909]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      6.580      7.001      0.007     -0.359      0.100     -0.120     -0.341     -0.114      0.034
         1      5.760      7.001      0.062     -1.059      0.100     -0.353     -1.067     -0.356      0.107
         2      7.710      7.001      0.020      0.605      0.100      0.202      0.582      0.194     -0.059
         3      8.840      7.001      0.137      1.569      0.100      0.523      1.735      0.578     -0.174
         4      8.470      7.001      0.087      1.253      0.100      0.418      1.300      0.433     -0.131
         5      7.040      7.001      0.000      0.033      0.100      0.011      0.031      0.011     -0.003
         6      5.250      7.001      0.124     -1.494      0.100     -0.498     -1.624     -0.541      0.163
         7     12.500     12.500     -1.#IO     -1.#IO      1.000     -1.#IO     -1.#IO     -1.#IO     -3.070
         8      5.560      7.001      0.084     -1.229      0.100     -0.410     -1.271     -0.423      0.128
         9      7.910      7.001      0.033      0.775      0.100      0.259      0.757      0.252     -0.076
        10      6.890      7.001      0.001     -0.095      0.100     -0.032     -0.089     -0.030      0.009
=============================================================================================================
RESET
F test: F=array([[ 126.84716907]]), p=[[ 0.00000007]], df_denom=9, df_num=4



What's going on with observation 7? The "-1.#IO" stands for numpy.NaN. All the measures that rely on leaving out observation 7 are indeterminate, since without observation 7 we cannot identify a slope.
The diagonal value of the hat matrix and the dfbeta for the slope indicate that this observation is very influential.
 

example 4a: exog_outlier at 7 (v2)
----------------------------------


Parameter estimates (constant, slope): [ 3.00172741  0.49990908]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      7.006      7.006      0.000      0.000      0.091      0.000      0.000      0.000     -0.000
         1      6.580      7.001      0.007     -0.377      0.091     -0.119     -0.360     -0.114      0.033
         2      5.760      7.001      0.062     -1.110      0.091     -0.351     -1.125     -0.356      0.103
         3      7.710      7.001      0.020      0.634      0.091      0.201      0.614      0.194     -0.056
         4      8.840      7.001      0.135      1.645      0.091      0.520      1.828      0.578     -0.167
         5      8.470      7.001      0.086      1.314      0.091      0.416      1.371      0.434     -0.125
         6      7.040      7.001      0.000      0.035      0.091      0.011      0.033      0.011     -0.003
         7      5.250      7.001      0.123     -1.567      0.091     -0.495     -1.711     -0.541      0.156
         8     12.500     12.500      0.000     -0.000      1.000     -0.001     -0.000     -0.001     -0.001
         9      5.560      7.001      0.083     -1.289      0.091     -0.408     -1.339     -0.424      0.122
        10      7.910      7.001      0.033      0.813      0.091      0.257      0.798      0.253     -0.073
        11      6.890      7.001      0.001     -0.099      0.091     -0.031     -0.094     -0.030      0.009
=============================================================================================================
RESET
F test: F=array([[ 113.86707171]]), p=[[ 0.00000011]], df_denom=9, df_num=4

Since I didn't like the NaN results in the previous example, I made one change. I added an additional observation that has an x value slightly larger the the large group, 8.01 insted of 8.00, the y value was chosen so it lies on the predicted line. The new observation is the first observation, the x outlier is now at index 8.

What happened to our influence measures? Both observations 0 and 8 have zero influence, since we only need one of them to identify the slope. Leave one observation out doesn't help if we have two "similar" (in terms of parameter estimation) outliers.

Since this was interesting, let's try another variation. This time I change the value of the new observation to lie below the regression line (6.999 instead of 7.006). One more table and we are done.


example 4b: exog_outlier at 7 (v3)
----------------------------------


Parameter estimates (constant, slope): [ 3.00063327  0.49996637]
=============================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
                           value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
         0      6.999      7.005      0.000     -0.006      0.091     -0.002     -0.005     -0.002      0.001
         1      6.580      7.000      0.007     -0.376      0.091     -0.119     -0.359     -0.114      0.033
         2      5.760      7.000      0.062     -1.110      0.091     -0.351     -1.124     -0.356      0.103
         3      7.710      7.000      0.020      0.635      0.091      0.201      0.615      0.194     -0.056
         4      8.840      7.000      0.136      1.646      0.091      0.520      1.829      0.578     -0.167
         5      8.470      7.000      0.086      1.315      0.091      0.416      1.372      0.434     -0.125
         6      7.040      7.000      0.000      0.035      0.091      0.011      0.034      0.011     -0.003
         7      5.250      7.000      0.123     -1.566      0.091     -0.495     -1.710     -0.541      0.156
         8     12.500     12.500     21.566      0.006      1.000      6.567      0.005      6.231      5.965
         9      5.560      7.000      0.083     -1.289      0.091     -0.407     -1.339     -0.423      0.122
        10      7.910      7.000      0.033      0.814      0.091      0.257      0.799      0.253     -0.073
        11      6.890      7.000      0.001     -0.099      0.091     -0.031     -0.094     -0.030      0.009
=============================================================================================================
RESET
F test: F=array([[ 113.85091319]]), p=[[ 0.00000011]], df_denom=9, df_num=4

Now we see that the outlier measures like the studentized residuals are very small for observation 8, however the influence measures, diagonal value of hat matrix, Cook's distance, dffits and dfbeta, are all large and clearly signal that this is a very influential observation.

Extra comment:

The RESET test in the versions of example four has different degrees of freedom than the first three examples, even though it uses the same code. After looking at this surprising result a bit, it turns out that because of the structure of the explanatory variable (only 2 or 3 distinct x values) the auxilliary regression for the RESET test has reduced rank.

A bit of history:
I didn't know about the Anscombe example until this thread http://mail.scipy.org/pipermail/scipy-user/2011-February/028462.html
It took me a year to get back to this.


Sunday, January 29, 2012

Influence and Outlier measures in regression

Suppose we run a simple regression, and want to know whether there are possible outliers, or observations that have a large influence on the estimated parameters. In the past, I added several specification test to statsmodels. These are tests that check whether your model assumptions are "correct" or whether there are violations.
(This is sloppy phrasing, they are statistical tests after all.)

What is missing so far in statsmodels are the usual outlier and influence measures. (Although there are robust linear models.)

I found a nice example for influence and outlier measures using SAS in http://www.bios.unc.edu/~truong/b545/reg/simple/wa.html

Trying to do something similar as an extension to statsmodels, I ended up with the table below. This is supposed to correspond to the bottom tables on the above linked page.
All numbers look the same, but I'm still missing `Cov Ratio` and haven't looked for the prediction intervals yet. I also have dfbetas, but ran out of time to add them nicely to the table. Some of the measures have associated pvalues or recommended thresholds that can be used to interpret the results and find which observations might "mess up" the estimation.

Additionally, I also coded Variance Inflation Factor (VIF), a measure of multicollinearity for the explanatory variables, and the Ramsey RESET Test which is unrelated to the rest here and is a test for linear specification. I think I got now all from this page http://en.wikipedia.org/wiki/Category:Regression_diagnostics.

The code is roughly, after imports and estimating the model by OLS

infl = Influence(res_ols)
print infl.summary()


obs endog fittedCook's d studentized hat diagdffits ext.stud. dffits
valueresidualsinternal residuals
0.0 64.059.71430.03170.75130.10080.25160.73380.2457
1.0 71.067.00.03340.70790.11760.25850.68910.2516
2.0 53.052.42860.00250.11240.28570.07110.10670.0675
3.0 67.070.64290.058-0.67780.2017-0.3407-0.6583-0.3309
4.0 55.059.71430.0383-0.82650.1008-0.2768-0.8123-0.272
5.0 58.056.07140.01250.35150.16810.1580.33550.1508
6.0 77.067.00.20881.76970.11760.64622.02590.7398
7.0 57.063.35710.0559-1.10420.084-0.3345-1.1179-0.3386
8.0 56.067.00.2526-1.94670.1176-0.7108-2.3435-0.8557
9.0 51.052.42860.0158-0.2810.2857-0.1777-0.2676-0.1693
10.0 76.074.28570.0310.34980.33610.24890.33390.2376
11.0 68.063.35710.02980.80640.0840.24430.79120.2397


The html for this table was created with scikits.statsmodels.iolib.table.SimpleTable, but I never tried to fine tune the html options for a nice result. The text table that is getting printed in the python shell is

>>> print infl.summary_obs()
===================================================================================================
       obs      endog     fitted   Cook's d studentized   hat diag    dffits   ext.stud.     dffits
                           value              residuals              internal  residuals          
---------------------------------------------------------------------------------------------------
         0    64.0000    59.7143     0.0317      0.7513     0.1008     0.2516     0.7338     0.2457
         1    71.0000    67.0000     0.0334      0.7079     0.1176     0.2585     0.6891     0.2516
         2    53.0000    52.4286     0.0025      0.1124     0.2857     0.0711     0.1067     0.0675
         3    67.0000    70.6429     0.0580     -0.6778     0.2017    -0.3407    -0.6583    -0.3309
         4    55.0000    59.7143     0.0383     -0.8265     0.1008    -0.2768    -0.8123    -0.2720
         5    58.0000    56.0714     0.0125      0.3515     0.1681     0.1580     0.3355     0.1508
         6    77.0000    67.0000     0.2088      1.7697     0.1176     0.6462     2.0259     0.7398
         7    57.0000    63.3571     0.0559     -1.1042     0.0840    -0.3345    -1.1179    -0.3386
         8    56.0000    67.0000     0.2526     -1.9467     0.1176    -0.7108    -2.3435    -0.8557
         9    51.0000    52.4286     0.0158     -0.2810     0.2857    -0.1777    -0.2676    -0.1693
        10    76.0000    74.2857     0.0310      0.3498     0.3361     0.2489     0.3339     0.2376
        11    68.0000    63.3571     0.0298      0.8064     0.0840     0.2443     0.7912     0.2397
===================================================================================================



Observations 6 and 8 look like outliers from these measures.
 

This was the fun part, and I managed to get it done in a few (or more) hours this weekend. The rest sounds more like work.

Some comments on the implementation.

The first measures can be calculated efficiently from only the original regression, the second part of this list of measures requires a loop that estimates the model dropping each observation one at a time (Leave One Observation Out). The latter can get expensive if the number of observations is large.

Initially I wanted to cache all Leave One Observation Out result instances, but this might get expensive in terms of memory, since at least some copies of the data will be made and stored together with the result instances. Instead, I finally decided to loop once but store all the attributes that will be required for the various measures. (Right now, I actually needed only the estimate of the error variance and the parameter estimates).

If the number of observations is large, then calculating the Leave One Observation Out results for dropping every observation will be a waste, because many observations will not be likely candidates for the "full treatment". The, not yet implemented, idea is to calculate the cheap measures first, and use them to select those observations that are likely candidates for the more precise LOOO loop.

Also still missing: plots.
I did some cleanup for the regression plots, but the outlier plots still look plain vanilla.


Completely unrelated:
We went this weekend to the Fete des neiges and looked at the snow village (giant igloos).

Quote: "Parc Jean-Drapeau is proud to welcome a North-American premiere – a Snow Village which has a hotel, restaurant, replicas of familiar MontrĂ©al buildings and lighted snow tunnels to explore all winter long. This Finnish concept attraction allows visitors and clients to enjoy a unique MontrĂ©al-winter experience."


I'm definitely NOT going to book a room there. But if anyone is interested, here's the link.

Saturday, January 28, 2012

Distributions with matplotlib in 3d

I finally managed to figure out the settings for matplotlib's surface plot that makes a bivariate distribution look more like those in published articles.

The first version uses

ax2 = fig2.add_subplot(111, projection='3d')
surf = ax2.plot_surface(X, Y, Z, rstride=1, cstride=1, 
        cmap = cm.gray_r, alpha=0.9, linewidth=1)




The second version uses

surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, color='0.8',
                       alpha=0.85, linewidth=1)

Previously I was trying out mainly the contour plots, since I didn't get 3d to look nice enough.

For example the following shows a mixture of two bivariate normal distributions and the estimate by gaussian_kde. The colored areas are the differences between kde and the true distribution, in the blue areas the kde is too large, in the redish areas the kde is too small. It's a contour plot version for showing that gaussian_kde with default settings for the bandwidth lowers the hills and fills the valleys in the case of bimodal distributions.


Thursday, January 19, 2012

Monotone Hermite Polynomials

Just a quick note: Following the instruction on Wikipedia

http://en.wikipedia.org/wiki/Cubic_Hermite_spline#Cardinal_spline

http://en.wikipedia.org/wiki/Monotone_cubic_interpolation

I finally implemented monotone interpolation, in pure numpy and fully vectorized. It works and looks good, but I haven't verified yet whether it's correct.



Another example: Interpolating a noisy sine curve with cardinal splines with c=0.1