tag:blogger.com,1999:blog-19098639615160037192017-11-28T19:11:14.353-08:00joepyJosef Perktoldnoreply@blogger.comBlogger38125tag:blogger.com,1999:blog-1909863961516003719.post-45736839972327638152013-08-06T14:07:00.001-07:002013-08-06T14:23:07.631-07:00Statsmodels Release 0.5.0rc1
<div class="document" id="statsmodels-release-0-5-0rc1">
<p>After approximately a year since our last release, we are finally ready again for a new release of statsmodels. Skipper pushed the distribution files to <a class="reference external" href="https://pypi.python.org/pypi/statsmodels">pypi</a> last week.</p>
<p>Actually the main new feature, the use of formulas in the style of R, has already been inofficially released one year ago in the distribution files for last years scipy conference. It was downloaded around 18000 times on github (before githubs download disappeared). Formulas and pandas integration have been a huge success, and after one year it looks pretty solid.</p>
<p>During last year we merged many additional new features, and continued to improve our traditional models,</p>
<p>I am planning to introduce some of the new features in future blog posts here. As always, I am very excited to get a new release out and a bit sad about all the things that we didn't have time to change or add.</p>
<p>For now, I just copied, and lightly edited, some of the release information from the <a class="reference external" href="http://statsmodels.sourceforge.net/devel/release/version0.5.html">documentation</a> which follows for the rest of this post. If you want the version with the inline links to the documentation of the different features, then look at the page in the documentation instead of reading here.</p>
<p>Statsmodels 0.5 is a large and very exciting release that brings together a year of work done by 36 authors, including almost 2000 commits. It contains many new features and a large amount of bug fixes detailed below.</p>
<p>The following major new features appear in this version.</p>
<div class="section" id="support-for-model-formulas-via-patsy">
<h4>Support for Model Formulas via Patsy</h4>
<p>Statsmodels now supports fitting models with a formula. This functionality is provided by <a class="reference external" href="http://patsy.readthedocs.org/en/latest/">patsy</a>. Patsy is now a dependency for statsmodels. Models can be individually imported from the <tt class="docutils literal">statsmodels.formula.api</tt> namespace or you can import them all as:</p>
<pre class="literal-block">
import statsmodels.formula.api as smf
</pre>
<p>Alternatively, each model in the usual <tt class="docutils literal">statsmodels.api</tt> namespace has a <tt class="docutils literal">from_formula</tt> classmethod that will create a model using a formula. Formulas are also available for specifying linear hypothesis tests using the <tt class="docutils literal">t_test</tt> and <tt class="docutils literal">f_test</tt> methods after model fitting. A typical workflow can now look something like this.</p>
<pre class="code python literal-block">
<span class="keyword namespace">import</span> <span class="name namespace">numpy</span> <span class="keyword namespace">as</span> <span class="name namespace">np</span>
<span class="keyword namespace">import</span> <span class="name namespace">pandas</span> <span class="keyword namespace">as</span> <span class="name namespace">pd</span>
<span class="keyword namespace">import</span> <span class="name namespace">statsmodels.formula.api</span> <span class="keyword namespace">as</span> <span class="name namespace">smf</span>
<span class="name">url</span> <span class="operator">=</span> <span class="literal string">'http://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv'</span>
<span class="name">data</span> <span class="operator">=</span> <span class="name">pd</span><span class="operator">.</span><span class="name">read_csv</span><span class="punctuation">(</span><span class="name">url</span><span class="punctuation">)</span>
<span class="comment"># Fit regression model (using the natural log of one of the regressors)</span>
<span class="name">results</span> <span class="operator">=</span> <span class="name">smf</span><span class="operator">.</span><span class="name">ols</span><span class="punctuation">(</span><span class="literal string">'Lottery ~ Literacy + np.log(Pop1831)'</span><span class="punctuation">,</span> <span class="name">data</span><span class="operator">=</span><span class="name">data</span><span class="punctuation">)</span><span class="operator">.</span><span class="name">fit</span><span class="punctuation">()</span>
</pre>
</div>
<div class="section" id="empirical-likelihood-google-summer-of-code-2012-project">
<h4>Empirical Likelihood (Google Summer of Code 2012 project)</h4>
<p>Empirical Likelihood-Based Inference for moments of univariate and multivariate variables is available as well as EL-based ANOVA tests. EL-based linear regression, including the regression through the origin model. In addition, the accelerated failure time model for inference on a linear regression model with a randomly right censored endogenous variable is available.</p>
</div>
<div class="section" id="analysis-of-variance-anova-modeling">
<h4>Analysis of Variance (ANOVA) Modeling</h4>
<p>Support for ANOVA is now available including type I, II, and III sums of squares.</p>
</div>
<div class="section" id="multivariate-kernel-density-estimators-gsoc-2012-project">
<h4>Multivariate Kernel Density Estimators (GSoC 2012 project)</h4>
<p>Kernel density estimation has been extended to handle multivariate estimation as well via product kernels. It is available as <cite>sm.nonparametric.KDEMultivariate</cite>. It supports least squares and maximum likelihood cross-validation for bandwidth estimation, as well as mixed continuous, ordered, and unordered categorical data. Conditional density estimation is also available via <cite>sm.nonparametric.KDEMUltivariateConditional</cite>.</p>
</div>
<div class="section" id="nonparameteric-regression-gsoc-2012-project">
<h4>Nonparameteric Regression (GSoC 2012 project)</h4>
<p>Kernel regression models are now available via <cite>sm.nonparametric.KernelReg</cite>. It is based on the product kernel mentioned above, so it also has the same set of features including support for cross-validation as well as support for estimation mixed continuous and categorical variables. Censored kernel regression is also provided by <cite>kernel_regression.KernelCensoredReg</cite>.</p>
</div>
<div class="section" id="quantile-regression-model">
<h4>Quantile Regression Model</h4>
<p>Quantile regression is supported via the <cite>sm.QuantReg</cite> class. Kernel and bandwidth selection options are available for estimating the asymptotic covariance matrix using a kernel density estimator.</p>
</div>
<div class="section" id="negative-binomial-regression-model">
<h4>Negative Binomial Regression Model</h4>
<p>It is now possible to fit negative binomial models for count data via maximum-likelihood using the <cite>sm.NegativeBinomial</cite> class. <tt class="docutils literal">NB1</tt>, <tt class="docutils literal">NB2</tt>, and <tt class="docutils literal">geometric</tt> variance specifications are available.</p>
</div>
<div class="section" id="l1-penalized-discrete-choice-models">
<h4>L1-penalized Discrete Choice Models</h4>
<p>A new optimization method has been added to the discrete models, which includes Logit, Probit, MNLogit and Poisson, that makes it possible to estimate the models with an l1, linear, penalization. This shrinks parameters towards zero and can set parameters that are not very different from zero to zero. This is especially useful if there are a large number of explanatory variables and a large associated number of parameters. <a class="reference external" href="http://cvxopt.org/">CVXOPT</a> is now an optional dependency that can be used for fitting these models.</p>
</div>
<div class="section" id="new-and-improved-graphics">
<h4>New and Improved Graphics</h4>
<ul class="simple"><li><strong>ProbPlot</strong>: A new <cite>ProbPlot</cite> object has been added to provide a simple interface to create P-P, Q-Q, and probability plots with options to fit a distribution and show various reference lines. In the case of Q-Q and P-P plots, two different samples can be compared with the <cite>other</cite> keyword argument. <cite>sm.graphics.ProbPlot</cite></li>
</ul><pre class="code python literal-block">
<span class="keyword namespace">import</span> <span class="name namespace">numpy</span> <span class="keyword namespace">as</span> <span class="name namespace">np</span>
<span class="keyword namespace">import</span> <span class="name namespace">statsmodels.api</span> <span class="keyword namespace">as</span> <span class="name namespace">sm</span>
<span class="name">x</span> <span class="operator">=</span> <span class="name">np</span><span class="operator">.</span><span class="name">random</span><span class="operator">.</span><span class="name">normal</span><span class="punctuation">(</span><span class="name">loc</span><span class="operator">=</span><span class="literal number float">1.12</span><span class="punctuation">,</span> <span class="name">scale</span><span class="operator">=</span><span class="literal number float">0.25</span><span class="punctuation">,</span> <span class="name">size</span><span class="operator">=</span><span class="literal number integer">37</span><span class="punctuation">)</span>
<span class="name">y</span> <span class="operator">=</span> <span class="name">np</span><span class="operator">.</span><span class="name">random</span><span class="operator">.</span><span class="name">normal</span><span class="punctuation">(</span><span class="name">loc</span><span class="operator">=</span><span class="literal number float">0.75</span><span class="punctuation">,</span> <span class="name">scale</span><span class="operator">=</span><span class="literal number float">0.45</span><span class="punctuation">,</span> <span class="name">size</span><span class="operator">=</span><span class="literal number integer">37</span><span class="punctuation">)</span>
<span class="name">ppx</span> <span class="operator">=</span> <span class="name">sm</span><span class="operator">.</span><span class="name">ProbPlot</span><span class="punctuation">(</span><span class="name">x</span><span class="punctuation">)</span>
<span class="name">ppy</span> <span class="operator">=</span> <span class="name">sm</span><span class="operator">.</span><span class="name">ProbPlot</span><span class="punctuation">(</span><span class="name">y</span><span class="punctuation">)</span>
<span class="name">fig1</span> <span class="operator">=</span> <span class="name">ppx</span><span class="operator">.</span><span class="name">qqplot</span><span class="punctuation">()</span>
<span class="name">fig2</span> <span class="operator">=</span> <span class="name">ppx</span><span class="operator">.</span><span class="name">qqplot</span><span class="punctuation">(</span><span class="name">other</span><span class="operator">=</span><span class="name">ppy</span><span class="punctuation">)</span>
</pre>
<ul class="simple"><li><strong>Mosaic Plot</strong>: Create a mosaic plot from a contingency table. This allows you to visualize multivariate categorical data in a rigorous and informative way. Available with <cite>sm.graphics.mosaic</cite>.</li>
<li><strong>Interaction Plot</strong>: Interaction plots now handle categorical factors as well as other improviments. <cite>sm.graphics.interaction_plot</cite>.</li>
<li><strong>Regression Plots</strong>: The regression plots have been refactored and improved. They can now handle pandas objects and regression results instances appropriately..</li>
</ul></div>
<div class="section" id="power-and-sample-size-calculations">
<h4>Power and Sample Size Calculations</h4>
<p>The power module (<tt class="docutils literal">statsmodel.stats.power</tt>) currently implements power and sample size calculations for the t-tests (<cite>sm.stats.TTestPower</cite>, <cite>sm.stats.TTestIndPower</cite>), normal based test (<cite>sm.stats.NormIndPower</cite>), F-tests (<cite>sm.stats.FTestPower</cite>, <cite>sm.stats.FTestAnovaPower</cite>) and Chisquare goodness of fit (<cite>sm.stats.GofChisquarePower</cite>) test. The implementation is class based, but the module also provides three shortcut functions, <cite>sm.stats.tt_solve_power</cite>, <cite>sm.stats.tt_ind_solve_power</cite> and <cite>sm.stats.zt_ind_solve_power</cite> to solve for any one of the parameters of the power equations. See this <a class="reference external" href="http://jpktd.blogspot.fr/2013/03/statistical-power-in-statsmodels.html">blog post</a> for a more in-depth description of the additions.</p>
<div class="section" id="other-important-new-features">
<h5>Other important new features</h5>
<ul><li><p class="first"><strong>IPython notebook examples</strong>: Many of our examples have been converted or added as IPython notebooks now. They are available <a class="reference external" href="http://statsmodels.sourceforge.net/devel/examples/index.html#notebook-examples">here</a>.</p>
</li>
<li><p class="first"><strong>Improved marginal effects for discrete choice models</strong>: Expanded options for obtaining marginal effects after the estimation of nonlinear discrete choice models are available.</p>
</li>
<li><p class="first"><strong>OLS influence outlier measures</strong>: After the estimation of a model with OLS, the common set of influence and outlier measures and a outlier test are now available attached as methods <tt class="docutils literal">get_influnce</tt> and <tt class="docutils literal">outlier_test</tt> to the Results instance.</p>
</li>
<li><p class="first"><strong>New datasets</strong>: New <cite>datasets</cite> are available for examples.</p>
</li>
<li><p class="first"><strong>Access to R datasets</strong>: We now have access to many of the same datasets available to R users through the <a class="reference external" href="http://vincentarelbundock.github.io/Rdatasets/">Rdatasets project</a>. You can access these using the <cite>sm.datasets.get_rdataset</cite> function. This function also includes caching of these datasets.</p>
</li>
<li><p class="first"><strong>Improved numerical differentiation tools</strong>: Numerical differentiation routines have been greatly improved and expanded to cover all the routines discussed in:</p>
<pre class="literal-block">
Ridout, M.S. (2009) Statistical applications of the complex-step method
of numerical differentiation. The American Statistician, 63, 66-74
</pre>
<p>See the <cite>sm.tools.numdiff</cite> module.</p>
</li>
<li><p class="first"><strong>Consistent constant handling across models</strong>: Result statistics no longer rely on the assumption that a constant is present in the model.</p>
</li>
<li><p class="first"><strong>Missing value handling across models</strong>: Users can now control what models do in the presence of missing values via the <tt class="docutils literal">missing</tt> keyword available in the instantiation of every model. The options are <tt class="docutils literal">'none'</tt>, <tt class="docutils literal">'drop'</tt>, and <tt class="docutils literal">'raise'</tt>. The default is <tt class="docutils literal">'none'</tt>, which does no missing value checks. To drop missing values use <tt class="docutils literal">'drop'</tt>. And <tt class="docutils literal">'raise'</tt> will raise an error in the presence of any missing data.</p>
</li>
<li><p class="first"><strong>Ability to write Stata datasets</strong>: Added the ability to write Stata <tt class="docutils literal">.dta</tt> files. See <cite>sm.iolib.StataWriter</cite>.</p>
</li>
<li><p class="first"><strong>ARIMA modeling</strong>: Statsmodels now has support for fitting Autoregressive Integrated Moving Average (ARIMA) models. See <cite>ARIMA</cite> and <cite>ARIMAResults</cite> for more information.</p>
</li>
<li><p class="first"><strong>Support for dynamic prediction in AR(I)MA models</strong>: It is now possible to obtain dynamic in-sample forecast values in <cite>ARMA</cite> and <cite>ARIMA</cite> models.</p>
</li>
<li><p class="first"><strong>Improved Pandas integration</strong>: Statsmodels now supports all frequencies available in pandas for time-series modeling. These are used for intelligent dates handling for prediction. These features are available, if you pass a pandas Series or DataFrame with a DatetimeIndex to a time-series model.</p>
</li>
<li><p class="first"><strong>New statistical hypothesis tests</strong>: Added statistics for calculating interrater agreement including Cohen's kappa and Fleiss' kappa, statistics and hypothesis tests for proportions, Tukey HSD (with plot) was added as an enhancement to the multiple comparison tests (<cite>sm.stats.multicomp.MultiComparison</cite>, <cite>sm.stats.multicomp.pairwise_tukeyhsd</cite>). Weighted statistics and t tests were enhanced with new options. Tests of equivalence for one sample and two independent or paired samples were added based on t tests and z tests (See <cite>tost</cite>).</p>
</li>
</ul></div>
<div class="section" id="major-bugs-fixed">
<h5>Major Bugs fixed</h5>
<ul class="simple"><li>Post-estimation statistics for weighted least squares that depended on the centered total sum of squares were not correct. These are now correct and tested.</li>
<li>Regression through the origin models now correctly use uncentered total sum of squares in post-estimation statistics. This affected the <span class="math">\(R^2\)</span>
value in linear models without a constant.</li>
</ul></div>
<div class="section" id="backwards-incompatible-changes-and-deprecations">
<h5>Backwards incompatible changes and deprecations</h5>
<ul class="simple"><li>Cython code is now non-optional. You will need a C compiler to build from source. If building from github and not a source release, you will also need Cython installed. See the <cite>installation documentation</cite>.</li>
<li>The <tt class="docutils literal">q_matrix</tt> keyword to <cite>t_test</cite> and <cite>f_test</cite> for linear models is deprecated. You can now specify linear hypotheses using formulas.</li>
<li>The <tt class="docutils literal">conf_int</tt> keyword to <cite>sm.tsa.acf</cite> is deprecated.</li>
<li>The <tt class="docutils literal">names</tt> argument is deprecated in <cite>sm.tsa.VAR</cite> and <cite>sm.tsa.SVAR</cite>. This is now automatically detected and handled.</li>
<li>The <tt class="docutils literal">order</tt> keyword to <cite>sm.tsa.ARMA.fit</cite> is deprecated. It is now passed in during model instantiation.</li>
<li>The empirical distribution function (<cite>sm.distributions.ECDF</cite>) and supporting functions have been moved to <tt class="docutils literal">statsmodels.distributions</tt>. Their old paths have been deprecated.</li>
<li>The <tt class="docutils literal">margeff</tt> method of the discrete choice models has been deprecated. Use <tt class="docutils literal">get_margeff</tt> instead. See above. Also, the vague <tt class="docutils literal">resid</tt> attribute of the discrete choice models has been deprecated in favor of the more descriptive <tt class="docutils literal">resid_dev</tt> to indicate that they are deviance residuals.</li>
<li>The class <tt class="docutils literal">KDE</tt> has been deprecated and renamed to <cite>KDEUnivariate</cite> to distinguish it from the new <tt class="docutils literal">KDEMultivariate</tt>. See above.</li>
</ul></div>
</div>
</div>
Josef Perktoldnoreply@blogger.com1tag:blogger.com,1999:blog-1909863961516003719.post-55322241696577899022013-06-18T19:02:00.001-07:002013-06-18T19:02:18.073-07:00Quasi-Random Numbers with Halton SequencesJust two quick plots.<br>
<br>
For maximum simulated likelihood estimation and for some other cases, we need to integrate the likelihood function with respect to a distribution that reflects unobserved heterogeneity. When numeric integration is too difficult, then we can integrate by simulating the underlying distribution.<br>
<br>
However, using random draws from the underlying distribution can be inefficient for integration, and there are several ways of speeding up the integration or of increasing the accuracy for the same amount of time.<br>
<br>
One possibility is to use sequences that mimic random draws from the underlying distribution but have a better coverage of the underlying space, examples for <a href="http://en.wikipedia.org/wiki/Low-discrepancy_sequence">low-discrepancy_sequences</a> are <a href="http://en.wikipedia.org/wiki/Sobol_sequence">Sobol</a> and <a href="http://en.wikipedia.org/wiki/Halton_sequence">Halton</a> sequences.<br>
<a href="http://jpktd.blogspot.com/2013/06/quasi-random-numbers-with-halton.html#more">Read more »</a>Josef Perktoldnoreply@blogger.com0tag:blogger.com,1999:blog-1909863961516003719.post-75727853911250618712013-05-01T21:21:00.001-07:002013-05-01T21:21:12.509-07:00Power Plots in statsmodels
<div class="document" id="power-plots-in-statsmodels">
<p>I just want to show another two plots for the statistical power of a test, since I didn't have time for this <a class="reference external" href="http://jpktd.blogspot.ca/2013/03/statistical-power-in-statsmodels.html">earlier</a></p>
<p>The code to produce them is just calling the methods of the power classes, for example for the one sample t-test.</p>
</div><a href="http://jpktd.blogspot.com/2013/05/power-plots-in-statsmodels.html#more">Read more »</a>Josef Perktoldnoreply@blogger.com0tag:blogger.com,1999:blog-1909863961516003719.post-22818648366785441632013-04-24T12:32:00.000-07:002013-04-24T12:32:39.455-07:00Help: I caught a bug
<div class="document" id="help-i-caught-a-bug">
<p>I think I must be turning too much into a statistician and econometrician lately, I must have caught a virus or something.
Maybe it started already a while <a class="reference external" href="http://jpktd.blogspot.ca/2012/03/data-in-python.html">ago</a></p>
<p>The theme of the <a class="reference external" href="http://conference.scipy.org/scipy2013/">scipy conference this year</a> is "Machine Learning & Tools for Reproducible Science". However, I'm not doing any sexy twitter analysis, I just spent some days coding tests for proportion, boring stuff like pairwise comparisons of proportions.</p>
<p>Anyway, I decided to submit a tutorial proposal for econometrics with statsmodels to the scipy conference, see (lightly edited) proposal below. Since my proposal didn't get accepted, my first response was: <em>Wrong topic</em>, <em>Too much statistic</em>, <em>We just want numbers, not check whether the model is correct, and find out how to fix it</em>.</p>
<p>That leaves me with more time to go back to figuring out which other basic statistical tests are still missing in Python.</p>
</div><a href="http://jpktd.blogspot.com/2013/04/help-i-caught-bug.html#more">Read more »</a>Josef Perktoldnoreply@blogger.com3tag:blogger.com,1999:blog-1909863961516003719.post-8853893026968554902013-04-24T11:23:00.001-07:002013-04-24T11:23:18.191-07:00Statistics in Python: Reproducing Research
<div class="document" id="statistics-in-python-reproducing-research">
<p>This is just a short comment on a blog post.</p>
<p>Ferando Perez wrote a nice article about <a class="reference external" href="http://blog.fperez.org/2013/04/literate-computing-and-computational.html">"Literate computing" and computational reproducibility: IPython in the age of data-driven journalism</a></p>
<p>In the second part, he explains that Vincent Arel-Bundock came up with an ipython notebook within three hours to replicate some criticism of an economics journal article. Vincent's notebook can be seen <a class="reference external" href="http://nbviewer.ipython.org/urls/raw.github.com/vincentarelbundock/Reinhart-Rogoff/master/reinhart-rogoff.ipynb">here</a>.</p>
<p>What I found most striking was not the presentation as a notebook, although that makes it easy to read, instead it was: <tt class="docutils literal">pandas</tt>, <tt class="docutils literal">patsy</tt> and <tt class="docutils literal">statsmodels</tt>, and no R in sight. We have come a long way with <em>Statistics in Python</em> since I started to get involved in it five years ago.</p>
<p>Vincent has made many improvements and contributions to statsmodels in the last year.</p>
<p><strong>Aside</strong></p>
<p>I'm not following much of the economics debates these days, so I only know what I read in the two references that Fernando gave.</p>
<p>My impression is that it's just the usual (mis)use of economics research results. Politicians like the numbers that give them ammunition for their position. As economist, you are either very careful about how to present the results, or you join the political game (I worked for several years in an agricultural department of a small country). (An example for the use of economics results in another area, <a class="reference external" href="http://www.wired.com/techbiz/it/magazine/17-03/wp_quant?currentPage=all">blaming the financial crisis on the work on copulas</a>.)</p>
<p>"Believable" research: If your results sound too good or too interesting to be true, maybe they are not, and you better check your calculations. Although mistakes are not uncommon, the <cite>business as usual</cite> part is that the results are often very sensitive to assumptions, and it takes time to figure out what results are robust. I have seen enough economic debates where there never was a clear answer that convinced more than half of all economists. A long time ago, when the Asian Tigers where still tigers, one question was: Did they grow <cite>because of</cite> or <cite>in spite of</cite> government intervention?</p>
</div>
Josef Perktoldnoreply@blogger.com1tag:blogger.com,1999:blog-1909863961516003719.post-38292678475487218102013-04-19T10:01:00.001-07:002013-04-19T10:01:59.604-07:00Binomial Proportions, Equivalence and Power - part 0
<div class="document" id="binomial-proportions-equivalence-and-power-part-0">
<p>Just a pre-announcement because I have a nice graph.</p>
<p>I am looking into tests for binomial proportions, especially equivalence (TOST) and non-inferiority tests.</p>
<p>SAS provides a good overview over the available <a class="reference external" href="http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_freq_a0000000660.htm">methods</a>
and <a class="reference external" href="http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_power_a0000000980.htm">power</a> for it</p>
<p>Power and significance levels in testing for proportions have a saw tooth pattern because the observed proportions are discrete, see for example this <a class="reference external" href="http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_power_sect017.htm">SAS page</a></p>
<p>Unfortunately for my unit testing, I have not found any equivalence tests for proportions in R. Currently, I'm just trying to match some examples that I found on the internet.</p>
<p>And here is the plot for my power function. It shows the power as a function of the sample size, for either the normal approximation or the binomial distribution, of the test for equivalence, TOST two one-sided tests. The TOST test itself is based on the normal approximation.</p>
<a href="http://1.bp.blogspot.com/-vYZYP5EutJE/UXFy-b4_WRI/AAAAAAAAAzE/y7tEx1SHxF8/s1600/proportion_power.png" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-vYZYP5EutJE/UXFy-b4_WRI/AAAAAAAAAzE/y7tEx1SHxF8/s640/proportion_power.png" /></a>
</div>
Josef Perktoldnoreply@blogger.com0tag:blogger.com,1999:blog-1909863961516003719.post-13495970995671402152013-04-17T11:26:00.001-07:002013-04-17T11:26:38.620-07:00Multiple Testing P-Value Corrections in Statsmodels
<div class="document" id="multiple-testing-p-value-corrections-in-statsmodels">
<p>series : "Statistics in Python"</p>
<p>This is a follow-up to my previous posts, <a class="reference external" href="http://jpktd.blogspot.ca/2013/04/debugging-multiple-testing-p-value.html">here</a> and <a class="reference external" href="http://jpktd.blogspot.ca/2012/03/10-lines-of-code-and-it-took-you-3.html">this post</a>, which are on software development, and <a class="reference external" href="http://jpktd.blogspot.ca/2013/03/multiple-comparison-and-tukey-hsd-or_25.html">multiple comparisons</a> which looks at a specific case of pairwise mean comparisons.</p>
<p>In the following, I provide a brief introduction to the statsmodels functions for p-value asjustements to correct for multiple testing problems and then illustrate some properties using several Monte Carlo experiments.</p>
</div><a href="http://jpktd.blogspot.com/2013/04/multiple-testing-p-value-corrections-in.html#more">Read more »</a>Josef Perktoldnoreply@blogger.com1tag:blogger.com,1999:blog-1909863961516003719.post-41233014195213174202013-04-15T08:23:00.001-07:002013-04-17T11:27:15.485-07:00Debugging: Multiple Testing P-Value Corrections in Statsmodels
<div class="document" id="debugging-multiple-testing-p-value-corrections-in-statsmodels">
<p>subtitle: "The earth is round after all"</p>
<p>series : "Adventures with Statistics in Python"</p>
<p><em>If you run an experiment and it shows that the earth is not round, then you better check your experiment, your instrument, and don't forget to look up the definition of "round"</em></p>
<p>Statsmodels has 11 methods for correcting p-values to take account of multiple testing (or it will have after I merge my latest changes).</p>
<p>The following mainly describes how it took me some time to figure out what the interpretation of the results of a Monte Carlo run is. I wrote the Monte Carlo to verify that the multiple testing p-value corrections make sense. I will provide some additional explanations of the multiple testing function in statsmodels in a follow-up post.</p>
<p>Let's start with the earth that doesn't look round.</p>
<dl class="docutils"><dt>experiment:</dt>
<dd>Monte Carlo with 5000 or 10000 replications, to see how well the p-value corrections are doing.
We have 30 p-values from hypothesis tests, for 10 of those the null hypothesis is false.</dd>
<dt>instrument:</dt>
<dd><tt class="docutils literal">statsmodels.stats.multipletests</tt> to make the p-value correction</dd>
</dl><p>The first results</p>
<pre class="literal-block">
==========================================================================================
b s sh hs h hommel fdr_i fdr_n fdr_tsbky fdr_tsbh fdr_gbs
------------------------------------------------------------------------------------------
reject 9.6118 9.619 9.7178 9.7274 9.7178 9.72 10.3128 9.8724 10.5152 10.5474 10.5328
r_>k 0.0236 0.0246 0.0374 0.0384 0.0374 0.0376 0.2908 0.0736 0.3962 0.4118 0.4022
------------------------------------------------------------------------------------------
</pre>
<p>The headers are shortcuts for the p-value correction method. In the first line, <tt class="docutils literal">reject</tt>, are the average number of rejections across Monte Carlo iterations. The second line, <tt class="docutils literal">r_>k</tt>, are the fraction of cases where we reject more than 10 hypothesis. The average number of rejections is large because the alternative in the simulation is far away from the null hypothesis, and the corresponding p-values are small. So all methods are able to reject most of the false hypotheses.</p>
<p>The last three methods estimate, as part of the algorithm, the number of null hypotheses that are correct. All three of those methods reject a true null hypothesis in roughly 40% of all cases. All methods are supposed to limit the false discovery rate (FDR) to alpha which is 5% in this simulations. I expected the fraction in the last line to be below 0.05. So what's wrong?</p>
<p>It looks obvious, after the fact, but it had me puzzled for 3 days.</p>
<p><strong>Changing the experiment:</strong> The above data are based on p-values that are the outcome of 30 independent t-tests, which is already my second version for generating random p-values. For my third version, I changed to a data generating process similar to Benjamini, Krieger and Yekutieli 2006, which is the article on which <tt class="docutils literal">fdr_tsbky</tt> is based. None of the changes makes a qualitative difference to the results.</p>
<p><strong>Checking the instrument:</strong> All p-values corrections except <tt class="docutils literal">fdr_tsbky</tt> and <tt class="docutils literal">fdr_gbs</tt> are tested against R. For the case at hand, the p-values for <tt class="docutils literal">fdr_tsbh</tt> are tested against R's multtest package. However, the first step is a discrete estimate (number of true null hypothesis) and since it is discrete, the tests will not find differences that show up only in borderline cases. I checked a few more cases which also verify against R. Also, most methods have a double implementation, separately for the p-value correction and for the rejection boolean. Since they all give identical or similar answers, I start to doubt that there is a problem with the instrument.</p>
<p><strong>Is the earth really round?</strong> I try to read through the proof that these adaptive methods limit the FDR to alpha, to see if I missed some assumptions, but give up quickly. These are famous authors, and papers that have long been accepted and been widely used. I also don't find any assumption besides independence of the p-values, which I have in my Monte Carlo. However, looking a bit more closely at the proofs shows that I don't really understand FDR. When I implemented these functions, I focused on the algorithms and only skimmed the interpretation.</p>
<p><strong>What is the False Discovery Rate?</strong> Got it. I should not rely on vague memories of definitions that I read two years ago. What I was looking at, is not the FDR.</p>
<p>One of my new results (with a different data generating process in the Monte Carlo, but still 10 out of 30 hypotheses are false)</p>
<pre class="literal-block">
==============================================================================================
b s sh hs h hommel fdr_i fdr_n fdr_tsbky fdr_tsbh fdr_gbs
----------------------------------------------------------------------------------------------
reject 5.2924 5.3264 5.5316 5.5576 5.5272 5.5818 8.1904 5.8318 8.5982 8.692 8.633
rejecta 5.2596 5.2926 5.492 5.5176 5.488 5.5408 7.876 5.7744 8.162 8.23 8.1804
reject0 0.0328 0.0338 0.0396 0.04 0.0392 0.041 0.3144 0.0574 0.4362 0.462 0.4526
r_>k 0.0002 0.0002 0.0006 0.0006 0.0006 0.0006 0.0636 0.0016 0.1224 0.1344 0.1308
fdr 0.0057 0.0058 0.0065 0.0065 0.0064 0.0067 0.0336 0.0081 0.0438 0.046 0.0451
----------------------------------------------------------------------------------------------
</pre>
<p><tt class="docutils literal">reject</tt> : average number of rejections</p>
<p><tt class="docutils literal">rejecta</tt> : average number of rejections for cases where null hypotheses is false (10)</p>
<p><tt class="docutils literal">rejecta</tt> : average number of rejections for cases where null hypotheses is true (20)</p>
<p><tt class="docutils literal">r_>k</tt> : fraction of Monte Carlo iterations where we reject more than 10 hypotheses</p>
<p><tt class="docutils literal">fdr</tt> : average of the fraction of rejections when null is true out of all rejections</p>
<p>The last numbers look much better, the numbers are below alpha=0.05 as required, including the <tt class="docutils literal">fdr</tt> for the last three methods.</p>
<p><em>"Consider the problem of testing m null hypotheses h1, ..., hm simultaneously, of which m0 are true nulls. The proportion of true null hypotheses is denoted by mu0 = m0/m. Benjamini and Hochberg(1995) used R and V to denote, respectively, the total number of rejections and the number of false rejections, and this notation has persisted in the literature.
<...>
The FDR was loosely defined by Benjamini and Hochberg(1995) as E(V/R) where V/R is interpreted as zero if R = 0."</em>
Benjamini, Krieger and Yekutieli 2006, page 2127</p>
<p>Some additional explanations are in <a class="reference external" href="http://en.wikipedia.org/wiki/False_discovery_rate#Classification_of_m_hypothesis_tests">this Wikipedia page</a></p>
<p>What I had in mind when I wrote the code for my Monte Carlo results, was the family wise error rate, FWER,</p>
<p><em>"The FWER is the probability of making even one type I error in the family, FWER = Pr(V >= 1)"</em>
<a class="reference external" href="http://en.wikipedia.org/wiki/Family_wise_error_rate#The_FWER">Wikipedia</a></p>
<p>Although, I did not look up that definition either. What I actually used, is <tt class="docutils literal">Pr(R > k)</tt> where k is the number of false hypothesis in the data generating process. Although, I had chosen my initial cases so <tt class="docutils literal">Pr(R > k)</tt> is close to <tt class="docutils literal">Pr(V > 0)</tt>.</p>
<p>In the follow-up post I will go over the new Monte Carlo results, which now look all pretty good.</p>
<p><em>Reference</em></p>
<p>Benjamini, Yoav, Abba M. Krieger, and Daniel Yekutieli. 2006. “Adaptive Linear Step-up Procedures That Control the False Discovery Rate.” Biometrika 93 (3) (September 1): 491–507. doi:10.1093/biomet/93.3.491.</p>
</div>
Josef Perktoldnoreply@blogger.com0tag:blogger.com,1999:blog-1909863961516003719.post-57103752896522795842013-03-28T20:53:00.001-07:002013-03-28T20:54:49.100-07:00Inter-rater Reliability, Cohen's Kappa and Surprises with R
<div class="document" id="inter-rater-reliability-cohen-s-kappa-and-surprises-with-r">
<div class="section" id="introduction">
<h4>Introduction</h4>
<p>This is part three in adventures with statsmodels.stats, after <a class="reference external" href="http://jpktd.blogspot.ca/2013/03/statistical-power-in-statsmodels.html">power</a> and <a class="reference external" href="http://jpktd.blogspot.ca/2013/03/multiple-comparison-and-tukey-hsd-or_25.html">multicomparison</a>.</p>
<p>This time it is about Cohen's Kappa, a measure of inter-rater agreement or reliability. Suppose we have two raters that each assigns the same subjects or objects to one of a fixed number of categories. The question then is: How well do the raters agree in their assignments? Kappa provides a measure of association, the largest possible value is one, the smallest is minus 1, and it has a corresponding statistical test for the hypothesis that agreement is only by chance. Cohen's kappa and similar measures have widespread use, among other fields, in medical or biostatistic. In one class of applications, raters are doctors, subjects are patients and the categories are medical diagnosis. Cohen's Kappa provides a measure how well the two sets of diagnoses agree, and a hypothesis test whether the agreement is purely random.</p>
<p>For more background see <a class="reference external" href="http://en.wikipedia.org/wiki/Cohen%27s_kappa">this Wikipedia page</a> which was the starting point for my code.</p>
<p>Most of the following focuses on weighted kappa, and the interpretation of different weighting schemes. In the last part, I add some comments about R, which provided me with several hours of debugging, since I'm essentially an R newbie and have not yet figured out some of it's "funny" behavior.</p>
</div></div><a href="http://jpktd.blogspot.com/2013/03/inter-rater-reliability-cohen-kappa-and_268.html#more">Read more »</a>Josef Perktoldnoreply@blogger.com3tag:blogger.com,1999:blog-1909863961516003719.post-11192507425785945842013-03-25T20:47:00.000-07:002013-03-26T15:59:33.565-07:00Multiple Comparison and Tukey HSD or why statsmodels is awful
<div class="document" id="multiple-comparison-and-tukey-hsd-or-statsmodels-is-awful">
<div class="section" id="introduction">
<h4>Introduction</h4>
<p>Statistical tests are often grouped into one-sample, two-sample and k-sample
tests, depending on how many samples are involved in the test. In k-sample
tests the usual Null hypothesis is that a statistic, for example the mean
as in a one-way ANOVA, or the distribution in goodness-of-fit tests, is the
same in all groups or samples. The common test is the joint test that all
samples have the same value, against the alternative that at least one sample
or group has a different value.</p>
<p>However, often we are not just interested in the joint hypothesis if all
samples are the same, but we would also like to know for which pairs of
samples the hypothesis of equal values is rejected. In this case we conduct
several tests at the same time, one test for each pair of samples.</p>
<p>This results, as a consequence, in a <a class="reference external" href="http://en.wikipedia.org/wiki/Multiple_testing">multiple testing problem</a>
and we should correct our test distribution or p-values to account for this.</p>
<p>I mentioned some of the one- and two sample test in statsmodels before. Today,
I just want to look at pairwise comparison of means. We have k samples
and we want to test for each pair whether the mean is the same, or not.</p>
<p>Instead of adding more explanations here, I just want to point to
<a class="reference external" href="http://rtutorialseries.blogspot.ca/2011/03/r-tutorial-series-anova-pairwise.html">R tutorial</a>
and also the brief description on Wikipedia. A search for "Tukey HSD" or
multiple comparison on the internet will find many tutorials and explanations.</p>
<p>The following are examples in statsmodels and R interspersed with a few explanatory comments.</p>
</div></div><a href="http://jpktd.blogspot.com/2013/03/multiple-comparison-and-tukey-hsd-or_25.html#more">Read more »</a>Josef Perktoldnoreply@blogger.com4tag:blogger.com,1999:blog-1909863961516003719.post-75309729768343813762013-03-17T10:42:00.000-07:002013-03-17T14:42:29.862-07:00Statistical Power in Statsmodels
<div class="document" id="statistical-power-in-statsmodels">
<p>I merged last week a branch of mine into statsmodels that contains large parts of
basic power calculations and some effect size calculations. The documentation is in
<a class="reference external" href="http://statsmodels.sourceforge.net/devel/stats.html#basic-statistics-and-t-tests-with-frequency-weights">this section</a> .
Some parts are still missing but I thought I have worked enough on this for a while.</p>
<p>(Adding the power calculation for a new test now takes approximately:
3 lines of real code, 200 lines of wrapping it with mostly boiler plate and docstrings,
and 30 to 100 lines of tests.)</p>
<p>The first part contains some information on the implementation. In the second part, I compare
the calls to the function in the R <tt class="docutils literal">pwr</tt> package to the calls in my (statsmodels') version.</p>
<p>I am comparing it to the <tt class="docutils literal">pwr</tt> package because I ended up writing almost all unit tests
against it. The initial development was based on the SAS manual,
I used the explanations on the G-Power website for F-tests, and some parts were initially
written based on articles that I read. However, during testing I adjusted the options
(and fixed bugs), so I was able to match the results to <tt class="docutils literal">pwr</tt>, and I think <tt class="docutils literal">pwr</tt> has
just the right level of abstraction and easiness of use, that I ended up with code that is
pretty close to it.</p>
<p>If you just want to see the examples, skip to the second part.</p>
</div><a href="http://jpktd.blogspot.com/2013/03/statistical-power-in-statsmodels.html#more">Read more »</a>Josef Perktoldnoreply@blogger.com4tag:blogger.com,1999:blog-1909863961516003719.post-22977216147854738282013-03-15T21:52:00.000-07:002013-03-17T11:56:53.300-07:00Different Fields - Different Problems: Effect Size
<div class="document" id="different-fields-different-problems-effect-size">
<p>or <em>What's your scale?</em></p>
<div class="section" id="effect-size">
<h4>Effect Size</h4>
<p>I have been working on and off for a while now on adding statistical power
calculations to statsmodels. One of the topics I ran into is the
effect size.</p>
<p>At the beginning, I wasn't quite sure what to make of it. While I was
working on power calculations, it just seemed to be a convenient way of
specifying the distance between the alternative and the null hypothesis.
However, there were references that sounded like it's something special
and important. This was my first message to the mailing list</p>
<blockquote>
<dl class="docutils"><dt>A classical alternative to NHST (null-hypothesis significance testing):</dt>
<dd>report your effect size and confidence intervals</dd>
</dl><p><a class="reference external" href="http://en.wikipedia.org/wiki/Effect_size">http://en.wikipedia.org/wiki/Effect_size</a></p>
<p><a class="reference external" href="http://onlinelibrary.wiley.com/doi/10.1111/j.1469-185X.2007.00027.x/abstract">http://onlinelibrary.wiley.com/doi/10.1111/j.1469-185X.2007.00027.x/abstract</a></p>
<p><a class="reference external" href="http://onlinelibrary.wiley.com/doi/10.1111/j.1460-9568.2011.07902.x/abstract">http://onlinelibrary.wiley.com/doi/10.1111/j.1460-9568.2011.07902.x/abstract</a></p>
<p>I bumped into this while looking for power analysis.</p>
<p>MIA in python, as far as I can tell.</p>
</blockquote>
<p>Now what is the fuss all about?</p>
</div>
<div class="section" id="scaling-issues">
<h4>Scaling Issues</h4>
<p>Today I finally found some good motivating quotes:</p>
<p><em>"In the behavioral, educational, and social sciences (BESS), units of measurement are many
times arbitrary, in the sense that there is no necessary reason why the measurement instrument
is based on a particular scaling. Many, but certainly not all, constructs dealt with in the
BESS are not directly observable and the instruments used to measure such constructs do
not generally have a natural scaling metric as do many measures, for example, in the physical
sciences."</em></p>
<p>and</p>
<p><em>"However, effects sizes based on raw scores are
not always helpful or generalizable due to the lack of natural scaling metrics and multiple
scales existing for the same phenomenon in the BESS. A common methodological suggestion
in the BESS is to report standardized effect sizes in order to facilitate the interpretation of
results and for the cumulation of scientific knowledge across studies, which is the goal of
meta-analysis (<...>). A standardized effect size is an effect size that describes the size of the effect
but that does not depend on any particular measurement scale."</em></p>
<p>The two quotes are from the introduction in "Confidence Intervals for Standardized Effect Sizes: Theory, Application, and Implementation" by Ken Kelley <a class="reference external" href="http://www.jstatsoft.org/v20/a08">http://www.jstatsoft.org/v20/a08</a> .</p>
<p>Large parts of the literature that I was browsing or reading on this, are in Psychological journals.
This can also be seen in the list of references in the Wikipedia page on effect size.</p>
<p>One additional part, that I found puzzling was the definition of "conventional" effect sizes by Cohen.
"For Cohen's d an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect." (sentence from the Wikipedia page)</p>
<p>"Small" what? small potatoes, small reduction in the number of deaths, low wages? or, "I'm almost indifferent" (+0 on the mailing lists)?</p>
</div>
<div class="section" id="where-i-come-from">
<h4>Where I come from</h4>
<p>Now it's clearer why I haven't seen this in my traditional area, economics and econometrics.</p>
<p>Although economics falls into BESS, in the SS part, it has a long tradition of getting a common scale, money. Physical units also show up in some questions.</p>
<p>National Income Accounting tries to measure the economy with money as a unit. (And if something doesn't have a price associated with it, then it's ignored by most. That's another problem. Or, we make a price.) There are many measurement problems, but there is also a large industry to figure out common standards.</p>
<p>Effect sizes have a scale that is "natural":</p>
<blockquote>
<ul class="simple"><li>What's the increase in lifetime salary, if you attend business school?</li>
<li>What's the increase in sales (in Dollars, or physical units) if you lower the price?</li>
<li>What's the increase in sales if you run an advertising campaign?</li>
<li>What's your rate of return if you invest in stocks?</li>
</ul></blockquote>
<p>Effects might not be easy to estimate, or cannot be estimated accurately, but we don't need a long debate about what to report as effect.</p>
</div>
<div class="section" id="post-scripts">
<h4>Post Scripts</h4>
<p><cite>(i)</cite> I just saw the table at the end of this SAS page <a class="reference external" href="http://support.sas.com/documentation/cdl/en/statug/65328/HTML/default/viewer.htm#statug_glm_details22.htm">http://support.sas.com/documentation/cdl/en/statug/65328/HTML/default/viewer.htm#statug_glm_details22.htm</a> .
I love replicating SAS tables, but will refrain for now, since I am supposed to go back to other things.</p>
<p><cite>(ii)</cite> I started my last round of work on this because I was looking at effect size as distance measure for a chisquare goodness of fit test. When the sample size is very large, then small deviations from the Null Hypothesis will cause a statistical test to reject the Null, even if the effect, the difference to the Null is for all practical purposes irrelevant. My recent preferred solution to this is to switch to equivalence test or something similar, not testing the hypothesis that the effect is exactly zero, but to test whether the effect is "small" or not.</p>
<p><cite>(iii)</cite> I have several plans for blog posts (cohens_kappa, power onion) but never found the quiet time or urge to actually write them.</p>
</div>
</div>
Josef Perktoldnoreply@blogger.com5tag:blogger.com,1999:blog-1909863961516003719.post-74674185516582029022012-12-05T12:46:00.001-08:002012-12-05T12:46:57.353-08:00Visual Inspection of Random Numbers
<div class="document" id="visual-inspection-of-random-numbers">
<p>This is another post on showing what a few lines of matplotlib can produce.</p>
<div class="section" id="background">
<h4>Background</h4>
<p>When I wrote the test suite for <tt class="docutils literal">scipy.stats.distributions</tt>, I had to mark quite a few distributions as slow
so they are skipped under the normal test runs, because they were very slow. One of the reasons that some distributions are slow is because the generic random number generation is very indirect if only the density function is available.</p>
<p>For some time I was looking at spline interpolation of the inverse cumulative distribution, ppf, as a approximate way of generating random numbers. However, since scipy has no splines that impose monotonicity, that did not work.</p>
<p>Finally, I wrote a script that just uses linear interpolation of the cdf of a distribution, using <tt class="docutils literal">scipy.interpolate.interp1d</tt> so we can use standard inversion to create random numbers. To check whether the interpolation looks accurate enough, I went to "proof by plotting".</p>
<p>The interpolating random number generator takes about 0.3 seconds for one million random numbers, not counting the setup cost of creating the interpolator. The script is currently just a quick hack to see if it works.</p>
</div>
<div class="section" id="the-plots">
<h4>The Plots</h4>
<p>As an example I took the t distribution with 5 degrees of freedom, which has somewhat heavy tails. I calculated the
approximation for 1000 intervals, and then for 10 and 20 intervals as contrast.</p>
<p>Since a large part of the "action" is in the tails, and I want to get those to be resonably accurate, I could not look just at a regular histogram since the tails are not very visible. So I looked at two variations, one with log scale, the other where the bin width is chosen so each bin has equal probability instead of equal length.</p>
<p>The result are the following four plots, with equal-length bins in the first row, equal-probability bins in the second row, and linear scale on the left side and log scaled probabilites on the right side. With 1000 segments in the interpolation, I don't see any systematic deviation of the random numbers from the true distribution. Below is the qqplot, generated with statsmodels, that indicates that the random numbers are consistent with a t(5) distribution.</p>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-O2DZfI9jHUs/UL-jt4T89EI/AAAAAAAAAv4/44pqMExOJ1k/s1600/rvs_t_hist.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="304" src="http://3.bp.blogspot.com/-O2DZfI9jHUs/UL-jt4T89EI/AAAAAAAAAv4/44pqMExOJ1k/s640/rvs_t_hist.png" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-PipSlSm1eTA/UL-jtrsdjmI/AAAAAAAAAv0/_m9CnDcvJv0/s1600/rvs_t_qqplot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="412" src="http://3.bp.blogspot.com/-PipSlSm1eTA/UL-jtrsdjmI/AAAAAAAAAv0/_m9CnDcvJv0/s640/rvs_t_qqplot.png" width="640" /></a></div>
<br />
<p>As contrast, below are the same kind of plots for 20 intervals in the interpolation, which is a symmetric step function density with 20 intervals, many of them close to zero. The histogram shows clearly the steps, the qqplot shows systematic curved segments, which are more visible in the qqplot for 10 intervals.</p>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-C-tZmPKJhOw/UL-jtMMRJjI/AAAAAAAAAvk/znEkUKcFJG8/s1600/rvs_t_hist_20.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="304" src="http://2.bp.blogspot.com/-C-tZmPKJhOw/UL-jtMMRJjI/AAAAAAAAAvk/znEkUKcFJG8/s640/rvs_t_hist_20.png" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-dkuFU1T6sMw/UL-kYYIDNNI/AAAAAAAAAwQ/pKCVq_us5bw/s1600/rvs_t_qqplot_20.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="524" src="http://4.bp.blogspot.com/-dkuFU1T6sMw/UL-kYYIDNNI/AAAAAAAAAwQ/pKCVq_us5bw/s640/rvs_t_qqplot_20.png" width="640" /></a></div>
<p>The plots for 10 intervals are in my gallery <a class="reference external" href="https://picasaweb.google.com/lh/photo/jligO2fh-oeWS-t6eT9NJtMTjNZETYmyPJy0liipFm0?feat=directlink">histogram</a>
and <a class="reference external" href="https://picasaweb.google.com/lh/photo/XSaV9jjcmdVDlOjXqp-yptMTjNZETYmyPJy0liipFm0?feat=directlink">qqplot</a></p>
</div>
</div>
Josef Perktoldnoreply@blogger.com4tag:blogger.com,1999:blog-1909863961516003719.post-8217177914397726792012-12-01T14:10:00.000-08:002012-12-02T05:46:01.660-08:00Characteristic Functions and scipy.stats
<div class="document" id="characteristic-functions-and-scipy-stats">
<p><tt class="docutils literal">scipy.stats.distributions</tt> is among other things a nice formula collection.</p>
<p>One of the parts that are missing are the characteristic functions for the distributions.</p>
<p>Wikipedia is very good for large areas of statistics, see for some details and examples <a class="reference external" href="http://en.wikipedia.org/wiki/Characteristic_function_%28probability_theory%29">http://en.wikipedia.org/wiki/Characteristic_function_%28probability_theory%29</a>
Wikipedia lists the characteristic funtion also on the pages for many distributions.</p>
<p>I wrote something like the script below already several times (for the easy cases).</p>
<p>The characteristic function for the normal distribution is easy, but looking at the characteristic function
of the t-distribution, I wish someone had translated it into code already.</p>
<div class="section" id="t-distribution">
<h4>t-distribution</h4>
<p>Since I haven't seen it yet, I sat down and tried it myself. I managed to code the characteristic function of the t-distribution, but it returns <tt class="docutils literal">NaNs</tt> when it is evaluated close to zero for large df.</p>
<p>I didn't find a Bessel "k" function that works in this case</p>
<pre class="literal-block">
>>> special.kv(50/2., 1e-30)
inf
>>> special.kve(50/2., 1e-30)
inf
</pre>
<p>The t-distribution approaches the normal distribution as the shape parameter, the degrees of freedom,
gets large. So, the characteristic function of the t-distribution should be well behaved for large <tt class="docutils literal">df</tt>.
However, the individual terms go to infinity or zero.</p>
<p>Since in my current case, I don't care about the behavior of the characteristic function around zero, I stopped trying to get a better implementation.</p>
<p>Warning: monkey patching included in the script</p>
<p><strong>Aside:</strong> I cannot make up my mind whether the abbreviation for characteristic function should be <tt class="docutils literal">chf</tt> or <tt class="docutils literal">cf</tt>.
I have both versions in various scripts that I wrote.</p>
</div>
<div class="section" id="the-script">
<h4>The script</h4>
<p>Here's my current script</p>
<pre class="literal-block">
# -*- coding: utf-8 -*-
"""Characteristic Functions
Created on Fri Nov 30 22:43:36 2012
Author: Josef Perktold
"""
import numpy as np
from scipy import stats, special
def chf_normal(t, loc=0, scale=1):
'''characteristic function of normal distribution
Parameters
----------
t : array_like
points at which characteristic function is evaluated
loc : float (or array_like ?)
mean of underlying normal distribution
scale : float (or array_like ?)
standard deviation, scale of normal distribution
Returns
-------
chfval : ndarray
characteristic function evaluated at x
Notes
-----
Can be used for higher dimensional arguments if ``t``, ``mean`` and ``var``
broadcast.
'''
t = np.asarray(t)
return np.exp(1j * t * loc - 0.5 * t**2 * scale**2)
def chf_t(t, df, loc=0, scale=1):
'''characteristic function of t distribution
breaks down for large df and t close to zero
'''
t = np.asarray(t)
vhalf = df / 2.
term = np.sqrt(df) * np.abs(t*scale)
cf = special.kv(vhalf, term) * np.power(term, vhalf)
cf = cf / special.gamma(vhalf) / 2**(vhalf - 1)
cf = cf * np.exp(1j * loc * t)
if cf.shape == () and t == 0:
#special case: kv(0) returns nan
#for df>15 or so, we also get nans in the result in neighborhood of 0
cf = np.ones((), cf.dtype) #return same dtype as cf would
else:
cf[t == 0] = 1
return cf
def chf_t_(t, df, loc=0, scale=1):
#not much, but a bit better with log
vhalf = df / 2.
term = np.sqrt(df) * np.abs(t*scale)
cf = np.log(special.kv(vhalf, term)) + vhalf * np.log(term)
cf = cf - special.gammaln(vhalf) - (vhalf - 1) * np.log(2)
cf = cf + 1j * loc * t
if cf.shape == () and t == 0:
#special case: kv(0) returns nan
#for df>15 or so, we also get nans in the result in neighborhood of 0
cf = np.zeros((), cf.dtype) #return same dtype as cf would
else:
cf[t == 0] = 0
return np.exp(cf)
def chfn(self, t, *args, **kwds):
return chf_normal(t, *args, **kwds)
#monkeypatch scipy.stats
stats.distributions.norm_gen._chf = chfn
t = np.linspace(-1, 1, 11)
print stats.norm._chf(t, loc=1, scale=2)
print chf_t(t, 50, loc=1, scale=2)
print chf_t_(t, 50, loc=1, scale=2)
</pre>
<p><strong>Editorial note</strong>: I had written this initially for the scipy-user mailing list. (Old habits are difficult to break.)
But I remembered just before hitting <tt class="docutils literal">Send</tt> that the recommendation is to put it in a blog.</p>
</div>
<div class="section" id="application-and-plot-wrapped-circular-t-distribution">
<h4>Application and Plot: Wrapped Circular T Distribution</h4>
<p>As explained in my previous <a href="http://jpktd.blogspot.ca/2012/11/orthogonal-series-and-wrapped-circular.html">post</a>, once we have the characteristic function of a distribution
defined on the real line, it is simple to get the Fourier approximation for the
wrapped circular distribution. As an application of the characteristic function of the
t distribution, I constructed the wrapped circular distributions.</p>
<p>The following plot shows an example of the density functions of the wrapped Cauchy,
the wrapped normal distribution, and the wrapped t distribution for a few values of the
degrees of freedom. Normal and Cauchy distributions are the two extreme cases of the t distribution,
when the degrees of freedom go to infinity and when the degrees of freedom is one, respectively.</p>
<p>The distribution in the plot have the same location and scale parameter. However, this implies that
the variance of the distributions is not the same. As a different comparison we could have adjusted
the scale parameter to obtain distributions with identical variance. The latter is a more informative comparison
when we are estimating the parameters based on data, and the estimated distribution reflects a similar
variance as the data.</p>
<p>The fatter tails of Cauchy and t distributions with small t are clearly visible in the plot.</p>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-0cPbdfm1I-k/ULpsdyisb0I/AAAAAAAAAu0/cvcYrYnKjQM/s1600/wrapped_t_chf_0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="434" src="http://2.bp.blogspot.com/-0cPbdfm1I-k/ULpsdyisb0I/AAAAAAAAAu0/cvcYrYnKjQM/s640/wrapped_t_chf_0.png" width="640" /></a></div>
</div>
</div>
Josef Perktoldnoreply@blogger.com8tag:blogger.com,1999:blog-1909863961516003719.post-15397556797213218302012-11-20T10:21:00.001-08:002012-11-20T10:31:40.847-08:00Orthogonal Series and Wrapped Circular Distribution<div class="document" id="orthogonal-series-and-wrapped-circular-distribution">
This is just a quick follow-up on the previous posting.<br /><br />
recommended reading: Mardia and Jupp, section 3.5.7 on wrapped distributions
<a class="reference external" href="http://www.amazon.com/Directional-Statistics-Kanti-V-Mardia/dp/0471953334">http://www.amazon.com/Directional-Statistics-Kanti-V-Mardia/dp/0471953334</a><br /><br />
To construct a wrapped distributions on a circle, we can take a distribution
that is defined on the real line, like the normal, cauchy, t or stable
distribution, and wrap it around the circle. Essentially it's just taking
the support modulo (2 pi) and adding the overlapping densities. For some
distributions the wrapped density has a nice closed form expression, for
example the wrapped cauchy distribution that is also available in
<tt class="docutils literal">scipy.stats</tt>.<br />
For other distributions, the density is given as infinite sum, that however
converges in many cases very fast.<br />
Mardia and Jupp show how to construct the series representation of the wrapped
distribution from the characteristic function of the original, not wrapped
distribution.<br />
The basic idea is that for circular wrapped distributions the characteristic
function is only evaluated at the integers, and we can construct the Fourier
expansion of the wrapped density directly from the real and imaginary parts
of the characteristic function.
(In contrast, for densities on the real line we need a continuous inverse
Fourier transform that involves integration.)<br /><br />
To see that it works, I did a "proof by plotting"<br /><br />
For the wrapped Cauchy distribution, I can use <tt class="docutils literal">scipy.stats.wrapcauchy.pdf</tt>
as check. For both wrapped Cauchy and wrapped normal distributions, I also
coded directly the series from Mardia and Jupp's book (<tt class="docutils literal"><span class="pre">pdf-series1</span></tt>).
I also draw a large number (10000) of random numbers to be able to compare to
the histogram. The generic construction from only the characteristic function
is <tt class="docutils literal"><span class="pre">pdf-series2-chf</span></tt> in the plots. I used 10 terms in the series
representation.<br />
The plots are a bit "boring" because all 2 resp. 3 lines for the density
coincide up to a few decimals<br />
<br />
Here's the wrapped Cauchy:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-yFjdPoQtdZk/UKvAigvY7zI/AAAAAAAAAts/w6l1ORWkM_8/s1600/wrapped_cauchy_chf.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="440" src="http://4.bp.blogspot.com/-yFjdPoQtdZk/UKvAigvY7zI/AAAAAAAAAts/w6l1ORWkM_8/s640/wrapped_cauchy_chf.png" width="640" /></a></div>
<br />
And here's the wrapped normal distribution:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-NxGGrJ-nWEk/UKvAi1gUN1I/AAAAAAAAAt0/eyk9IjdHKsg/s1600/wrapped_normal_chf.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="442" src="http://4.bp.blogspot.com/-NxGGrJ-nWEk/UKvAi1gUN1I/AAAAAAAAAt0/eyk9IjdHKsg/s640/wrapped_normal_chf.png" width="640" /></a></div>
</div>
Josef Perktoldnoreply@blogger.com0tag:blogger.com,1999:blog-1909863961516003719.post-75152369300582173712012-11-18T19:09:00.000-08:002012-11-20T10:31:20.238-08:00Density Estimation with Orthogonal Series - circular data<div class="document" id="density-estimation-with-orthogonal-series-circular-data">
<div class="section" id="background">
<h4>
Background</h4>
Orthogonal Series are very useful. If we have a basis <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mo>(</mo><msub><mi>g</mi><mi>i</mi></msub><msub><mo>)</mo><mrow><mi>i</mi><mo>∈</mo><mi>N</mi></mrow></msub></mrow></math>
for some function space (usually with smoothness and integrability conditions),
then we can represent function as linear combinations of the basis functions:<br />
<div>
<math mode="display" xmlns="http://www.w3.org/1998/Math/MathML"><mtable><mtr><mtd><mi>f</mi><mo>(</mo><mi>x</mi><mo>)</mo><mo>=</mo><munder><mo>∑</mo><mrow><mi>i</mi><mo>∈</mo><mi>N</mi></mrow></munder><msub><mi>c</mi><mi>i</mi></msub><msub><mi>g</mi><mi>i</mi></msub><mo>(</mo><mi>x</mi><mo>)</mo></mtd></mtr></mtable></math></div>
To get an approximation to the function <cite>f</cite>, we can truncate after some finite
number of terms. (<i>N</i> is all positive integers.)<br />
Orthonormal polynomials are convenient for density estimation, because we can
directly estimate the coefficients <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>c</mi><mi>i</mi></msub></mrow></math> from the data without having to
run a non-linear optimization. In the basic case, we just need to calculate
the moments of the data.<br />
The orthogonality and normalization of the basis function is defined with
respect to a weighting function <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>w</mi></mrow></math>:<br />
<div>
<math mode="display" xmlns="http://www.w3.org/1998/Math/MathML"><mtable><mtr><mtd><mo>∫</mo><msub><mi>g</mi><mi>i</mi></msub><mo>(</mo><mi>x</mi><mo>)</mo><msub><mi>g</mi><mi>j</mi></msub><mo>(</mo><mi>x</mi><mo>)</mo><mi>w</mi><mo>(</mo><mi>x</mi><mo>)</mo></mtd><mtd><mo>=</mo><mn>0</mn><mtext> if </mtext><mi>i</mi><mo>≠</mo><mi>j</mi></mtd></mtr><mtr><mtd></mtd><mtd><mo>=</mo><mn>1</mn><mtext> if </mtext><mi>i</mi><mo>=</mo><mi>j</mi></mtd></mtr></mtable></math></div>
In the case of estimating or approximating a density we can use a reference
density as weighting function. Then, the first term corresponds to the
reference density, higher order terms are deviations from the reference
density. This forms the basis for smooth goodness-of-fit tests.
It is also very similar to series expansion of distributions, for example
the Gram-Charlier expansion for the normal distribution. The reference density
is the normal distribution. Higher order terms are based on Hermite polynomials.<br />
In the basic form, we can just add the weighting function to the expansion
above<br />
<div>
<math mode="display" xmlns="http://www.w3.org/1998/Math/MathML"><mtable><mtr><mtd><mi>f</mi><mo>(</mo><mi>x</mi><mo>)</mo><mo>=</mo><munder><mo>∑</mo><mrow><mi>i</mi><mo>∈</mo><mi>N</mi></mrow></munder><msub><mi>c</mi><mi>i</mi></msub><msub><mi>g</mi><mi>i</mi></msub><mo>(</mo><mi>x</mi><mo>)</mo><mi>w</mi><mo>(</mo><mi>x</mi><mo>)</mo></mtd></mtr></mtable></math></div>
However, these kinds of series expansion do not necessarily have densities
that are non-negative over the full range of the density function. As a
consequence, several non-linear transformation have been introduced in the
literature, for example squaring or taking the exponential. The transformed
expansion always results in non-negative densities. However, they loose the
simple estimation property and have to be estimated with non-linear
optimization.
(I haven't actually coded any of those yet.)<br />
These series approximation to densities can be extended to the multivariate
case, but I haven't coded those yet either.</div>
<div class="section" id="the-quest">
<h4>
The Quest</h4>
I got started with this after a "random" reading, <i>"Orthogonal series density
estimation"</i>
<a class="reference external" href="http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS97.html">http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS97.html</a>
and later <i>"Smooth tests of goodness of fit"</i>
<a class="reference external" href="http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS171.html">http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS171.html</a>
Both papers give well motivated introductions.<br />
In the mean time I have read dozens more papers in this direction.
The latest is a lot more
theoretical <a class="reference external" href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2141299">http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2141299</a>
and goes into continuous time stochastic processes, where I'm not yet ready
to go back to, and along a similar line, orthonormal series variance estimator
<a class="reference external" href="http://onlinelibrary.wiley.com/doi/10.1111/j.1368-423X.2012.00390.x/abstract">http://onlinelibrary.wiley.com/doi/10.1111/j.1368-423X.2012.00390.x/abstract</a><br />
<tt class="docutils literal">scipy.special</tt> has a nice collection of orthogonal polynomials. Now also
<tt class="docutils literal">numpy.polynomial</tt> has a good implementation of orthogonal polynomials, but they
were not available when I started with this. The <tt class="docutils literal">scipy.special</tt> documentation
is a bit "thin". It is good enough when we know what we are looking for, but not
very helpful when we only have a vague idea what kind of "animals" those
functions are.<br />
The first problem was to find the normalization, since the polynomials in scipy
are orthogonal but not orthonormal.
<a class="reference external" href="http://old.nabble.com/orthogonal-polynomials---tt31619489.html">http://old.nabble.com/orthogonal-polynomials---tt31619489.html</a><br />
Also, on the way I had to figure out how to construct orthonormal polynomials
for an arbitrary given weight function (reference density), and learn about
recurrence equations and how we can construct and evaluate orthogonal
polynomials. Neither of those are standard training where I come from.<br />
Plots of some of my previous results can be seen in my gallery. Two examples:<br />
<br />
Fourier polynomials<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-frKowg6RFLI/Td6uzu-JGOI/AAAAAAAAACs/pn7E4m0fE1g/s1600/four_dens_20.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="238" src="http://2.bp.blogspot.com/-frKowg6RFLI/Td6uzu-JGOI/AAAAAAAAACs/pn7E4m0fE1g/s320/four_dens_20.png" width="320" /> </a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
and Hermite polynomials (black line, green line is a normal distribution)<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-4WG-MvYRsQc/TftRHT_A7HI/AAAAAAAAAHI/8kkEp5W2pBg/s1600/density_hermit_mixnormal_1000_o20.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="228" src="http://3.bp.blogspot.com/-4WG-MvYRsQc/TftRHT_A7HI/AAAAAAAAAHI/8kkEp5W2pBg/s320/density_hermit_mixnormal_1000_o20.png" width="320" /></a></div>
</div>
<div class="section" id="the-latest-installment">
<h4>
The latest Installment</h4>
Recently, I started to figure out the basics of circular or directional
statistics, see for example <a class="reference external" href="http://en.wikipedia.org/wiki/Directional_statistics">http://en.wikipedia.org/wiki/Directional_statistics</a> .<br />
Trying to understand the usual smooth goodness-of-fit tests, I read
<a class="reference external" href="http://onlinelibrary.wiley.com/doi/10.1111/j.1467-842X.2009.00558.x/abstract">http://onlinelibrary.wiley.com/doi/10.1111/j.1467-842X.2009.00558.x/abstract</a>
However, orthonormal polynomials on the unit circle are "different". To get
orthogonal polynomials with the Von Mises distribution as the weight
functions, we need Verblunsky coefficients and Szego recurrence. Now what
are those? Searching with Google, I didn't find any basic explanations.
I don't really want to get a math book on the topic (by Barry Simon) and
read it.<br />
<a class="reference external" href="http://old.nabble.com/Orthogonal-polynomials-on-the-unit-circle-tt34608320.html">http://old.nabble.com/Orthogonal-polynomials-on-the-unit-circle-tt34608320.html</a><br />
To get started with something easier, I went back to orthogonal polynomials
with a uniform weight function, that is no weights. In this case, the
polynomials are just trigonometric functions or Fourier series.<br />
An explanation and application that imposes additionally non-negativity of the
density function (which I do not impose in my code) is
<a class="reference external" href="http://onlinelibrary.wiley.com/doi/10.1111/j.0006-341X.2004.00195.x/full">http://onlinelibrary.wiley.com/doi/10.1111/j.0006-341X.2004.00195.x/full</a><br />
The coefficients of the series approximation are just the circular moments
of the underlying distribution. We can calculate those for a given
distribution, or we can calculate the empirical moments from the data.<br />
<b>Detour</b>: <tt class="docutils literal">scipy.integrate.quad</tt><br />
An under used feature of <tt class="docutils literal">scipy.integrate.quad</tt> is that we are able
to use a weight function. For example, calculating the cosine and sine parts
of the circular moments can be done with<br />
<pre class="literal-block">integrate.quad(pdf_func, low, upp, weight='cos', wvar=k)
integrate.quad(pdf_func, low, upp, weight='sin', wvar=k)
</pre>
which calculates the k-th circular moment of a circular distribution given by
<tt class="docutils literal">pdf_func</tt>. The integration limits are either <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mo>(</mo><mn>0</mn><mo>,</mo><mn>2</mn><mi>π</mi><mo>)</mo></mrow></math> or
<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mo>(</mo><mo>-</mo><mi>π</mi><mo>,</mo><mi>π</mi><mo>)</mo></mrow></math>.
We cannot integrate with the complex definition:<br />
<pre class="literal-block">integrate.quad(lambda x: np.exp(1j*k*x)*pdf_func(x, *args), low, upp)
</pre>
because <tt class="docutils literal">quad</tt> throws away the imaginary part and issues a warning about the
casting to float.</div>
<div class="section" id="the-plots">
<h4>
The Plots</h4>
And now, the plots. I draw random numbers from a two component mixture of
Von Mises distributions <a class="footnote-reference" href="http://www.blogger.com/blogger.g?blogID=1909863961516003719&pli=1#id2" id="id1">[1]</a>. The plots contain the histogram of the data and
the estimated density based on the trigonometric series. For reference it also
contains the density of the data generating process, the mixture distribution,
and the density given by the 15 component series based on the circular moments
of the data generating distribution (calculated by integration as above). With 15 components the series distribution based on the true moments is essentially indistinguishable from the true density.<br />
<br />
First plot: 10000 observations, which makes the histogram and estimated
moments close to the true density and moments. <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-zW-LoduD_Fc/UKg5Ek25Z-I/AAAAAAAAAqI/zdx7EMFcpCk/s1600/vonmises_mix_series1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="304" src="http://4.bp.blogspot.com/-zW-LoduD_Fc/UKg5Ek25Z-I/AAAAAAAAAqI/zdx7EMFcpCk/s640/vonmises_mix_series1.png" width="640" /></a></div>
<br />
Second plot: 200 observations, given the randomness in the data, the histogram
is pretty uneven (given the number of bins). I fitted 3 components in the
series density estimate.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-JVYRnT9WM40/UKg5FeeWKOI/AAAAAAAAAqQ/vymWqfzq89M/s1600/vonmises_mix_series2_200.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="304" src="http://4.bp.blogspot.com/-JVYRnT9WM40/UKg5FeeWKOI/AAAAAAAAAqQ/vymWqfzq89M/s640/vonmises_mix_series2_200.png" width="640" /></a></div>
<br />
Third and fourth plots: 1000 observations, in one plot I used 5 components,
in the other plot I used 15 components in the series density. The density
with 15 components is fitting random "bumps" in the histogram.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-t4v4S81zOao/UKg5Em13hUI/AAAAAAAAAqE/c0w22jlSPtg/s1600/vonmises_mix_series2_1000.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="304" src="http://3.bp.blogspot.com/-t4v4S81zOao/UKg5Em13hUI/AAAAAAAAAqE/c0w22jlSPtg/s640/vonmises_mix_series2_1000.png" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-ZX47MmG050E/UKjYsCP4WUI/AAAAAAAAArg/4W5gmp-kwVk/s1600/vonmises_mix_series2_1000_15.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="444" src="http://1.bp.blogspot.com/-ZX47MmG050E/UKjYsCP4WUI/AAAAAAAAArg/4W5gmp-kwVk/s640/vonmises_mix_series2_1000_15.png" width="640" /></a></div>
</div>
<div class="section" id="some-comments">
<h4>
Some Comments</h4>
Orthogonal series expansion could be or is very useful. The advantage
compared to kernel density estimation is that it is much faster and we do not
need to keep the original data for evaluating the density. All we need are the
coefficients for the series. It also works better on bounded support than
kernel density estimation. One of the disadvantages is that it
is a global method and will not be able to adjust locally if there are regions
with different features, unless we sufficiently increase the number of terms in
the series. Increasing the number of terms will make the density estimate
more sensitive to random effects.<br />
My impression is that orthogonal series expansion for densities are limited in
their usefulness when the distribution contains a shape parameter and not
just location and scale. A while ago, I wrote the recursion for polynomial
series with Poisson as the weight function. It can be used for testing
whether a distribution is Poisson, as in the paper I used as reference.
However, I finally came to the conclusion that this is not really so useful,
since in many cases we want to have count regression, with the shape parameter
as a function of some explanatory variables. The series expansion of the Poisson
distribution is specific to a given shape parameter, which means that we cannot
create the orthonormal base independently of the regressors. I also have not
seen any articles that uses orthogonal expansion in regression outside the
normal distribution case, as far as I remember.<br />
One of the main missing pieces in my code is automatic selection of the
bandwidth or of the optimal penalization.
For the former, we need to select the number of components in the
series expansion. For the later, we use a larger number of terms but need to
find an increasingly strong penalization for higher order terms. I only know
of one journal article that derives the penalization for Fourier series on
the real line.<br />
Related to the last point: One of the main work that George and Ralph did
during GSOC last summer is to get automatic bandwidth selection for kernel
density estimation and kernel regression for the new <tt class="docutils literal">nonparametric</tt>
extension in statsmodels. There are many other new features besides this.
<tt class="docutils literal">statsmodels</tt> will get a good coverage of kernel methods when the branch is
merged, which will happen very soon.<br />
(My code is mostly only part of my private collection of "dirty" scripts.)<br />
<table class="docutils footnote" frame="void" id="id2" rules="none"><colgroup><col class="label"></col><col></col></colgroup><tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="http://www.blogger.com/blogger.g?blogID=1909863961516003719&pli=1#id1">[1]</a></td><td>I just concatenated the data and didn't randomize on the number of
observations in each component.</td></tr>
</tbody></table>
<b>Editorial Note</b>: I'm still using <tt class="docutils literal">rst2blogger</tt> with the default settings.
I am able to edit Latex math in restructured text for the use with sphinx
which I used for the draft. With <tt class="docutils literal">rst2blogger</tt> the default is conversion to
mathml, which doesn't recognize all Latex math that I was using, and some
fine-tunig got lost. Additionally, the math doesn't display in Internet
Explorer on my computer.<br />
PS: <b>Where does your black box start?</b><br />
Just a thought after reading this again.<br />
Sometimes I'm happy to use somebody else's code or recipes without worrying
about why it works. Sometimes I have to dig in myself because there are no
recipes available. But often I have to dig in because I feel a need to
understand whatever I'm using and I cannot rest until my ignorance is
sufficiently reduced (or I realize that the mountain is too big.)<br />
And after five or more years of <i>Statistics in Python</i>, I can safely say that
I learned a lot about topics that I never heard of before.</div>
</div>
Josef Perktoldnoreply@blogger.com0tag:blogger.com,1999:blog-1909863961516003719.post-19872534048334080782012-11-05T12:28:00.000-08:002012-11-05T12:34:49.990-08:00Polar Histogram
Just posting two plots from my gallery to show what we can do with matplotlib, and numpy and scipy. (No public code for now)
<BR/>
Both plots show the histogram of the data and the density function (pdf) of the Von Mises distribution with estimated mu and kappa.
<BR/>
The first one shows arrival times in a 24 hour clock
<div class="separator" style="clear: both; text-align: center;">
<a href="http://goo.gl/photos/jtLrWZzIHI" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="https://lh4.googleusercontent.com/-CNAc8H7DDT4/UJgccQ4mGAI/AAAAAAAAApU/VAZ5fNghnyM/s512/polar_hist_arrival_clockwise.png" /></a>
</div>
The second one shows wind direction (zero is north, I didn't use the offset in this plot.)
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-KRHM0gm3VfQ/UJgZINRXqcI/AAAAAAAAAo8/sgOJ8FQPRDs/s1600/polar_hist_wind.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-KRHM0gm3VfQ/UJgZINRXqcI/AAAAAAAAAo8/sgOJ8FQPRDs/s512/polar_hist_wind.png" /></a></div>
Josef Perktoldnoreply@blogger.com0tag:blogger.com,1999:blog-1909863961516003719.post-642172432946782062012-10-17T10:08:00.001-07:002012-10-17T10:08:59.617-07:00TOST: statistically significant difference and equivalence<div class="document" id="tost-statistically-significant-difference-and-equivalence">
or "Look I found a dime"<br />
<div class="section" id="the-story">
<h4>
The Story</h4>
Suppose we have two strategies (treatments) for making money.
We want to test whether there is difference in the payoffs that
we get with the two strategies. Assume that we are confident
enough to rely on t tests, that is, means are approximately normally distributed.
For some reasons, like transaction cost or cost differences, we don't care about the
difference in the strategies if the difference is less than 50 cents.<br />
To have an example we can simulate two samples, and let's take
as a true difference a dime, 0.1<br />
<pre class="literal-block">payoff_s1 = sigma * np.random.randn(nobs)
payoff_s2 = 0.1 + sigma * np.random.randn(nobs)
</pre>
I picked sigma=0.5 to get good numbers for the story.</div>
<div class="section" id="two-tests-t-test-and-tost">
<h4>
Two Tests: t-test and TOST</h4>
We compare two test, a standard t test for independent samples and
a test for equivalence, two one-sided tests, TOST:<br />
<pre class="literal-block">stats.ttest_ind(payoff_s1, payoff_s2)
smws.tost_ind(payoff_s1, payoff_s2, -0.5, 0.5, usevar='pooled')
</pre>
The null hypothesis for the <cite>t-test</cite> is that the two samples have
the same mean. If the p-value of the <cite>t-test</cite> is below, say 0.05,
we reject the hypothesis that the two means are the same.
If the p-value is above 0.05, then we don't have enough evidence
to reject the null hypothesis. This can also happen when the
power of the test is not high enough given our sample size.<br />
As the sample size increases, we have more information and the
test becomes more powerful.<br />
<b>If the true means are different</b>, then in large samples
we will always reject the null hypothesis of equal means.
(As the number of observations goes to infinity the probability of
rejection goes to one if the means are different.)<br />
The second test, <cite>TOST</cite>, has as null hypothesis that the difference is outside
an interval. In the symmetric case, this means that the absolute difference is
at least as large as a given threshold.
If the p-value is below 0.05, then we reject the null hypothesis that
the two means differ more than the threshold. If the p-value is above
0.05, we have insufficient evidence to reject the hypothesis that
the two means differ enough.<br />
Note that the null hypothesis of <cite>t-test</cite> and of <cite>TOST</cite> are reversed, rejection means
significant difference in <cite>t-test</cite> and significant equivalence in <cite>TOST</cite>.</div>
<div class="section" id="the-results">
<h4>
The Results</h4>
Looking at the simulated results:<br />
<i>small sample size</i>:<br />
<pre class="literal-block">nobs: 10 diff in means: -0.14039151695
ttest: 0.606109617438 not different tost: 0.0977715582206 different
</pre>
With 10 observations the information is not enough to reject the null hypothesis
in either test. The <cite>t-test</cite> says we cannot reject that they are different.
The <cite>TOST</cite> test says we cannot reject that they are the same.<br />
<i>medium sample size</i>:<br />
<pre class="literal-block">nobs: 100 diff in means: 0.131634043864
ttest: 0.0757146249227 not different tost: 6.39909387346e-07 not different
</pre>
The <cite>t-test</cite> does not reject that they are the same at a significance size of 0.05.
The <cite>TOST</cite> test now rejects the hypothesis that there is a large (at least 0.5)
difference.<br />
<i>large sample size</i>:<br />
<pre class="literal-block">nobs: 1000 diff in means: 0.107020981612
ttest: 1.51161249802e-06 different tost: 1.23092818968e-65 not different
</pre>
Both tests no reject their null hypothesis. The <cite>t-test</cite> rejects that the means are
the same. However the mean is only 0.1, so the statistically significant difference
is not large enough that we really care. Statistical significance doesn't mean it's
also an important difference.
The <cite>TOST</cite> test strongly rejects that there is a difference of at least 0.5, indicating
that given our threshold of 0.5, the two strategies are the same.</div>
<div class="section" id="the-script">
<h4>
The Script</h4>
<pre class="literal-block">import numpy as np
from scipy import stats
import statsmodels.stats.weightstats as smws
nobs_all = [10, 100, 1000]
sigma = 0.5
seed = 628561 #chosen to produce nice result in small sample
print seed
for nobs in nobs_all:
np.random.seed(seed)
payoff_s1 = sigma * np.random.randn(nobs)
payoff_s2 = 0.1 + sigma * np.random.randn(nobs)
p1 = stats.ttest_ind(payoff_s1, payoff_s2)[1]
p2 = smws.tost_ind(payoff_s1, payoff_s2, -0.5, 0.5, usevar='pooled')[0]
print 'nobs:', nobs, 'diff in means:', payoff_s2.mean() - payoff_s1.mean()
print 'ttest:', p1, ['not different', 'different '][p1 < 0.05],
print ' tost:', p2, ['different ', 'not different'][p2 < 0.05]
</pre>
</div>
<div class="section" id="notes">
<h4>
Notes</h4>
<b>Implementation:</b><br />
The t-tests are available in scipy.stats.
I wrote the first version for paired sample <cite>TOST</cite> just based on a scipy.stats ttest <a class="reference external" href="https://gist.github.com/3900314">https://gist.github.com/3900314</a> .
My new versions including <tt class="docutils literal">tost_ind</tt> will soon come to statsmodels.<br />
<b>Editorial note:</b><br />
I looked at tests for equivalence like <cite>TOST</cite> a while ago in response to some discussion on the scipy-user
mailing list about statistical significance.
This time I mainly coded, and spend some time looking at how to verify my code against SAS and R.
Finding references and quotes is left to the reader or to another time. There are some controversies
around TOST and some problems with it, but from all I saw, it's still the most widely accepted approach
and is recommended by the US goverment for bio-equivalence tests.</div>
</div>
Josef Perktoldnoreply@blogger.com2tag:blogger.com,1999:blog-1909863961516003719.post-26661408707646487962012-06-17T19:44:00.000-07:002012-06-17T21:20:48.748-07:00QR and Sequential Regression with Fixed Order<div class="document" id="qr-and-sequential-regression-with-fixed-order">
<div class="section" id="background">
<h4>
Background</h4>
I was fixing some bugs in determining the number of lags in statistical tests
for time series, and remembered that we still haven't optimized our sequential
regressions. So, I opened an <a class="reference external" href="https://github.com/statsmodels/statsmodels/issues/296">issue</a> .<br />
I learned a lot of linear algebra through the numpy and scipy mailing lists.
The relevant idea and thread is <a class="reference external" href="http://mail.scipy.org/pipermail/numpy-discussion/2010-January/048033.html">here</a>,
quoting myself:<br />
<pre class="literal-block">But from your description of QR, I thought specifically of the case
where we have a "natural" ordering of the regressors, similar to the
polynomial case of you and Anne. In the timeseries case, it would be by
increasing lags
yt on y_{t-1}
yt on y_{t-1}, y_{t-2}
...
...
yt on y_{t-k} for k= 1,...,K
or yt on xt and the lags of xt
This is really sequential LS with a predefined sequence, not PLS or
PCA/PCR or similar orthogonalization by "importance".
The usual procedure for deciding on the appropriate number of lags
usually loops over OLS with increasing number of regressors.
>From the discussion, I thought there might be a way to "cheat" in this
using QR and Gram-Schmidt
</pre>
I never got around trying this out, but I thought I give it a try today.
Some hours of trial and error and working on some algebra later, I have
what looks like a reasonable solution.</div>
<div class="section" id="using-qr-decomposition-for-sequential-regression">
<h4>
Using QR Decomposition for Sequential Regression</h4>
The following is just a commented walk through my current script, the core is just a few lines.<br />
First, some imports<br />
<pre class="literal-block">import numpy as np
from statsmodels.regression.linear_model import OLS
from numpy.testing import assert_almost_equal
</pre>
Then, we define a toy example to try out whether it works.
The last two variables have no effect. I used that to get the minimum of the
information criteria, aic, bic, to be in interior.<br />
<pre class="literal-block">nobs, k_vars = 50, 4
np.random.seed(85325783)
x = np.random.randn(nobs, k_vars)
y = x[:, :k_vars-2].sum(1) + np.random.randn(nobs)
</pre>
We start with the boring way of doing the sequential regression: use OLS from
statsmodels and loop with expanding number of explanatory variables in exog.<br />
Note, this uses the generalized inverse, <tt class="docutils literal">pinv</tt>, for each new matrix of
explanatory variables. I'm storing parameters, residual sum of squares and
information criteria to have a benchmark to compare against later.<br />
<pre class="literal-block">ssr_ols = np.zeros(k_vars)
params_ols = np.zeros((k_vars, k_vars))
ic_ols = np.zeros((2, k_vars))
for i in range(4):
res = OLS(y, x[:,:i+1]).fit()
ssr_ols[i] = res.ssr
params_ols[i, :i+1] = res.params
ic_ols[0, i] = res.aic
ic_ols[1, i] = res.bic
</pre>
In my example, the estimated coefficients are<br />
<pre class="literal-block">>>> params_ols
array([[ 0.7585129 , 0. , 0. , 0. ],
[ 1.00564191, 0.96414302, 0. , 0. ],
[ 1.00776594, 0.93035613, -0.10759121, 0. ],
[ 1.03379054, 0.91302697, -0.12998046, -0.08787965]])
</pre>
The first two coefficients are close to one, the second two coefficients are close to zero, all in the neighborhood of the true values of my simulated model.<br />
Instead of using <tt class="docutils literal">pinv</tt>, statsmodels also has the option to use <cite>QR</cite>
for estimating a linear model, with the basic code as the following.
This uses the <cite>QR</cite> decomposition of the matrix of explanatory variables.<br />
<pre class="literal-block">q, r = np.linalg.qr(x)
qy = np.dot(q.T, y)
print '\nparams full model'
print params_ols[-1]
print np.linalg.solve(r, qy)
</pre>
It gives the same result as the <tt class="docutils literal">pinv</tt> version<br />
<pre class="literal-block">params full model
[ 1.03379054 0.91302697 -0.12998046 -0.08787965]
[ 1.03379054 0.91302697 -0.12998046 -0.08787965]
</pre>
We already have the <cite>QR</cite> decomposition for the full model. QR` calculates
the orthogonalization using the sequence as it is given by the matrix of
explanatory variables.<br />
My first attempt was to find the OLS parameters sequentially using the
appropriate subset or slices of the <cite>QR</cite> matrices<br />
<pre class="literal-block">params1 = np.zeros((k_vars, k_vars))
for i in range(4):
params1[i, :i+1] = np.linalg.solve(r[:i+1, :i+1], qy[:i+1])
</pre>
This gives us the same parameter estimates as the OLS loop. The computations
are on much smaller arrays than the original problem, <tt class="docutils literal">r</tt> is <tt class="docutils literal">(k_vars, k_vars)</tt>
and <tt class="docutils literal">qy</tt> is just one dimensional with length <tt class="docutils literal">k_vars</tt>.<br />
There is still one loop, although a much smaller one, and I want to get rid of it.
The following is trivial once we know what's going on, but it took me a while to
figure this out. Some observations about <cite>QR</cite>:<br />
<blockquote>
<ul class="simple">
<li><tt class="docutils literal">q</tt> is the orthogonal matrix that spans the same space as <tt class="docutils literal">x</tt>.</li>
<li><tt class="docutils literal">theta = r beta = qy</tt> is the parameter vector in the orthogonalized problem.</li>
<li><tt class="docutils literal">beta = inv(r) theta</tt> is the parameter vector for the original regressors.</li>
</ul>
</blockquote>
All we need to do is to select the sequentially increasing subsets of <tt class="docutils literal">r</tt> and <tt class="docutils literal">theta</tt>.
For calculating predicted values, we can work directly in the orthogonalized system.<br />
Using an upper triangular matrix of ones<br />
<pre class="literal-block">upp_tri = (np.arange(4)[:,None] <= np.arange(4)).astype(float)
</pre>
we can get the expanding <tt class="docutils literal">theta</tt> or <tt class="docutils literal">qy</tt> in a matrix<br />
<pre class="literal-block">>>> upp_tri * qy[:,None]
array([[ 4.87045847, 4.87045847, 4.87045847, 4.87045847],
[-0. , -6.94085588, -6.94085588, -6.94085588],
[ 0. , 0. , 0.70410517, 0.70410517],
[ 0. , 0. , 0. , 0.60930516]])
</pre>
Then, we can just solve the system for all steps at once. First, I tried
the matrix inverse since that's easier to understand theoretically<br />
<pre class="literal-block">params2 = np.dot(np.linalg.inv(r), upp_tri * qy[:,None]).T
</pre>
But, we can avoid the inverse and use <tt class="docutils literal">linalg.solve</tt> directly:<br />
<pre class="literal-block">params2 = np.linalg.solve(r, upp_tri * qy[:,None]).T
</pre>
<b>That's it</b>, one line and I have the parameters for the four sequential
regression problems. Comparing the parameter estimates of the three methods,
we see they are the same (at floating point precision)<br />
<pre class="literal-block">>>> np.max(np.abs(params1 - params2))
0.0
>>> np.max(np.abs(params1 - params_ols))
5.5511151231257827e-16
</pre>
The rest is calculating fitted values and residuals<br />
<pre class="literal-block">contrib = q * qy
fitted = contrib.cumsum(1)
resids = y[:,None] - fitted
ssrs = (resids**2).sum(0)
print '\nresidual sum of squares ssr'
print ssrs
print ssr_ols
</pre>
which gives us identical residual sum of squares, <tt class="docutils literal">ssr</tt><br />
<pre class="literal-block">[ 80.31708213 32.14160173 31.64583764 31.27458486]
[ 80.31708213 32.14160173 31.64583764 31.27458486]
</pre>
I added some asserts in the script, so I don't break anything by accident<br />
<pre class="literal-block">assert_almost_equal(ssrs, ssr_ols, decimal=13)
assert_almost_equal(params1, params_ols, decimal=13)
assert_almost_equal(params2, params_ols, decimal=13)
</pre>
The applications, that I have in mind for this, are selection of the
number of regressors or lags in the time series case, so I want to
calculate the information criteria, <tt class="docutils literal">aic</tt> and <tt class="docutils literal">bic</tt>. In the previous result, I
had calculated the residual sum of squares, which I can feed to
helper functions that I had written to calculate information criteria.<br />
The values cannot be directly compared to the results from OLS,
since <tt class="docutils literal">aic</tt> and <tt class="docutils literal">bic</tt> reported by the OLS Results instance are
based on the likelihood value, while here I use the <tt class="docutils literal">ssr</tt> based
definition, which uses a different normalization and drops a constant term.
However, the minimum is achieved at two variables for all cases, which was
expected since I selected the simulated model for this.<br />
<pre class="literal-block">print '\naic, bic'
print ic_ols
import statsmodels.tools.eval_measures as evm
dfmodelwc = range(1, k_vars+1) #number of variables in regression
aics = [evm.aic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)]
bics = [evm.bic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)]
print aics
print bics
print 'aic, bic minimum number of variables'
print np.argmin(ic_ols, 1) + 1
print np.argmin(aics) + 1, np.argmin(aics) + 1
</pre>
prints<br />
<pre class="literal-block">aic, bic
[[ 167.59181941 123.80026281 125.02303444 126.43299217]
[ 169.50384241 127.62430882 130.75910345 134.08108419]]
[4.4259823271765022, 3.5501511951835187, 3.5746066277468964, 3.6028057824070068]
[4.4642227872850651, 3.6266321154006445, 3.689328008072585, 3.7557676228412582]
aic, bic minimum number of variables
[2 2]
2 2
</pre>
PS: Since I never had any numerical linear algebra training, I can still get excited about
figuring out a two-liner. I could have tried to find a book for it, but that would
have been boring compared to working it out on my own.<br />
PS: Following pep-8, I'm used to not using capital letters very often.</div>
</div>Josef Perktoldnoreply@blogger.com6tag:blogger.com,1999:blog-1909863961516003719.post-76604774917694451232012-06-15T17:54:00.002-07:002012-06-17T21:21:18.961-07:00Non-linear dependence measures - Distance Correlation, Kendall's Tau and Mutual Information<editorial: a="" blogger="" directly="" in="" is="" messy.="" my="" post="" restructured="" text?="" where's="" writing=""> </editorial:><br />
<br />
An ongoing unfinished project of mine is to look at dependency measures between to random variables if we observe two samples.<br />
<br />
<a href="http://en.wikipedia.org/wiki/Pearson_correlation_coefficient">Pearson's correlation</a> is the most commonly used measure. However it has the disadvantage that it can only measure linear (affine) relationship between the two variables. Scipy.stats also has <a href="http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient">Spearman's rho</a> and <a href="http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient">Kendall's tau</a> which both measure monotonic relationship. Neither of them can capture all types of non-linear dependency.<br />
<br />
There exist measures that can capture any type of non-linear relationship, <a href="http://en.wikipedia.org/wiki/Mutual_information">mutual information</a> is one of them and the first that I implemented as an experimental version in the <a href="https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/distributions/mv_measures.py">statsmodels sandbox</a>.<br />
Another one that has been popular in econometrics for some time is <a href="http://en.wikipedia.org/wiki/Hellinger_distance">Hellinger distance</a>, which is a proper metric, for example <a href="http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9892.1994.tb00200.x/abstract">Granger, Lin 1994</a>.<br />
<br />
Today, Satrajit and Yaroslav had a short discussion about <a href="http://en.wikipedia.org/wiki/Distance_correlation">Distance correlation</a> on the scikit-learn mailing list. I had downloaded some papers about it a while ago, but haven't read them yet. However, the Wikipedia link in the mailing list discussion looked reasonably easy to implement. After fixing a silly bug copying from my session to a script file, I get the same results as R. (Here's my <a href="https://gist.github.com/2938402">gist</a>.)<br />
<br />
Yaroslav has a more feature rich implementation in pymvpa, and Satrajit is working on a version for scikit-learn. <br />
<br />
I'm looking only at two univariate variables, a possible benefit of Distance correlation is that it naturally extends to two multivariate variables, while mutual information looks, at least computationally more difficult, and I have not seen a multivariate version of Hellinger's Distance yet.<br />
<br />
But I wanted to see examples, so I wrote something that looks similar to the graphs in <a href="http://en.wikipedia.org/wiki/Distance_correlation">Wikipedia - Distance_correlation.</a> . (They are not the same examples, since I only looked briefly at the source of the Wikipedia plots after writing mine.) Compared to mutual information, distance correlation shows a much weaker correlation or dependence between the two variables, while Kendall's tau doesn't find any correlation at all (since none of the relationships are monotonic).<br />
<br />
Note: I didn't take the square root in the Distance correlation, taking the square root as the <i>energy</i> package in R, I get<br />
<br />
<div style="font-family: "Courier New",Courier,monospace;">
>>> np.sqrt([0.23052, 0.0621, 0.03831, 0.0042])<br />
array([ 0.48012498, 0.24919872, 0.19572941, 0.06480741])</div>
<br />
<a href="http://goo.gl/photos/TL8Wq8qPa9" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="https://lh6.googleusercontent.com/-YkywVEqBQEQ/T9vGhjrUIII/AAAAAAAAAek/TaJ77KQgoHk/s512/dependence_measures.png" /> </a><br />
Now, what is this useful for. There main reason that I started to look in this direction was related to tests of independence of random variables. In the Gaussian case, zero correlation implies independence, but this is not the case for other distributions. There are many statistical tests that the Null Hypothesis that two variables are independent, the ones I started to look at were based on <a href="http://en.wikipedia.org/wiki/Copula_%28probability_theory%29">copulas</a>. For goodness-of-fit tests there are many measures of divergence, that define a distance between two probability function, which can similarly be used to measure the distance between the estimated joint probability density function and the density under the hypothesis of independence.<br />
<br />
While those are oriented towards testing, looking at the measures themselves gives a quantification of the strength of the relationship. For example, auto- or cross-correlation in time series analysis only measures linear dependence. The non-linear correlation measures are also able to capture dependencies that will be hidden to Pearson's correlation or also to Kendall's tau.<br />
<br />
PS: There are more interesting things to do, than time for it.Josef Perktoldnoreply@blogger.com15tag:blogger.com,1999:blog-1909863961516003719.post-85383127593062955622012-05-10T10:55:00.001-07:002012-05-10T19:54:02.247-07:00Regression Plots - Part 1<div class="document" id="regression-plots-part-1">
I started to work on improving the documentation for the regressions plot in statsmodels. (However, I realized I have to improve them a bit.)
<br/>
<br/>
For now, just a question: <i>Can you spot the mis-specification of the model?</i>
<br/>
<br/>
I simulate a model, run a linear regression on three variables and a constant. Here is the estimation summary, which looks quite good, large R-squared, all variables significant, no obvious problems:
<pre class="literal-block">>>> print res.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.901
Model: OLS Adj. R-squared: 0.898
Method: Least Squares F-statistic: 290.3
Date: Thu, 10 May 2012 Prob (F-statistic): 5.31e-48
Time: 13:15:22 Log-Likelihood: -173.85
No. Observations: 100 AIC: 355.7
Df Residuals: 96 BIC: 366.1
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.4872 0.024 20.076 0.000 0.439 0.535
x2 0.5408 0.045 12.067 0.000 0.452 0.630
x3 0.5136 0.030 16.943 0.000 0.453 0.574
const 4.6294 0.372 12.446 0.000 3.891 5.368
==============================================================================
Omnibus: 0.945 Durbin-Watson: 1.570
Prob(Omnibus): 0.624 Jarque-Bera (JB): 1.031
Skew: -0.159 Prob(JB): 0.597
Kurtosis: 2.617 Cond. No. 33.2
==============================================================================
</pre>
The following three graphs are refactored versions of the regression plots. Each graph looks at the data and estimation results with respect to one of the three variables. (The graphs look better in original size.)
<br/>
The short lines in the first subplot of each graph are the prediction confidence intervals for each observation.
<a href="http://goo.gl/photos/UhPCnQjatR" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="305" src="https://lh6.googleusercontent.com/-A2niHa-I4_k/T6v58S6ZlUI/AAAAAAAAAYE/uOqKYGKeF_o/s0/regressionplot0.png" width="640" /></a>
<a href="http://goo.gl/photos/5XAYagHCc3" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="305" src="https://lh5.googleusercontent.com/-p4VWQsiMmjA/T6v59Hs1EpI/AAAAAAAAAYQ/GDvfWvYDEXk/s0/regressionplot1.png" width="640" /></a>
<a href="http://goo.gl/photos/KjmJomvbxI" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="305" src="https://lh4.googleusercontent.com/-DezL0mCzY9w/T6v59O7AljI/AAAAAAAAAYM/_4KzbOe8tp4/s0/regressionplot2.png" width="640" /></a>
<br/>
The code is short, if we have the (still unpublished) helper functions.
<br/>
<cite>res</cite> is an OLS results instance
<pre class="literal-block">from regressionplots_new import plot_regress_exog
fig9 = plot_regress_exog(res, exog_idx=0)
add_lowess(fig9, ax_idx=1, lines_idx=0)
add_lowess(fig9, ax_idx=2, lines_idx=0)
add_lowess(fig9, ax_idx=3, lines_idx=0)
fig10 = plot_regress_exog(res, exog_idx=1)
add_lowess(fig10, ax_idx=1, lines_idx=0)
add_lowess(fig10, ax_idx=2, lines_idx=0)
add_lowess(fig10, ax_idx=3, lines_idx=0)
fig11 = plot_regress_exog(res, exog_idx=2)
add_lowess(fig11, ax_idx=1, lines_idx=0)
add_lowess(fig11, ax_idx=2, lines_idx=0)
add_lowess(fig11, ax_idx=3, lines_idx=0)
</pre>
</div>Josef Perktoldnoreply@blogger.com0tag:blogger.com,1999:blog-1909863961516003719.post-32166185444355650882012-05-08T10:07:00.000-07:002012-05-10T17:44:13.122-07:00Plots in statsmodels: qqplot<div class="document" id="plots-in-statsmodels-qqplot">
Other news first, since I haven't managed to catch up with the blogs:<br />
<blockquote>
<ul class="simple">
<li>statsmodels has four students in GSoC, the first four projects described in my previous <a class="reference external" href="http://jpktd.blogspot.ca/2012/04/statsmodels-and-google-summer-of-code.html">post</a>. Congratulations to Alexandre, Divyanshu, George and Justin</li>
<li>statsmodels 0.4.0 has been release with new name without scikits in front, more on <a class="reference external" href="http://pypi.python.org/pypi/scikits.statsmodels">pypi</a></li>
</ul>
</blockquote>
statsmodels has a <a class="reference external" href="http://statsmodels.sourceforge.net/devel/graphics.html">graphics subdirectory</a>, where we started to collect some of the common statistical plots. To make the documentation a bit more exciting, I am adding plots directly to the docstrings for the individual functions. Currently, we don't have many of them in the online documentation yet, two examples <a class="reference external" href="http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.boxplots.violinplot.html">violin_plot</a> and <a class="reference external" href="http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.boxplots.beanplot.html">bean_plot</a>.<br />
A note on the documentation: Skipper improved the <a class="reference external" href="http://statsmodels.sourceforge.net/">frontpage</a>, which makes it easier to find the documentation for the latest released version and for the development version. Currently, the development version is better and is improving, and it is incompatible with the 0.4.0 release in only one part.<br />
<div class="section" id="quantile-quantile-plot-qqplot">
<h4>
quantile-quantile plot: qqplot</h4>
The documentation for the function is <a class="reference external" href="http://www.blogger.com/statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html">here</a>. The function signature is<br />
<pre class="literal-block">qqplot(data, dist=stats.norm, distargs=(), a=0, loc=0, scale=1, fit=False, line=False, ax=None)
</pre>
I am not copying the entire docstring, what I would like to present here are some examples and how to work with the plots.<br />
The first example is from the docstring. I don't like the default, so I kept adding keyword arguments until the plot is more to my taste.<br />
<blockquote>
<ul>
<li><div class="first">
The first plot uses no keywords and assumes normal distribution, and does not standardize the data.</div>
</li>
<li><div class="first">
The second plot adds <cite>line='s'</cite>, which according to the docstring</div>
<pre class="literal-block">'s' - standardized line, the expected order statistics are scaled
by the standard deviation of the given sample and have the mean
added to them
</pre>
corresponds to the line after fitting location and scale for the normal distribution</li>
<li><div class="first">
The third plot adds <cite>fit=True</cite> to get standardized sample quantiles and plots the 45 degree line. That's the plot I would prefer.</div>
</li>
<li><div class="first">
The fourth plot is similar to the third plot, but with the t distribution instead of the normal distribution. I was surprised that the third and fourth plot look the same, until I checked and it turned out that the fitted t distribution has a huge degrees of freedom parameter and so is essentially identical to the normal distribution.</div>
</li>
</ul>
</blockquote>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://lh3.googleusercontent.com/-zYnK8P3AQhg/T6lQq9Q-SPI/AAAAAAAAAXE/e6135A__kSQ/s720/qqplot1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="432" src="https://lh3.googleusercontent.com/-zYnK8P3AQhg/T6lQq9Q-SPI/AAAAAAAAAXE/e6135A__kSQ/s640/qqplot1.png" width="640" /></a></div>
<br />
<br />
I will go over the code to produce this below.<br />
I started the second example to see whether fitting the t distribution works correctly. Instead of using real data, I generate 1000 observations with a t distribution with df=4 and standard location(<cite>0</cite>) and scale (<cite>1</cite>).<br />
<blockquote>
<ul class="simple">
<li>The first plot fits a normal distribution, keywords: <cite>line='45', fit=True</cite></li>
<li>The second plot fits the t distribution, keywords: <cite>dist=stats.t, line='45', fit=True</cite></li>
<li>The third plot is the same as the second plot, but I fit the t distribution myself, instead of having qqplot do it. keywords: <cite>dist=stats.t, distargs=(dof,), loc=loc, scale=scale, line='45'</cite>. I added the estimated parameters into a text insert in the plot. qqplot currently doesn't tell us what the fitted parameters are.</li>
</ul>
</blockquote>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://lh4.googleusercontent.com/-i8u0En5LVD0/T6lQrE5m9PI/AAAAAAAAAXM/As9lokXub_4/s640/qqplot2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="475" src="https://lh4.googleusercontent.com/-i8u0En5LVD0/T6lQrE5m9PI/AAAAAAAAAXM/As9lokXub_4/s640/qqplot2.png" width="640" /></a></div>
<br />
<br /></div>
<div class="section" id="the-code">
<h4>
The Code</h4>
Here was my first attempt, following the docstring example<br />
<pre class="literal-block">from scipy import stats
import statsmodels.api as sm
#estimate to get the residuals
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
mod_fit = sm.OLS(data.endog, data.exog).fit()
res = mod_fit.resid
fig = sm.graphics.qqplot(res, dist=stats.t, line='45', fit=True)
fig.show()
</pre>
It works but the x-axis goes from -3 to 3, while there are only values from -2 to 2.<br />
<div class="section" id="detour-to-some-background">
<h5>
Detour to some background</h5>
A while ago we had a discussion on the mailing list what a plot in statsmodels should return. With the helpful comments of John Hunter, we finally agreed that plots should take an <cite>ax</cite> (matplotlib axis) argument if it's meaningful, and always return a figure instance <cite>fig</cite>. If <cite>ax</cite> is None, or the plot is a combination plot (several plots in one figure), then a figure is created and returned. If <cite>ax</cite> is given, then that is used to attach the plot elements. Ralf Gommers converted our plot functions to follow this pattern, besides that, he also wrote several of the plots that are currently in statsmodels.<br />
So, to change the axis limits in the above graph, all I have to add is:<br />
<pre class="literal-block">fig.axes[0].set_xlim(-2, 2)
</pre>
The resulting plot is then the same as the third plot in the first graph above.</div>
<div class="section" id="the-first-graph">
<h5>
The first graph</h5>
Here is now the script for the first graph in several stages:<br />
First I import some modules and calculate the residuals following the example<br />
<pre class="literal-block">from scipy import stats
from matplotlib import pyplot as plt
import statsmodels.api as sm
#example from docstring
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
mod_fit = sm.OLS(data.endog, data.exog).fit()
res = mod_fit.resid
</pre>
Then I hardcode a left position for text inserts, and create a matplotlib figure instance<br />
<pre class="literal-block">left = -1.8
fig = plt.figure()
</pre>
Next we can add the first subplot. The only keyword arguments for qqplot is <cite>ax</cite> to tell qqplot to attach the plot to my first subplot. Since I want to insert a text to describe the keywords, I needed to spend some time with the matplotlib documentation. As we have a reference to the axis instance, it is easy to change or add plot elements<br />
<pre class="literal-block">ax = fig.add_subplot(2, 2, 1)
sm.graphics.qqplot(res, ax=ax)
top = ax.get_ylim()[1] * 0.75
txt = ax.text(left, top, 'no keywords', verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))
</pre>
The other subplots follow the same pattern. I didn't try to generalize or avoid hardcoding<br />
<pre class="literal-block">ax = fig.add_subplot(2, 2, 2)
sm.graphics.qqplot(res, line='s', ax=ax)
top = ax.get_ylim()[1] * 0.75
txt = ax.text(left, top, "line='s'", verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))
ax = fig.add_subplot(2, 2, 3)
sm.graphics.qqplot(res, line='45', fit=True, ax=ax)
ax.set_xlim(-2, 2)
top = ax.get_ylim()[1] * 0.75
txt = ax.text(left, top, "line='45', \nfit=True", verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))
ax = fig.add_subplot(2, 2, 4)
sm.graphics.qqplot(res, dist=stats.t, line='45', fit=True, ax=ax)
ax.set_xlim(-2, 2)
top = ax.get_ylim()[1] * 0.75
txt = ax.text(left, top, "dist=stats.t, \nline='45', \nfit=True",
verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))
</pre>
The final step is to adjust the layout, so that axis labels don't overlap with other subplots if the graph is not very large<br />
<pre class="literal-block">fig.tight_layout()
</pre>
</div>
<div class="section" id="the-second-graph">
<h5>
The second graph</h5>
The second graph follows the same pattern with a few changes.<br />
First we generate a random sample using scipy.stats which under the hood uses the random numbers from numpy. You can notice here that I am cheating. I ran the script several times to find "nice" seeds. Especially in smaller samples, qqplot might often not be very good in distinguishing normal and t distributions.<br />
<pre class="literal-block">import numpy as np
seed = np.random.randint(1000000)
print 'seed', seed
seed = 461970 #nice seed for nobs=1000
#seed = 571478 #nice seed for nobs=100
#seed = 247819 #for nobs=100, estimated t is essentially normal
np.random.seed(seed)
rvs = stats.t.rvs(4, size=1000)
</pre>
The first two subplot are very similar to what is in the first graph<br />
<pre class="literal-block">fig2 = plt.figure()
ax = fig2.add_subplot(2, 2, 1)
fig2 = sm.graphics.qqplot(rvs, dist=stats.norm, line='45', fit=True, ax=ax)
top = ax.get_ylim()[1] * 0.75
xlim = ax.get_xlim()
frac = 0.1
left = xlim[0] * (1-frac) + xlim[1] * frac
txt = ax.text(left, top, "normal", verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))
ax = fig2.add_subplot(2, 2, 2)
fig2 = sm.graphics.qqplot(rvs, dist=stats.t, line='45', fit=True, ax=ax)
top = ax.get_ylim()[1] * 0.75
xlim = ax.get_xlim()
frac = 0.1
left = xlim[0] * (1-frac) + xlim[1] * frac
txt = ax.text(left, top, "t", verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))
</pre>
For the third plot, I estimate the parameters of the t-distribution to see whether I get the same results as in the second plot (I do), and so I can insert the parameter estimates into the plot<br />
<pre class="literal-block">params = stats.t.fit(rvs)
dof, loc, scale = params
ax = fig2.add_subplot(2, 2, 4)
fig2 = sm.graphics.qqplot(rvs, dist=stats.t, distargs=(dof,), loc=loc,
scale=scale, line='45', fit=False, ax=ax)
top = ax.get_ylim()[1] * 0.75
xlim = ax.get_xlim()
frac = 0.1
left = xlim[0] * (1-frac) + xlim[1] * frac
txt = ax.text(left, top, "t \ndof=%3.2F \nloc=%3.2F, \nscale=%3.2F" % tuple(params),
verticalalignment='top')
txt.set_bbox(dict(facecolor='k', alpha=0.1))
</pre>
That's it for the plots, now I need to add them to the statsmodels documentation.</div>
</div>
<div class="section" id="ps-normality-tests-details-left-for-another-day">
<h4>
PS: normality tests, details left for another day</h4>
qqplots give us a visual check whether a sample follows a specific distribution. The case that we are interested in most often, is a test for normality. Scipy.stats and statsmodels have several normality tests. The ones I have written recently are Anderson-Darling and Lillifors. Lillifors is a Kolmogorov-Smirnov test for normality when mean and variance are estimated. Calculating a statistical test provides a more reliable test than a "vague" visual inspection, but these tests do not point us to a specific alternative and provide less information about the direction in which a null hypothesis might be incorrect.<br />
Using the residuals in the first example, neither test rejects the Null Hypothesis that the residuals come from a normal distribution<br />
<pre class="literal-block">>>> normal_ad(res)
(0.43982328207860633, 0.25498161947268855)
>>> lillifors(res)
(0.17229856392873188, 0.2354638181341876)
</pre>
On the other hand, in the second example with 1000 observations from the t distribution, the assumption that the data comes from a normal distribution is clearly rejected<br />
<pre class="literal-block">>>> normal_ad(rvs)
(6.5408483355136013, 4.7694160497092537e-16)
>>> lillifors(rvs)
(0.05919821253474411, 8.5872265678140885e-09)
</pre>
<b>PPS:</b><br />
I'm reluctant to publish the import path, because I had forgotten to add them to a proper place for 0.4.0, and the import location will not stay where it is. It took me a few minutes to find out that they are not on any recommended import path when I wrote these scripts<br />
<pre class="literal-block">>>> from statsmodels.stats.adnorm import normal_ad
>>> from statsmodels.stats.lilliefors import lillifors
</pre>
</div>
</div>Josef Perktoldnoreply@blogger.com0tag:blogger.com,1999:blog-1909863961516003719.post-18778066179066142682012-04-11T09:24:00.002-07:002012-04-11T09:25:49.692-07:00Statsmodels and Google Summer of Code 2012<div class="document" id="statsmodels-and-google-summer-of-code-2012"><p>I didn't have much time or motivation to work on my blog these last weeks, mainly because I was busy discussing Google Summer of Code and preparing a new release for statsmodels.</p><p>So here is just an update on our Google Summer of Code candidates and their projects. This year was a successful year in attracting student proposals. We have six proposals, five of them we discussed quite extensively on our mailing list before the application.</p><p>Of the five projects, the first two are must-haves for econometrics or statistical packages, one on System of Equations, the other on Nonlinear Least-Squares and Nonlinear Robust Models. The next two are nonparametric or semi-parametric methods, one more traditional kernel estimation, the other using Empirical Likelihood which is a relatively new approach that has become popular in recent research both in econometrics and in statistics. The fifth is on Dynamic Linear Models mainly using Kalman filter and a Bayesian approach, which would extend the depth of statsmodels in time series analysis.</p><p>All topics would be valuable extensions to statsmodels and significantly increase our coverage of statistics and econometrics. From the discussion on the mailing list I think that all candidates are qualified to bring the projects to a successful finish.</p><div class="section" id="estimating-system-of-equations"><h4>Estimating System of Equations</h4><p>This is a standard econometrics topic, but I only recently found that graphical models and causal models discussed in other fields have a large overlap with this. In the case of a system of simultaneous equations, we have several variables that depend on each other. The simplest case in economics is a market equilibrium, where the demanded and supplied quantities depend on the price, and the price depends on the supply and demand. The estimation methods commonly used in this area are two-stage and three-stage least-squares and limited and full information maximum likelihood estimation. The first part of the project starts with the simpler case when we have several response variables, but they don't depend on each other simultaneously, although they can depend on past values of other response variables. I'm very glad that someone is picking this one up.</p></div><div class="section" id="extension-of-linear-to-non-linear-models"><h4>Extension of Linear to Non Linear Models</h4><p>This project has two parts, the first is extending the linear least-squares model to the non-linear case, the second part is to implement non-linear models for robust estimation. Non-linear least squares is available in scipy for example with scipy.optimize.curve_fit. However, in the statsmodels version we want to provide all the usual results statistics and statistical tests. The second part will implement two robust estimators for non-linear model, that have been shown to be most successful in recent Monte Carlo studies comparing different robust estimators for non-linear equations. Robust estimation here refers to the case when there are possibly many outliers in the data. My guess is that these will become the most used models of all proposals.</p></div><div class="section" id="nonparametric-estimation"><h4>Nonparametric Estimation</h4><p>This project extends the kernel based method in statsmodels from the univariate to the multivariate case, will provide better bandwidth selection, and then implement nonparametric function estimation. Multivariate kernel density estimation should complement scipy.stats.gaussian_kde which only works well with distributions that are approximately normal shaped or have only a single peak. Another extension is to provide kernels and estimation methods for discrete variables. These methods have been on our wishlist for a while, but only the univariate case has been included in statsmodels so far.</p></div><div class="section" id="empirical-likelihood"><h4>Empirical Likelihood</h4><p>This is a relatively new approach in statistics and econometrics that avoids the distributional assumptions in estimation and in statistical tests. Instead of relying on a known distribution in small samples, where we often assume normal distribution, or instead of relying on the asymptotic normal distribution in large samples, this approach estimates the distribution in a nonparametric way. This is similar, to some extend, to the difference between, for example, a t-test and a rank-based Mann–Whitney U or Wilcoxon test, which are available in scipy.stats. The advantages are that in smaller samples the estimates and tests are more accurate when the distribution is not known, and in many cases, for example in finance, most tests show that the assumption of normal distribution does not hold. For this project, I still have to catch up with some readings because I'm only familiar with a small part of this literature, mainly on empirical likelihood in relation to Generalized Method of Moments (GMM).</p></div><div class="section" id="dynamic-linear-models"><h4>Dynamic Linear Models</h4><p>This covers statespace models implemented by Kalman Filter for multivariate time series models, both from a likelihood and a Bayesian perspective. The project expands the coverage of statsmodels in linear time series analysis, the first area where we get a good coverage of models. Currently, we have univariate AR and ARIMA, vector autoregressive models VAR, and structural VAR. Part of this project would be to get a good cython based implementation of Kalman filters. Wes has started a libray, statlib, for this last year, however, it is still incomplete and needs to be integrated with statsmodels. Another advantage of this project is that it increases our infrastructure and models for macro-econometrics, estimation of macroeconomic models and dynamic stochastic general equilibrium DSGE models, which is currently still Matlab dominated, as far as I can tell.</p><p>Now we still have to see how many GSoC slots we will get, but we have the chance this year to get a large increase in the speed of development of statsmodels, and we can reduce the number of cases where someone needs to run to R, or Stata, or Matlab because there is no implementation for a statistical analysis available in Python.</p></div></div>Josef Perktoldnoreply@blogger.com0tag:blogger.com,1999:blog-1909863961516003719.post-48230302547179122312012-03-04T11:12:00.000-08:002012-03-04T17:46:05.726-08:00Numerical Accuracy in Linear Least Squares and Rescaling<div class="document" id="numerical-accuracy-in-linear-least-squares-and-rescaling"> <div class="section" id="the-problem"><h4>The Problem</h4><p>(Warning upfront: there are problems to replicate this, more below)</p><p>A week ago, I stumbled on this <a class="reference external" href="http://en.wikibooks.org/wiki/Statistics:Numerical_Methods/Numerical_Comparison_of_Statistical_Software#Linear_Regression">Numerical_Comparison_of_Statistical_Software</a> which presents some test results for numerical accuracy of statistical packages.</p><p>For linear regression, there is one test, Longley, that we have in the datasets in statsmodels. But I wanted to look at Filip which sounds like a difficult case, neither SAS nor SPSS produced a solution. Let's see how badly statsmodels and numpy are doing, or maybe not.</p><p>The model is a polynomial of degree 10. Description, data, certified values and a plot are on the NIST website <a class="reference external" href="http://www.itl.nist.gov/div898/strd/lls/data/Filip.shtml">here</a></p><pre class="literal-block">1 Predictor Variable
82 Observations
Higher Level of Difficulty
Model: Polynomial, 11 Parameters
</pre><p>I parsed the data into an array <cite>dta</cite>, first column is the endogeous, <cite>y</cite>, variable second column is the exogenous, <cite>x</cite>, variable. I saved <cite>y</cite> in <cite>endog</cite>. I also parsed the main NIST result in <cite>params_nist</cite>, first column parameters, second column their standard deviation.</p></div><div class="section" id="fitting-a-polynomial"><h4>Fitting a Polynomial</h4><p>Since it is a polynomial problem, let us treat it as such and use the polynomials from numpy.</p><p>First try, use the old polyfit function</p><pre class="literal-block">>>> p_params = np.polyfit(dta[:,1], endog, 10)[::-1]
>>> p_params
array([-1467.48963361, -2772.17962811, -2316.37111156, -1127.97395547,
-354.47823824, -75.1242027 , -10.87531817, -1.062215 ,
-0.06701912, -0.00246781, -0.0000403 ])
>>> log_relative_error(p_params, params_nist[:,0])
array([ 7.87929761, 7.88443445, 7.88840683, 7.89138269, 7.89325784,
7.89395336, 7.89341841, 7.89162977, 7.88859034, 7.88432427,
7.87887292])
</pre><p>Not bad, following the description on the Wikipedia page, I wrote a function <cite>log_relative_error</cite> that tells us how many significant digits agreement is between the two arrays. <cite>polyfit</cite> agrees at 7 to 8 significant digits, that's about the same as S-Plus on the Wikipedia page.</p><p>Let's work properly with polynomials and use the new polynomial package in numpy. Charles Harris wrote it and is still expanding and improving it.</p><pre class="literal-block">>>> poly = np.polynomial.Polynomial.fit(dta[:,1],endog, 10)
>>> poly
Polynomial([ 0.88784146, 0.10879327, -0.53636698, 0.28747072, 2.20567367,
-1.31072158, -4.21841581, 1.76229897, 3.8096025 , -0.77251557,
-1.30327368], [-8.78146449, -3.13200249])
</pre><p>Oops, these numbers don't look like the NIST numbers. The last numbers, [-8.78146449, -3.13200249], show the domain of the polynomial, our values have been transformed. A bit of introspection, and we figure out how to change the domain. To get the "standard" representation, we can transform the domain back to the standard domain (-1, 1).</p><pre class="literal-block">>>> poly.convert(domain=(-1,1))
Polynomial([-1467.48961423, -2772.17959193, -2316.37108161, -1127.97394098,
-354.4782337 , -75.12420174, -10.87531804, -1.06221499,
-0.06701912, -0.00246781, -0.0000403 ], [-1., 1.])
</pre><p>Now, this looks more like NIST, it even agrees at 13 to 14 significant digits</p><pre class="literal-block">>>> log_relative_error(poly.convert(domain=(-1,1)).coef, params_nist[:,0])
array([ 13.72347502, 13.84056851, 13.81494335, 13.70878715,
13.78207216, 13.79374075, 13.6729684 , 13.71128925,
13.75445952, 13.68695573, 13.67736436])
</pre><p>Nice job Charles. No problem fitting this polynomial with numpy.</p></div><div class="section" id="linear-regression"><h4>Linear Regression</h4><p>In the previous part we knew we were fitting a polynomial, but lets treat it just as a standard linear regression problem and use scikits.statsmodels.</p><p>First try: just create the design matrix in the simplest way and estimate</p><pre class="literal-block">>>> exog0 = dta[:,1:]**np.arange(11)
>>> res0 = sm.OLS(endog, exog0).fit()
>>> res0.params
array([ 8.443046917097718, 1.364996059973237, -5.350750046084954,
-3.34190287892045 , -0.406458629495091, 0.257727311296307,
0.119771653524165, 0.023140891929892, 0.002403995206457,
0.000131618839544, 0.000002990001222])
>>> log_relative_error(res0.params, params_nist[:,0])
array([-0.002491507096328, -0.000213790029725, 0.00100436814061 ,
0.001288615104161, 0.000498264786078, -0.00148737673275 ,
-0.004756810105056, -0.009359738327099, -0.015305377783833,
-0.022566206229652, -0.031085341541384])
</pre><p>Bummer, 0 significant digits, way off.</p><p>We can print the full summary of the results, maybe we see something</p><pre class="literal-block">>>> print res0.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.996
Model: OLS Adj. R-squared: 0.995
Method: Least Squares F-statistic: 2390.
Date: Sat, 03 Mar 2012 Prob (F-statistic): 1.85e-84
Time: 23:47:45 Log-Likelihood: 344.73
No. Observations: 82 AIC: -673.5
Df Residuals: 74 BIC: -654.2
Df Model: 7
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 8.4430 12.864 0.656 0.514 -17.189 34.075
x1 1.3650 6.496 0.210 0.834 -11.578 14.308
x2 -5.3508 9.347 -0.572 0.569 -23.974 13.273
x3 -3.3419 11.702 -0.286 0.776 -26.659 19.975
x4 -0.4065 5.923 -0.069 0.945 -12.209 11.396
x5 0.2577 1.734 0.149 0.882 -3.197 3.712
x6 0.1198 0.321 0.373 0.710 -0.520 0.759
x7 0.0231 0.038 0.604 0.548 -0.053 0.099
x8 0.0024 0.003 0.838 0.405 -0.003 0.008
x9 0.0001 0.000 1.072 0.287 -0.000 0.000
x10 2.99e-06 2.29e-06 1.303 0.197 -1.58e-06 7.56e-06
==============================================================================
Omnibus: 1.604 Durbin-Watson: 1.627
Prob(Omnibus): 0.449 Jarque-Bera (JB): 1.379
Skew: -0.317 Prob(JB): 0.502
Kurtosis: 2.961 Cond. No. -1.#J
==============================================================================
The smallest eigenvalue is -0.38. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
</pre><p>R square is 0.996, so we are fitting the curve pretty well, but our design matrix with the polynomial is not positive definite. There is even a negative eigenvalue. A negative eigenvalue sounds strange, a quadratic form shouldn't have them. Just to make sure that this is not a bug, check with numpy</p><pre class="literal-block">>>> np.linalg.eigvalsh(np.dot(exog0.T, exog0)).min()
-0.38006444279775781
>>> np.linalg.eigvals(np.dot(exog0.T, exog0)).min()
-0.00011161167956978682
</pre><p>I'm still suspicious, but I delay the detour into numpy's and scipy's linalg modules.</p><p>One more check of our regression results, the residual standard error is not very far away from the Nist numbers:</p><pre class="literal-block">>>> np.sqrt(res0.mse_resid), 0.334801051324544E-02,
(0.0038044343586352601, 0.0033480105132454399)
</pre><p><strong>Conclusion</strong>: If you try to fit a linear regression with a non-positive definite design matrix, then the parameters are not identified, but we can still get a good fit.</p><p>(Technical aside: statsmodels uses by default the generalized inverse, pinv, for linear regression. So it just drops the eigenvalues below a threshold close to zero. The parameter estimates will be closer to a penalized Ridge regression. But don't quote me on the last part since I don't remember where I read that pinv is the limit of a Ridge problem.)</p><p>The question for statsmodels is what to do about it.</p><p>One solution that works in this case, as we have seen with numpy polynomials, is to rescale the explanatory variables or design matrix. I'm showing one example below. My working title for this post was: <strong>Don't do this, or do we have to do it for you?</strong> Is it the responsibility of the user not to use a design matrix that numerically doesn't make much sense and we can only warn, or should we automatically transform the design matrix to make it numerically more stable. The latter will be costly and might not be required in 99% of the cases?</p><p>Another issue is that there are many different ways to do the linear algebra, and we have not investigated much what might work better or worse in different cases. See Addendum below for the effect that linear algebra can have in numerically unstable problems.</p></div><div class="section" id="rescaling-the-design-matrix"><h4>Rescaling the Design Matrix</h4><p>Our design matrix looks pretty bad, variables vary in a large range and the correlation is very high</p><blockquote><pre class="doctest-block">>>> np.set_printoptions(precision=3)
>>> print np.max(np.abs(exog0),0)
[ 1.000e+00 8.781e+00 7.711e+01 6.772e+02 5.947e+03 5.222e+04
4.586e+05 4.027e+06 3.536e+07 3.105e+08 2.727e+09]
</pre><pre class="doctest-block">>>> print np.corrcoef(exog0[:,1:], rowvar=0)
[[ 1. -0.991 0.969 -0.938 0.904 -0.87 0.838 -0.808 0.782 -0.758]
[-0.991 1. -0.993 0.975 -0.951 0.925 -0.899 0.874 -0.851 0.83 ]
[ 0.969 -0.993 1. -0.994 0.981 -0.963 0.943 -0.924 0.904 -0.886]
[-0.938 0.975 -0.994 1. -0.996 0.986 -0.973 0.958 -0.943 0.928]
[ 0.904 -0.951 0.981 -0.996 1. -0.997 0.99 -0.98 0.968 -0.957]
[-0.87 0.925 -0.963 0.986 -0.997 1. -0.998 0.992 -0.985 0.976]
[ 0.838 -0.899 0.943 -0.973 0.99 -0.998 1. -0.998 0.994 -0.988]
[-0.808 0.874 -0.924 0.958 -0.98 0.992 -0.998 1. -0.999 0.995]
[ 0.782 -0.851 0.904 -0.943 0.968 -0.985 0.994 -0.999 1. -0.999]
[-0.758 0.83 -0.886 0.928 -0.957 0.976 -0.988 0.995 -0.999 1. ]]
</pre></blockquote><p>Now we can use just the simplest transform, limit the maximum absolute value to be one:</p><pre class="literal-block">exog1 = exog0 / np.max(np.abs(exog0),0)
</pre><p>After running the regression on the rescaled design matrix, we see an agreement with the NIST benchmark results at around 7 to 8 significant digits both for the parameters and for the standard deviation of the parameter estimates, bse in statsmodels:</p><pre class="literal-block">>>> res1 = sm.OLS(endog, exog1).fit()
>>> params_rescaled = res1.params / np.max(np.abs(exog0), 0)
>>> log_relative_error(params_rescaled, params_nist[:,0])
array([ 7.419, 7.414, 7.409, 7.402, 7.394, 7.384, 7.373, 7.36 ,
7.346, 7.331, 7.314])
>>> bse_rescaled = res1.bse / np.max(np.abs(exog0),0)
>>> log_relative_error(bse_rescaled, params_nist[:,1])
array([ 8.512, 8.435, 8.368, 8.308, 8.255, 8.207, 8.164, 8.124,
8.089, 8.057, 8.028])
</pre><p>Also R squared and the standard deviation of the residuals (using appropriate degrees of freedom) agrees with the NIST numbers at around 10 and 7 digits, resp.</p><pre class="literal-block">>>> log_relative_error(res1.rsquared, 0.996727416185620)
10.040156510920081
>>> log_relative_error(np.sqrt(res1.mse_resid), 0.334801051324544E-02)
7.8575015681097371
</pre><p>So we are doing pretty well just with a simple rescaling of the variables. Although, the comment at the end of <cite>print res1.summary()</cite> still reports a smallest eigenvalue of -1.51e-15, so essentially zero. But I worry about this later. I looked initially at another way of rescaling the design matrix but didn't check yet how the choice of the rescaling will affect the results</p></div><div class="section" id="addendum-1-smallest-eigenvalue-of-ill-conditioned-array"><h4>Addendum 1: Smallest eigenvalue of ill-conditioned array</h4><p>Going back to the original design matrix without rescaling, define the moment matrix <cite>X'X</cite>:</p><pre class="literal-block">>>> xpx0 = np.dot(exog0.T, exog0)
</pre><p>the eigenvalues, assuming a symmetric matrix, are</p><pre class="literal-block">>>> np.sort(np.linalg.eigvalsh(xpx0))
array([ -3.79709e-01, 1.14869e-05, 4.40507e-03, 3.20670e+00,
7.91804e+02, 1.05833e+03, 3.98410e+05, 2.31485e+08,
4.28415e+11, 1.93733e+15, 5.17955e+19])
</pre><p>This looks very badly conditioned. the largest eigenvalue is 5e19, the smallest is "around zero".</p><p>We can compare different algorithms to calculate the smallest eigenvalues (<cite>splinalg</cite> is <cite>scipy.linalg</cite>)</p><pre class="literal-block">>>> np.sort(np.linalg.eigvals(xpx0))[:4]
array([ 3.41128e-04, 5.58946e-04, 1.23213e-02, 3.33365e+00])
>>> np.sort(splinalg.eigvalsh(xpx0))[:4]
array([ -2.14363e+03, -2.00323e-01, 1.26094e-05, 4.40956e-03])
>>> np.sort(splinalg.eigvals(xpx0))[:4]
array([ -3.66973e-05+0.j, 1.61750e-04+0.j, 7.90465e-03+0.j,
2.01592e+00+0.j])
>>> np.sort(np.linalg.svd(xpx0)[1])[:4]
array([ 2.84057e-05, 4.91555e-04, 7.28252e-03, 3.41739e+00])
>>> np.sort(splinalg.svd(xpx0)[1])[:4]
array([ 2.19202e-05, 7.11920e-04, 7.00790e-03, 3.28229e+00])
>>> np.sort(np.linalg.svd(exog0)[1]**2)[:4]
array([ 1.65709e-11, 3.08225e-08, 2.48138e-05, 1.08036e-02])
>>> np.sort(splinalg.svd(exog0)[1]**2)[:4]
array([ 1.65708e-11, 3.08225e-08, 2.48138e-05, 1.08036e-02])
</pre><p>So, we see that they are pretty much all over the place, from -0.38 to 2.8e-05. The last version with singular value decomposition is the closest to what statsmodels uses with pinv. It also looks like I picked the worst algorithm for the regression <cite>summary</cite> in this case.</p><p><strong>Warning</strong>: Calculations at machine precision are not necessarily deterministic, in the sense that if you run it repeatedly you might not always get the same results. There are several cases on the scipy and numpy mailing lists that report that the results might "randomly" switch between several patterns. And the results won't agree on different operating systems, compilers and versions of the linear algebra libraries. So, I don't expect that these results can be replicated in exactly the same way.</p></div><div class="section" id="addendum-2-scipy-linalg-versus-numpy-linalg"><h4>Addendum 2: scipy.linalg versus numpy.linalg</h4><p>To avoid getting these changing results whenever I re-ran the script while preparing this post, I changed the statsmodels source to use <cite>scipy.linalg.pinv</cite> instead of <cite>numpy.linalg.pinv</cite>. I expected more replicable results, however what I found is:</p><pre class="literal-block">>>> exog0 = dta[:,1:]**np.arange(11)
>>> res0 = sm.OLS(endog, exog0).fit()
>>> log_relative_error(res0.params, params_nist[:,0])
array([ 5.31146488, 5.7400516 , 6.53794562, 6.81318335, 6.81855769,
7.22333339, 8.13319742, 7.38788711, 7.24457806, 7.18580677,
7.12494176])
>>> log_relative_error(res0.bse, params_nist[:,1])
array([ 2.25861611, 2.25837872, 2.25825903, 2.25822427, 2.2582245 ,
2.25823174, 2.25823693, 2.25823946, 2.25824058, 2.25824108,
2.25824165])
</pre><p>Just by changing the algorithm that calculates the generalized inverse, I get agreement with the NIST data at 5 to 7 significant digits for the parameters and 2 digits for the standard error of the parameter estimates even with the very ill-conditioned original design matrix. That doesn't look so bad, much better than when using the numpy.linalg version.</p><p>(But I need to write proper tests and look at this when I can trust the results. I now have two python sessions open, one that imported the original source, and one that imported the source after changing the statsmodels source. Also, if I run this regression repeatedly the numbers changed once, but remained within the same neighborhood. Besides different algorithm there is also <cite>rcond</cite> which defines the cutoff in pinv. I didn't check whether that differs in the numpy and scipy versions.)</p></div><div class="section" id="conclusions"><h4>Conclusions</h4><p>I think this test case on the NIST site is very well "cooked" to test the numerical accuracy of a linear regression program. The main lesson is that we shouldn't throw a numerically awful problem at a statistical package, unless we know that the package takes care for us of the basic tricks for making the problem numerically more stable. It's safer to make sure our design matrix is numerically sound.</p><p>Also, if we just want to estimate a polynomial function, then use the information and use a specialized algorithm, or, even better, use an orthogonal polynomial basis instead of power polynomials.</p><p><strong>What does it mean for statistical analysis?</strong></p><p>That, I'm not so sure. Multicollinearity is a serious issue, and there a various <a class="reference external" href="http://en.wikipedia.org/wiki/Multicollinearity#Remedies_for_multicollinearity">approaches</a> for dealing with it. But my attitude so far has been:</p><p><em>If you work with real data and run into numerical problems, it's not a problem with numerical accuracy but with your data, or with your model.</em></p><p>We should still use basic precautions like scaling our variables appropriately, but if we have high multicollinearity, then it mainly means that the model that we specified is asking for information that's not in the data. In certain directions the data is not informative enough to reliably identify some parameters. Given measurement errors, noise in the data and misspecification, there are many other parts to worry about before machine precision becomes important. For a related discussion see <a class="reference external" href="https://groups.google.com/group/pystatsmodels/browse_thread/thread/db07ab3b25a3fb98?hl=en">this thread</a> on the statsmodels mailinglist.</p><p>I tried before to come up with a case where standardizing (zscoring) the design matrix helps in improving the precision of the estimates but I didn't manage. Whether I zscored or not, the results where essentially the same. Now, I have a test case to add to statsmodels. I am sceptical about automatic rescaling, but I started a while ago to look into how to make it easier for users to use predefined transforms in statsmodels, instead of having to code them from scratch.</p><p>I'm not an expert in numerical analysis and don't intend to become one, my "numerical incompetence" has improved only a bit since <a class="reference external" href="http://mail.scipy.org/pipermail/scipy-user/2009-November/023107.html">this</a> although I know now a bit more linear algebra.</p><p>I put a script with the NIST case in <a class="reference external" href="https://gist.github.com/1974049">this gist</a>. I haven't yet copied over the parts from the interpreter sessions.</p><p><em>A final comment</em>:</p><p>I don't like long interpreter sessions, I usually convert everything as fast as possible to a script. For this, I copied everything directly from the session. After cleaning up the original script a bit, I'm getting different numbers for the log relative error (LRE). I'm now using <cite>scipy.linalg.pinv</cite> inside statsmodels, and LRE is in this case a measure for the behavior at machine precision, and bounces anywhere between 5 and 8. This is a good result in that we can still get estimates with a reasonable precision, but it makes LRE unreliable for replicating the results. I will make a proper script and unittest later, so that I can be more certain about how much the numbers change and whether there isn't a bug somewhere in my "live" session.</p></div></div>Josef Perktoldnoreply@blogger.com2tag:blogger.com,1999:blog-1909863961516003719.post-52092950221679110052012-03-03T07:58:00.000-08:002012-03-04T17:48:12.442-08:00Data "Analysis" in Python<div class="document" id="data-analysis-in-python"> <p>I'm catching up with some Twitter feeds and other information on the internet about the <a class="reference external" href="http://pydataworkshop.eventbrite.com/">PyData Workshop</a></p><p>There is a big effort in the Python/Numpy/SciPy community to get into the "Big Data" and data processing market.</p><p>Even the creator of Python was at the workshop and took not of <a class="reference external" href="https://plus.google.com/115212051037621986145/posts/hZVVtJ9bK3u">it</a>.</p><pre class="literal-block">Guido van Rossum - Yesterday 9:05 PM - Public
Pandas: a data analysis library for Python, poised to give R a run for its money
</pre><p>I think Python is well suited for this, Python in combination with numpy and scipy has been for 4 years my favorite language for coding for statistics and econometrics. I have been working for several years now on improving "Statistics in Python", both in scipy.stats and statsmodels.</p><p>Since the PyData Workshop didn't include anything about statistics or econometrics, it looks like my view is a bit out of mainstream. The blogoshpere is awash with articles about what's hype and what's reality behind BIG DATA. (I don't find the links to the articles I liked, but SAS might have a realistic view <a class="reference external" href="http://blogs.sas.com/content/corneroffice/2012/02/03/is-big-data-over-hyped/">Is big data overhyped</a> )</p><p>However, what came to my mind reading the buzz surrounding the PyData Workshop is more personal and specific to software developement in Python.</p><p>My first thoughts can be roughly summarized with</p><p><strong>You know that you are out of date, if</strong></p><ul class="simple"><li>you like mailing lists. <a class="footnote-reference" href="#id3" id="id1">[1]</a></li>
<li>you signed up for Twitter and never posted anything.</li>
<li>you signed up for Google plus and never posted anything.</li>
<li>you read the Twitter feed of others once a month.</li>
<li>you don't even know how to link to a Twitter message.</li>
</ul><p><strong>You know you don't do the popular things, if</strong></p><ul class="simple"><li>you spend two days checking the numerical accuracy of your algorithm for a case with bad data instead of trying to calculate it in the cloud.</li>
<li>you spend a week writing test cases verifying your code against other packages, instead of waiting for the bug reports from users.</li>
<li>you spend your time figuring out skew, kurtosis and fat tails, and everyone thinks the world is normal, (normally distributed, that is).</li>
<li>you think you can to "fancy" econometrics in Python, when users can just use STATA.</li>
<li>you think you can to "fancy" statistics in Python, when users can just use R.</li>
<li>you think "Data Analysis" requires statistics and econometrics.</li>
</ul><p><strong>You know you are missing the boat (or the point), if</strong></p><ul class="simple"><li>"all the best and brightest in the scipy/numpy community are doing a startup" <a class="footnote-reference" href="#id4" id="id2">[2]</a>, and you are not among them.</li>
<li>you are looking for your business plan, and you realize you never came up with one.</li>
<li>the "community" of your open source project consists mostly of two developers.</li>
</ul><table class="docutils footnote" frame="void" id="id3" rules="none"><colgroup><col class="label" /><col /></colgroup><tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td><a class="reference external" href="http://groups.google.com/group/scipy-user/about?hl=en">example</a></td></tr>
</tbody></table><table class="docutils footnote" frame="void" id="id4" rules="none"><colgroup><col class="label" /><col /></colgroup><tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>from <a class="reference external" href="http://twitter.com/#!/stochastician">this</a> feed</td></tr>
</tbody></table></div>Josef Perktoldnoreply@blogger.com1