Wednesday, April 24, 2013

Statistics in Python: Reproducing Research

This is just a short comment on a blog post.

Ferando Perez wrote a nice article about "Literate computing" and computational reproducibility: IPython in the age of data-driven journalism

In the second part, he explains that Vincent Arel-Bundock came up with an ipython notebook within three hours to replicate some criticism of an economics journal article. Vincent's notebook can be seen here.

What I found most striking was not the presentation as a notebook, although that makes it easy to read, instead it was: pandas, patsy and statsmodels, and no R in sight. We have come a long way with Statistics in Python since I started to get involved in it five years ago.

Vincent has made many improvements and contributions to statsmodels in the last year.

Aside

I'm not following much of the economics debates these days, so I only know what I read in the two references that Fernando gave.

My impression is that it's just the usual (mis)use of economics research results. Politicians like the numbers that give them ammunition for their position. As economist, you are either very careful about how to present the results, or you join the political game (I worked for several years in an agricultural department of a small country). (An example for the use of economics results in another area, blaming the financial crisis on the work on copulas.)

"Believable" research: If your results sound too good or too interesting to be true, maybe they are not, and you better check your calculations. Although mistakes are not uncommon, the business as usual part is that the results are often very sensitive to assumptions, and it takes time to figure out what results are robust. I have seen enough economic debates where there never was a clear answer that convinced more than half of all economists. A long time ago, when the Asian Tigers where still tigers, one question was: Did they grow because of or in spite of government intervention?

1 comment:

  1. just another thought:

    The first graph in the economist article cited by Fernando shows a huge difference between mean and median in the Reinhart-Rogoff analysis.

    This would require some heavy skewness in the data or some very large negative outliers, and would require a closer look. The median is a robust statistic and shows that the reduction in growth is small for half of all countries.

    We can use statsmodels.RLM if we want robust estimates, or, soon to come, quantile regression to see the impact on different groups of countries and years.

    ReplyDelete