Saturday, March 3, 2012

Data "Analysis" in Python

I'm catching up with some Twitter feeds and other information on the internet about the PyData Workshop

There is a big effort in the Python/Numpy/SciPy community to get into the "Big Data" and data processing market.

Even the creator of Python was at the workshop and took not of it.

Guido van Rossum  -  Yesterday 9:05 PM  -  Public
Pandas: a data analysis library for Python, poised to give R a run for its money

I think Python is well suited for this, Python in combination with numpy and scipy has been for 4 years my favorite language for coding for statistics and econometrics. I have been working for several years now on improving "Statistics in Python", both in scipy.stats and statsmodels.

Since the PyData Workshop didn't include anything about statistics or econometrics, it looks like my view is a bit out of mainstream. The blogoshpere is awash with articles about what's hype and what's reality behind BIG DATA. (I don't find the links to the articles I liked, but SAS might have a realistic view Is big data overhyped )

However, what came to my mind reading the buzz surrounding the PyData Workshop is more personal and specific to software developement in Python.

My first thoughts can be roughly summarized with

You know that you are out of date, if

  • you like mailing lists. [1]
  • you signed up for Twitter and never posted anything.
  • you signed up for Google plus and never posted anything.
  • you read the Twitter feed of others once a month.
  • you don't even know how to link to a Twitter message.

You know you don't do the popular things, if

  • you spend two days checking the numerical accuracy of your algorithm for a case with bad data instead of trying to calculate it in the cloud.
  • you spend a week writing test cases verifying your code against other packages, instead of waiting for the bug reports from users.
  • you spend your time figuring out skew, kurtosis and fat tails, and everyone thinks the world is normal, (normally distributed, that is).
  • you think you can to "fancy" econometrics in Python, when users can just use STATA.
  • you think you can to "fancy" statistics in Python, when users can just use R.
  • you think "Data Analysis" requires statistics and econometrics.

You know you are missing the boat (or the point), if

  • "all the best and brightest in the scipy/numpy community are doing a startup" [2], and you are not among them.
  • you are looking for your business plan, and you realize you never came up with one.
  • the "community" of your open source project consists mostly of two developers.
[2]from this feed

1 comment:

  1. Joe, heads up!
    It is your work that got me started with python and statistics, and I am using it for a course at university now. And I might be wrong : but I think that R as a programing language is ugly enough that python will catch on for statistics applications.
    Keep up your good work - people like me are Very grateful!