Friday, March 15, 2013

Different Fields - Different Problems: Effect Size

or What's your scale?

Effect Size

I have been working on and off for a while now on adding statistical power calculations to statsmodels. One of the topics I ran into is the effect size.

At the beginning, I wasn't quite sure what to make of it. While I was working on power calculations, it just seemed to be a convenient way of specifying the distance between the alternative and the null hypothesis. However, there were references that sounded like it's something special and important. This was my first message to the mailing list

A classical alternative to NHST (null-hypothesis significance testing):
report your effect size and confidence intervals

I bumped into this while looking for power analysis.

MIA in python, as far as I can tell.

Now what is the fuss all about?

Scaling Issues

Today I finally found some good motivating quotes:

"In the behavioral, educational, and social sciences (BESS), units of measurement are many times arbitrary, in the sense that there is no necessary reason why the measurement instrument is based on a particular scaling. Many, but certainly not all, constructs dealt with in the BESS are not directly observable and the instruments used to measure such constructs do not generally have a natural scaling metric as do many measures, for example, in the physical sciences."


"However, effects sizes based on raw scores are not always helpful or generalizable due to the lack of natural scaling metrics and multiple scales existing for the same phenomenon in the BESS. A common methodological suggestion in the BESS is to report standardized effect sizes in order to facilitate the interpretation of results and for the cumulation of scientific knowledge across studies, which is the goal of meta-analysis (<...>). A standardized effect size is an effect size that describes the size of the effect but that does not depend on any particular measurement scale."

The two quotes are from the introduction in "Confidence Intervals for Standardized Effect Sizes: Theory, Application, and Implementation" by Ken Kelley .

Large parts of the literature that I was browsing or reading on this, are in Psychological journals. This can also be seen in the list of references in the Wikipedia page on effect size.

One additional part, that I found puzzling was the definition of "conventional" effect sizes by Cohen. "For Cohen's d an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect." (sentence from the Wikipedia page)

"Small" what? small potatoes, small reduction in the number of deaths, low wages? or, "I'm almost indifferent" (+0 on the mailing lists)?

Where I come from

Now it's clearer why I haven't seen this in my traditional area, economics and econometrics.

Although economics falls into BESS, in the SS part, it has a long tradition of getting a common scale, money. Physical units also show up in some questions.

National Income Accounting tries to measure the economy with money as a unit. (And if something doesn't have a price associated with it, then it's ignored by most. That's another problem. Or, we make a price.) There are many measurement problems, but there is also a large industry to figure out common standards.

Effect sizes have a scale that is "natural":

  • What's the increase in lifetime salary, if you attend business school?
  • What's the increase in sales (in Dollars, or physical units) if you lower the price?
  • What's the increase in sales if you run an advertising campaign?
  • What's your rate of return if you invest in stocks?

Effects might not be easy to estimate, or cannot be estimated accurately, but we don't need a long debate about what to report as effect.

Post Scripts

(i) I just saw the table at the end of this SAS page . I love replicating SAS tables, but will refrain for now, since I am supposed to go back to other things.

(ii) I started my last round of work on this because I was looking at effect size as distance measure for a chisquare goodness of fit test. When the sample size is very large, then small deviations from the Null Hypothesis will cause a statistical test to reject the Null, even if the effect, the difference to the Null is for all practical purposes irrelevant. My recent preferred solution to this is to switch to equivalence test or something similar, not testing the hypothesis that the effect is exactly zero, but to test whether the effect is "small" or not.

(iii) I have several plans for blog posts (cohens_kappa, power onion) but never found the quiet time or urge to actually write them.


  1. I bumped into this while looking for power analysis.

    MIA in python, as far as I can tell.

    Are there any facilities for power analysis that you know of or is it still MIA?

  2. not so MIA anymore.

    I merged my branch into statsmodels a few days ago.
    But the documentation still has problems, I just saw: I didn't have the problem when I build the documentation locally.

    I'm preparing a comparison with ``pwr`` package in R.

    1. Thanks for that; I have used the pwr package in R. I'm pretty excited with all the progress folks are making for stats in python lately. Looking forward to being able to move completely to python for experimental design and analysis (I also like the AlgDesign package in R for DoE).

    2. Until there was a question on the statsmodels mailing list last November, I didn't even know what DoE is. It's completely outside my expertise. Hopefully, some users or developers familiar with this can create a DoE in python.

      I'm glad if my work on power and sample size calculations turns out to be useful for this.

  3. I published my follow-up post with the examples for using the power functions.