Crosscombe, Cornwall, UK

Or do stats using python

Python data analysis & visualisation

Recently I had to do some data-analysis and visualisation for multi-factorial data. Importantly, the question was about the weight (/significance) of one of the factors, taking into account i) the within- and between-subjects variance, ii) the multiple-comparisons problem (i.e., multiple related measures within one subject), and missing data (i.e. not all condition measured in all subjects). This short overview shows some of the ways to properly (?) address each of these issues and also shortly touching on the obsession with p-values (p < 0.05 is more important than the effect size).

The key here is using a clever way to organise your data and using some kind of statistical model to assess how much variance is explained for each of the factors and comparing this to some residual variance, in turn assessing whether the factors explain significantly more variance the some unexplained variance.

First, acquire the data. In my experience, the best way to organise the data is by using rows for measures and columns for factor. For example, I have a data-set with multiple subjects (sub_ids), stimulus conditions (e.g. sound intensity for two different kinds of stimuli), ears (l/r). For each of the measurements I have an amplitude and latency (I’m talking about an ABR study using either Click-stimuli or Chirps). This is all saved in a plain-text file (see next block):


Now we get to the actual python framework. For this we use Panda’s to get, select and group the data according to our needs. The beauty of Panda’s is that it is really geared towards this purpose. First of all, import all packages (including statsmodel, pandas and seaborn - I’ll get to those later)

import os
import os.path as op
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
#load the data
resdir = '/Users/cplanting/ABR-vis/'
data_bera = pd.read_csv(op.join(resdir,'amp.csv'), sep=';',header=0)
print data_bera.head()
   sub_id  intensity initials     ear stimulus_type  latency  amplitude
0       1         80       JM  Rechts         Chirp     5.77        620
1       1         80       JM  Rechts        Clicks     5.57        450
2       1         80       JM   Links         Chirp     5.57        725
3       1         80       JM   Links        Clicks     5.73        500
4       1         70       JM  Rechts        Clicks     5.93        475

Since the data also contains 5 dB-step sizes (i.e. 45 dB, not shown), let’s bin the data by rounding to the next multiple of 10. Rather than overwriting existing data, I just add a new column.

def int_round(x,mult):
    #round to closest multiple of 10; 5.0 -> 0.0; 5.1 -> 10.0
    return np.round(x / mult) * mult
data_bera['intensity (dB SPL)'] = int_round(data_bera['intensity'],10.0)
tmp = data_bera[data_bera.intensity <= 85]

The next step is visualising the data. A neat package that I use is Seaborn, a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. And it really does. For this data-set I want a factorplot with on the x-axis the intensity on the y-axis the amplitude and a bar for each of the two stimuli-types separately. Also, I’d like to have some idea of the confidence interval, taken as the 68 percentile based on 5000 bootstraps (i.e. the mean ± standard deviation; see e.g. the 68-95-99.7 rule). This is done by simply invoking the following:

g0 = sns.factorplot("intensity (dB SPL)","amplitude","stimulus_type",kind="auto",
                     ci=68,n_boot=5000, estimator=np.mean, 
                     data=tmp, size=6, palette="muted")


This already give me a strong indication that there is a strong effect of stimulus-type, and sound-intensity. To further explore this, I’m going to model the data based using two approaches. One is the plain-old ANOVA with two factors (stimulus-type & intensity) and their interaction. This brushes over the glaring point of repeated measures but demonstrates ‘how-to’ do this kind of analysis (whether rightly or wrongly). The other approach is by using a generalized linear mixed model. This GLMM approach models each response (/variable) as a combination of a random effect (e.g., the overall mean performance of a subject) and a fixed effect (e.g., the effect of a perturbation you induced, such a intensity level or condition). Especially when measuring several things within each subject, this mixed-model approach is really the way to go. Moreover, it is more geared towards the missing-data issue that ANOVA really cannot deal.

First let’s check the ANOVA model:

model_anova = ols('amplitude ~ stimulus_type*intensity',data=tmp).fit()
table = sm.stats.anova_lm(model_anova, typ=2) # Type 2 ANOVA DataFrame
print model_anova.summary()
print table
                            OLS Regression Results                            
Dep. Variable:              amplitude   R-squared:                       0.312
Model:                            OLS   Adj. R-squared:                  0.305
Method:                 Least Squares   F-statistic:                     43.38
Date:                Sat, 15 Aug 2015   Prob (F-statistic):           3.79e-23
Time:                        10:51:42   Log-Likelihood:                -1967.8
No. Observations:                 291   AIC:                             3944.
Df Residuals:                     287   BIC:                             3958.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
Intercept                           238.6935     38.346      6.225      0.000       163.219   314.168
stimulus_type[T.Clicks]             -76.5944     59.868     -1.279      0.202      -194.431    41.242
intensity                             6.4898      0.770      8.427      0.000         4.974     8.006
stimulus_type[T.Clicks]:intensity    -1.8457      1.150     -1.605      0.110        -4.109     0.417
Omnibus:                        2.872   Durbin-Watson:                   0.663
Prob(Omnibus):                  0.238   Jarque-Bera (JB):                2.756
Skew:                           0.238   Prob(JB):                        0.252
Kurtosis:                       3.008   Cond. No.                         320.

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                                  sum_sq   df          F        PR(>F)
stimulus_type             1927226.558883    1  43.434354  2.094849e-10
intensity                 4349489.911650    1  98.025468  4.505229e-20
stimulus_type:intensity    114352.377676    1   2.577186  1.095140e-01
Residual                 12734482.515712  287        NaN           NaN

Part of the variance is explained and the model fits the data well. Both the Intercept and the intensity explain a significant part of the variance. When looking at the stimulus type, however, it does not seem to explain a lot… Looking at the factorplot seems to suggest differently. I think* that it has to do with the large intersubject variance; probably to a larger degree than the within-subject variance. And here comes the mixed-model approach. This is done by defining a mixed model with two main factors and their interaction (note that the main effects are automatically added to the model when defining the interaction) and allowing for a random intercept for each subject (groups=tmp[“sub_id”]). This approach is very short in python code as can be seen here:

model = sm.MixedLM.from_formula("amplitude ~ intensity * stimulus_type", tmp, groups=tmp["sub_id"])
result =
print result.summary()
                       Mixed Linear Model Regression Results
Model:                      MixedLM         Dependent Variable:         amplitude 
No. Observations:           291             Method:                     REML      
No. Groups:                 16              Scale:                      15421.7194
Min. group size:            3               Likelihood:                 -1834.6913
Max. group size:            29              Converged:                  Yes       
Mean group size:            18.2                                                  
                                    Coef.   Std.Err.   z    P>|z|  [0.025   0.975]
Intercept                            95.938   53.345  1.798 0.072   -8.616 200.492
stimulus_type[T.Clicks]            -108.567   35.500 -3.058 0.002 -178.145 -38.989
intensity                             8.538    0.471 18.112 0.000    7.614   9.462
intensity:stimulus_type[T.Clicks]    -1.245    0.683 -1.824 0.068   -2.584   0.093
Intercept RE                      36195.683  117.297                              

By using a random intercept, the factor stimulus_type also pops up as significant. Note that this approach is very comparable to the LME4 package in R as can be see in the examples.

python statsmodels linear mixed-models