Or do stats using python

# Python data analysis & visualisation

Recently I had to do some data-analysis and visualisation for multi-factorial data. Importantly, the question was about the weight (/significance) of one of the factors, taking into account i) the within- and between-subjects variance, ii) the multiple-comparisons problem (i.e., multiple related measures within one subject), and missing data (i.e. not all condition measured in all subjects). This short overview shows some of the ways to properly (?) address each of these issues and also shortly touching on the obsession with p-values (p < 0.05 is more important than the effect size).

The key here is using a clever way to organise your data and using some kind of statistical model to assess how much variance is explained for each of the factors and comparing this to some residual variance, in turn assessing whether the factors explain significantly more variance the some unexplained variance.

First, acquire the data. In my experience, the best way to organise the data is by using rows for measures and columns for factor. For example, I have a data-set with multiple subjects (sub_ids), stimulus conditions (e.g. sound intensity for two different kinds of stimuli), ears (l/r). For each of the measurements I have an amplitude and latency (I’m talking about an ABR study using either Click-stimuli or Chirps). This is all saved in a plain-text file (see next block):

Now we get to the actual python framework. For this we use Panda’s to get, select and group the data according to our needs. The beauty of Panda’s is that it is really geared towards this purpose. First of all, import all packages (including statsmodel, pandas and seaborn - I’ll get to those later)

```
sub_id intensity initials ear stimulus_type latency amplitude
0 1 80 JM Rechts Chirp 5.77 620
1 1 80 JM Rechts Clicks 5.57 450
2 1 80 JM Links Chirp 5.57 725
3 1 80 JM Links Clicks 5.73 500
4 1 70 JM Rechts Clicks 5.93 475
```

Since the data also contains 5 dB-step sizes (i.e. 45 dB, not shown), let’s bin the data by rounding to the next multiple of 10. Rather than overwriting existing data, I just add a new column.

The next step is visualising the data. A neat package that I use is Seaborn, a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. And it really does. For this data-set I want a factorplot with on the x-axis the intensity on the y-axis the amplitude and a bar for each of the two stimuli-types separately. Also, I’d like to have some idea of the confidence interval, taken as the 68 percentile based on 5000 bootstraps (i.e. the mean ± standard deviation; see e.g. the 68-95-99.7 rule). This is done by simply invoking the following:

This already give me a strong indication that there is a strong effect of stimulus-type, and sound-intensity. To further explore this, I’m going to model the data based using two approaches. One is the plain-old ANOVA with two factors (stimulus-type & intensity) and their interaction. This brushes over the glaring point of repeated measures but demonstrates ‘how-to’ do this kind of analysis (whether rightly or wrongly). The other approach is by using a generalized linear mixed model. This GLMM approach models each response (/variable) as a combination of a random effect (e.g., the overall mean performance of a subject) and a fixed effect (e.g., the effect of a perturbation you induced, such a intensity level or condition). Especially when measuring several things within each subject, this mixed-model approach is really the way to go. Moreover, it is more geared towards the missing-data issue that ANOVA really cannot deal.

First let’s check the **ANOVA model**:

```
OLS Regression Results
==============================================================================
Dep. Variable: amplitude R-squared: 0.312
Model: OLS Adj. R-squared: 0.305
Method: Least Squares F-statistic: 43.38
Date: Sat, 15 Aug 2015 Prob (F-statistic): 3.79e-23
Time: 10:51:42 Log-Likelihood: -1967.8
No. Observations: 291 AIC: 3944.
Df Residuals: 287 BIC: 3958.
Df Model: 3
Covariance Type: nonrobust
=====================================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------------------------------
Intercept 238.6935 38.346 6.225 0.000 163.219 314.168
stimulus_type[T.Clicks] -76.5944 59.868 -1.279 0.202 -194.431 41.242
intensity 6.4898 0.770 8.427 0.000 4.974 8.006
stimulus_type[T.Clicks]:intensity -1.8457 1.150 -1.605 0.110 -4.109 0.417
==============================================================================
Omnibus: 2.872 Durbin-Watson: 0.663
Prob(Omnibus): 0.238 Jarque-Bera (JB): 2.756
Skew: 0.238 Prob(JB): 0.252
Kurtosis: 3.008 Cond. No. 320.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
sum_sq df F PR(>F)
stimulus_type 1927226.558883 1 43.434354 2.094849e-10
intensity 4349489.911650 1 98.025468 4.505229e-20
stimulus_type:intensity 114352.377676 1 2.577186 1.095140e-01
Residual 12734482.515712 287 NaN NaN
```

Part of the variance is explained and the model fits the data well. Both the Intercept and the intensity explain a significant part of the variance. When looking at the stimulus type, however, it does not seem to explain a lot… Looking at the factorplot seems to suggest differently. I think* that it has to do with the large intersubject variance; probably to a larger degree than the within-subject variance. And here comes the **mixed-model approach**. This is done by defining a mixed model with two main factors and their interaction (note that the main effects are automatically added to the model when defining the interaction) and allowing for a random intercept for each subject (groups=tmp[“sub_id”]). This approach is very short in python code as can be seen here:

```
Mixed Linear Model Regression Results
==================================================================================
Model: MixedLM Dependent Variable: amplitude
No. Observations: 291 Method: REML
No. Groups: 16 Scale: 15421.7194
Min. group size: 3 Likelihood: -1834.6913
Max. group size: 29 Converged: Yes
Mean group size: 18.2
----------------------------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 95.938 53.345 1.798 0.072 -8.616 200.492
stimulus_type[T.Clicks] -108.567 35.500 -3.058 0.002 -178.145 -38.989
intensity 8.538 0.471 18.112 0.000 7.614 9.462
intensity:stimulus_type[T.Clicks] -1.245 0.683 -1.824 0.068 -2.584 0.093
Intercept RE 36195.683 117.297
==================================================================================
```

By using a random intercept, the factor stimulus_type also pops up as significant. Note that this approach is very comparable to the LME4 package in R as can be see in the examples.

PYTHON

python statsmodels linear mixed-models