The Permutation Test

The “permutation test” tests which whether two different subpopulations, A and B, are different by some metric, e.g. the mean/medians of an numerical observation.

This test originates in (no surprise) Fisher’s work in the thirties.

As a toy example that arises in practice, let’s say we have a table where each row is an individual, and one column takes one of two possible values, called A or B.

A priori, this column might be utterly meaningless in terms of the context at hand. Moreover, for practical reasons, we want to limit the number of features which influence our decisions.

In other words, we want to “make it hard” to establish that there is a statistically significant difference between these two subpopulations.

Therefore, the null hypothesis of the permutation test asserts that the difference between these two subpopulations is statistically indinstinguishable from randomly splitting the population into two groups. In other words, the designation of whether a sample was in subpopulation A or B were completely arbitrary.

If we’re going to test this hypothesis, we need to enter into the realm where the null hypothesis was true. In this world, A and B were “from the same population” so that any difference comes from the randomness in splitting this singular population into two groups.

Therefore, in order to test this hypothesis, we need to simulate splitting the population into two groups, and examine the differences between this two groups. The choice of a metric (i.e. test statistic) of these two groups can greatly simplify our task.

As in the case with most hypothesis tests, this null hypothesis (the difference is due to noise) generates a distribution by looking at the difference of the metrics between the two groups.


We emphasize that the distribution of the test statistic under the null hypothesis is being computed via simulation, unlike the parametric tests which utilize tables of a known distribution.

Let’s say our alternative hypothesis is that B is better than A. In this case, giving evidence for this difference requires us to show that a statistically significant amount of differences generated by the null distribution are greater than the difference between A and B.

The p-value associated to the difference between A and B is straightfoward to compute: its then the number of simulated differences which are greater than the observed differences divided by the number of simulated differences.


As p-values are computed by counting samples, we did not need to assume/argue that the distribution of the test statistic is of a particular, analytically nice form. It could be something like the median or mode, which resists an analytically computable description.

This is the real power of the permutation test (more generally nonparametric tests): it’s very flexible with the test statistic!


Like all bootstrap methods, certain assumptions must be met! Namely, that there are sufficient number of samples that the empirical distribution (of the test statistic) adequately approximates the “true” distribution.

The “permutation” part of the name is a bit unfortunate, as it hides the underlying intent and logic of the test. It refers to one way of generating splittings of the pooled sample, by permuting an ordered list of samples, and splitting along the middle.