Probabilistic Notation

Personally, I find the notation and language conventionally used in probability theory imprecise to the point of confusion.


This becomes especially apparent when discussing the zoo of notions of error in statistical learning. To quote Elements of Statistical Learning:

Discussions of error rate estimation can be confusing, because we have to make clear which quantities are fixed and which are random… Indeed, in the first edition of our book, this section wasn’t sufficiently clear.

The text below is purely technical, but we hope that making these essential notions explicit justifies the tedium. Throughout, we assume the reader is familiar with the notion of a probability distribution.

Since we’re talking about probablilities, we should have notation for probability distributions with a fixed, explicit scope.


Given a set \(X\), we let


denote the set of probability distributions on \(X\).


Throughout this discussion, we will abusively (and hypocritically) conflate a probability distribution with some associate density.

Pushforward & Restrictions

The next two definitions shows us how to use functions between sets to make new probability distributions.

Definition: Pushforward

A map of sets:

\[X_0 \overset{f} \longrightarrow X_1\]

induces a map of probability distributions:

\[\begin{split}\begin{align*} \mathrm{Prob}(X_0) &\overset{f_*} \longrightarrow \mathrm{Prob}(X_1) \\ \rho \longmapsto& \bigl[U \subset X_1 \mapsto (f_*\rho)(U) := \rho(f^{-1}(U)) \bigl] \end{align*}\end{split}\]

called the pushforward along \(f\). We call \(f_*(\rho)\) the pushforward of \(\rho\) along \(f\)

An example to keep in mind is when \(X\) admits a factorization as:

\[X \simeq X_0 \times X_1\]

and \(f\) is a projection onto the zeroeth component:

\[\begin{split}\begin{align*} X_0 \times X_1 &\overset{\pi^0}\longrightarrow X_1 \\ (x_0, x_1) &\longmapsto x_0 \end{align*}\end{split}\]

In this case, for a probability distribution \(\rho \in \mathrm{Prob}(X_0 \times X_1)\),


is commonly referred to as the marginal distribution. It is computed through integration:

\[(\pi^0)_*(\rho)(x_1) = \int_{X_0} \rho(x_0, x_1) \mathrm{d}x_0\]


Wilson’s renormalization group methods apply this construction when this decomposition splits the degrees of freedom into high and low energy fields.

Universality/renormalizability results assert that the image of pushing forward along the projection onto the low energy degrees of freedom concentrate along a finite dimensional manifold.


By convention, when \(X\) is a finite set, integration coincides with summation.


A classical unsupervised technique is to represent a complicated distribution on the ‘visible’ degrees of freedom \(\rho \in \mathrm{Prob}(X)\) as a pushforward of a distribution of a system augmented by ‘hidden’ degrees of freedom \(X_h\).

In other words, we are looking for some

\[\tilde{\rho}\in \mathrm{Prob}(X_h \times X)\]

so that:

\[\pi^h_*(\tilde{\rho}) \approx \rho\]

This is useful when \(\tilde{\rho}\) is ‘simpler.’ For example, when the information in \(\tilde{\rho}\) admits a description as a low order polynomial.

Hopefully, this suggests a relationship between certain unsupervised machine learning and Wilson’s renormalization methods.


Conventionally, one would indicate that the input of \(\rho\) is a tuple \((x, y) \in X \times Y\) by writing \(\rho(x, y)\).

This is problematic, as it is unclear whether one is referring to the distribution \(\rho\) or the number \(\rho(x, y)\). In other words, it promotes type errors. Explicit is better than implicit.

When \(f\) is an inclusion of a subset, we can generate probability distributions ‘contravariantly’:


Given an inclusion of a subset:

\[A \overset{\iota_A}\longrightarrow X\]

we can restrict a probability distribution \(P\) to \(A\)

\[\begin{split}\begin{align*} \mathrm{Prob}(S_0) &\overset{\iota_A^*} \longrightarrow \mathrm{Prob}(S_1) \\ P &\longmapsto (\iota^*P)(A) := P(-|A) =: P|_A \end{align*}\end{split}\]

conventionally referred to as the conditional distribution. We will refer to \(\rho|_A\) as \(\rho\) restricted to \(A\).

Obviously, I haven’t given an actual definition of the restriction of a probability distribution. Eventually, I will define it in variational terms, invoking the notion of relative entropy.

We’d also like to discuss the special case when \(X_1\) admits an algebraic structure. For example, when \(X_1 \simeq \mathbb{R}\).

“Random Variables”


A random variable \(\mathscr{O}\) on a set \(X\) is the data of an \(\mathbb{R}\)-valued function on \(X\)

\[X \overset{\mathscr{O}}\longrightarrow \mathbb{R}\]

This data naturally gives a probability distribution on \(\mathbb{R}\):

\[\mathscr{O}_* (\rho)\]


We apologize to those readers who are used to referring to a random variable as \(X\).

We’ve chosen this notation to emphasize what a random variable is in practice: an observable quantity, i.e. a ‘feature’


The term ‘random variable’ is ambiguous. For example, it promotes the conflation of the (‘deterministic’) function \(\mathscr{O}\) with the (‘random’) probability distribution \(\mathscr{O}_*(\rho)\).

These are the essential ingredients of probability theory.