From Samples to Probability Distributions

The goal of these notes are to show how to think of sample data as probability distributions satisfying a finite support condition.

The Empirical Distribution

Usually, sample data is thought of a bunch of tuples \(z_i\) for all \(i=0, \ldots, N\).

In other words:


Sample data is the data of a map:

\[\begin{split}I &\overset{S}\longrightarrow \Omega \\ i &\longmapsto S_i =: z_i\end{split}\]

for some finite indexing set \(I\). We also adopt the following notation:

\[S \in \Omega^I\]


We review how the technology of tabular data fits into this perspective. For simplicity, we’ll consider the case of a table with two columns. Let’s say there is a column with a string type. As we’re speaking on a very general level, we’ll identify strings with some hashing:

\[\mathrm{String} \simeq \mathbb{N}\]

Furthermmore, let’s say the second column is a float, i.e. a element of \(\mathbb{R}\). In this case,

\[\Omega \simeq \mathbb{N}\times \mathbb{R}\]

Let’s further assume that our table is indexed by some set of “surrogate keys” \(I\).

This this case, the formal definition of sample data:
\[I \rightarrow \mathbb{N} \times \mathbb{R}\]

is the data of, for every \(i \in I\), a string and a float. This is, in particular, the data of a table with two columns. The general situation follows, mutatis mutandi.


It’s common to factor \(\Omega\) as:

\[\Omega \simeq \mathcal{X} \times \mathcal{Y}\]

and call \(\mathcal{X}\) the space of features, and \(\mathcal{Y}\) the space of labels.

The choice of a factorization establishes a supervised learning problem.

We view the “supervision” of a learning to be a structure. The distinction between the features and labels (which we can think of as columns in a table) is a choice we make or context gives. We view the situation


Standard examples to keep in mind are:

  1. \(\mathcal{Y} = \{0, 1\}\) corresponds to binary classification problem

  2. \(\mathcal{Y} = J\), for a finite set \(J\), corresponds to multiclass classification problem

  3. \(\mathcal{Y} = P(J)\) corresponds to a labelling problem. Here \(P(-)\) denotes the power set construction.

  4. \(\mathcal{Y} = \mathbb{R}\) corresponds to a regression problem


Let \(\rho_I^\mathrm{unif}\) denote the uniform distribution on the set \(I\). Viewing the sample as a map allows us to efficiently generate a distribution on \(\Omega\):

\[\rho_S := S_*(\rho_I^\mathrm{unif}) = \frac{1}{n} \sum_{i=1}^n \delta_{x_i}\]

which we shall refer to as the empirical distribution associated to the sample \(S\).


These notes assume a familiarity with notation reviewed in probabilistic-notation.


This is construction is implicit in the construction of a histogram from data.


In practice, the ordering of a sample has any meaning in terms of the problem at hand. More concretely, no learning algorithm should depend on any ordering of the data, as any such dependence would introduce superfluous variance to the model, without any reciprocal reduction in bias.

Therefore, if we wanted to count the number of of all possible samples, we must not overcount the number of samples by individually counting samples differing by a permutation.

One benefit of the notion of the empirical distribution is that it naturally avoids the quotient process. In other words, it’s a convenient representation of the moduli space of samples.


Moreover, different choices for the distribution on \(I\) yield different “class weights” for the learning problem. This forms the basis for boosting algorithms.


In classical information theory, the empirical distribution is referred to as the type of the sample.

Probability Distributions with Finite Support

Below is a property which, in some sense, characterizes data probabilistically:


A probability distribution with finite support is a probability distribution of the form:

\[(I \overset{S}\longrightarrow \Omega)_*\rho_I = \sum_{i=1}^n \rho_i \delta_{x_i}\]

for some sample \(S\) and \(\rho_I\in\mathrm{Prob}(I)\).


In other words, when the distribution has finite support, only finitely many points have nonzero probability.

We introduce the following notation to invite the reader to imagine sample data as forming a geometric object.


The space of sample data:

\[\mathfrak{D}(\Omega) := \mathrm{Prob}^\mathrm{fin}(\Omega)\]

is defined as the space of finitely supported probability distributions.

At times, we may abusively omit reference to \(\Omega\) and simply write \(\mathfrak{D}\).

The space of sample data admits a filtration/stratification by cardinality. Below are the strata of this stratification.


We denote:

\[\mathfrak{D}_n \subset \mathfrak{D}\]

the space of distributions supported on \(n\) points, and

\[\mathfrak{D}_{\leq n} \subset \mathfrak{D}\]

the space of distributions supported on less than \(n+1\) points


When \(n=1\),

\[\mathfrak{D}_1 (\Omega) \simeq \Omega\]


Note that the empirical distribution can be considering as a point in the space of samples, in that there is a natural factorization:

\[\Omega^I \rightarrow \mathfrak{D}(\Omega) \rightarrow \mathrm{Prob}(\Omega)\]

Large Samples Approximation of Distributions

Fix a probability distribution \(\rho \in \mathrm{Prob}(\Omega)\). This in turn generates a distribution

\[\rho^{\otimes n} \in \mathrm{Prob}(\Omega^n)\]

for every \(n\) in the standard fashion.


This distribution satisfies a maximum entropy principle: it maximizes entropy subject to the condition that it’s marginals coincides with \(\rho\):

\[(\Omega^{\times n} \overset{\pi^i}\longrightarrow \Omega)_*\tilde{\rho} = \rho\]

for every \(i\).

The following theorem is essentially the inferential interpretation of probability (i.e. outcomes of repeated trials generate the probability distribution):


\[\bigl(\Omega^n \rightarrow \mathfrak{D}_{\leq n} \rightarrow \mathrm{Prob}(\Omega)\bigl)_* \rho^{\otimes n} \overset{n\rightarrow \infty}\longrightarrow \delta_{\rho} \in \mathrm{Prob}\bigl(\mathrm{Prob}(\Omega) \bigl)\]