# From Samples to Probability Distributions¶

The goal of these notes are to show how to think of sample data as probability distributions satisfying a finite support condition.

## The Empirical Distribution¶

Usually, sample data is thought of a bunch of tuples \(z_i\) for all \(i=0, \ldots, N\).

In other words:

Definition

**Sample data** is the data of a map:

for some finite indexing set \(I\). We also adopt the following notation:

Example

We review how the technology of tabular data fits into this perspective. For simplicity, we’ll consider the case of a table with two columns. Let’s say there is a column with a string type. As we’re speaking on a very general level, we’ll identify strings with some hashing:

\[\mathrm{String} \simeq \mathbb{N}\]

Furthermmore, let’s say the second column is a float, i.e. a element of \(\mathbb{R}\). In this case,

\[\Omega \simeq \mathbb{N}\times \mathbb{R}\]

Let’s further assume that our table is indexed by some set of “surrogate keys” \(I\).

- This this case, the formal definition of sample data:
- \[I \rightarrow \mathbb{N} \times \mathbb{R}\]

is the data of, for every \(i \in I\), a string and a float. This is, in particular, the data of a table with two columns. The general situation follows, mutatis mutandi.

Digression

It’s common to factor \(\Omega\) as:

and call \(\mathcal{X}\) the **space of features**, and
\(\mathcal{Y}\) the **space of labels**.

The choice of a factorization establishes a supervised learning problem.

We view the “supervision” of a learning to be a structure. The distinction between the features and labels (which we can think of as columns in a table) is a choice we make or context gives. We view the situation

Example

Standard examples to keep in mind are:

\(\mathcal{Y} = \{0, 1\}\) corresponds to binary classification problem

\(\mathcal{Y} = J\), for a finite set \(J\), corresponds to multiclass classification problem

\(\mathcal{Y} = P(J)\) corresponds to a labelling problem. Here \(P(-)\) denotes the power set construction.

\(\mathcal{Y} = \mathbb{R}\) corresponds to a regression problem

Definition

Let \(\rho_I^\mathrm{unif}\) denote the uniform distribution on the set \(I\). Viewing the sample as a map allows us to efficiently generate a distribution on \(\Omega\):

which we shall refer to as the **empirical distribution** associated to the
sample \(S\).

Warning

These notes assume a familiarity with notation reviewed in probabilistic-notation.

Note

This is construction is implicit in the construction of a histogram from data.

Motivation

In practice, the ordering of a sample has any meaning in terms of the problem at hand. More concretely, no learning algorithm should depend on any ordering of the data, as any such dependence would introduce superfluous variance to the model, without any reciprocal reduction in bias.

Therefore, if we wanted to count the number of of all possible samples, we must not overcount the number of samples by individually counting samples differing by a permutation.

One benefit of the notion of the empirical distribution is that it naturally avoids the quotient process. In other words, it’s a convenient representation of the moduli space of samples.

Note

Moreover, different choices for the distribution on \(I\) yield different “class weights” for the learning problem. This forms the basis for boosting algorithms.

Note

In classical information theory, the empirical distribution is referred to
as the **type** of the sample.

## Probability Distributions with Finite Support¶

Below is a property which, in some sense, characterizes data probabilistically:

Definition

A **probability distribution with finite support** is a probability
distribution of the form:

for some sample \(S\) and \(\rho_I\in\mathrm{Prob}(I)\).

Note

In other words, when the distribution has finite support, only finitely many points have nonzero probability.

We introduce the following notation to invite the reader to imagine sample data as forming a geometric object.

Definition

The **space of sample data**:

is defined as the space of finitely supported probability distributions.

At times, we may abusively omit reference to \(\Omega\) and simply write \(\mathfrak{D}\).

The space of sample data admits a filtration/stratification by cardinality. Below are the strata of this stratification.

Definition

We denote:

the space of distributions supported on \(n\) points, and

the space of distributions supported on less than \(n+1\) points

Example

When \(n=1\),

Note

Note that the empirical distribution can be considering as a point in the space of samples, in that there is a natural factorization:

## Large Samples Approximation of Distributions¶

Fix a probability distribution \(\rho \in \mathrm{Prob}(\Omega)\). This in turn generates a distribution

for every \(n\) in the standard fashion.

Note

This distribution satisfies a maximum entropy principle: it maximizes entropy subject to the condition that it’s marginals coincides with \(\rho\):

for every \(i\).

The following theorem is essentially the inferential interpretation of probability (i.e. outcomes of repeated trials generate the probability distribution):

Theorem