Hypotheses, Models and Loss Functions

These notes seek to establish a precise notion of losses and functions in statistical learning using a probabilistic framework

Models and Hypotheses


We call

\[\mathcal{H}_\mathrm{gen} := \mathrm{Prob}(\mathcal{X}\times\mathcal{Y})\]

the space of generative hypotheses

A generative model is a family of generative hypotheses.


A generative model is a space \(\mathcal{M}\), along with a map:

\[\begin{split}\mathcal{M} &\overset{\Theta}\longrightarrow \mathcal{H}_\mathrm{gen} \\ f &\longmapsto \rho^f \in \mathrm{Prob}(\mathcal{X}\times\mathcal{Y})\end{split}\]

Discriminative models are similiar to generative models, but are impartial in regard to the distribution of features:


We call

\[\mathcal{H}_\mathrm{discr} := \mathrm{Maps}\bigl(\mathcal{X}, \mathrm{Prob}(\mathcal{Y})\bigl)\]

the space of discriminative hypotheses


A discriminative model is a space \(\mathcal{M}\), along with a map:

\[\begin{split}\mathcal{M} &\overset{\Theta}\longrightarrow \mathcal{H}_\mathrm{discr} \\ f &\mapsto \Bigl[x \longmapsto \rho^f_x \Bigl] \in \mathrm{Maps}\bigl(\mathcal{X},\mathrm{Prob}(\mathcal{Y})\bigl)\end{split}\]


In general, \(\Theta\) is not an embedding.


When \(\mathcal{Y} \simeq \mathbb{R}\), the learning problem is called regression.

Linear regression may be described as:

\[\begin{split}\mathcal{M} &\simeq (\mathbb{R}^n)^\vee \longrightarrow \mathcal{H}_\mathrm{discr} \\ f &\simeq \beta \longmapsto \bigl( x \mapsto \delta_{\beta(x)} \bigl) \in \mathrm{Maps}(\mathbb{R}^n, \mathbb{R})\end{split}\]

Bayes’ Law factors generative models into the data of a discriminative model and a probability distribution on the space of features:


There is a natural equivalence:

\[\begin{split}\mathcal{H}_\mathrm{discr}\times\mathrm{Prob}(\mathcal{X}) &\overset{\simeq}\longrightarrow \mathcal{H}_\mathrm{gen} \\ (f, \tilde{\rho}) &\longmapsto \bigl( (x, y) \mapsto \tilde{\rho}(x) \rho_x^f(y) \bigl)\end{split}\]

Learning problems come in two flavors: supervised and unsupervised. We take the somewhat unusual approach, defining unsupervised learning as a special case of supervised learning:


A learning problem is unsupervised if:

\[\mathcal{Y} \simeq *\]


Note this means supervised learning problems inherit any construction or property of supervised learning problems.

However, it may be the case that they become uninteresting in the unsupervised case. For example, there are no interesting discriminative models for unsupervised learning problems.


To deemphasize the difference between supervised and unsupervised, we’ll adopt the notation:

\[\Omega := \mathcal{X}\times\mathcal{Y}\]

Loss Functions

A loss function pairs data and models, producing a number. This both assesses models given data and allows data to act on the space of models via gradient descents.


A loss function is a map:

\[\begin{split}\mathfrak{D} \times \mathcal{M} &\overset{\mathscr{L}}\longrightarrow \mathbb{R} \\ (\rho, f) &\longmapsto \mathscr{L}_{\rho} (f)\end{split}\]

where we are viewing data, \(\rho\), as a finitely support probability distribution on \(\Omega\)


A standard example comes from relative entropy:

\[\begin{split}\mathrm{Prob}(\Omega) \times \mathrm{Prob}(\Omega) &\longrightarrow \mathbb{R} \\ (\rho_A, \rho_B) &\longmapsto \mathcal{S}(\rho_A || \rho_B)\end{split}\]

and is:

\[\begin{split}\mathcal{D}(\Omega) \times \mathcal{M} \overset{\Theta}\longrightarrow &\mathrm{Prob}(\Omega) \times \mathrm{Prob}(\Omega)\overset{S( - || - )}\longrightarrow \mathbb{R} \\ (\rho^\mathrm{emp}, f) &\longmapsto \mathcal{S}( \rho^\mathrm{emp} || \rho_x)\end{split}\]


Although relative entropy does not coicide with cross entropy, they differ by a term independent of the model, so that optimizing relative entropy coincides with optimizing cross entropy.

Some loss functions do not depend on the hypothesis of the underlying model. In other words, many may no reference to \(\Theta\).


In many instances, there are additional regularization terms and hyperparameters:

\[\begin{split}\mathcal{M} \times \Lambda &\overset{g}\longrightarrow \mathbb{R} \\ f, \lambda &\longmapsto g_\lambda(f)\end{split}\]

which define a regularized loss function

\[\mathscr{L}_{\rho, \lambda}(f) : = \mathscr{L}_{\rho}(f) + g_\lambda(f)\]


A standard example comes from some linear coordinates:

\[\begin{split}\mathcal{M} &\simeq \mathbb{R}^n \\ f &\mapsto \beta\end{split}\]

and \(g_\lambda = \lambda \lvert\lvert - \lvert\lvert^2\) comes from a norm with a scaling factor \(\lambda\):

\[\mathscr{L}_{\rho, \lambda}(f) : = \mathscr{L}_{\rho}(f) + \lambda \lvert \lvert \beta \lvert\lvert^2\]


Regularization alters not only the loss function, but also augments the model. As a rule of thumb, this augmentation should be slight, as one normally optimzes hyperparameters though some randomized search.