Other than the entropy method, the transportation method is another method that controls the logarithm of the moment generating function. In this post we introduce Gibbs variational formula, which is used to derive the transportation lemma.
Gibbs Variational Formula
Let $P$ be a probability measure on $\Omega$ and $Z: \Omega \mapsto \mathbb{R}$ be any random variable.
Then, Gibbs variational formula states
[
\log \mathbb{E}_{P} e^{ Z - \mathbb{E}_{P} \left \lbrack Z \right \rbrack }
=
\sup_{Q \lt \lt P}
\left(
\mathbb{E}_{Q} \left \lbrack Z \right \rbrack
-
\mathbb{E}_{P} \left \lbrack Z \right \rbrack
-
D(Q \parallel P)
\right),
\label{eq:1}\tag{1}
]
which says that
complicated $\log \mathbb{E}_{P} e^{ Z - \mathbb{E}_{P} \left \lbrack Z \right \rbrack }$
can be decomposed into simple $\mathbb{E}_{Q}$ and $\mathbb{E}_{P}$.
The transformation from $P$ to $Q$ is the tilting introduced
in this post,
and its variations are innumerable,
but interstingly, when $Q$ is the exponential tilting of $P$, i.e.,
$
Q =
P_{
\frac
{e^{Z}}
{\mathbb{E} \left \lbrack e^{Z} \right \rbrack }
}
$,
the supremum is achieved.
Proof
Let
$
R =
P_{
\frac
{e^{Z}}
{\mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack }
}
$ and thus
$\frac{dR}{dP} = \frac {e^{Z}} {\mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack } $,
then we have
[
D(Q \parallel R)
=
\mathbb{E}_{Q} \left \lbrack \log \frac{dQ}{dR} \right \rbrack
=
\mathbb{E}_{Q} \left \lbrack \log \frac{dQ}{dP} \right \rbrack
-
\mathbb{E}_{Q} \left \lbrack \log \frac{dR}{dP} \right \rbrack
=
\mathbb{E}_{Q} \left \lbrack \log \frac{dQ}{dP} \right \rbrack
-
\mathbb{E}_{Q} \left \lbrack \log \frac {e^{Z}} {\mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack } \right \rbrack
]
The LHS is decomposed further as
[
=
\mathbb{E}_{Q} \left \lbrack \log \frac{dQ}{dP} \right \rbrack
-
\mathbb{E}_{Q} \left \lbrack \log e^{Z} \right \rbrack
+
\mathbb{E}_{Q} \left \lbrack \log {\mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack } \right \rbrack
=
D(Q \parallel P)
-
\mathbb{E}_{Q} \left \lbrack Z \right \rbrack
+
\log {\mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack }.
]
Since $D(Q \parallel R) \geq 0$, we obtain
[
\log {\mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack }
\geq
\mathbb{E}_{Q} \left \lbrack Z \right \rbrack
-
D(Q \parallel P),
\label{eq:2}\tag{2}
]
and replacing $Z$ with $Z - \mathbb{E}_{P} \left \lbrack Z \right \rbrack $ yields
[
\log \mathbb{E}_{P} e^{ Z - \mathbb{E}_{P} \left \lbrack Z \right \rbrack }
\geq
\mathbb{E}_{Q} \left \lbrack Z \right \rbrack
-
\mathbb{E}_{P} \left \lbrack Z \right \rbrack
-
D(Q \parallel P).
]
Thus, taking $\sup_{Q \lt \lt P}$ results in the statement.
The equality holds when
$
Q =
R =
P_{
\frac
{e^{Z}}
{\mathbb{E} \left \lbrack e^{Z} \right \rbrack }
}
$, because in that case $D(Q \parallel R) =0$.
Variational formula for KL Divergence
As Eq.$\,$($\ref{eq:2}$) says $ D(Q \parallel P) \geq \mathbb{E}_{Q} \left \lbrack Z \right \rbrack - \log {\mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack } $ for every $Q \lt \lt P$ and every $Z$ such that $\mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack \lt \infty$, it indicates [ D(Q \parallel P) = \sup_{Z: \mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack \lt \infty} \mathbb{E}_{Q} \left \lbrack Z \right \rbrack - \log {\mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack }. \label{eq:3}\tag{3} ] Letting $Z = \log \frac{dQ}{dP}$, we have [ \mathbb{E}_{Q} \left \lbrack Z \right \rbrack - \log {\mathbb{E}_{p} \left \lbrack e^{Z} \right \rbrack } = \mathbb{E}_{Q} \left \lbrack \log \frac{dQ}{dP} \right \rbrack - \log \mathbb{E}_{p} \left \lbrack \frac{dQ}{dP} \right \rbrack , ] and because of [ {\mathbb{E}_{p} \left \lbrack \frac{dQ}{dP} \right \rbrack } = 1 ] from the definition of the tilting, the second term is $0$. Since [ \mathbb{E}_{Q} \left \lbrack \log \frac{dQ}{dP} \right \rbrack = D(Q \parallel P), ] it indicates that the equality of Eq.$\,$($\ref{eq:3}$) holds with $Z = \log \frac{dQ}{dP}$.
Another Variational formula for KL Divergence
Jensen’s gap variational formula
For a convex function $f: I \mapsto \mathbb{R}$, the following is know as Jensen’s gap variational formula: [ \mathbb{E} \left \lbrack f(x) \right \rbrack - f \left ( \mathbb{E} \left \lbrack x \right \rbrack \right) = \min_{a \in I} \mathbb{E} \left \lbrack f(x) - f(a) - f’(a) (x-a) \right \rbrack. ] The equality holds when $a = \mathbb{E} \left \lbrack x \right \rbrack $, because the last term becomes $ f’(a) (\mathbb{E} \left \lbrack x \right \rbrack - \mathbb{E} \left \lbrack x \right \rbrack) = 0$.
Application to variance
Consider $f(x) = x^{2}$, then we have [ \mathbb{E} \left \lbrack x^{2} \right \rbrack - \mathbb{E} \left \lbrack x \right \rbrack ^{2} = \text{Var}(x) = \mathbb{E} \left \lbrack (x - \mathbb{E} \left \lbrack x \right \rbrack) ^{2}\right \rbrack. ] This can be expressed as [ = \min_{a} \mathbb{E} \left \lbrack x^{2} - a^{2} -2a (x-a) \right \rbrack = \min_{a} \mathbb{E} \left \lbrack (x - a) ^{2}\right \rbrack, ] meaning that the expected value minimizes the MSE.
Application to entropy
Consider $f(x) = x \log x$ for $x \in [0, \infty)$, then we have [ \mathbb{E} \left \lbrack x \log x \right \rbrack - \mathbb{E} \left \lbrack x \right \rbrack \log \mathbb{E} \left \lbrack x \right \rbrack = \text{Ent}(x) = \min_{a \gt 0} \mathbb{E} \left \lbrack x \log x - a \log a - (1 + \log a) (x -a) \right \rbrack = \min_{a \gt 0} \mathbb{E} \left \lbrack x \log \frac{x}{a} - (x -a) \right \rbrack ] When $x = \frac{dQ}{dP}$, [ \text{Ent}(x) = D(Q \parallel P) = \min_{a \gt 0} \mathbb{E}_{p} \left \lbrack \frac{dQ}{dP} \log \frac{dQ}{dP} - \log a - (\frac{dQ}{dP} -a) \right \rbrack, ] which is another variational formula for KL divergence.