The Dual Nature of MLE and KL Divergence in Generative Modeling
\(\DeclareMathOperator*{\argmin}{argmin}\) \(\DeclareMathOperator*{\argmax}{argmax}\)
KL-Divergence
In generative modeling, we assume the existence of a true probability distribution \(p_{\text{data}}\) of our dataset. However, this distribution is usually unknown. We aim to approximate the true data distribution \(p_{\text{data}}\), with a model distribution \(p_{\theta}\), parameterized by \(\theta\), striving for the closest possible match. One solution to achieve this is to find \(\theta^{*}_{\text{KL}}\) that minimizes the Kullback–Leibler divergence between \(p_{\text{data}}\) and \(p_{\theta}\).
Recall that the KL-divergence of the 2 probabilities \(p\) and \(q\) is given by:
\[D_{\text{KL}}(p || q) = \mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})}\left[\log \left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right)\right]\]with three properties:
- $$D_{\text{KL}}(p || q) \ge 0$$
- $$D_{\text{KL}}(p || q) = 0 \text{ if and only if } p=q$$
- $$D_{\text{KL}}(p || q) \ne D_{\text{KL}}(q || p)$$
Our objective is to find the parameter \(\theta^{*}_{\text{KL}}\) that minimizes the function \(D_{\text{KL}}(p_{\text{data}} || p_{\theta})\). Note that \(D_{\text{KL}}(p_{\text{data}} || p_{\theta})\) is preferred over \(D_{\text{KL}}(p_{\theta} || p_{\text{data}})\) because the expectation in the former is with respect to \(p_{\text{data}}\), the true data distribution, whereas in the latter, it is with respect to \(p_{\theta}\). We can further expand the objective equation to:
The term \(\mathbb{E}_{\mathbf{x} \sim p_{\text{data}}(\mathbf{x})}[\log (p_{\text{data}}(\mathbf{x}))]\) is a constant with respect to \(\theta\). Therefore, the equation \((3)\) is equivalent to the equation \((4)\). Since the last equation is in expectation, we can use sampling to calculate the empirical approximation of the expectation. Given the presupposition of a true underlying distribution \(p_{\text{data}}\), we can represent our dataset as a series of samples \(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(n)}\) drawn from \(p_{\text{data}}(\mathbf{x})\), which leads us to the following derivation:
\[\begin{align} \argmin\limits_{\theta} D_{\text{KL}}(p_{\text{data}} || p_{\theta}) = \argmax\limits_{\theta} \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}(\mathbf{x})}[\log (p_{\theta}(\mathbf{x}))] \approx \argmax\limits_{\theta} \frac{1}{n} \sum_{i=1}^{n}\log (p_{\theta}(\mathbf{x}^{(i)})) \end{align}\]Maximum Likelihood Estimation
In parallel with our exploration of KL-divergence, Maximum Likelihood Estimation (MLE) offers a complementary perspective. MLE find the parameter \(\theta^{*}\) that maximizes the likelihood of observing the dataset under the model. More formally, given the i.i.d samples \(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(n)}\), we can write the MLE objective as:
\[\begin{align} \theta^{*}_{\text{MLE}} &= \argmax\limits_{\theta} p_{\theta}(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(n)}) \\ &= \argmax\limits_{\theta} \prod_{i=1}^{n}p_{\theta}(\mathbf{x}^{(i)}) \\ &= \argmax\limits_{\theta} \log \left(\prod_{i=1}^{n}p_{\theta}(\mathbf{x}^{(i)}) \right) \\ &= \argmax\limits_{\theta} \sum_{i=1}^{n} \log p_{\theta}(\mathbf{x}^{(i)}) \\ \end{align}\]We can start to see the similarity in the formula.
Relationship between MLE and KL-Divergence
The equations above show the connection between MLE and KL-divergence. In MLE, our goal is to find the parameter set \(\theta^{*}_{\text{MLE}}\) that maximizes the log-likelihood of our observed dataset. Conversely, in the context of KL-divergence, we minimize the distance between the model distribution with the true data distribution. The connection can be shown as follows:
\[\begin{align} \theta^{*}_{\text{KL}} &= \argmin\limits_{\theta} D_{\text{KL}}(p_{\text{data}} || p_{\theta}) \\ &\approx \argmax\limits_{\theta} \frac{1}{n} \sum_{i=1}^{n}\log (p_{\theta}(\mathbf{x}^{(i)})) \\ &= \argmax\limits_{\theta} \sum_{i=1}^{n}\log (p_{\theta}(\mathbf{x}^{(i)})) \\ &= \theta^{*}_{\text{MLE}} \end{align}\]The practical significance of this duality lies in its utility for constructing probabilistic models and understanding the core of machine learning inference. Doing the MLE of our model distribution ensures that our model’s assumptions reflect reality as closely as possible. Consider two probability models \(f\) and \(g\), parameterized by a set of parameters \(\theta\). The figure below represents the application of MLE to find the optimal parameters.
The optimal parameters \(\theta^{*}\) for our models \(f\) and \(g\), namely \(f_{\theta^{*}}\) and \(g_{\theta^{*}}\), are those that minimizes the KL-divergence to \(p_{\text{data}}\). Therefore, when we employ MLE, we are implicitly minimizing the KL Divergence between our model’s distribution and the true distribution of the data.
Limitations
This approach, however, encounters a fundamental limitation due to the elusive nature of \(p_{\text{data}}\), making it impossible to directly compute $$D_{\text{KL}}(p_{\text{data}} | p_{\theta})\(. While we can estimate\)\mathbb{E}{\mathbf{x} \sim p{\text{data}}(\mathbf{x})}[\log (p_{\theta}(\mathbf{x}))]$$ using sampling methods, the following expression: |
Depends on the term \(\mathbb{E}_{\mathbf{x} \sim p_{\text{data}}(\mathbf{x})}[\log (p_{\text{data}}(\mathbf{x}))]\), which remains unknown. Future explorations must focus on developing innovative approaches or refining existing methods to better approximate or understand \(p_{\text{data}}\). Such advancements are crucial for enhancing the accuracy and reliability of model predictions, ultimately pushing the boundaries of our understanding in this field.
Enjoy Reading This Article?
Here are some more articles you might like to read next: