Basic Normalizing Flow

1 minute read

Published: June 04, 2022

1.Change of variables formula

Normalizing Flows (NF) can model flexible distributions for data sampling and density estimation. The idea is based on change of variables formula, applying a series of transformations on a simple distribution to approximate a complex distribution.

Let $\boldsymbol{z}\in\mathbb{R}^d$ has density $p_1(\boldsymbol{z})$ and $\boldsymbol{x}=f(\boldsymbol{z})$ is an invertible and differentiable transformation ($\boldsymbol{x}$ needs to have the same dimension as $\boldsymbol{z}$). Then $\boldsymbol{x}$ has density

\[p_2(\boldsymbol{x})=p_1(f^{-1}(\boldsymbol{x}))\cdot\left|\det\left(\frac{\partial f^{-1}(\boldsymbol{x})}{\partial \boldsymbol{x}}\right)\right|=p_1(f^{-1}(\boldsymbol{x}))\cdot\left|\det\left(\frac{\partial f(\boldsymbol{z})}{\partial \boldsymbol{z}}\right)^{-1}\right|.\]

(Intuitively, $\int p_2(x)dx=\int p_1(f^{-1}(x))dx/dz\cdot dz$)

The input and output space of the mapping should have the same dimension. If $d=1$, it is sufficient that $f$ is strictly monotonic. Ideally both $f$ and $f^{-1}$ are are continuously differentiable. Differentiability is a sufficient condition, and in theory the mapping $f$ does not have to be differentiable everywhere (i.e., piecewise continuous). In practice we usually use only differentiable transformations.

We can also define a series of stacking transformations $f_i$ such that $\boldsymbol{x}=\boldsymbol{z}_K=f_K(f_{K-1}(\cdots f_2(f_1(\boldsymbol{z_0}))))$. The change of variables formula gives

\[p_K(\boldsymbol{x})=p_0(\boldsymbol{z})\prod_{i=1}^K\left|\det\left(\frac{\partial f_i^{-1}(\boldsymbol{z}_i)}{\partial \boldsymbol{z}_i}\right)\right|,\]

or the log version

\[\log p_K(\boldsymbol{x})=\log p_0(\boldsymbol{z})+\sum_{i=1}^K\log\left|\det\left(\frac{\partial f_i^{-1}(\boldsymbol{z}_i)}{\partial \boldsymbol{z}_i}\right)\right|.\]

2.Forward and Reverse Parametrization

2.1 Reverse Parametrization

Reverse Parametrization is for density evaluation at $\boldsymbol{x}$. If we know $g=f^{-1}$ analytically, $g_\psi(\boldsymbol{x})=\boldsymbol{z}$ is computable and parameter $\psi$ can be learned. The density can be evaluated by

\[p_2(\boldsymbol{x})=p_1(g_\psi(\boldsymbol{x}))\cdot\left|\det\left(\frac{\partial g_\psi(\boldsymbol{x})}{\partial \boldsymbol{x}}\right)\right|.\]

This formula also works for the satcking transformation $g_\psi=g_{\psi_1}\circ\ \cdots\circ g_{\psi_k}$.

To learn the parameter $\psi$:

\[\max\limits_{\psi}p_\psi(\mathcal{D})=\max\limits_{\psi}\frac{1}{n}\sum_{\boldsymbol{x}^{(j)}\in\mathcal{D}}\log p_\psi(\boldsymbol{x}^{(j)}).\]

2.2 Forward Parametrization

Forward Parametrization can be used for sampling $\boldsymbol{x}$ and density evaluation at the sample. We assume that $f_\theta(\boldsymbol{z})=\boldsymbol{x}$ is computable and parameter $\theta$ can be learned. The inverse $f^{-1}$ exists but may not be easy to compute.

For each sample $\boldsymbol{z}^{(j)}\sim p_1(\boldsymbol{z})$, we can compute the sample $\boldsymbol{x}^{(j)}=f_\theta(\boldsymbol{z}^{(j)})\sim p_2(\boldsymbol{x})$ and the density $p_2(\boldsymbol{x}^{(j)})=p_1(\boldsymbol{z}^{(j)})\cdot\left\vert\det\left(\frac{\partial f_\theta(\boldsymbol{z}^{(j)})}{\partial \boldsymbol{z}}\right)\right\vert^{-1}$.

The formula also works for $f_\theta=f_{\theta_1}\circ\ \cdots\circ f_{\theta_k}$.

PS: This is exactly what we need in Variational Inference: to sample $\boldsymbol{x}$ from a distribution $q$ and to compute the probability $q(\boldsymbol{x})$ for this sample.

References

Machine Learning for Graphs and Sequential Data (IN2323), TUM S21

Share on

Twitter Facebook LinkedIn

Jun Wu

Basic Normalizing Flow

1.Change of variables formula

2.Forward and Reverse Parametrization

2.1 Reverse Parametrization

2.2 Forward Parametrization

References

Share on

You May Also Enjoy

Variational Inference

1.Latent variable models and EM algorithm

Covergence cross mapping

1.Background

Double Machine Learning

Some notes about KL divergence

1.$f$-divergence