Some Mathematical Notes

An Introduction to WGAN

May 21, 2024

Abstract: This is my note when self-studying the WGAN.

Notation

  • true sample $x_1,x_2,…,x_n$.

  • new sample (false sample) $\hat{x_1},\hat{x_2},…,\hat{x_n}$. We want this series to approximate the true sample.

  • $p(x)$ is the ditribution of true sample.

  • $q(x)$ is the distribution of false sample, that is generated by $x=G(z)$, where $z\sim q(z)$ is the standard Gaussian distribution.

f-divergence

definition 1: f-divergence

[D_f(P Q)=\int q(x) f(\frac{p(x)}{q(x)})dx]

For KL-divergence, $f(u)=u\,\text{log}u$; for JS-divergence, $f(u)=-\frac{u+1}{2}\,\text{log}\frac{u+1}{2}+\frac{u}{2}\,\text{log}u$

Generally, it is required that:

1.$f:\mathbb R^+\rightarrow \mathbb R$
2.$f(1)=0$ (to ensure $D_f(P||P)=0$)
3.$f$ is convex (to ensure $E[f(x)]\geq f[E(x)]$)

Then we have:

[\begin{aligned}\int q(x) f(\frac{p(x)}{q(x)})dx&=E_{x\sim q(x)}\left[f(\frac{p(x)}{q(x)})\right]\geq f\left(E_{x\sim q(x)}[\frac{p(x)}{q(x)}]\right) \&=f(\int p(x))=f(1)=0 \end{aligned}]

Therefore, f-divergence is not negative.

Convex Conjugate

The tangent line at $u=\xi$ of $y=f(u)$ is

[y=f(\xi)+f’(\xi)(u-\xi)]

Since $f$ is convex, we have

[f(u)=\text{max}_{\xi\in D}\left(f(\xi)+f’(\xi)(u-\xi)\right)]

why can we do this??????????????????????

Define $t=f’(\xi), g(t)=-f(\xi)+f’(\xi)\xi$ and suppose $t=T(u)$, we have

[f(u)=\text{max}_{T} {T(u)u-g(T(u))}]

Therefore, we have

[D_f(P Q)=\text{max}_T \int q(x)\left[\frac{p(x)}{q(x)}T(\frac{p(x)}{q(x)})-g(T(\frac{p(x)}{q(x)}))\right]dx]

Denote $T(x)=T(\frac{p(x)}{q(x)})$, then

[D_f(P Q)=\text{max}T(E{x\sim p(x)}[T(x)]-E_{x\sim q(x)}[g(T(x))])]

To train the generator, we indeed apply $\underset{G}{\text{min}} D_f(P||Q)$.

The loss of GAN

[\underset{D}{\text{min}}E_{x\sim p(x)}[-\text{log}D(x)]-E_{x\sim q(x)}[-\text{log}(1-D(x))]
\underset{G}{\text{max}}E_{z\sim q(z)}[-\text{log}(D(G(z)))]]

The loss of WGAN

[\underset{G}{\text{min}}\underset{D,|D|L \leq 1}{\text{max}}E{x\sim p(x)}[D(x)]-E_{z\sim q(z)}[D(G(z))]]

The loss of WGAN-GP

[\underset{G}{\text{min}}E_{x\sim q(x)}[D(x)]-E_{x\sim p(x)}[D(x)]+\lambda E_{x\sim r(x)}[\nabla_x D(x)-c]^2
\underset{G}{\text{min}} E_{z\sim q(z)}[-D(G(z))]]

where c is recommended to set as 0 or 1, $r(x)$ is a derivative distribution of $p(x)$ and $q(x)$.

Lipschitz constraints in Deep Learning

Suppose $P_r$ is the real distribution and $P_g$ is the generated distribution. Remember the optimization target of WGAN is

[W(P_r,P_g)=\underset{|f|L\leq 1}{\text{sup}}E{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]]

where $|f|_L$ is defined to be

[|f|_L:=\underset{x\neq y}{max}\frac{T(x)-T(y)}{x-y}]

And we define gradient penalty as

[\underset{f}{min}E_{x\sim P_g}[f(x)]-E_{x\sim P_r}[f(x)]+\lambda (|f’(\tilde{x})|-1)^2]

where $\tilde{x}=\epsilon x_{\text{real}}+(1-\epsilon)x_{\text{generated}},\,\epsilon\sim U[0,1], \,x_{\text{real}}\sim P_r, \,x_{\text{generated}}\sim P_g$.

How to make $|f|_L\leq 1$? The answer is to use the Spectral Normalization, that is to replace the parameter $w$ with $w/|w|$ in model $f$.

Transformation of Probability Distribution

Set $X\sim U[0,1],\,Y\sim N[0,1]$. We denote the transformation as $Y=f(x)$ and let $\rho(x)$ be the density function in $X$. Since under the transformation, the prbability lie in $[x,x+dx],[y,y+dy]$ must be equal, thus we have

[\rho(x)dx=\frac{1}{\sqrt{2\pi}}e^{\frac{-y^2}{2}}dy\\Rightarrow \int_0^x \rho(t)dt=\int_{-\infty}^y \frac{1}{\sqrt{2\pi}}e^{\frac{-t^2}{2}}dt
\Rightarrow y=\Phi^{-1}(\rho(t)dt)]

Problem: how to find $f$? We could use the Neural Network here, that is $Y=G(X,\theta)$.

Wasserstein Measure and WGAN

We define

[W_c[p(x),q(x)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} E_{(x,y)\sim \gamma} [c(x,y)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} \int \gamma(x,y)c(x,y)dxdy]

where $c(x,y)=|x-y|$ is a measure and the marginal probability distribution is

[\int \gamma(x,y)dy=p(x), \,\int \gamma(x,y)dx=q(y)]

The dual problem is, as we have showed above

[W(P_r,P_g)=\underset{|f|L\leq 1}{\text{sup}}E{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]]

Denote

[\Gamma= \left[ \begin{array}{c} \gamma(x_1,y_1)\\gamma(x_1,y_2)\ \vdots\ \hline \gamma(x_2,y_1)\\gamma(x_2,y_2)\ \vdots\ \hline \gamma(x_n,y_1)\\gamma(x_n,y_2)\ \vdots \end{array} \right] \;\;\;\;\;\; C=\left[ \begin{array}{c} c(x_1,y_1)\c(x_1,y_2)\ \vdots\ \hline c(x_2,y_1)\c(x_2,y_2)\ \vdots\ \hline c(x_n,y_1)\c(x_n,y_2)\ \vdots \end{array} \right] \;\;\;\;\;\; b=\left[ \begin{array}{c} p(x_1)\ p(x_2)\ \vdots\ \hline q(x_1)\ q(x_2)\ \vdots\ \end{array} \right]]

[A=\left[ \begin{array}{c|c|c|c} 1,1,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\0,0,0,0,\cdots&1,1,0,0,\cdots&0,0,0,0,\cdots&\cdots\0,0,0,0,\cdots&0,0,0,0,\cdots&1,1,0,0,\cdots&\cdots\ \vdots&\vdots&\vdots&\vdots\\hline 1,0,0,0,\cdots&1,0,0,0,\cdots&1,0,0,0,\cdots&\cdots\0,1,0,0,\cdots&0,1,0,0,\cdots&0,1,0,0,\cdots&\cdots\0,0,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\ \vdots&\vdots&\vdots&\vdots
\end{array} \right]]

Then the optimization problem is in fact a linear programming problem

[\underset{\Gamma}{\text{min}}{\langle\Gamma,C\rangleA\Gamma=b,\Gamma\geq 0}]

Dual Form

Generally, for a problem

[\underset{x}{\text{min}}{c^{\text{T}}xAx=b,x\geq 0}]

where $x,c\in \mathbb{R}^n; b\in \mathbb{R}^m, A\in \mathbb{R}^{m\times n}$. Its dual form is

[\underset{x}{\text{max}}{b^{\text{T}}yA^{\text{T}}y\leq c}]

Weak Dual Form

[\underset{x}{\text{max}}{b^{\text{T}}yA^{\text{T}}y\leq c}\leq \underset{x}{\text{max}}{b^{\text{T}}yA^{\text{T}}y\leq c}]

Strong Dual Form

[\underset{x}{\text{max}}{b^{\text{T}}yA^{\text{T}}y\leq c}= \underset{x}{\text{max}}{b^{\text{T}}yA^{\text{T}}y\leq c}]

The proof is ommitted here. During the proof, there is an important lemma

Farkas’ Lemma Only one of the following statements can be true:

  • $\exists x\in \mathbb{R}^n \& x\geq 0, s.t.\, Ax=b$
  • $\exists y\in \mathbb{R}^m, s.t.\, A^{\text{T}}y\leq 0\&b^{\text{T}}y> 0$

The Dual Problem of the optimal transportation

[\underset{\Gamma}{\text{min}}{\langle\Gamma,C\rangleA\Gamma=b,\Gamma\geq 0}=
\underset{\Gamma}{\text{max}}{\langle b,F\rangleA\Gamma=b,A^{\text{T}}F\leq C}]

We denote F as

[F=\left[ \begin{array}{c} f(x_1)\ f(x_2)\ \vdots\ \hline g(x_1)\ g(x_2)\ \vdots\ \end{array} \right]]

In this way,

[\langle b,F\rangle=\sum_i p(x_i)f(x_i)+q(x_i)g(x_i),
\text{or} \;\langle b,F\rangle=\int p(x)f(x)+q(x)g(x),]

The constraint condition can be reformulated as

[A^{\text{T}}F\leq C\Rightarrow f(x_i)+g(x_i)\leq c(x_i,y_i)]

Since

[f(x)+g(x)\leq c(x,x)=0]

We have $f(x)\leq -g(x)$, and

[p(x_i)f(x_i)+q(x_i)g(x_i)\leq p(x_i)f(x_i)-q(x_i)f(x_i)]

We could set

[W[p,q]=\int p(x)f(x)+q(x)g(x)f(x)-f(y)\leq|x-y|]

Thus the training of WGAN is

[W(P_r,P_g)=\underset{G}{\text{min}}\underset{f,|f|L\leq 1}{\text{max}}E{x\sim p(x)}[f(x)]-E_{x\sim q(z)}[f(G(z))]]

Relation to Other Loss

We say d is weaker then d’ if every sequence that converges under d’ converges under d, thus we have

Theorem: Every distribution that converges under KL, reverse KL, TV and JS also converges under Wasserstein Divergence.

Reference

苏剑林. (Oct. 07, 2018). 《深度学习中的Lipschitz约束:泛化与生成模型 》[Blog post]. Retrieved from https://kexue.fm/archives/6051

苏剑林. (May. 03, 2019). 《从动力学角度看优化算法(四):GAN的第三个阶段 》[Blog post]. Retrieved from https://kexue.fm/archives/6583

Mescheder L, Geiger A, Nowozin S. Which training methods for gans do actually converge?[C]//International conference on machine learning. PMLR, 2018: 3481-3490.

苏剑林. (Jun. 08, 2017). 《互怼的艺术:从零直达WGAN-GP 》[Blog post]. Retrieved from https://kexue.fm/archives/4439

苏剑林. (Jan. 20, 2019). 《从Wasserstein距离、对偶理论到WGAN 》[Blog post]. Retrieved from https://kexue.fm/archives/6280

Introduction to Flow Model

May 06, 2024

Flow Model

Basic Ideas

Suppose a neural network generator $G$ defines a distribution $P_G$, that is

[z\sim \pi(z), \,x\sim P_G(x),
x=G(z)]

The optimal model is

[G^*=\underset{G}{\text{argmax}}\sum_{i=1}^m \text{log}P_G(x_i) \approx KL(P_{data} P_G)]

From the transformation relationship, we have

[P_G(x_i)=\pi(z_i)\text{det}(J_{G^{-1}})]

where $\pi(z_i)=G^{-1}(x_i)$. Apply $\text{log}$ to both sides, we have

[\text{log}P_G(x_i)=\text{log}\pi(G^{-1}(x_i))+\text{log}\text{det}(J_{G^{-1}})]

The iteration of multiple models (the flow) is as follows

[\pi(x)\rightarrow\boxed{G_1}\rightarrow P_1(x)\rightarrow\boxed{G_2}\rightarrow P_2(x)\rightarrow \cdots]

And the we get distribution

[P_G(x_i)=\pi(z_i)\text{det}(J_{G_1^{-1}}) \text{det}(J_{G_2^{-1}})\cdots\text{det}(J_{G_K^{-1}}),\
i.e.,\text{log}P_G(x_i)=\text{log}\pi(z_i)+\sum_{n=1}^K\text{det}(J_{G_n^{-1}})]    

Coupling Layer

coupling_layer

It is easy to find that it is a invertible transformation, on the one hand,

[x_{i\leq d}=z_{i\leq d},
x_{i> d}=\beta_{i> d} z_{i> d}+\gamma_{i> d}]

on the other hand,

[z_{i\leq d}=x_{i\leq d},
z_{i> d}=\frac{x_{i> d}-\gamma_{i> d}}{\beta_{i> d}}]

Now we can compute the Jacobian Matrix

[\left[ \begin{array}{c|c} I& 0 \ \hline *& Diagonal \end{array} \right]]

[\text{det}J_G=\frac{\partial x_{d+1}}{\partial z_{d+1}}\frac{\partial x_{d+2}}{\partial z_{d+2}}\cdots \frac{\partial x_{D}}{\partial z_{D}}=\beta_{d+1}\beta_{d+2}\cdots \beta_{D}]

Coupling Layer Stacks

coupling_layers