An Introduction to WGAN

Date:

Abstract: This is my note when self-studying the WGAN.

Notation

  • true sample $x_1,x_2,…,x_n$.

  • new sample (false sample) $\hat{x_1},\hat{x_2},…,\hat{x_n}$. We want this series to approximate the true sample.

  • $p(x)$ is the ditribution of true sample.

  • $q(x)$ is the distribution of false sample, that is generated by $x=G(z)$, where $z\sim q(z)$ is the standard Gaussian distribution.

f-divergence

definition 1: f-divergence

\[D_f(P||Q)=\int q(x) f(\frac{p(x)}{q(x)})dx\]

For KL-divergence, $f(u)=u\,\text{log}u$; for JS-divergence, $f(u)=-\frac{u+1}{2}\,\text{log}\frac{u+1}{2}+\frac{u}{2}\,\text{log}u$

Generally, it is required that:

1.$f:\mathbb R^+\rightarrow \mathbb R$
2.$f(1)=0$ (to ensure $D_f(P||P)=0$)
3.$f$ is convex (to ensure $E[f(x)]\geq f[E(x)]$)

Then we have:

\[\begin{aligned}\int q(x) f(\frac{p(x)}{q(x)})dx&=E_{x\sim q(x)}\left[f(\frac{p(x)}{q(x)})\right]\geq f\left(E_{x\sim q(x)}[\frac{p(x)}{q(x)}]\right) \\&=f(\int p(x))=f(1)=0 \end{aligned}\]

Therefore, f-divergence is not negative.

Convex Conjugate

The tangent line at $u=\xi$ of $y=f(u)$ is

\[y=f(\xi)+f'(\xi)(u-\xi)\]

Since $f$ is convex, we have

\[f(u)=\text{max}_{\xi\in D}\left(f(\xi)+f'(\xi)(u-\xi)\right)\]

why can we do this??????????????????????

Define $t=f’(\xi), g(t)=-f(\xi)+f’(\xi)\xi$ and suppose $t=T(u)$, we have

\[f(u)=\text{max}_{T} \{T(u)u-g(T(u))\}\]

Therefore, we have

\[D_f(P||Q)=\text{max}_T \int q(x)\left[\frac{p(x)}{q(x)}T(\frac{p(x)}{q(x)})-g(T(\frac{p(x)}{q(x)}))\right]dx\]

Denote $T(x)=T(\frac{p(x)}{q(x)})$, then

\[D_f(P||Q)=\text{max}_T(E_{x\sim p(x)}[T(x)]-E_{x\sim q(x)}[g(T(x))])\]

To train the generator, we indeed apply $\underset{G}{\text{min}} D_f(P||Q)$.

The loss of GAN

\[\underset{D}{\text{min}}E_{x\sim p(x)}[-\text{log}D(x)]-E_{x\sim q(x)}[-\text{log}(1-D(x))]\\ \underset{G}{\text{max}}E_{z\sim q(z)}[-\text{log}(D(G(z)))]\]

The loss of WGAN

\[\underset{G}{\text{min}}\underset{D,\|D\|_L \leq 1}{\text{max}}E_{x\sim p(x)}[D(x)]-E_{z\sim q(z)}[D(G(z))]\]

The loss of WGAN-GP

\[\underset{G}{\text{min}}E_{x\sim q(x)}[D(x)]-E_{x\sim p(x)}[D(x)]+\lambda E_{x\sim r(x)}[\nabla_x D(x)-c]^2\\ \underset{G}{\text{min}} E_{z\sim q(z)}[-D(G(z))]\]

where c is recommended to set as 0 or 1, $r(x)$ is a derivative distribution of $p(x)$ and $q(x)$.

Lipschitz constraints in Deep Learning

Suppose $P_r$ is the real distribution and $P_g$ is the generated distribution. Remember the optimization target of WGAN is

\[W(P_r,P_g)=\underset{\|f\|_L\leq 1}{\text{sup}}E_{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]\]

where $|f|_L$ is defined to be

\[\|f\|_L:=\underset{x\neq y}{max}\frac{|T(x)-T(y)|}{|x-y|}\]

And we define gradient penalty as

\[\underset{f}{min}E_{x\sim P_g}[f(x)]-E_{x\sim P_r}[f(x)]+\lambda (\|f'(\tilde{x})\|-1)^2\]

where $\tilde{x}=\epsilon x_{\text{real}}+(1-\epsilon)x_{\text{generated}},\,\epsilon\sim U[0,1], \,x_{\text{real}}\sim P_r, \,x_{\text{generated}}\sim P_g$.

How to make $|f|_L\leq 1$? The answer is to use the Spectral Normalization, that is to replace the parameter $w$ with $w/|w|$ in model $f$.

Transformation of Probability Distribution

Set $X\sim U[0,1],\,Y\sim N[0,1]$. We denote the transformation as $Y=f(x)$ and let $\rho(x)$ be the density function in $X$. Since under the transformation, the prbability lie in $[x,x+dx],[y,y+dy]$ must be equal, thus we have

\[\rho(x)dx=\frac{1}{\sqrt{2\pi}}e^{\frac{-y^2}{2}}dy\\\Rightarrow \int_0^x \rho(t)dt=\int_{-\infty}^y \frac{1}{\sqrt{2\pi}}e^{\frac{-t^2}{2}}dt \\ \Rightarrow y=\Phi^{-1}(\rho(t)dt)\]

Problem: how to find $f$? We could use the Neural Network here, that is $Y=G(X,\theta)$.

Wasserstein Measure and WGAN

We define

\[W_c[p(x),q(x)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} E_{(x,y)\sim \gamma} [c(x,y)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} \int \gamma(x,y)c(x,y)dxdy\]

where $c(x,y)=|x-y|$ is a measure and the marginal probability distribution is

\[\int \gamma(x,y)dy=p(x), \,\int \gamma(x,y)dx=q(y)\]

The dual problem is, as we have showed above

\[W(P_r,P_g)=\underset{\|f\|_L\leq 1}{\text{sup}}E_{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]\]

Denote

\[\Gamma= \left[ \begin{array}{c} \gamma(x_1,y_1)\\\gamma(x_1,y_2)\\ \vdots\\ \hline \gamma(x_2,y_1)\\\gamma(x_2,y_2)\\ \vdots\\ \hline \gamma(x_n,y_1)\\\gamma(x_n,y_2)\\ \vdots \end{array} \right] \;\;\;\;\;\; C=\left[ \begin{array}{c} c(x_1,y_1)\\c(x_1,y_2)\\ \vdots\\ \hline c(x_2,y_1)\\c(x_2,y_2)\\ \vdots\\ \hline c(x_n,y_1)\\c(x_n,y_2)\\ \vdots \end{array} \right] \;\;\;\;\;\; b=\left[ \begin{array}{c} p(x_1)\\ p(x_2)\\ \vdots\\ \hline q(x_1)\\ q(x_2)\\ \vdots\\ \end{array} \right]\] \[A=\left[ \begin{array}{c|c|c|c} 1,1,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\\0,0,0,0,\cdots&1,1,0,0,\cdots&0,0,0,0,\cdots&\cdots\\0,0,0,0,\cdots&0,0,0,0,\cdots&1,1,0,0,\cdots&\cdots\\ \vdots&\vdots&\vdots&\vdots\\\hline 1,0,0,0,\cdots&1,0,0,0,\cdots&1,0,0,0,\cdots&\cdots\\0,1,0,0,\cdots&0,1,0,0,\cdots&0,1,0,0,\cdots&\cdots\\0,0,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\\ \vdots&\vdots&\vdots&\vdots\\ \end{array} \right]\]

Then the optimization problem is in fact a linear programming problem

\[\underset{\Gamma}{\text{min}}\{\langle\Gamma,C\rangle|A\Gamma=b,\Gamma\geq 0\}\]

Dual Form

Generally, for a problem

\[\underset{x}{\text{min}}\{c^{\text{T}}x|Ax=b,x\geq 0\}\]

where $x,c\in \mathbb{R}^n; b\in \mathbb{R}^m, A\in \mathbb{R}^{m\times n}$. Its dual form is

\[\underset{x}{\text{max}}\{b^{\text{T}}y|A^{\text{T}}y\leq c\}\]

Weak Dual Form

\[\underset{x}{\text{max}}\{b^{\text{T}}y|A^{\text{T}}y\leq c\}\leq \underset{x}{\text{max}}\{b^{\text{T}}y|A^{\text{T}}y\leq c\}\]

Strong Dual Form

\[\underset{x}{\text{max}}\{b^{\text{T}}y|A^{\text{T}}y\leq c\}= \underset{x}{\text{max}}\{b^{\text{T}}y|A^{\text{T}}y\leq c\}\]

The proof is ommitted here. During the proof, there is an important lemma

Farkas’ Lemma Only one of the following statements can be true:

  • $\exists x\in \mathbb{R}^n \& x\geq 0, s.t.\, Ax=b$
  • $\exists y\in \mathbb{R}^m, s.t.\, A^{\text{T}}y\leq 0\&b^{\text{T}}y> 0$

The Dual Problem of the optimal transportation

\[\underset{\Gamma}{\text{min}}\{\langle\Gamma,C\rangle|A\Gamma=b,\Gamma\geq 0\}= \underset{\Gamma}{\text{max}}\{\langle b,F\rangle|A\Gamma=b,A^{\text{T}}F\leq C\}\]

We denote F as

\[F=\left[ \begin{array}{c} f(x_1)\\ f(x_2)\\ \vdots\\ \hline g(x_1)\\ g(x_2)\\ \vdots\\ \end{array} \right]\]

In this way,

\[\langle b,F\rangle=\sum_i p(x_i)f(x_i)+q(x_i)g(x_i),\\ \text{or} \;\langle b,F\rangle=\int p(x)f(x)+q(x)g(x),\]

The constraint condition can be reformulated as

\[A^{\text{T}}F\leq C\Rightarrow f(x_i)+g(x_i)\leq c(x_i,y_i)\]

Since

\[f(x)+g(x)\leq c(x,x)=0\]

We have $f(x)\leq -g(x)$, and

\[p(x_i)f(x_i)+q(x_i)g(x_i)\leq p(x_i)f(x_i)-q(x_i)f(x_i)\]

We could set

\[W[p,q]=\int p(x)f(x)+q(x)g(x) |f(x)-f(y)\leq\|x-y\|\]

Thus the training of WGAN is

\[W(P_r,P_g)=\underset{G}{\text{min}}\underset{f,\|f\|_L\leq 1}{\text{max}}E_{x\sim p(x)}[f(x)]-E_{x\sim q(z)}[f(G(z))]\]

Relation to Other Loss

We say d is weaker then d’ if every sequence that converges under d’ converges under d, thus we have

Theorem: Every distribution that converges under KL, reverse KL, TV and JS also converges under Wasserstein Divergence.

Reference

苏剑林. (Oct. 07, 2018). 《深度学习中的Lipschitz约束:泛化与生成模型 》[Blog post]. Retrieved from https://kexue.fm/archives/6051

苏剑林. (May. 03, 2019). 《从动力学角度看优化算法(四):GAN的第三个阶段 》[Blog post]. Retrieved from https://kexue.fm/archives/6583

Mescheder L, Geiger A, Nowozin S. Which training methods for gans do actually converge?[C]//International conference on machine learning. PMLR, 2018: 3481-3490.

苏剑林. (Jun. 08, 2017). 《互怼的艺术:从零直达WGAN-GP 》[Blog post]. Retrieved from https://kexue.fm/archives/4439

苏剑林. (Jan. 20, 2019). 《从Wasserstein距离、对偶理论到WGAN 》[Blog post]. Retrieved from https://kexue.fm/archives/6280