An Introduction to WGAN
Date:
Abstract: This is my note when self-studying the WGAN.
Notation
true sample $x_1,x_2,…,x_n$.
new sample (false sample) $\hat{x_1},\hat{x_2},…,\hat{x_n}$. We want this series to approximate the true sample.
$p(x)$ is the ditribution of true sample.
$q(x)$ is the distribution of false sample, that is generated by $x=G(z)$, where $z\sim q(z)$ is the standard Gaussian distribution.
f-divergence
definition 1: f-divergence
\[D_f(P||Q)=\int q(x) f(\frac{p(x)}{q(x)})dx\]For KL-divergence, $f(u)=u\,\text{log}u$; for JS-divergence, $f(u)=-\frac{u+1}{2}\,\text{log}\frac{u+1}{2}+\frac{u}{2}\,\text{log}u$
Generally, it is required that:
1.$f:\mathbb R^+\rightarrow \mathbb R$
2.$f(1)=0$ (to ensure $D_f(P||P)=0$)
3.$f$ is convex (to ensure $E[f(x)]\geq f[E(x)]$)
Then we have:
\[\begin{aligned}\int q(x) f(\frac{p(x)}{q(x)})dx&=E_{x\sim q(x)}\left[f(\frac{p(x)}{q(x)})\right]\geq f\left(E_{x\sim q(x)}[\frac{p(x)}{q(x)}]\right) \\&=f(\int p(x))=f(1)=0 \end{aligned}\]Therefore, f-divergence is not negative.
Convex Conjugate
The tangent line at $u=\xi$ of $y=f(u)$ is
\[y=f(\xi)+f'(\xi)(u-\xi)\]Since $f$ is convex, we have
\[f(u)=\text{max}_{\xi\in D}\left(f(\xi)+f'(\xi)(u-\xi)\right)\]why can we do this??????????????????????
Define $t=f’(\xi), g(t)=-f(\xi)+f’(\xi)\xi$ and suppose $t=T(u)$, we have
\[f(u)=\text{max}_{T} \{T(u)u-g(T(u))\}\]Therefore, we have
\[D_f(P||Q)=\text{max}_T \int q(x)\left[\frac{p(x)}{q(x)}T(\frac{p(x)}{q(x)})-g(T(\frac{p(x)}{q(x)}))\right]dx\]Denote $T(x)=T(\frac{p(x)}{q(x)})$, then
\[D_f(P||Q)=\text{max}_T(E_{x\sim p(x)}[T(x)]-E_{x\sim q(x)}[g(T(x))])\]To train the generator, we indeed apply $\underset{G}{\text{min}} D_f(P||Q)$.
The loss of GAN
\[\underset{D}{\text{min}}E_{x\sim p(x)}[-\text{log}D(x)]-E_{x\sim q(x)}[-\text{log}(1-D(x))]\\ \underset{G}{\text{max}}E_{z\sim q(z)}[-\text{log}(D(G(z)))]\]The loss of WGAN
\[\underset{G}{\text{min}}\underset{D,\|D\|_L \leq 1}{\text{max}}E_{x\sim p(x)}[D(x)]-E_{z\sim q(z)}[D(G(z))]\]The loss of WGAN-GP
\[\underset{G}{\text{min}}E_{x\sim q(x)}[D(x)]-E_{x\sim p(x)}[D(x)]+\lambda E_{x\sim r(x)}[\nabla_x D(x)-c]^2\\ \underset{G}{\text{min}} E_{z\sim q(z)}[-D(G(z))]\]where c is recommended to set as 0 or 1, $r(x)$ is a derivative distribution of $p(x)$ and $q(x)$.
Lipschitz constraints in Deep Learning
Suppose $P_r$ is the real distribution and $P_g$ is the generated distribution. Remember the optimization target of WGAN is
\[W(P_r,P_g)=\underset{\|f\|_L\leq 1}{\text{sup}}E_{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]\]where $|f|_L$ is defined to be
\[\|f\|_L:=\underset{x\neq y}{max}\frac{|T(x)-T(y)|}{|x-y|}\]And we define gradient penalty as
\[\underset{f}{min}E_{x\sim P_g}[f(x)]-E_{x\sim P_r}[f(x)]+\lambda (\|f'(\tilde{x})\|-1)^2\]where $\tilde{x}=\epsilon x_{\text{real}}+(1-\epsilon)x_{\text{generated}},\,\epsilon\sim U[0,1], \,x_{\text{real}}\sim P_r, \,x_{\text{generated}}\sim P_g$.
How to make $|f|_L\leq 1$? The answer is to use the Spectral Normalization, that is to replace the parameter $w$ with $w/|w|$ in model $f$.
Transformation of Probability Distribution
Set $X\sim U[0,1],\,Y\sim N[0,1]$. We denote the transformation as $Y=f(x)$ and let $\rho(x)$ be the density function in $X$. Since under the transformation, the prbability lie in $[x,x+dx],[y,y+dy]$ must be equal, thus we have
\[\rho(x)dx=\frac{1}{\sqrt{2\pi}}e^{\frac{-y^2}{2}}dy\\\Rightarrow \int_0^x \rho(t)dt=\int_{-\infty}^y \frac{1}{\sqrt{2\pi}}e^{\frac{-t^2}{2}}dt \\ \Rightarrow y=\Phi^{-1}(\rho(t)dt)\]Problem: how to find $f$? We could use the Neural Network here, that is $Y=G(X,\theta)$.
Wasserstein Measure and WGAN
We define
\[W_c[p(x),q(x)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} E_{(x,y)\sim \gamma} [c(x,y)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} \int \gamma(x,y)c(x,y)dxdy\]where $c(x,y)=|x-y|$ is a measure and the marginal probability distribution is
\[\int \gamma(x,y)dy=p(x), \,\int \gamma(x,y)dx=q(y)\]The dual problem is, as we have showed above
\[W(P_r,P_g)=\underset{\|f\|_L\leq 1}{\text{sup}}E_{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]\]Denote
\[\Gamma= \left[ \begin{array}{c} \gamma(x_1,y_1)\\\gamma(x_1,y_2)\\ \vdots\\ \hline \gamma(x_2,y_1)\\\gamma(x_2,y_2)\\ \vdots\\ \hline \gamma(x_n,y_1)\\\gamma(x_n,y_2)\\ \vdots \end{array} \right] \;\;\;\;\;\; C=\left[ \begin{array}{c} c(x_1,y_1)\\c(x_1,y_2)\\ \vdots\\ \hline c(x_2,y_1)\\c(x_2,y_2)\\ \vdots\\ \hline c(x_n,y_1)\\c(x_n,y_2)\\ \vdots \end{array} \right] \;\;\;\;\;\; b=\left[ \begin{array}{c} p(x_1)\\ p(x_2)\\ \vdots\\ \hline q(x_1)\\ q(x_2)\\ \vdots\\ \end{array} \right]\] \[A=\left[ \begin{array}{c|c|c|c} 1,1,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\\0,0,0,0,\cdots&1,1,0,0,\cdots&0,0,0,0,\cdots&\cdots\\0,0,0,0,\cdots&0,0,0,0,\cdots&1,1,0,0,\cdots&\cdots\\ \vdots&\vdots&\vdots&\vdots\\\hline 1,0,0,0,\cdots&1,0,0,0,\cdots&1,0,0,0,\cdots&\cdots\\0,1,0,0,\cdots&0,1,0,0,\cdots&0,1,0,0,\cdots&\cdots\\0,0,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\\ \vdots&\vdots&\vdots&\vdots\\ \end{array} \right]\]Then the optimization problem is in fact a linear programming problem
\[\underset{\Gamma}{\text{min}}\{\langle\Gamma,C\rangle|A\Gamma=b,\Gamma\geq 0\}\]Dual Form
Generally, for a problem
\[\underset{x}{\text{min}}\{c^{\text{T}}x|Ax=b,x\geq 0\}\]where $x,c\in \mathbb{R}^n; b\in \mathbb{R}^m, A\in \mathbb{R}^{m\times n}$. Its dual form is
\[\underset{x}{\text{max}}\{b^{\text{T}}y|A^{\text{T}}y\leq c\}\]Weak Dual Form
\[\underset{x}{\text{max}}\{b^{\text{T}}y|A^{\text{T}}y\leq c\}\leq \underset{x}{\text{max}}\{b^{\text{T}}y|A^{\text{T}}y\leq c\}\]Strong Dual Form
\[\underset{x}{\text{max}}\{b^{\text{T}}y|A^{\text{T}}y\leq c\}= \underset{x}{\text{max}}\{b^{\text{T}}y|A^{\text{T}}y\leq c\}\]The proof is ommitted here. During the proof, there is an important lemma
Farkas’ Lemma Only one of the following statements can be true:
- $\exists x\in \mathbb{R}^n \& x\geq 0, s.t.\, Ax=b$
- $\exists y\in \mathbb{R}^m, s.t.\, A^{\text{T}}y\leq 0\&b^{\text{T}}y> 0$
The Dual Problem of the optimal transportation
\[\underset{\Gamma}{\text{min}}\{\langle\Gamma,C\rangle|A\Gamma=b,\Gamma\geq 0\}= \underset{\Gamma}{\text{max}}\{\langle b,F\rangle|A\Gamma=b,A^{\text{T}}F\leq C\}\]We denote F as
\[F=\left[ \begin{array}{c} f(x_1)\\ f(x_2)\\ \vdots\\ \hline g(x_1)\\ g(x_2)\\ \vdots\\ \end{array} \right]\]In this way,
\[\langle b,F\rangle=\sum_i p(x_i)f(x_i)+q(x_i)g(x_i),\\ \text{or} \;\langle b,F\rangle=\int p(x)f(x)+q(x)g(x),\]The constraint condition can be reformulated as
\[A^{\text{T}}F\leq C\Rightarrow f(x_i)+g(x_i)\leq c(x_i,y_i)\]Since
\[f(x)+g(x)\leq c(x,x)=0\]We have $f(x)\leq -g(x)$, and
\[p(x_i)f(x_i)+q(x_i)g(x_i)\leq p(x_i)f(x_i)-q(x_i)f(x_i)\]We could set
\[W[p,q]=\int p(x)f(x)+q(x)g(x) |f(x)-f(y)\leq\|x-y\|\]Thus the training of WGAN is
\[W(P_r,P_g)=\underset{G}{\text{min}}\underset{f,\|f\|_L\leq 1}{\text{max}}E_{x\sim p(x)}[f(x)]-E_{x\sim q(z)}[f(G(z))]\]Relation to Other Loss
We say d is weaker then d’ if every sequence that converges under d’ converges under d, thus we have
Theorem: Every distribution that converges under KL, reverse KL, TV and JS also converges under Wasserstein Divergence.
Reference
苏剑林. (Oct. 07, 2018). 《深度学习中的Lipschitz约束:泛化与生成模型 》[Blog post]. Retrieved from https://kexue.fm/archives/6051
苏剑林. (May. 03, 2019). 《从动力学角度看优化算法(四):GAN的第三个阶段 》[Blog post]. Retrieved from https://kexue.fm/archives/6583
Mescheder L, Geiger A, Nowozin S. Which training methods for gans do actually converge?[C]//International conference on machine learning. PMLR, 2018: 3481-3490.
苏剑林. (Jun. 08, 2017). 《互怼的艺术:从零直达WGAN-GP 》[Blog post]. Retrieved from https://kexue.fm/archives/4439
苏剑林. (Jan. 20, 2019). 《从Wasserstein距离、对偶理论到WGAN 》[Blog post]. Retrieved from https://kexue.fm/archives/6280