An Introduction to WGAN
Abstract: This is my note when self-studying the WGAN.
Notation
true sample $x_1,x_2,…,x_n$.
new sample (false sample) $\hat{x_1},\hat{x_2},…,\hat{x_n}$. We want this series to approximate the true sample.
$p(x)$ is the ditribution of true sample.
$q(x)$ is the distribution of false sample, that is generated by $x=G(z)$, where $z\sim q(z)$ is the standard Gaussian distribution.
f-divergence
definition 1: f-divergence
[D_f(P | Q)=\int q(x) f(\frac{p(x)}{q(x)})dx] |
For KL-divergence, $f(u)=u\,\text{log}u$; for JS-divergence, $f(u)=-\frac{u+1}{2}\,\text{log}\frac{u+1}{2}+\frac{u}{2}\,\text{log}u$
Generally, it is required that:
1.$f:\mathbb R^+\rightarrow \mathbb R$
2.$f(1)=0$ (to ensure $D_f(P||P)=0$)
3.$f$ is convex (to ensure $E[f(x)]\geq f[E(x)]$)
Then we have:
[\begin{aligned}\int q(x) f(\frac{p(x)}{q(x)})dx&=E_{x\sim q(x)}\left[f(\frac{p(x)}{q(x)})\right]\geq f\left(E_{x\sim q(x)}[\frac{p(x)}{q(x)}]\right) \&=f(\int p(x))=f(1)=0 \end{aligned}]
Therefore, f-divergence is not negative.
Convex Conjugate
The tangent line at $u=\xi$ of $y=f(u)$ is
[y=f(\xi)+f’(\xi)(u-\xi)]
Since $f$ is convex, we have
[f(u)=\text{max}_{\xi\in D}\left(f(\xi)+f’(\xi)(u-\xi)\right)]
why can we do this??????????????????????
Define $t=f’(\xi), g(t)=-f(\xi)+f’(\xi)\xi$ and suppose $t=T(u)$, we have
[f(u)=\text{max}_{T} {T(u)u-g(T(u))}]
Therefore, we have
[D_f(P | Q)=\text{max}_T \int q(x)\left[\frac{p(x)}{q(x)}T(\frac{p(x)}{q(x)})-g(T(\frac{p(x)}{q(x)}))\right]dx] |
Denote $T(x)=T(\frac{p(x)}{q(x)})$, then
[D_f(P | Q)=\text{max}T(E{x\sim p(x)}[T(x)]-E_{x\sim q(x)}[g(T(x))])] |
To train the generator, we indeed apply $\underset{G}{\text{min}} D_f(P||Q)$.
The loss of GAN
[\underset{D}{\text{min}}E_{x\sim p(x)}[-\text{log}D(x)]-E_{x\sim q(x)}[-\text{log}(1-D(x))]
\underset{G}{\text{max}}E_{z\sim q(z)}[-\text{log}(D(G(z)))]]
The loss of WGAN
[\underset{G}{\text{min}}\underset{D,|D|L \leq 1}{\text{max}}E{x\sim p(x)}[D(x)]-E_{z\sim q(z)}[D(G(z))]]
The loss of WGAN-GP
[\underset{G}{\text{min}}E_{x\sim q(x)}[D(x)]-E_{x\sim p(x)}[D(x)]+\lambda E_{x\sim r(x)}[\nabla_x D(x)-c]^2
\underset{G}{\text{min}} E_{z\sim q(z)}[-D(G(z))]]
where c is recommended to set as 0 or 1, $r(x)$ is a derivative distribution of $p(x)$ and $q(x)$.
Lipschitz constraints in Deep Learning
Suppose $P_r$ is the real distribution and $P_g$ is the generated distribution. Remember the optimization target of WGAN is
[W(P_r,P_g)=\underset{|f|L\leq 1}{\text{sup}}E{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]]
where $|f|_L$ is defined to be
[|f|_L:=\underset{x\neq y}{max}\frac{ | T(x)-T(y) | }{ | x-y | }] |
And we define gradient penalty as
[\underset{f}{min}E_{x\sim P_g}[f(x)]-E_{x\sim P_r}[f(x)]+\lambda (|f’(\tilde{x})|-1)^2]
where $\tilde{x}=\epsilon x_{\text{real}}+(1-\epsilon)x_{\text{generated}},\,\epsilon\sim U[0,1], \,x_{\text{real}}\sim P_r, \,x_{\text{generated}}\sim P_g$.
How to make $|f|_L\leq 1$? The answer is to use the Spectral Normalization, that is to replace the parameter $w$ with $w/|w|$ in model $f$.
Transformation of Probability Distribution
Set $X\sim U[0,1],\,Y\sim N[0,1]$. We denote the transformation as $Y=f(x)$ and let $\rho(x)$ be the density function in $X$. Since under the transformation, the prbability lie in $[x,x+dx],[y,y+dy]$ must be equal, thus we have
[\rho(x)dx=\frac{1}{\sqrt{2\pi}}e^{\frac{-y^2}{2}}dy\\Rightarrow \int_0^x \rho(t)dt=\int_{-\infty}^y \frac{1}{\sqrt{2\pi}}e^{\frac{-t^2}{2}}dt
\Rightarrow y=\Phi^{-1}(\rho(t)dt)]
Problem: how to find $f$? We could use the Neural Network here, that is $Y=G(X,\theta)$.
Wasserstein Measure and WGAN
We define
[W_c[p(x),q(x)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} E_{(x,y)\sim \gamma} [c(x,y)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} \int \gamma(x,y)c(x,y)dxdy]
where $c(x,y)=|x-y|$ is a measure and the marginal probability distribution is
[\int \gamma(x,y)dy=p(x), \,\int \gamma(x,y)dx=q(y)]
The dual problem is, as we have showed above
[W(P_r,P_g)=\underset{|f|L\leq 1}{\text{sup}}E{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]]
Denote
[\Gamma= \left[ \begin{array}{c} \gamma(x_1,y_1)\\gamma(x_1,y_2)\ \vdots\ \hline \gamma(x_2,y_1)\\gamma(x_2,y_2)\ \vdots\ \hline \gamma(x_n,y_1)\\gamma(x_n,y_2)\ \vdots \end{array} \right] \;\;\;\;\;\; C=\left[ \begin{array}{c} c(x_1,y_1)\c(x_1,y_2)\ \vdots\ \hline c(x_2,y_1)\c(x_2,y_2)\ \vdots\ \hline c(x_n,y_1)\c(x_n,y_2)\ \vdots \end{array} \right] \;\;\;\;\;\; b=\left[ \begin{array}{c} p(x_1)\ p(x_2)\ \vdots\ \hline q(x_1)\ q(x_2)\ \vdots\ \end{array} \right]]
[A=\left[ \begin{array}{c|c|c|c} 1,1,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\0,0,0,0,\cdots&1,1,0,0,\cdots&0,0,0,0,\cdots&\cdots\0,0,0,0,\cdots&0,0,0,0,\cdots&1,1,0,0,\cdots&\cdots\ \vdots&\vdots&\vdots&\vdots\\hline 1,0,0,0,\cdots&1,0,0,0,\cdots&1,0,0,0,\cdots&\cdots\0,1,0,0,\cdots&0,1,0,0,\cdots&0,1,0,0,\cdots&\cdots\0,0,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\ \vdots&\vdots&\vdots&\vdots
\end{array} \right]]
Then the optimization problem is in fact a linear programming problem
[\underset{\Gamma}{\text{min}}{\langle\Gamma,C\rangle | A\Gamma=b,\Gamma\geq 0}] |
Dual Form
Generally, for a problem
[\underset{x}{\text{min}}{c^{\text{T}}x | Ax=b,x\geq 0}] |
where $x,c\in \mathbb{R}^n; b\in \mathbb{R}^m, A\in \mathbb{R}^{m\times n}$. Its dual form is
[\underset{x}{\text{max}}{b^{\text{T}}y | A^{\text{T}}y\leq c}] |
Weak Dual Form
[\underset{x}{\text{max}}{b^{\text{T}}y | A^{\text{T}}y\leq c}\leq \underset{x}{\text{max}}{b^{\text{T}}y | A^{\text{T}}y\leq c}] |
Strong Dual Form
[\underset{x}{\text{max}}{b^{\text{T}}y | A^{\text{T}}y\leq c}= \underset{x}{\text{max}}{b^{\text{T}}y | A^{\text{T}}y\leq c}] |
The proof is ommitted here. During the proof, there is an important lemma
Farkas’ Lemma Only one of the following statements can be true:
- $\exists x\in \mathbb{R}^n \& x\geq 0, s.t.\, Ax=b$
- $\exists y\in \mathbb{R}^m, s.t.\, A^{\text{T}}y\leq 0\&b^{\text{T}}y> 0$
The Dual Problem of the optimal transportation
[\underset{\Gamma}{\text{min}}{\langle\Gamma,C\rangle | A\Gamma=b,\Gamma\geq 0}= |
\underset{\Gamma}{\text{max}}{\langle b,F\rangle | A\Gamma=b,A^{\text{T}}F\leq C}] |
We denote F as
[F=\left[ \begin{array}{c} f(x_1)\ f(x_2)\ \vdots\ \hline g(x_1)\ g(x_2)\ \vdots\ \end{array} \right]]
In this way,
[\langle b,F\rangle=\sum_i p(x_i)f(x_i)+q(x_i)g(x_i),
\text{or} \;\langle b,F\rangle=\int p(x)f(x)+q(x)g(x),]
The constraint condition can be reformulated as
[A^{\text{T}}F\leq C\Rightarrow f(x_i)+g(x_i)\leq c(x_i,y_i)]
Since
[f(x)+g(x)\leq c(x,x)=0]
We have $f(x)\leq -g(x)$, and
[p(x_i)f(x_i)+q(x_i)g(x_i)\leq p(x_i)f(x_i)-q(x_i)f(x_i)]
We could set
[W[p,q]=\int p(x)f(x)+q(x)g(x) | f(x)-f(y)\leq|x-y|] |
Thus the training of WGAN is
[W(P_r,P_g)=\underset{G}{\text{min}}\underset{f,|f|L\leq 1}{\text{max}}E{x\sim p(x)}[f(x)]-E_{x\sim q(z)}[f(G(z))]]
Relation to Other Loss
We say d is weaker then d’ if every sequence that converges under d’ converges under d, thus we have
Theorem: Every distribution that converges under KL, reverse KL, TV and JS also converges under Wasserstein Divergence.
Reference
苏剑林. (Oct. 07, 2018). 《深度学习中的Lipschitz约束:泛化与生成模型 》[Blog post]. Retrieved from https://kexue.fm/archives/6051
苏剑林. (May. 03, 2019). 《从动力学角度看优化算法(四):GAN的第三个阶段 》[Blog post]. Retrieved from https://kexue.fm/archives/6583
Mescheder L, Geiger A, Nowozin S. Which training methods for gans do actually converge?[C]//International conference on machine learning. PMLR, 2018: 3481-3490.
苏剑林. (Jun. 08, 2017). 《互怼的艺术:从零直达WGAN-GP 》[Blog post]. Retrieved from https://kexue.fm/archives/4439
苏剑林. (Jan. 20, 2019). 《从Wasserstein距离、对偶理论到WGAN 》[Blog post]. Retrieved from https://kexue.fm/archives/6280