An Introduction to WGAN

May 21, 2024

Abstract: This is my note when self-studying the WGAN.

Notation

true sample $x_1,x_2,…,x_n$.
new sample (false sample) $\hat{x_1},\hat{x_2},…,\hat{x_n}$. We want this series to approximate the true sample.
$p(x)$ is the ditribution of true sample.
$q(x)$ is the distribution of false sample, that is generated by $x=G(z)$, where $z\sim q(z)$ is the standard Gaussian distribution.

f-divergence

definition 1: f-divergence

[D_f(P

Q)=\int q(x) f(\frac{p(x)}{q(x)})dx]

For KL-divergence, $f(u)=u\,\text{log}u$; for JS-divergence, $f(u)=-\frac{u+1}{2}\,\text{log}\frac{u+1}{2}+\frac{u}{2}\,\text{log}u$

Generally, it is required that:

1.$f:\mathbb R^+\rightarrow \mathbb R$
2.$f(1)=0$ (to ensure $D_f(P||P)=0$)
3.$f$ is convex (to ensure $E[f(x)]\geq f[E(x)]$)

Then we have:

[\begin{aligned}\int q(x) f(\frac{p(x)}{q(x)})dx&=E_{x\sim q(x)}\left[f(\frac{p(x)}{q(x)})\right]\geq f\left(E_{x\sim q(x)}[\frac{p(x)}{q(x)}]\right) \&=f(\int p(x))=f(1)=0 \end{aligned}]

Therefore, f-divergence is not negative.

Convex Conjugate

The tangent line at $u=\xi$ of $y=f(u)$ is

[y=f(\xi)+f’(\xi)(u-\xi)]

Since $f$ is convex, we have

[f(u)=\text{max}_{\xi\in D}\left(f(\xi)+f’(\xi)(u-\xi)\right)]

why can we do this??????????????????????

Define $t=f’(\xi), g(t)=-f(\xi)+f’(\xi)\xi$ and suppose $t=T(u)$, we have

[f(u)=\text{max}_{T} {T(u)u-g(T(u))}]

Therefore, we have

[D_f(P

Q)=\text{max}_T \int q(x)\left[\frac{p(x)}{q(x)}T(\frac{p(x)}{q(x)})-g(T(\frac{p(x)}{q(x)}))\right]dx]

Denote $T(x)=T(\frac{p(x)}{q(x)})$, then

[D_f(P

Q)=\text{max}T(E{x\sim p(x)}[T(x)]-E_{x\sim q(x)}[g(T(x))])]

To train the generator, we indeed apply $\underset{G}{\text{min}} D_f(P||Q)$.

The loss of GAN

[\underset{D}{\text{min}}E_{x\sim p(x)}[-\text{log}D(x)]-E_{x\sim q(x)}[-\text{log}(1-D(x))]
\underset{G}{\text{max}}E_{z\sim q(z)}[-\text{log}(D(G(z)))]]

The loss of WGAN

[\underset{G}{\text{min}}\underset{D,|D|L \leq 1}{\text{max}}E{x\sim p(x)}[D(x)]-E_{z\sim q(z)}[D(G(z))]]

The loss of WGAN-GP

[\underset{G}{\text{min}}E_{x\sim q(x)}[D(x)]-E_{x\sim p(x)}[D(x)]+\lambda E_{x\sim r(x)}[\nabla_x D(x)-c]^2
\underset{G}{\text{min}} E_{z\sim q(z)}[-D(G(z))]]

where c is recommended to set as 0 or 1, $r(x)$ is a derivative distribution of $p(x)$ and $q(x)$.

Lipschitz constraints in Deep Learning

Suppose $P_r$ is the real distribution and $P_g$ is the generated distribution. Remember the optimization target of WGAN is

[W(P_r,P_g)=\underset{|f|L\leq 1}{\text{sup}}E{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]]

where $|f|_L$ is defined to be

[|f|_L:=\underset{x\neq y}{max}\frac{

T(x)-T(y)

}{

x-y

}]

And we define gradient penalty as

[\underset{f}{min}E_{x\sim P_g}[f(x)]-E_{x\sim P_r}[f(x)]+\lambda (|f’(\tilde{x})|-1)^2]

where $\tilde{x}=\epsilon x_{\text{real}}+(1-\epsilon)x_{\text{generated}},\,\epsilon\sim U[0,1], \,x_{\text{real}}\sim P_r, \,x_{\text{generated}}\sim P_g$.

How to make $|f|_L\leq 1$? The answer is to use the Spectral Normalization, that is to replace the parameter $w$ with $w/|w|$ in model $f$.

Transformation of Probability Distribution

Set $X\sim U[0,1],\,Y\sim N[0,1]$. We denote the transformation as $Y=f(x)$ and let $\rho(x)$ be the density function in $X$. Since under the transformation, the prbability lie in $[x,x+dx],[y,y+dy]$ must be equal, thus we have

[\rho(x)dx=\frac{1}{\sqrt{2\pi}}e^{\frac{-y^2}{2}}dy\\Rightarrow \int_0^x \rho(t)dt=\int_{-\infty}^y \frac{1}{\sqrt{2\pi}}e^{\frac{-t^2}{2}}dt
\Rightarrow y=\Phi^{-1}(\rho(t)dt)]

Problem: how to find $f$? We could use the Neural Network here, that is $Y=G(X,\theta)$.

Wasserstein Measure and WGAN

We define

[W_c[p(x),q(x)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} E_{(x,y)\sim \gamma} [c(x,y)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} \int \gamma(x,y)c(x,y)dxdy]

where $c(x,y)=|x-y|$ is a measure and the marginal probability distribution is

[\int \gamma(x,y)dy=p(x), \,\int \gamma(x,y)dx=q(y)]

The dual problem is, as we have showed above

[W(P_r,P_g)=\underset{|f|L\leq 1}{\text{sup}}E{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]]

Denote

[\Gamma= \left[ \begin{array}{c} \gamma(x_1,y_1)\\gamma(x_1,y_2)\ \vdots\ \hline \gamma(x_2,y_1)\\gamma(x_2,y_2)\ \vdots\ \hline \gamma(x_n,y_1)\\gamma(x_n,y_2)\ \vdots \end{array} \right] \;\;\;\;\;\; C=\left[ \begin{array}{c} c(x_1,y_1)\c(x_1,y_2)\ \vdots\ \hline c(x_2,y_1)\c(x_2,y_2)\ \vdots\ \hline c(x_n,y_1)\c(x_n,y_2)\ \vdots \end{array} \right] \;\;\;\;\;\; b=\left[ \begin{array}{c} p(x_1)\ p(x_2)\ \vdots\ \hline q(x_1)\ q(x_2)\ \vdots\ \end{array} \right]]

[A=\left[ \begin{array}{c|c|c|c} 1,1,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\0,0,0,0,\cdots&1,1,0,0,\cdots&0,0,0,0,\cdots&\cdots\0,0,0,0,\cdots&0,0,0,0,\cdots&1,1,0,0,\cdots&\cdots\ \vdots&\vdots&\vdots&\vdots\\hline 1,0,0,0,\cdots&1,0,0,0,\cdots&1,0,0,0,\cdots&\cdots\0,1,0,0,\cdots&0,1,0,0,\cdots&0,1,0,0,\cdots&\cdots\0,0,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\ \vdots&\vdots&\vdots&\vdots
\end{array} \right]]

Then the optimization problem is in fact a linear programming problem

[\underset{\Gamma}{\text{min}}{\langle\Gamma,C\rangle

A\Gamma=b,\Gamma\geq 0}]

Dual Form

Generally, for a problem

[\underset{x}{\text{min}}{c^{\text{T}}x

Ax=b,x\geq 0}]

where $x,c\in \mathbb{R}^n; b\in \mathbb{R}^m, A\in \mathbb{R}^{m\times n}$. Its dual form is

[\underset{x}{\text{max}}{b^{\text{T}}y

A^{\text{T}}y\leq c}]

Weak Dual Form

[\underset{x}{\text{max}}{b^{\text{T}}y

A^{\text{T}}y\leq c}\leq \underset{x}{\text{max}}{b^{\text{T}}y

A^{\text{T}}y\leq c}]

Strong Dual Form

[\underset{x}{\text{max}}{b^{\text{T}}y

A^{\text{T}}y\leq c}= \underset{x}{\text{max}}{b^{\text{T}}y

A^{\text{T}}y\leq c}]

The proof is ommitted here. During the proof, there is an important lemma

Farkas’ Lemma Only one of the following statements can be true:

$\exists x\in \mathbb{R}^n \& x\geq 0, s.t.\, Ax=b$
$\exists y\in \mathbb{R}^m, s.t.\, A^{\text{T}}y\leq 0\&b^{\text{T}}y> 0$

The Dual Problem of the optimal transportation

[\underset{\Gamma}{\text{min}}{\langle\Gamma,C\rangle	A\Gamma=b,\Gamma\geq 0}=
\underset{\Gamma}{\text{max}}{\langle b,F\rangle	A\Gamma=b,A^{\text{T}}F\leq C}]

We denote F as

[F=\left[ \begin{array}{c} f(x_1)\ f(x_2)\ \vdots\ \hline g(x_1)\ g(x_2)\ \vdots\ \end{array} \right]]

In this way,

[\langle b,F\rangle=\sum_i p(x_i)f(x_i)+q(x_i)g(x_i),
\text{or} \;\langle b,F\rangle=\int p(x)f(x)+q(x)g(x),]

The constraint condition can be reformulated as

[A^{\text{T}}F\leq C\Rightarrow f(x_i)+g(x_i)\leq c(x_i,y_i)]

Since

[f(x)+g(x)\leq c(x,x)=0]

We have $f(x)\leq -g(x)$, and

[p(x_i)f(x_i)+q(x_i)g(x_i)\leq p(x_i)f(x_i)-q(x_i)f(x_i)]

We could set

[W[p,q]=\int p(x)f(x)+q(x)g(x)

f(x)-f(y)\leq|x-y|]

Thus the training of WGAN is

[W(P_r,P_g)=\underset{G}{\text{min}}\underset{f,|f|L\leq 1}{\text{max}}E{x\sim p(x)}[f(x)]-E_{x\sim q(z)}[f(G(z))]]

Relation to Other Loss

We say d is weaker then d’ if every sequence that converges under d’ converges under d, thus we have

Theorem: Every distribution that converges under KL, reverse KL, TV and JS also converges under Wasserstein Divergence.

Reference

苏剑林. (Oct. 07, 2018). 《深度学习中的Lipschitz约束：泛化与生成模型》[Blog post]. Retrieved from https://kexue.fm/archives/6051

苏剑林. (May. 03, 2019). 《从动力学角度看优化算法（四）：GAN的第三个阶段》[Blog post]. Retrieved from https://kexue.fm/archives/6583

Mescheder L, Geiger A, Nowozin S. Which training methods for gans do actually converge?[C]//International conference on machine learning. PMLR, 2018: 3481-3490.

苏剑林. (Jun. 08, 2017). 《互怼的艺术：从零直达WGAN-GP 》[Blog post]. Retrieved from https://kexue.fm/archives/4439

苏剑林. (Jan. 20, 2019). 《从Wasserstein距离、对偶理论到WGAN 》[Blog post]. Retrieved from https://kexue.fm/archives/6280

A Starter

May 06, 2024

This is the starter file

Introduction to Flow Model

May 06, 2024

Flow Model

Basic Ideas

Suppose a neural network generator $G$ defines a distribution $P_G$, that is

[z\sim \pi(z), \,x\sim P_G(x),
x=G(z)]

The optimal model is

[G^*=\underset{G}{\text{argmax}}\sum_{i=1}^m \text{log}P_G(x_i) \approx KL(P_{data}

P_G)]

From the transformation relationship, we have

[P_G(x_i)=\pi(z_i)

\text{det}(J_{G^{-1}})

]

where $\pi(z_i)=G^{-1}(x_i)$. Apply $\text{log}$ to both sides, we have

[\text{log}P_G(x_i)=\text{log}\pi(G^{-1}(x_i))+\text{log}

\text{det}(J_{G^{-1}})

]

The iteration of multiple models (the flow) is as follows

[\pi(x)\rightarrow\boxed{G_1}\rightarrow P_1(x)\rightarrow\boxed{G_2}\rightarrow P_2(x)\rightarrow \cdots]

And the we get distribution

[P_G(x_i)=\pi(z_i)	\text{det}(J_{G_1^{-1}})		\text{det}(J_{G_2^{-1}})	\cdots	\text{det}(J_{G_K^{-1}})	,\
i.e.,\text{log}P_G(x_i)=\text{log}\pi(z_i)+\sum_{n=1}^K	\text{det}(J_{G_n^{-1}})	]

Coupling Layer

coupling_layer

It is easy to find that it is a invertible transformation, on the one hand,

[x_{i\leq d}=z_{i\leq d},
x_{i> d}=\beta_{i> d} z_{i> d}+\gamma_{i> d}]

on the other hand,

[z_{i\leq d}=x_{i\leq d},
z_{i> d}=\frac{x_{i> d}-\gamma_{i> d}}{\beta_{i> d}}]

Now we can compute the Jacobian Matrix

[\left[ \begin{array}{c|c} I& 0 \ \hline *& Diagonal \end{array} \right]]

[\text{det}J_G=\frac{\partial x_{d+1}}{\partial z_{d+1}}\frac{\partial x_{d+2}}{\partial z_{d+2}}\cdots \frac{\partial x_{D}}{\partial z_{D}}=\beta_{d+1}\beta_{d+2}\cdots \beta_{D}]

Coupling Layer Stacks

coupling_layers

Some Mathematical Notes

An Introduction to WGAN

Notation

f-divergence

Convex Conjugate

The loss of GAN

The loss of WGAN

The loss of WGAN-GP

Lipschitz constraints in Deep Learning

Transformation of Probability Distribution

Wasserstein Measure and WGAN

Dual Form

Weak Dual Form

Strong Dual Form

The Dual Problem of the optimal transportation

Relation to Other Loss

Reference

A Starter

This is the starter file

Introduction to Flow Model

Flow Model

Basic Ideas

Coupling Layer

Coupling Layer Stacks