Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

Future Blog Post

less than 1 minute read

Published:

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

notes

Multipole Method

Published:

The $n$’th cyclotomic polynomial $\Phi_n(x)$ is formed by collecting all the linear factors $x-\zeta$ such that $\zeta$ is the $n$’th primitive root of unity, so we can explicitly define it via 为了进一步调试,你可以提供更多关于你的网站的构建环境和部署方式的信息。此外,确认网站的构建日志中是否有任何错误信息,这些信息通常能提供为什么Markdown文件没有被正确处理的线索。如果使用Jekyll等工具,确保所有必需的插件都已安装,并且_config.yml文件中的配置是正确的。如果使用数学表达式,确认是否已启用MathJax或相似的库来渲染这些表达式为了进一步调试,你可以提供更多关于你的网站的构建环境和部署方式的信息。此外,确认网站的构建日志中是否有任何错误信息,这些信息通常能提供为什么Markdown文件没有被正确处理的线索。如果使用Jekyll等工具,确保所有必需的插件都已安装,并且_config.yml文件中的配置是正确的。如果使用数学表达式,确认是否已启用MathJax或相似的库来渲染这些表达式为了进一步调试,你可以提供更多关于你的网站的构建环境和部署方式的信息。此外,确认网站的构建日志中是否有任何错误信息,这些信息通常能提供为什么Markdown文件没有被正确处理的线索。如果使用Jekyll等工具,确保所有必需的插件都已安装,并且_config.yml文件中的配置是正确的。如果使用数学表达式,确认是否已启用MathJax或相似的库来渲染这些表达式

portfolio

publications

Paper Title Number 1

Published in , 2009


title: “Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers” collection: publications venue: ‘Yineng Chen, Zuchao Li, Lefei Zhang, Bo Du, Hai Zhao’ paperurl: ‘https://proceedings.mlr.press/v202/chen23r/chen23r.pdf’

Paper Title Number 2

Published in , 2010


title: “Fine-Grained Position Helps Memorizing More, a Novel Music Compound Transformer Model with Feature Interaction Fusion” collection: publications venue: ‘AAAI 2023’ paperurl: ‘https://ojs.aaai.org/index.php/AAAI/article/view/25650’

talks

Introduction to Flow Model

Published:

Flow Model

Basic Ideas

Suppose a neural network generator $G$ defines a distribution $P_G$, that is

[z\sim \pi(z), \,x\sim P_G(x),
x=G(z)]

The optimal model is

[G^*=\underset{G}{\text{argmax}}\sum_{i=1}^m \text{log}P_G(x_i) \approx KL(P_{data} P_G)]

From the transformation relationship, we have

[P_G(x_i)=\pi(z_i)\text{det}(J_{G^{-1}})]

where $\pi(z_i)=G^{-1}(x_i)$. Apply $\text{log}$ to both sides, we have

[\text{log}P_G(x_i)=\text{log}\pi(G^{-1}(x_i))+\text{log}\text{det}(J_{G^{-1}})]

The iteration of multiple models (the flow) is as follows

[\pi(x)\rightarrow\boxed{G_1}\rightarrow P_1(x)\rightarrow\boxed{G_2}\rightarrow P_2(x)\rightarrow \cdots]

And the we get distribution

[P_G(x_i)=\pi(z_i)\text{det}(J_{G_1^{-1}}) \text{det}(J_{G_2^{-1}})\cdots\text{det}(J_{G_K^{-1}}),\
i.e.,\text{log}P_G(x_i)=\text{log}\pi(z_i)+\sum_{n=1}^K\text{det}(J_{G_n^{-1}})]    

Coupling Layer

coupling_layer

It is easy to find that it is a invertible transformation, on the one hand,

[x_{i\leq d}=z_{i\leq d},
x_{i> d}=\beta_{i> d} z_{i> d}+\gamma_{i> d}]

on the other hand,

[z_{i\leq d}=x_{i\leq d},
z_{i> d}=\frac{x_{i> d}-\gamma_{i> d}}{\beta_{i> d}}]

Now we can compute the Jacobian Matrix

[\left[ \begin{array}{c|c} I& 0 \ \hline *& Diagonal \end{array} \right]]

[\text{det}J_G=\frac{\partial x_{d+1}}{\partial z_{d+1}}\frac{\partial x_{d+2}}{\partial z_{d+2}}\cdots \frac{\partial x_{D}}{\partial z_{D}}=\beta_{d+1}\beta_{d+2}\cdots \beta_{D}]

Coupling Layer Stacks

coupling_layers

A Starter

Published:

This is the starter file

An Introduction to WGAN

Published:

Abstract: This is my note when self-studying the WGAN.

Notation

  • true sample $x_1,x_2,…,x_n$.

  • new sample (false sample) $\hat{x_1},\hat{x_2},…,\hat{x_n}$. We want this series to approximate the true sample.

  • $p(x)$ is the ditribution of true sample.

  • $q(x)$ is the distribution of false sample, that is generated by $x=G(z)$, where $z\sim q(z)$ is the standard Gaussian distribution.

f-divergence

definition 1: f-divergence

[D_f(P Q)=\int q(x) f(\frac{p(x)}{q(x)})dx]

For KL-divergence, $f(u)=u\,\text{log}u$; for JS-divergence, $f(u)=-\frac{u+1}{2}\,\text{log}\frac{u+1}{2}+\frac{u}{2}\,\text{log}u$

Generally, it is required that:

1.$f:\mathbb R^+\rightarrow \mathbb R$
2.$f(1)=0$ (to ensure $D_f(P||P)=0$)
3.$f$ is convex (to ensure $E[f(x)]\geq f[E(x)]$)

Then we have:

[\begin{aligned}\int q(x) f(\frac{p(x)}{q(x)})dx&=E_{x\sim q(x)}\left[f(\frac{p(x)}{q(x)})\right]\geq f\left(E_{x\sim q(x)}[\frac{p(x)}{q(x)}]\right) \&=f(\int p(x))=f(1)=0 \end{aligned}]

Therefore, f-divergence is not negative.

Convex Conjugate

The tangent line at $u=\xi$ of $y=f(u)$ is

[y=f(\xi)+f’(\xi)(u-\xi)]

Since $f$ is convex, we have

[f(u)=\text{max}_{\xi\in D}\left(f(\xi)+f’(\xi)(u-\xi)\right)]

why can we do this??????????????????????

Define $t=f’(\xi), g(t)=-f(\xi)+f’(\xi)\xi$ and suppose $t=T(u)$, we have

[f(u)=\text{max}_{T} {T(u)u-g(T(u))}]

Therefore, we have

[D_f(P Q)=\text{max}_T \int q(x)\left[\frac{p(x)}{q(x)}T(\frac{p(x)}{q(x)})-g(T(\frac{p(x)}{q(x)}))\right]dx]

Denote $T(x)=T(\frac{p(x)}{q(x)})$, then

[D_f(P Q)=\text{max}T(E{x\sim p(x)}[T(x)]-E_{x\sim q(x)}[g(T(x))])]

To train the generator, we indeed apply $\underset{G}{\text{min}} D_f(P||Q)$.

The loss of GAN

[\underset{D}{\text{min}}E_{x\sim p(x)}[-\text{log}D(x)]-E_{x\sim q(x)}[-\text{log}(1-D(x))]
\underset{G}{\text{max}}E_{z\sim q(z)}[-\text{log}(D(G(z)))]]

The loss of WGAN

[\underset{G}{\text{min}}\underset{D,|D|L \leq 1}{\text{max}}E{x\sim p(x)}[D(x)]-E_{z\sim q(z)}[D(G(z))]]

The loss of WGAN-GP

[\underset{G}{\text{min}}E_{x\sim q(x)}[D(x)]-E_{x\sim p(x)}[D(x)]+\lambda E_{x\sim r(x)}[\nabla_x D(x)-c]^2
\underset{G}{\text{min}} E_{z\sim q(z)}[-D(G(z))]]

where c is recommended to set as 0 or 1, $r(x)$ is a derivative distribution of $p(x)$ and $q(x)$.

Lipschitz constraints in Deep Learning

Suppose $P_r$ is the real distribution and $P_g$ is the generated distribution. Remember the optimization target of WGAN is

[W(P_r,P_g)=\underset{|f|L\leq 1}{\text{sup}}E{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]]

where $|f|_L$ is defined to be

[|f|_L:=\underset{x\neq y}{max}\frac{T(x)-T(y)}{x-y}]

And we define gradient penalty as

[\underset{f}{min}E_{x\sim P_g}[f(x)]-E_{x\sim P_r}[f(x)]+\lambda (|f’(\tilde{x})|-1)^2]

where $\tilde{x}=\epsilon x_{\text{real}}+(1-\epsilon)x_{\text{generated}},\,\epsilon\sim U[0,1], \,x_{\text{real}}\sim P_r, \,x_{\text{generated}}\sim P_g$.

How to make $|f|_L\leq 1$? The answer is to use the Spectral Normalization, that is to replace the parameter $w$ with $w/|w|$ in model $f$.

Transformation of Probability Distribution

Set $X\sim U[0,1],\,Y\sim N[0,1]$. We denote the transformation as $Y=f(x)$ and let $\rho(x)$ be the density function in $X$. Since under the transformation, the prbability lie in $[x,x+dx],[y,y+dy]$ must be equal, thus we have

[\rho(x)dx=\frac{1}{\sqrt{2\pi}}e^{\frac{-y^2}{2}}dy\\Rightarrow \int_0^x \rho(t)dt=\int_{-\infty}^y \frac{1}{\sqrt{2\pi}}e^{\frac{-t^2}{2}}dt
\Rightarrow y=\Phi^{-1}(\rho(t)dt)]

Problem: how to find $f$? We could use the Neural Network here, that is $Y=G(X,\theta)$.

Wasserstein Measure and WGAN

We define

[W_c[p(x),q(x)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} E_{(x,y)\sim \gamma} [c(x,y)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} \int \gamma(x,y)c(x,y)dxdy]

where $c(x,y)=|x-y|$ is a measure and the marginal probability distribution is

[\int \gamma(x,y)dy=p(x), \,\int \gamma(x,y)dx=q(y)]

The dual problem is, as we have showed above

[W(P_r,P_g)=\underset{|f|L\leq 1}{\text{sup}}E{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]]

Denote

[\Gamma= \left[ \begin{array}{c} \gamma(x_1,y_1)\\gamma(x_1,y_2)\ \vdots\ \hline \gamma(x_2,y_1)\\gamma(x_2,y_2)\ \vdots\ \hline \gamma(x_n,y_1)\\gamma(x_n,y_2)\ \vdots \end{array} \right] \;\;\;\;\;\; C=\left[ \begin{array}{c} c(x_1,y_1)\c(x_1,y_2)\ \vdots\ \hline c(x_2,y_1)\c(x_2,y_2)\ \vdots\ \hline c(x_n,y_1)\c(x_n,y_2)\ \vdots \end{array} \right] \;\;\;\;\;\; b=\left[ \begin{array}{c} p(x_1)\ p(x_2)\ \vdots\ \hline q(x_1)\ q(x_2)\ \vdots\ \end{array} \right]]

[A=\left[ \begin{array}{c|c|c|c} 1,1,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\0,0,0,0,\cdots&1,1,0,0,\cdots&0,0,0,0,\cdots&\cdots\0,0,0,0,\cdots&0,0,0,0,\cdots&1,1,0,0,\cdots&\cdots\ \vdots&\vdots&\vdots&\vdots\\hline 1,0,0,0,\cdots&1,0,0,0,\cdots&1,0,0,0,\cdots&\cdots\0,1,0,0,\cdots&0,1,0,0,\cdots&0,1,0,0,\cdots&\cdots\0,0,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\ \vdots&\vdots&\vdots&\vdots
\end{array} \right]]

Then the optimization problem is in fact a linear programming problem

[\underset{\Gamma}{\text{min}}{\langle\Gamma,C\rangleA\Gamma=b,\Gamma\geq 0}]

Dual Form

Generally, for a problem

[\underset{x}{\text{min}}{c^{\text{T}}xAx=b,x\geq 0}]

where $x,c\in \mathbb{R}^n; b\in \mathbb{R}^m, A\in \mathbb{R}^{m\times n}$. Its dual form is

[\underset{x}{\text{max}}{b^{\text{T}}yA^{\text{T}}y\leq c}]

Weak Dual Form

[\underset{x}{\text{max}}{b^{\text{T}}yA^{\text{T}}y\leq c}\leq \underset{x}{\text{max}}{b^{\text{T}}yA^{\text{T}}y\leq c}]

Strong Dual Form

[\underset{x}{\text{max}}{b^{\text{T}}yA^{\text{T}}y\leq c}= \underset{x}{\text{max}}{b^{\text{T}}yA^{\text{T}}y\leq c}]

The proof is ommitted here. During the proof, there is an important lemma

Farkas’ Lemma Only one of the following statements can be true:

  • $\exists x\in \mathbb{R}^n \& x\geq 0, s.t.\, Ax=b$
  • $\exists y\in \mathbb{R}^m, s.t.\, A^{\text{T}}y\leq 0\&b^{\text{T}}y> 0$

The Dual Problem of the optimal transportation

[\underset{\Gamma}{\text{min}}{\langle\Gamma,C\rangleA\Gamma=b,\Gamma\geq 0}=
\underset{\Gamma}{\text{max}}{\langle b,F\rangleA\Gamma=b,A^{\text{T}}F\leq C}]

We denote F as

[F=\left[ \begin{array}{c} f(x_1)\ f(x_2)\ \vdots\ \hline g(x_1)\ g(x_2)\ \vdots\ \end{array} \right]]

In this way,

[\langle b,F\rangle=\sum_i p(x_i)f(x_i)+q(x_i)g(x_i),
\text{or} \;\langle b,F\rangle=\int p(x)f(x)+q(x)g(x),]

The constraint condition can be reformulated as

[A^{\text{T}}F\leq C\Rightarrow f(x_i)+g(x_i)\leq c(x_i,y_i)]

Since

[f(x)+g(x)\leq c(x,x)=0]

We have $f(x)\leq -g(x)$, and

[p(x_i)f(x_i)+q(x_i)g(x_i)\leq p(x_i)f(x_i)-q(x_i)f(x_i)]

We could set

[W[p,q]=\int p(x)f(x)+q(x)g(x)f(x)-f(y)\leq|x-y|]

Thus the training of WGAN is

[W(P_r,P_g)=\underset{G}{\text{min}}\underset{f,|f|L\leq 1}{\text{max}}E{x\sim p(x)}[f(x)]-E_{x\sim q(z)}[f(G(z))]]

Relation to Other Loss

We say d is weaker then d’ if every sequence that converges under d’ converges under d, thus we have

Theorem: Every distribution that converges under KL, reverse KL, TV and JS also converges under Wasserstein Divergence.

Reference

苏剑林. (Oct. 07, 2018). 《深度学习中的Lipschitz约束:泛化与生成模型 》[Blog post]. Retrieved from https://kexue.fm/archives/6051

苏剑林. (May. 03, 2019). 《从动力学角度看优化算法(四):GAN的第三个阶段 》[Blog post]. Retrieved from https://kexue.fm/archives/6583

Mescheder L, Geiger A, Nowozin S. Which training methods for gans do actually converge?[C]//International conference on machine learning. PMLR, 2018: 3481-3490.

苏剑林. (Jun. 08, 2017). 《互怼的艺术:从零直达WGAN-GP 》[Blog post]. Retrieved from https://kexue.fm/archives/4439

苏剑林. (Jan. 20, 2019). 《从Wasserstein距离、对偶理论到WGAN 》[Blog post]. Retrieved from https://kexue.fm/archives/6280

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.