Page Not Found
Page not found. Your pixels are in another canvas.
A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.
Page not found. Your pixels are in another canvas.
About me
This is a page not in th emain menu
Published:
This post will show up by default. To disable scheduling of future posts, edit config.yml
and set future: false
.
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Published:
The $n$’th cyclotomic polynomial $\Phi_n(x)$ is formed by collecting all the linear factors $x-\zeta$ such that $\zeta$ is the $n$’th primitive root of unity, so we can explicitly define it via 为了进一步调试,你可以提供更多关于你的网站的构建环境和部署方式的信息。此外,确认网站的构建日志中是否有任何错误信息,这些信息通常能提供为什么Markdown文件没有被正确处理的线索。如果使用Jekyll等工具,确保所有必需的插件都已安装,并且_config.yml文件中的配置是正确的。如果使用数学表达式,确认是否已启用MathJax或相似的库来渲染这些表达式为了进一步调试,你可以提供更多关于你的网站的构建环境和部署方式的信息。此外,确认网站的构建日志中是否有任何错误信息,这些信息通常能提供为什么Markdown文件没有被正确处理的线索。如果使用Jekyll等工具,确保所有必需的插件都已安装,并且_config.yml文件中的配置是正确的。如果使用数学表达式,确认是否已启用MathJax或相似的库来渲染这些表达式为了进一步调试,你可以提供更多关于你的网站的构建环境和部署方式的信息。此外,确认网站的构建日志中是否有任何错误信息,这些信息通常能提供为什么Markdown文件没有被正确处理的线索。如果使用Jekyll等工具,确保所有必需的插件都已安装,并且_config.yml文件中的配置是正确的。如果使用数学表达式,确认是否已启用MathJax或相似的库来渲染这些表达式
Published:
A new test is coming soon.
Short description of portfolio item number 1
Short description of portfolio item number 2
Published in , 2009
title: “Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers” collection: publications venue: ‘Yineng Chen, Zuchao Li, Lefei Zhang, Bo Du, Hai Zhao’ paperurl: ‘https://proceedings.mlr.press/v202/chen23r/chen23r.pdf’
Published in , 2010
title: “Fine-Grained Position Helps Memorizing More, a Novel Music Compound Transformer Model with Feature Interaction Fusion” collection: publications venue: ‘AAAI 2023’ paperurl: ‘https://ojs.aaai.org/index.php/AAAI/article/view/25650’
Published:
Suppose a neural network generator $G$ defines a distribution $P_G$, that is
[z\sim \pi(z), \,x\sim P_G(x),
x=G(z)]
The optimal model is
[G^*=\underset{G}{\text{argmax}}\sum_{i=1}^m \text{log}P_G(x_i) \approx KL(P_{data} | P_G)] |
From the transformation relationship, we have
[P_G(x_i)=\pi(z_i) | \text{det}(J_{G^{-1}}) | ] |
where $\pi(z_i)=G^{-1}(x_i)$. Apply $\text{log}$ to both sides, we have
[\text{log}P_G(x_i)=\text{log}\pi(G^{-1}(x_i))+\text{log} | \text{det}(J_{G^{-1}}) | ] |
The iteration of multiple models (the flow) is as follows
[\pi(x)\rightarrow\boxed{G_1}\rightarrow P_1(x)\rightarrow\boxed{G_2}\rightarrow P_2(x)\rightarrow \cdots]
And the we get distribution
[P_G(x_i)=\pi(z_i) | \text{det}(J_{G_1^{-1}}) | \text{det}(J_{G_2^{-1}}) | \cdots | \text{det}(J_{G_K^{-1}}) | ,\ | |
i.e.,\text{log}P_G(x_i)=\text{log}\pi(z_i)+\sum_{n=1}^K | \text{det}(J_{G_n^{-1}}) | ] |
It is easy to find that it is a invertible transformation, on the one hand,
[x_{i\leq d}=z_{i\leq d},
x_{i> d}=\beta_{i> d} z_{i> d}+\gamma_{i> d}]
on the other hand,
[z_{i\leq d}=x_{i\leq d},
z_{i> d}=\frac{x_{i> d}-\gamma_{i> d}}{\beta_{i> d}}]
Now we can compute the Jacobian Matrix
[\left[ \begin{array}{c|c} I& 0 \ \hline *& Diagonal \end{array} \right]]
[\text{det}J_G=\frac{\partial x_{d+1}}{\partial z_{d+1}}\frac{\partial x_{d+2}}{\partial z_{d+2}}\cdots \frac{\partial x_{D}}{\partial z_{D}}=\beta_{d+1}\beta_{d+2}\cdots \beta_{D}]
Published:
Published:
Abstract: This is my note when self-studying the WGAN.
true sample $x_1,x_2,…,x_n$.
new sample (false sample) $\hat{x_1},\hat{x_2},…,\hat{x_n}$. We want this series to approximate the true sample.
$p(x)$ is the ditribution of true sample.
$q(x)$ is the distribution of false sample, that is generated by $x=G(z)$, where $z\sim q(z)$ is the standard Gaussian distribution.
definition 1: f-divergence
[D_f(P | Q)=\int q(x) f(\frac{p(x)}{q(x)})dx] |
For KL-divergence, $f(u)=u\,\text{log}u$; for JS-divergence, $f(u)=-\frac{u+1}{2}\,\text{log}\frac{u+1}{2}+\frac{u}{2}\,\text{log}u$
Generally, it is required that:
1.$f:\mathbb R^+\rightarrow \mathbb R$
2.$f(1)=0$ (to ensure $D_f(P||P)=0$)
3.$f$ is convex (to ensure $E[f(x)]\geq f[E(x)]$)
Then we have:
[\begin{aligned}\int q(x) f(\frac{p(x)}{q(x)})dx&=E_{x\sim q(x)}\left[f(\frac{p(x)}{q(x)})\right]\geq f\left(E_{x\sim q(x)}[\frac{p(x)}{q(x)}]\right) \&=f(\int p(x))=f(1)=0 \end{aligned}]
Therefore, f-divergence is not negative.
The tangent line at $u=\xi$ of $y=f(u)$ is
[y=f(\xi)+f’(\xi)(u-\xi)]
Since $f$ is convex, we have
[f(u)=\text{max}_{\xi\in D}\left(f(\xi)+f’(\xi)(u-\xi)\right)]
why can we do this??????????????????????
Define $t=f’(\xi), g(t)=-f(\xi)+f’(\xi)\xi$ and suppose $t=T(u)$, we have
[f(u)=\text{max}_{T} {T(u)u-g(T(u))}]
Therefore, we have
[D_f(P | Q)=\text{max}_T \int q(x)\left[\frac{p(x)}{q(x)}T(\frac{p(x)}{q(x)})-g(T(\frac{p(x)}{q(x)}))\right]dx] |
Denote $T(x)=T(\frac{p(x)}{q(x)})$, then
[D_f(P | Q)=\text{max}T(E{x\sim p(x)}[T(x)]-E_{x\sim q(x)}[g(T(x))])] |
To train the generator, we indeed apply $\underset{G}{\text{min}} D_f(P||Q)$.
[\underset{D}{\text{min}}E_{x\sim p(x)}[-\text{log}D(x)]-E_{x\sim q(x)}[-\text{log}(1-D(x))]
\underset{G}{\text{max}}E_{z\sim q(z)}[-\text{log}(D(G(z)))]]
[\underset{G}{\text{min}}\underset{D,|D|L \leq 1}{\text{max}}E{x\sim p(x)}[D(x)]-E_{z\sim q(z)}[D(G(z))]]
[\underset{G}{\text{min}}E_{x\sim q(x)}[D(x)]-E_{x\sim p(x)}[D(x)]+\lambda E_{x\sim r(x)}[\nabla_x D(x)-c]^2
\underset{G}{\text{min}} E_{z\sim q(z)}[-D(G(z))]]
where c is recommended to set as 0 or 1, $r(x)$ is a derivative distribution of $p(x)$ and $q(x)$.
Suppose $P_r$ is the real distribution and $P_g$ is the generated distribution. Remember the optimization target of WGAN is
[W(P_r,P_g)=\underset{|f|L\leq 1}{\text{sup}}E{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]]
where $|f|_L$ is defined to be
[|f|_L:=\underset{x\neq y}{max}\frac{ | T(x)-T(y) | }{ | x-y | }] |
And we define gradient penalty as
[\underset{f}{min}E_{x\sim P_g}[f(x)]-E_{x\sim P_r}[f(x)]+\lambda (|f’(\tilde{x})|-1)^2]
where $\tilde{x}=\epsilon x_{\text{real}}+(1-\epsilon)x_{\text{generated}},\,\epsilon\sim U[0,1], \,x_{\text{real}}\sim P_r, \,x_{\text{generated}}\sim P_g$.
How to make $|f|_L\leq 1$? The answer is to use the Spectral Normalization, that is to replace the parameter $w$ with $w/|w|$ in model $f$.
Set $X\sim U[0,1],\,Y\sim N[0,1]$. We denote the transformation as $Y=f(x)$ and let $\rho(x)$ be the density function in $X$. Since under the transformation, the prbability lie in $[x,x+dx],[y,y+dy]$ must be equal, thus we have
[\rho(x)dx=\frac{1}{\sqrt{2\pi}}e^{\frac{-y^2}{2}}dy\\Rightarrow \int_0^x \rho(t)dt=\int_{-\infty}^y \frac{1}{\sqrt{2\pi}}e^{\frac{-t^2}{2}}dt
\Rightarrow y=\Phi^{-1}(\rho(t)dt)]
Problem: how to find $f$? We could use the Neural Network here, that is $Y=G(X,\theta)$.
We define
[W_c[p(x),q(x)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} E_{(x,y)\sim \gamma} [c(x,y)]=\underset{\gamma\in \pi(p,q)}{\text{inf}} \int \gamma(x,y)c(x,y)dxdy]
where $c(x,y)=|x-y|$ is a measure and the marginal probability distribution is
[\int \gamma(x,y)dy=p(x), \,\int \gamma(x,y)dx=q(y)]
The dual problem is, as we have showed above
[W(P_r,P_g)=\underset{|f|L\leq 1}{\text{sup}}E{x\sim P_r}[f(x)]-E_{x\sim P_g}[f(x)]]
Denote
[\Gamma= \left[ \begin{array}{c} \gamma(x_1,y_1)\\gamma(x_1,y_2)\ \vdots\ \hline \gamma(x_2,y_1)\\gamma(x_2,y_2)\ \vdots\ \hline \gamma(x_n,y_1)\\gamma(x_n,y_2)\ \vdots \end{array} \right] \;\;\;\;\;\; C=\left[ \begin{array}{c} c(x_1,y_1)\c(x_1,y_2)\ \vdots\ \hline c(x_2,y_1)\c(x_2,y_2)\ \vdots\ \hline c(x_n,y_1)\c(x_n,y_2)\ \vdots \end{array} \right] \;\;\;\;\;\; b=\left[ \begin{array}{c} p(x_1)\ p(x_2)\ \vdots\ \hline q(x_1)\ q(x_2)\ \vdots\ \end{array} \right]]
[A=\left[ \begin{array}{c|c|c|c} 1,1,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\0,0,0,0,\cdots&1,1,0,0,\cdots&0,0,0,0,\cdots&\cdots\0,0,0,0,\cdots&0,0,0,0,\cdots&1,1,0,0,\cdots&\cdots\ \vdots&\vdots&\vdots&\vdots\\hline 1,0,0,0,\cdots&1,0,0,0,\cdots&1,0,0,0,\cdots&\cdots\0,1,0,0,\cdots&0,1,0,0,\cdots&0,1,0,0,\cdots&\cdots\0,0,0,0,\cdots&0,0,0,0,\cdots&0,0,0,0,\cdots&\cdots\ \vdots&\vdots&\vdots&\vdots
\end{array} \right]]
Then the optimization problem is in fact a linear programming problem
[\underset{\Gamma}{\text{min}}{\langle\Gamma,C\rangle | A\Gamma=b,\Gamma\geq 0}] |
Generally, for a problem
[\underset{x}{\text{min}}{c^{\text{T}}x | Ax=b,x\geq 0}] |
where $x,c\in \mathbb{R}^n; b\in \mathbb{R}^m, A\in \mathbb{R}^{m\times n}$. Its dual form is
[\underset{x}{\text{max}}{b^{\text{T}}y | A^{\text{T}}y\leq c}] |
[\underset{x}{\text{max}}{b^{\text{T}}y | A^{\text{T}}y\leq c}\leq \underset{x}{\text{max}}{b^{\text{T}}y | A^{\text{T}}y\leq c}] |
[\underset{x}{\text{max}}{b^{\text{T}}y | A^{\text{T}}y\leq c}= \underset{x}{\text{max}}{b^{\text{T}}y | A^{\text{T}}y\leq c}] |
The proof is ommitted here. During the proof, there is an important lemma
Farkas’ Lemma Only one of the following statements can be true:
[\underset{\Gamma}{\text{min}}{\langle\Gamma,C\rangle | A\Gamma=b,\Gamma\geq 0}= |
\underset{\Gamma}{\text{max}}{\langle b,F\rangle | A\Gamma=b,A^{\text{T}}F\leq C}] |
We denote F as
[F=\left[ \begin{array}{c} f(x_1)\ f(x_2)\ \vdots\ \hline g(x_1)\ g(x_2)\ \vdots\ \end{array} \right]]
In this way,
[\langle b,F\rangle=\sum_i p(x_i)f(x_i)+q(x_i)g(x_i),
\text{or} \;\langle b,F\rangle=\int p(x)f(x)+q(x)g(x),]
The constraint condition can be reformulated as
[A^{\text{T}}F\leq C\Rightarrow f(x_i)+g(x_i)\leq c(x_i,y_i)]
Since
[f(x)+g(x)\leq c(x,x)=0]
We have $f(x)\leq -g(x)$, and
[p(x_i)f(x_i)+q(x_i)g(x_i)\leq p(x_i)f(x_i)-q(x_i)f(x_i)]
We could set
[W[p,q]=\int p(x)f(x)+q(x)g(x) | f(x)-f(y)\leq|x-y|] |
Thus the training of WGAN is
[W(P_r,P_g)=\underset{G}{\text{min}}\underset{f,|f|L\leq 1}{\text{max}}E{x\sim p(x)}[f(x)]-E_{x\sim q(z)}[f(G(z))]]
We say d is weaker then d’ if every sequence that converges under d’ converges under d, thus we have
Theorem: Every distribution that converges under KL, reverse KL, TV and JS also converges under Wasserstein Divergence.
苏剑林. (Oct. 07, 2018). 《深度学习中的Lipschitz约束:泛化与生成模型 》[Blog post]. Retrieved from https://kexue.fm/archives/6051
苏剑林. (May. 03, 2019). 《从动力学角度看优化算法(四):GAN的第三个阶段 》[Blog post]. Retrieved from https://kexue.fm/archives/6583
Mescheder L, Geiger A, Nowozin S. Which training methods for gans do actually converge?[C]//International conference on machine learning. PMLR, 2018: 3481-3490.
苏剑林. (Jun. 08, 2017). 《互怼的艺术:从零直达WGAN-GP 》[Blog post]. Retrieved from https://kexue.fm/archives/4439
苏剑林. (Jan. 20, 2019). 《从Wasserstein距离、对偶理论到WGAN 》[Blog post]. Retrieved from https://kexue.fm/archives/6280
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.