Bootstrap

去噪扩散概率模型(Denoising Diffusion Probabilistic Model,DDPM)

公式乱码严重,文章已转至

你似乎来到了没有知识存在的荒原 - 知乎知乎,中文互联网高质量的问答社区和创作者聚集的原创内容平台,于 2011 年 1 月正式上线,以「让人们更好的分享知识、经验和见解,找到自己的解答」为品牌使命。知乎凭借认真、专业、友善的社区氛围、独特的产品机制以及结构化和易获得的优质内容,聚集了中文互联网科技、商业、影视、时尚、文化等领域最具创造力的人群,已成为综合性、全品类、在诸多领域具有关键影响力的知识分享社区和创作者聚集的原创内容平台,建立起了以社区驱动的内容变现商业模式。https://zhuanlan.zhihu.com/p/636776166

去噪扩散概率模型(Denoising Diffusion Probabilistic Model, DDPM)在2020年被提出,向世界展示了扩散模型的强大能力,带动了扩散模型的火热。笔者出于兴趣自学相关知识,结合网络上的参考资料和自己的理解介绍DDPM。需要说明的是,笔者能力很有限,学习过程中遇到了很多知识盲区,只能硬着头皮现学现卖。如果发现文中有错误,欢迎评论指出,大家一起学习,共同进步。

前置知识

① 贝叶斯公式 

P(A,B,C) = P(C\mid B,A)P(B,A)=P(C\mid B,A)P(B\mid A)P(A)

P(B,C\mid A) = P(B\mid A)P(C\mid A,B)

若满足马尔科夫链关系 A\rightarrow B\rightarrow C,即当前时刻的概率分布仅与上一时刻有关,则有:

P(A,B,C) = P(C\mid B,A)P(B,A)=\textcolor{red}{P(C\mid B)}P(B\mid A)P(A)

P(B,C\mid A) = P(B\mid A)\color{red}{P(C\mid B)}

② 高斯分布的概率密度函数、高斯函数的叠加公式

给定均值为 \mu ,方差为 \sigma ^{2} 的单一变量高斯分布 \mathcal{N}(\mu , \sigma ^{2}) ,其概率密度函数为:

\small q(x) = \frac{1}{\sqrt{2\pi }\sigma }\exp \left ( -\frac{1}{2}\left ( \frac{x-\mu }{\sigma } \right )^2 \right )

很多时候,为了方便起见,可以将前面的常数系数去掉,写成:

\small q(x) \propto exp\left ( -\frac{1}{2}\left ( \frac{x-\mu }{\sigma } \right )^2 \right ) \Leftrightarrow q(x) \propto \exp \left ( -\frac{1}{2}\left ( \frac{1}{\sigma ^{2}}x^2 - \frac{2\mu }{\sigma ^{2}}x + \frac{\mu ^{2}}{\sigma ^{2}} \right ) \right )

给定两个高斯分布 X\sim \mathcal{N}(\mu_{1} , \sigma_{1} ^{2}) ,Y\sim \mathcal{N}(\mu_{2} , \sigma_{2} ^{2}),则它们叠加后的分布 aX+bY 满足:

aX+bY\sim \mathcal{N}(a\times \mu _{1} + b \times \mu_{2},a^{2} \times \sigma _{1}^{2} + b^{2} \times \sigma _{2}^{2})

③ KL散度与交叉熵

详细讲解可参照我之前的博客。假设随机变量的真实概率分布为 P,而我们通过建模得到的一个近似分布为 Q,则 P 与 Q 的KL散度交叉熵满足下式:

\textcolor[rgb]{0, 0.39, 0}{D_{KL}(P,Q)} = \textcolor[rgb]{0.63, 0.12, 0.94}{-\sum P\log Q} - (-\sum P\log P) =\textcolor[rgb]{0.63, 0.12, 0.94}{\mathbb{E}_{P}[- \log Q]} - \mathbb{E}_{P}[- \log P]

对于两个单一变量的高斯分布 p\sim \mathcal{N}(\mu _{1}, \sigma _{1}^{2}) 和 q\sim \mathcal{N}(\mu _{2}, \sigma _{2}^{2}) 而言,它们的KL散度为:

D_{KL}(p,q) = \log \frac{\sigma _{2}}{\sigma _{1}} + \frac{\sigma _{1}^{2} + (\mu _{1} - \mu _{2})^{2}}{2 \sigma _{2}^{2}} - \frac{1}{2}

④ 参数重整化(重参数技巧) 

若要从高斯分布 \mathcal{N}(\mu ,\sigma^{2} ) 中采样,可先从标准分布 \mathcal{N}(0 ,1 ) 中采样出 z,再得到 \sigma ^{2}\ast z + \mu,即我们的采样值。这样做的目的是将随机性转移到 z 上,让采样值对 \mu 和 \sigma 可导。

基本介绍

如下图所示,DDPM模型主要分为两个过程:加噪过程(从右往左)去噪过程(从左往右)

★ 加噪过程:给定真实图像 x_{0},逐步对它添加高斯噪声,得到 x_{1},\ x_{2},\ \cdots,显然这是一个马尔科夫链过程,在进行了足够多的 T 次加噪后,图像会被高斯噪声淹没,可以认为是各向独立的高斯噪声的图像。

★ 去噪过程:针对噪声图像 x_{T},让神经网络模型对其逐步去噪,得到 x_{T-1},\ x_{T-2},\ \cdots,最终复原出没有噪声的逼真图像 x_{0},所以加噪过程其实可以看作是在为去噪过程构建标签

前向过程(扩散过程,加噪过程)

给定初始图像 x_{0},向其中逐步添加高斯噪声,加噪过程持续 T 次,产生一系列带噪图像,达到破坏图像的目的。由 x_{t-1} 加噪至 x_{t} 的过程中,所加噪声的方差为 \beta _{t},又称扩散率,是一个给定的大于 0 小于 1 的,随扩散步数增加而逐渐增大的值。定义扩散过程如下式:

x_{t}=\sqrt{1-\beta _{t}}x_{t-1}+\sqrt{\beta _{t}}z _{t},\hspace{2em}z_{t}\sim \mathcal N(0,\boldsymbol{I})

根据定义式,加噪过程可以看作在上一步的状态 x_{t-1} 上乘了一个系数 \sqrt{1-\beta_{t}},然后加上了均值为0,方差为 \beta_{t} 的高斯分布。所以加噪过程是确定的,并不是可学习的过程,将其写成概率分布的形式,则有:

q(x_{t}\mid x_{t-1}) = \mathcal{N} (x_{t}; \sqrt{1 - \beta _{t}}x_{t-1}, \beta_{t}\boldsymbol{I})

此外,加噪过程是一个马尔科夫链过程,所以联合概率分布可以写成下式:

q(x_{1},x_{2},\cdots ,x_{T} | x_{0}) = q(x_{1} | x_{0})q(x_{2} | x_{1})\cdots q(x_{T}| x_{T-1}) = \prod_{t=1}^{T}q(x_{t}| x_{t-1})

定义 \alpha _{t} = 1 - \beta _{t},即 \alpha _{t} + \beta _{t} = 1,代入 x_{t} 表达式并迭代推导,可以得到 x_{0} 到 x_{t} 的公式:

x_{t} = \sqrt{1-\beta _{t}}x_{t-1} + \sqrt{\beta _{t}}z _{t} = \sqrt{\alpha _{t}}x_{t-1}+\sqrt{\beta _{t}}z _{t}

= \sqrt{\alpha _{t}}\textcolor{red}{(\sqrt{\alpha _{t-1}}x_{t-2} + \sqrt{\beta _{t-1}}z _{t-1})} + \sqrt{\beta _{t}}z _{t}

=\sqrt{\alpha _{t}\alpha _{t-1}}x_{t-2}+\sqrt{\alpha _{t}\beta _{t-1}}z _{t-1}+\sqrt{\beta _{t}}z _{t}

=\sqrt{\alpha _{t}\alpha _{t-1}}\textcolor{red}{(\sqrt{\alpha _{t-2}}x_{t-3} + \sqrt{\beta _{t-2}}z _{t-2})} + \sqrt{\alpha _{t}\beta _{t-1}}z _{t-1} + \sqrt{\beta _{t}}z _{t}

=\sqrt{\alpha _{t}\alpha _{t-1}\alpha _{t-2}}x_{t-3} + \sqrt{\alpha _{t}\alpha _{t-1}\beta _{t-2}}z _{t-2} + \sqrt{\alpha _{t}\beta _{t-1}}z _{t-1} + \sqrt{\beta _{t}}z _{t}

= \sqrt{\alpha _{t}\alpha _{t-1}\cdots \alpha _{1}}x_{0} + \sqrt{\alpha _{t}\alpha _{t-1}\cdots \alpha _{2}\beta _{1}}z _{1} + \sqrt{\alpha _{t}\alpha _{t-1}\cdots \alpha _{3}\beta _{2}}z _{2} + \cdots + \sqrt{\alpha _{t}\alpha _{t-1}\beta _{t-2}}z _{t-2} + \sqrt{\alpha _{t}\beta _{t-1}}z _{t-1} + \sqrt{\beta _{t}}z _{t}

上式从第二项到最后一项都是独立的高斯噪声,它们的均值都为0,方差为各自系数的平方。根据高斯分布的叠加公式,它们的满足均值为0,方差为各项方差之和的高斯分布。又有上式每一项系数的平方和(包括第一项)为1,证明如下,注意始终有 \alpha _{t} + \beta _{t} = 1

\alpha _{t}\alpha _{t-1}\cdots \alpha _{1} + \alpha _{t}\alpha _{t-1}\cdots \alpha _{2}\beta _{1} + \alpha _{t}\alpha _{t-1}\cdots \alpha _{3}\beta _{2} + \cdots + \alpha _{t}\beta _{t-1} + \beta _{t}

= \alpha _{t}\alpha _{t-1}\cdots \alpha _{2}(\alpha _{1} + \beta _{1}) + \alpha _{t}\alpha _{t-1}\cdots \alpha _{3}\beta _{2} + \cdots + \alpha _{t}\alpha _{t-1}\beta _{t-2} + \alpha _{t}\beta _{t-1} + \beta _{t}

= \alpha _{t}\alpha _{t-1}\cdots \alpha _{2}\times {\color{red}1} + \alpha _{t}\alpha _{t-1}\cdots \alpha _{3}\beta _{2} + \cdots + \alpha _{t}\alpha _{t-1}\beta _{t-2} + \alpha _{t}\beta _{t-1} + \beta _{t}

= \alpha _{t}\alpha _{t-1}\cdots \alpha _{3}(\alpha _{2}+\beta _{2}) + \cdots + \alpha _{t}\alpha _{t-1}\beta _{t-2} + \alpha _{t}\beta _{t-1} + \beta _{t}

= \alpha _{t}\alpha _{t-1}\cdots \alpha _{3}\times {\color{red}1} + \cdots + \alpha _{t}\alpha _{t-1}\beta _{t-2} + \alpha _{t}\beta _{t-1} + \beta _{t}

= \cdots \cdots = \alpha _{t} + \beta _{t} = 1

那么,将 \alpha _{t}\alpha _{t-1}\cdots \alpha _{1} 记作 \bar{\alpha }_{t},则正态噪声的方差之和为 1-\bar{\alpha }_{t}x_{t} 可表示为:

x_{t} = \sqrt{\bar{\alpha }_{t}}x_{0} + \sqrt{1-\bar{\alpha }_{t}}\bar{z}_{t},\hspace{2em}\bar{z}_{t} \sim \mathcal N(0,\boldsymbol{I})

由该式可以看出,x_{t} 实际上是原始图像 x_{0} 和随机噪声 \bar{z}_{t} 的线性组合,即只要给定初始值,以及每一步的扩散率,就可以得到任意时刻的 x_{t},写成概率分布的形式:

q(x_{t}\mid x_{0}) = \mathcal{N}(x_{t}; \sqrt{\bar{\alpha }_{t}}x_{0}, (1-\bar{\alpha }_{t})\boldsymbol{I})

当加噪步数 T 足够大时,\bar{\alpha }_{t} 趋向于 0,1-\alpha_{t} 趋向于 1,所以 x_{T} 趋向于标准高斯分布

反向过程(逆扩散过程,去噪过程)

前向过程对原始图像 x_{0} 逐步加噪声变成 x_{T},反向过程则是从 x_{T} 逐步恢复到 x_{0}。前向过程我们用 q(x_{t}\mid x_{t-1}) 来表示,而反向过程则是求 q(x_{t-1}\mid x_{t})。如果能实现这种逆转,就可以从一个随机的高斯噪声 \mathcal{N}(0,\boldsymbol{I}) 中重建出一个真实的原始样本,即从杂乱无章的噪声图像中得到真实图像,实现图像生成的目的。

有文献证明,如果 q(x_{t}\mid x_{t-1}) 满足高斯分布且 \beta _{t} 足够小,则 q(x_{t-1}\mid x_{t}) 也满足高斯分布。虽然我们已知前向过程中每一步所加的噪声都采样自特定的高斯分布,但是采样有无数种可能,所以我们无法简单地预测 q(x_{t-1}\mid x_{t}),这时候深度学习就有用武之地了,可以通过学习一个深度网络(参数为 \theta)来模拟。

反向过程仍然是一个马尔科夫链过程,网络以当前时刻 t 和当前时刻的图像 x_{t} 作为输入,构建反向过程条件概率,其中,均值和方差都是含参的,且都以 x_{t} 和 t 作为输入,有下式:

p_{\theta}(x_{t-1}\mid x_{t}) = \mathcal{N}\left ( x_{t-1}; \mu _{\theta}(x_{t}, t), \Sigma _{\theta}(x_{t}, t) \right )

p_{\theta}(x_{0:T}) = p_{\theta}(x_{T})p_{\theta}(x_{T-1} \mid x_{T})\cdots q(x_{0}\mid x_{1}) = p_{\theta}(x_{T}) \prod_{t=1}^{T}p_{\theta}(x_{t-1}\mid x_{t})

真实的反向过程,或者称作扩散过程的后验条件概率,可以写成:

q(x_{t-1}\mid x_{t}) = q(x_{t}\mid x_{t-1}) \frac{q(x_{t-1})}{q(x_{t})}

其中,q(x_{t-1}) 是不可知的,但是如果知道 x_{0},则扩散过程的后验条件概率可以写成:

\small q(x_{t-1}\mid x_{t}, x_{0}) = \frac{q(x_{t}\mid x_{t-1}, x_{0})\times q(x_{t-1}\mid x_{0})}{q(x_{t}\mid x_{0})} = \mathcal{N}\left ( x_{t-1}, {\color{blue}\tilde{\mu }(x_{t}, x_{0})}, {\color{red}\tilde{\beta }_{t}}\boldsymbol{I} \right )

又根据前向过程的推导,有下面三个式子满足:

\small q(x_{t-1}| x_{0}) = \sqrt{\bar{\alpha }_{t-1}}x_{0} + \sqrt{1-\bar{\alpha }_{t-1}}\bar{z}_{t-1} \hspace{0.9em} \sim \hspace{0.9em} \mathcal{N}\left ( x_{t-1}; \sqrt{\bar{\alpha }_{t-1}}x_{0}, \left ( 1-\bar{\alpha }_{t-1} \right )\boldsymbol{I} \right )

\small q(x_{t}\mid x_{0}) = \sqrt{\bar{\alpha }_{t}}x_{0} + \sqrt{1-\bar{\alpha }_{t}}\bar{z}_{t} \hspace{1em} \sim \hspace{1em} \mathcal{N}\left ( x_{t}; \sqrt{\bar{\alpha }_{t}}x_{0}, (1-\bar{\alpha }_{t})\boldsymbol{I} \right )

q(x_{t}\mid x_{t-1}, x_{0}) = q(x_{t}\mid x_{t-1}) = \sqrt{\alpha _{t}}x_{t-1} + \sqrt{\beta _{t}}z _{t} \hspace{1em} \sim \hspace{1em} \mathcal{N}\left ( x_{t}; \sqrt{\alpha _{t}}x_{t-1}, \beta _{t} \boldsymbol{I} \right )

将三个式子代入,并结合前置知识中的高斯函数概率密度函数,展开后合并同类项,有下式:

q(x_{t-1}\mid x_{t}, x_{0}) = \frac{\mathcal{N}(x_{t}; \sqrt{\alpha _{t}}x_{t-1}, \beta _{t} \boldsymbol{I}) \times \mathcal{N}(x_{t-1}; \sqrt{\bar{\alpha }_{t-1}}x_{0}, (1-\bar{\alpha }_{t-1})\boldsymbol{I})}{\mathcal{N}(x_{t}; \sqrt{\bar{\alpha }_{t}}x_{0}, (1-\bar{\alpha }_{t})\boldsymbol{I})}

\small \propto \exp \left ( -\frac{1}{2}\left ( \frac{(x_{t} - \sqrt{\alpha _{t}}x_{t-1})^{2}}{\beta _{t}} + \frac{(x_{t-1} - \sqrt{\bar{\alpha }_{t-1}}x_{0})^{2}}{1-\bar{\alpha }_{t-1}} - \frac{(x_{t} - \sqrt{\bar{\alpha }_{t}}x_{0})^{2}}{1 - \bar{\alpha }_{t}} \right ) \right )

\small = \exp \left ( -\frac{1}{2}\left ( \frac{x_{t}^{2} - 2 \sqrt{\alpha _{t}}x_{t}{\color{blue} x_{t-1}} + \alpha_{t}{\color{red} x_{t-1}^{2}}}{\beta _{t}} + \frac{​{\color{red} x_{t-1}^{2}} - 2 \sqrt{\bar{\alpha }_{t-1}}x_{0}{\color{blue} x_{t-1}} + \bar{\alpha }_{t-1}x_{0}^{2}}{1-\bar{\alpha }_{t-1}} - \frac{(x_{t} - \sqrt{\bar{\alpha }_{t}}x_{0})^{2}}{1 - \bar{\alpha }_{t}} \right ) \right )

\small =\exp \left ( -\frac{1}{2}\left ( {\color{red} \left ( \frac{\alpha _{t}}{\beta _{t}} + \frac{1}{1 - \bar{\alpha }_{t-1}} \right )}x_{t-1}^{2} - {\color{blue} \left ( \frac{2\sqrt{\alpha _{t}}}{\beta _{t}} x_{t}+ \frac{2 \sqrt{\bar{\alpha }_{t-1}}}{1 - \bar{\alpha }_{t-1}}x_{0} \right )}x_{t-1} + \mathcal{C}\left ( x_{t}, x_{0} \right ) \right ) \right )

此式符合前置知识中高斯函数概率密度函数的展开形式,有以下两个式子满足:

\frac{1}{\tilde{\beta _{t}} ^{2}} = {\color{red} \frac{\alpha _{t}}{\beta _{t}} + \frac{1}{1 - \bar{\alpha} _{t-1}}} \hspace{1em} and \hspace{1em} \frac{2 \tilde{\mu }(x_{t}, x_{0}) }{\tilde{\beta _{t}} ^{2}}={\color{blue} \frac{2\sqrt{\alpha _{t}}}{\beta _{t}}x_{t} + \frac{2\sqrt{\bar{\alpha }_{t-1}}}{1 - \bar{\alpha }_{t-1}}x_{0}}

对第一个式子,有:

\frac{1}{\tilde{\beta _{t}} ^{2}} = \frac{\alpha _{t}(1-\bar{\alpha }_{t-1}) + \textcolor[rgb]{0.55, 0, 0}{\beta _{t}}}{\beta _{t}(1 - \bar{\alpha }_{t-1})} = \frac{\alpha _{t} -\textcolor[rgb]{1, 0.55, 0}{ \alpha _{t}\bar{\alpha }_{t-1}} + \textcolor[rgb]{0.55, 0, 0}{ 1 - \alpha _{t}}}{\beta _{t}(1 - \bar{\alpha }_{t-1})} = \frac{1 - \textcolor[rgb]{1, 0.55, 0}{\bar{\alpha }_{t}}}{\beta _{t}(1 - \bar{\alpha }_{t-1})}

对第二个式子,有:

\small \tilde{\mu }(x_{t}, x_{0}) = \left ( \frac{\sqrt{\alpha _{t}}}{\beta _{t}}x_{t} + \frac{\sqrt{\bar{\alpha } _{t-1}}}{1-\bar{\alpha }_{t-1}}x_{0} \right )\times \textcolor[rgb]{0.55, 0, 0}{\tilde{\beta _{t}} ^{2}} = \left ( \frac{\sqrt{\alpha _{t}}}{\beta _{t}}x_{t} + \frac{\sqrt{\bar{\alpha } _{t-1}}}{1-\bar{\alpha }_{t-1}}x_{0} \right )\times \textcolor[rgb]{0.55, 0, 0}{\frac{1-\bar{\alpha }_{t-1}}{1-\bar{\alpha }_{t}}\beta _{t}}

= \frac{\sqrt{\alpha _{t}}(1-\bar{\alpha }_{t-1})}{1 - \bar{\alpha }_{t}}x_{t} + \frac{\sqrt{\bar{\alpha }_{t-1}}}{1 - \bar{\alpha }_{t}}\beta _{t}\textcolor[rgb]{1, 0.55, 0}{x_{0}} = \frac{\sqrt{\alpha _{t}}(1-\bar{\alpha }_{t-1})}{1 - \bar{\alpha }_{t}}x_{t} + \frac{\sqrt{\bar{\alpha }_{t-1}}}{1 - \bar{\alpha }_{t}}\beta _{t}\textcolor[rgb]{1, 0.55, 0}{\frac{x_{t} - \sqrt{1 - \bar{\alpha }_{t}}\bar{z}_{t} }{\sqrt{\bar{\alpha }_{t}}}}

= \left ( \frac{\sqrt{\alpha _{t}}(1 - \bar{\alpha }_{t-1})}{1 - \bar{\alpha }_{t}} + \frac{​{\color{magenta} \beta _{t}}\textcolor[rgb]{0, 0.39, 0}{\sqrt{\bar{\alpha }_{t-1}}}}{\textcolor[rgb]{0, 0.39, 0}{\sqrt{\bar{\alpha } _{t}}}(1 - \bar{\alpha }_{t})} \right )x_{t} - \frac{\textcolor[rgb]{0.63, 0.13, 0.94}{\sqrt{\bar{\alpha }_{t-1}}}\sqrt{1 - \bar{\alpha }_{t}}\beta _{t}\bar{z}_{t} }{\textcolor[rgb]{0.63, 0.13, 0.94}{\sqrt{\bar{\alpha } _{t}}}(1 - \bar{\alpha }_{t})}

= \left ( \frac{\sqrt{\alpha _{t}}(1-\bar{\alpha }_{t-1})}{1 - \bar{\alpha }_{t}} + \frac{​{\color{magenta} 1 - \alpha _{t}}}{\textcolor[rgb]{0, 0.39, 0}{\sqrt{\alpha _{t}}}(1 - \bar{\alpha }_{t})} \right )x_{t} - \frac{\beta _{t}\bar{z}_{t} }{\textcolor[rgb]{0.63, 0.13, 0.94}{\sqrt{\alpha _{t}}}\sqrt{1 - \bar{\alpha }_{t}}}

= \frac{\alpha _{t}(1 - \bar{\alpha }_{t-1}) + 1 - \alpha _{t}}{\sqrt{\alpha _{t}}(1 - \bar{\alpha }_{t})}x_{t} - \frac{\beta _{t}\bar{z}_{t} }{\sqrt{\alpha _{t}}\sqrt{1-\bar{\alpha }_{t}}} = \frac{1 - \textcolor[rgb]{0.1, 0.1, 0.44}{\alpha _{t}\bar{\alpha }_{t-1}}}{\sqrt{\alpha _{t}}(1 - \bar{\alpha }_{t})}x_{t} - \frac{\beta _{t}\bar{z}_{t} }{\sqrt{\alpha _{t}}\sqrt{1-\bar{\alpha }_{t}}}

= \frac{1 - \textcolor{blue}{\bar{\alpha }_{t}}}{\sqrt{\alpha _{t}}(1 - \bar{\alpha }_{t})}x_{t} - \frac{\beta _{t}\bar{z}_{t} }{\sqrt{\alpha _{t}}\sqrt{1-\bar{\alpha }_{t}}} = \frac{1}{\sqrt{\alpha _{t}}}\left ( x_{t} - \frac{\beta _{t}}{\sqrt{1-\bar{\alpha }_{t}}}\bar{z}_{t} \right )

所以,在给定 x_{0} 的条件下,反向过程真实的概率分布的均值只与 x_{t} 和 \bar{z}_{t} 有关,满足下式:

q(x_{t-1}\mid x_{t},x_{0}) = \mathcal{N}\left ( x_{t-1}, {\color{blue} \tilde{\mu }(x_{t},x_{0})}, {\color{red} \tilde{\beta }_{t}}\boldsymbol{I} \right )=\mathcal{N}\left ( x_{t-1}, {\color{blue} \frac{1}{\sqrt{\alpha _{t}}}\left ( x_{t} - \frac{\beta _{t}}{\sqrt{1-\bar{\alpha }_{t}}}\bar{z}_{t} \right )}, {\color{red} \frac{1-\bar{\alpha }_{t-1}}{1-\bar{\alpha }_{t}}\beta _{t}} \boldsymbol{I} \right )​​​​​​​

优化目标

我们的目标是得到尽可能真实的 x_{0} ,即求模型参数 \theta,使其最终得到 x_{0} 的概率最大,这显然是一个极大似然估计问题,写出似然函数:

p\left ( x_{0}\mid \theta \right ) = p_{\theta }(x_{0}) = \int_{x_{1}}\int _{x_{2}}\cdots \int _{x_{T}}p_{\theta}(x_{0}, x_{1}, x_{2},\cdots ,x_{T}) d_{x_{1}}d_{x_{2}}\cdots d_{x_{T}}

= \int_{x_{1}}\int _{x_{2}}\cdots \int _{x_{T}}\textcolor[rgb]{0, 0.55, 0}{q(x_{1:T}\mid x_{0}) }\frac{p_{\theta}(x_{0}, x_{1}, x_{2},\cdots ,x_{T})}{\textcolor[rgb]{0, 0.55, 0}{q(x_{1:T}\mid x_{0})}} d_{x_{1}}d_{x_{2}}\cdots d_{x_{T}}

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [\frac{\textcolor[rgb]{0.55, 0, 0}{p_{\theta}(x_{0:T})}}{ q(x_{1:T}\mid x_{0})} \right]

Jensen 不等式,对任一凸函数 f,始终满足函数值的期望大于等于期望的函数值,对上式两边取对数,得到对数似然函数,满足:

\log p_{\theta }(x_{0}) = \log \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [ \frac{ p_{\theta}(x_{0:T})}{ q(x_{1:T}\mid x_{0})} \right] \geq \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [\log \frac{ p_{\theta}(x_{0:T})}{ q(x_{1:T}\mid x_{0})}\right]

再对两边同时取负,得到负对数似然函数,满足:

- \log p_{\theta }(x_{0}) \leq \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [\log \frac{ q(x_{1:T}\mid x_{0})}{ p_{\theta}(x_{0:T})}\right]

​​​​​式子右侧称为变分上界,最大化对数似然函数可以转换为最小化变分上界,结合马尔科夫链的贝叶斯公式将变分上界展开:

L_{VLB} = \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [\log \frac{ \textcolor[rgb]{0, 0, 0.55}{q(x_{1:T}\mid x_{0})}}{ \textcolor[rgb]{0.55, 0, 0}{p_{\theta}(x_{0:T})}}\right] = \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [\log \frac{ \textcolor[rgb]{0, 0, 0.55}{q(x_{1}\mid x_{0})q(x_{2}\mid x_{1})\cdots q(x_{T}\mid x_{T-1})}}{ \textcolor[rgb]{0.55, 0, 0}{p_{\theta}(x_{T})p_{\theta}(x_{T-1}\mid x_{T})\cdots p_{\theta}(x_{1}\mid x_{0})}}\right]

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [\log \frac{\textcolor[rgb]{0, 0, 0.55}{\prod_{t=1}^{T} q(x_{t}\mid x_{t-1})}}{\textcolor[rgb]{0.55, 0, 0}{p_{\theta}(x_{T}) \prod_{t=1}^{T} p_{\theta}(x_{t-1}\mid x_{t})}}\right] = \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [- \log p_{\theta}(x_{T}) + \sum_{t=1}^{T}\log \frac{q(x_{t}\mid x_{t-1})}{p_{\theta}(x_{t-1}\mid x_{t})}\right]

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [- \log p_{\theta}(x_{T}) + \sum_{t=2}^{T}\log \frac{\textcolor[rgb]{0.63, 0.13, 0.94}{q(x_{t}\mid x_{t-1})}}{p_{\theta}(x_{t-1}\mid x_{t})} + \log \frac{q(x_{1}\mid x_{0})}{p_{\theta }(x_{0} \mid x_{1})}\right]

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [- \log p_{\theta}(x_{T}) + \sum_{t=2}^{T}\log \frac{\textcolor[rgb]{0.63, 0.13, 0.94}{q(x_{t}\mid x_{t-1}, x_{0})}}{p_{\theta}(x_{t-1}\mid x_{t})} + \log \frac{q(x_{1}\mid x_{0})}{p_{\theta }(x_{0} \mid x_{1})}\right]

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [- \log p_{\theta}(x_{T}) + \sum_{t=2}^{T}\log \frac{\textcolor[rgb]{0.63, 0.13, 0.94}{q(x_{t}, x_{t-1}, x_{0})}}{p_{\theta}(x_{t-1}\mid x_{t})\textcolor[rgb]{0.63, 0.13, 0.94}{q(x_{t-1},x_{0})}} + \log \frac{q(x_{1}\mid x_{0})}{p_{\theta }(x_{0} \mid x_{1})}\right]

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [- \log p_{\theta}(x_{T}) + \sum_{t=2}^{T}\log \frac{\textcolor[rgb]{0.63, 0.13, 0.94}{q(x_{t-1}\mid x_{t},x_{0})q(x_{t}\mid x_{0})q(x_{0})}}{p_{\theta}(x_{t-1}\mid x_{t})\textcolor[rgb]{0.63, 0.13, 0.94}{q(x_{t-1}\mid x_{0})q(x_{0})}} + \log \frac{q(x_{1}\mid x_{0})}{p_{\theta }(x_{0} \mid x_{1})}\right]

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [- \log p_{\theta}(x_{T}) + \sum_{t=2}^{T}\log \frac{q(x_{t-1}\mid x_{t},x_{0})}{p_{\theta}(x_{t-1}\mid x_{t})}\cdot \frac{q(x_{t}\mid x_{0})}{q(x_{t-1}\mid x_{0})} + \log \frac{q(x_{1}\mid x_{0})}{p_{\theta }(x_{0} \mid x_{1})}\right]

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [- \log p_{\theta}(x_{T}) + \sum_{t=2}^{T}\log \frac{q(x_{t-1}\mid x_{t}, x_{0})}{p_{\theta}(x_{t-1}\mid x_{t})} + \color{red} {\sum_{t=2}^{T}\log \frac{q(x_{t}\mid x_{0})}{q(x_{t-1}\mid x_{0})}} + \log \frac{q(x_{1}\mid x_{0})}{p_{\theta }(x_{0} \mid x_{1})}\right]

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [- \log p_{\theta}(x_{T}) + \sum_{t=2}^{T}\log \frac{q(x_{t-1}\mid x_{t}, x_{0})}{p_{\theta}(x_{t-1}\mid x_{t})} + {\color{red} \log \frac{q(x_{T}\mid x_{0})}{q(x_{1}\mid x_{0})}} + \log \frac{q(x_{1}\mid x_{0})}{p_{\theta }(x_{0} \mid x_{1})}\right]

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [\log \frac{q(x_{T}\mid x_{0})}{p_{\theta}(x_{T})} + \sum_{t=2}^{T}\log \frac{q(x_{t-1}\mid x_{t}, x_{0})}{p_{\theta}(x_{t-1}\mid x_{t})} - \log p_{\theta}(x_{0}\mid x_{1})\right]

因为和的期望等于期望的和,可得:

= \mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [\log \frac{q(x_{T}\mid x_{0})}{p_{\theta}(x_{T})} \right] + \sum_{t=2}^{T}\mathbb{E}_{ q(x_{1:T}\mid x_{0})}\left [\log \frac{q(x_{t-1}\mid x_{t}, x_{0})}{p_{\theta}(x_{t-1}\mid x_{t})} \right] - \mathbb{E}_{ q(x_{1:T}\mid x_{0})} \left [ \log p_{\theta}(x_{0}\mid x_{1}) \right ]

因为期望目标与部分时间步的概率无关可以直接省去,可得:

= \mathbb{E}_{ q(x_{T}\mid x_{0})}\left [\log \frac{q(x_{T}\mid x_{0})}{p_{\theta}(x_{T})} \right] + \sum_{t=2}^{T}\mathbb{E}_{ q(x_{t},x_{t-1}\mid x_{0})}\left [\log \frac{q(x_{t-1}\mid x_{t}, x_{0})}{p_{\theta}(x_{t-1}\mid x_{t})} \right] - \mathbb{E}_{ q(x_{1}\mid x_{0})} \left [ \log p_{\theta}(x_{0}\mid x_{1}) \right ]

中间的求和项根据前置知识中的贝叶斯公式改写,可得:

\small = \mathbb{E}_{ q(x_{T}\mid x_{0})}\left [\log \frac{q(x_{T}\mid x_{0})}{p_{\theta}(x_{T})} \right] + \sum_{t=2}^{T}\mathbb{E}_{ q(x_{t}\mid x_{0}) q(x_{t-1}\mid x_{t},x_{0})}\left [\log \frac{q(x_{t-1}\mid x_{t}, x_{0})}{p_{\theta}(x_{t-1}\mid x_{t})} \right] - \mathbb{E}_{ q(x_{1}\mid x_{0})} \left [ \log p_{\theta}(x_{0}\mid x_{1}) \right ]

​​​​​​​= \textcolor[rgb]{1, 0.55, 0} {D_{KL}\left ( q(x_{T}| x_{0})\parallel p_{\theta}(x_{T}) \right )} \textcolor[rgb]{0, 0.39, 0}{ + \sum_{t=2}^{T} \mathbb{E}_{ q(x_{t}\mid x_{0})} \left [ D_{KL}(q(x_{t-1}| x_{t},x_{0})\parallel p_{\theta}(x_{t-1}| x_{t})) \right ]} \textcolor[rgb]{1, 0.84, 0}{ - \mathbb{E}_{ q(x_{1}\mid x_{0})} \left [ \log p_{\theta}(x_{0}| x_{1}) \right ]}​​​​​​​

= \textcolor[rgb]{1, 0.55, 0} {L_{T}} \textcolor[rgb]{0, 0.4, 0}{ + L_{T-1} + \cdots} \textcolor[rgb]{1, 0.84, 0}{ + L_{0}}

对 L_{T} 而言,先验分布 q(x_{T}\mid x_{0}) 是一个确定的值,而 p_{\theta} (x_{T}) 是一个各向同性的高斯分布,二者都不含参,KL散度近似为 0,最小化变分上界时不用考虑。而对于中间的求和项,当 t 的取值为 1 时,分子为 q(x_{0} \mid x_{1}, x_{0}),即在已知 x_{0} 的条件下求 x_{0} 的概率分布,肯定是一个确定值,所以对比下来发现其实 L_{0} 可以并入L_{t-1},由此可将变分上界简化一些。推导到这里,优化目标就被转化为了最小化后验分布网络参数化的高斯分布之间的KL散度。

为了简化计算,DDPM对 p_{\theta}(x_{t-1}\mid x_{t}) 做了进一步简化,采用固定的方差:\Sigma _{\theta}(x_{t}, t) = \sigma_{t} ^{2}\boldsymbol{I},这里的 \sigma_{t} ^{2} 是一个无需训练的常量,文中提到,设置为 \beta _{t} 或 \tilde{\beta }_{t} 有相似的效果,这里假定 \sigma_{t} ^{2} = \tilde{\beta }_{t},则KL散度中的两项分别可写作:

\small q(x_{t-1}\mid x_{t},x_{0}) = \mathcal{N}\left ( x_{t-1}, \tilde{\mu }(x_{t},x_{0}), {\color{red} \tilde{\beta }_{t}}\boldsymbol{I} \right )= \mathcal{N}\left ( x_{t-1}, \tilde{\mu }(x_{t},x_{0}), {\color{red} \sigma_{t}^{2}}\boldsymbol{I} \right )

\small p_{\theta}(x_{t-1}\mid x_{t}) = \mathcal{N}\left ( x_{t-1}; \mu _{\theta}(x_{t}, t), \color{red}{\Sigma _{\theta}(x_{t}, t)} \right ) = \mathcal{N}(x_{t-1}; \mu _{\theta}(x_{t}, t), \color{red} {\sigma_{t}^{2}}\boldsymbol{I})

结合前置知识中高斯分布的KL散度公式,有:

D_{KL}\left ( q(x_{T}| x_{0})\parallel p_{\theta}(x_{T}) \right ) = D_{KL}\left ( \mathcal{N}(x_{t-1}, \tilde{\mu }(x_{t},x_{0}), \sigma_{t}^{2}\boldsymbol{I}) \parallel \mathcal{N}(x_{t-1}; \mu _{\theta}(x_{t}, t), \sigma_{t}^{2}\boldsymbol{I}) \right )

= \log 1 + \frac{\sigma _{t}^{2} + \left \| \tilde{\mu }_{t}(x_{t},x_{0}) - \mu _{\theta}(x_{t}, t) \right \| ^{2}}{2 \sigma_{t} ^{2}} - \frac{1}{2} = \frac{1}{2\sigma_{t} ^{2}}\left \| \tilde{\mu }_{t}(x_{t},x_{0}) - \mu _{\theta}(x_{t}, t) \right \| ^{2}

将此式代回之前变分上界的式子,则优化目标 L_{t-1} 可以写作:

L_{t-1} = \mathbb{E}_{ q(x_{t}\mid x_{0})}\left [ \frac{1}{2\sigma_{t} ^{2}}\left \| \tilde{\mu }_{t}(x_{t},x_{0}) - \mu _{\theta}(x_{t}, t) \right \| ^{2} \right ]

也就是说,我们希望网络参数化高斯分布的均值 \mu _{\theta}(x_{t},t) 与后验分布的均值 \tilde{\mu }_{t}(x_{t},x_{0}) 一致。但是,此式还可以继续扩展,在反向过程中,\tilde{\mu }_{t}(x_{t},x_{0}) 可以写成用 x_{t} 和 \bar{z}_{t} 表示的形式,而在前向过程中,x_{t} 又可以写成用 x_{0} 和 \bar{z}_{t} 表示的形式,将它们代入上式,则有:

L_{t-1} = \mathbb{E}_{ x_{0}, \bar{z}_{t} \sim \mathcal{N}(0,\boldsymbol{I})}\left [ \frac{1}{2\sigma_{t} ^{2}}\left \| \frac{1}{\sqrt{\alpha _{t}}}\left ( x_{t}\left ( x_{0}, \bar{z}_{t}\right ) - \frac{\beta _{t}}{\sqrt{1-\bar{\alpha }_{t}}}\bar{z}_{t} \right ) - \mu _{\theta}\left ( x_{t}\left ( x_{0}, \bar{z}_{t}\right ), t \right ) \right \| ^{2} \right ]

其中,参数化高斯分布的均值 \mu _{\theta}(x_{t},t) 可以相应改写成与真实均值 \tilde{\mu }_{t}(x_{t},x_{0}) 一样的形式:

\small \mu _{\theta}(x_{t}\left ( x_{0}, \bar{z}_{t}\right ), t) = \frac{1}{\sqrt{\alpha _{t}}}\left ( x_{t}\left ( x_{0}, \bar{z}_{t}\right ) - \frac{\beta _{t}}{\sqrt{1-\bar{\alpha }_{t}}}z_{\theta}(x_{t}\left ( x_{0}, \bar{z}_{t}\right ), t) \right )

这里的 z_{\theta}(x_{t}\left ( x_{0}, \bar{z}_{t}\right ), t)) 是神经网络的拟合项,即优化目标由原来的拟合均值转换成了拟合噪声。将其代入 L_{t-1} 的表达式中,则有:

L_{t-1} = \mathbb{E}_{ x_{0}, \bar{z}_{t} \sim \mathcal{N}(0,\boldsymbol{I})} \left [ \frac{1}{2\sigma_{t} ^{2}}\left \| \frac{1}{\sqrt{\alpha _{t}}}\left ( x_{t}\left ( x_{0}, \bar{z}_{t}\right ) - \frac{\beta _{t}}{\sqrt{1-\bar{\alpha }_{t}}}\bar{z}_{t} \right ) - \frac{1}{\sqrt{\alpha _{t}}}\left ( x_{t} - \frac{\beta _{t}}{\sqrt{1-\bar{\alpha }_{t}}}z_{\theta}(x_{t}\left ( x_{0}, \bar{z}_{t}\right ), t) \right ) \right \| ^{2} \right ]

= \mathbb{E}_{ x_{0}, \bar{z}_{t} \sim \mathcal{N}(0,\boldsymbol{I})} \left [ \frac{1}{2\sigma_{t} ^{2}}\left \| \frac{x_{t}\left ( x_{0}, \bar{z}_{t}\right )}{\sqrt{\alpha _{t}}}-\frac{\beta _{t}}{\sqrt{\alpha _{t}} \sqrt{1-\bar{\alpha }_{t}}}\bar{z}_{t} - \frac{x_{t}\left ( x_{0}, \bar{z}_{t}\right )}{\sqrt{\alpha _{t}}} + \frac{\beta _{t}}{\sqrt{\alpha _{t}}\sqrt{1-\bar{\alpha }_{t}}}z_{\theta}(x_{t}\left ( x_{0}, \bar{z}_{t}\right ), t) \right \| ^{2} \right ]

= \mathbb{E}_{ x_{0}, \bar{z}_{t} \sim \mathcal{N}(0,\boldsymbol{I})} \left [ \frac{1}{2\sigma_{t} ^{2}}\left \| \frac{\beta _{t}}{\sqrt{\alpha _{t}} \sqrt{1-\bar{\alpha }_{t}}}\left (\bar{z}_{t} - z_{\theta}\left ( \color{magenta}{x_{t}(x_{0},\bar{z}_{t})}, t \right ) \right ) \right \| ^{2} \right ]​​​​​​​

= \mathbb{E}_{ x_{0}, \bar{z}_{t} \sim \mathcal{N}(0,\boldsymbol{I})} \left [ \frac{\beta_{t}^{2}}{2\sigma ^{2} \alpha _{t}(1 - \bar{\alpha }_{t})}\left \| \bar{z}_{t} - z_{\theta}(\color{magenta}{\sqrt{\bar{\alpha }_{t}}x_{0} + \sqrt{1-\bar{\alpha }_{t}}\bar{z}_{t}}, t) \right \| ^{2} \right ]

可以将系数项去掉,进一步简化:

L_{t-1}^{simple}= \mathbb{E}_{ x_{0}, \bar{z}_{t} \sim \mathcal{N}(0,\boldsymbol{I})} \left [ \left \| \bar{z}_{t} - z_{\theta}(\sqrt{\bar{\alpha }_{t}}x_{0} + \sqrt{1-\bar{\alpha }_{t}}\bar{z}_{t}, t) \right \| ^{2} \right ]​​​​​​​

虽然背后的推导比较复杂,但是最终得到的优化目标非常简单,就是让网络预测的噪声与真实的噪声一致。

参考

大一统视角理解扩散模型Understanding Diffusion Models: A Unified Perspective 阅读笔记 - 知乎

扩散模型之DDPM - 知乎

动态-哔哩哔哩

什么是 Diffusion Models/扩散模型?_哔哩哔哩_bilibili

Jensen不等式及其应用 - 知乎

单变量高斯分布的KL散度_昕晛的博客-CSDN博客_高斯分布的kl散度

组会分享:生成扩散概率模型简介 Diffusion Models_哔哩哔哩_bilibili

简单基础入门理解Denoising Diffusion Probabilistic Model,DDPM扩散模型_xiongxyowo的博客-CSDN博客

轻松学习扩散模型(diffusion model),被巨怪踩过的脑袋也能懂——原理详解+pytorch代码详解(附全部代码) - 知乎

;