Bootstrap

扩散模型学习笔记(1)

扩散模型(公式推导)

本文很大程度上参考了Diffusion Model 扩散模型-[通俗易懂+代码讲解](一) - 知乎 (zhihu.com)以及https://blog.csdn.net/m0_63642362/article/details/127586200,https://segmentfault.com/a/1190000043744225
三篇文章都写得非常好
本文相当于作者自己的一个学习笔记。在一些步骤上进行了补充,但由于博主刚开始学习,还是存在一些不理解的地方。

模型训练

1. 前向扩散过程(Forward Diffusion Process)→图片中添加噪声;

2. 反向扩散过程(Reverse Diffusion Process)→去除图片中的噪声

前向扩散过程:不断往其中添加高斯噪声

反向扩散过程:将噪声逐渐还原为原始图片

一些前置知识:

给定两个服从正态分布的独立随机变量 X ∼ N ( μ X , σ X 2 ) , Y ∼ N ( μ Y , σ Y 2 ) X \sim N(\mu_X, \sigma_X^2), Y \sim N(\mu_Y, \sigma_Y^2) XN(μX,σX2),YN(μY,σY2),这两个分布的加和 Z = X + Y Z = X + Y Z=X+Y,同样服从正态分布 Z ∼ N ( μ X + μ Y , σ X 2 + σ Y 2 ) Z \sim N(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2) ZN(μX+μY,σX2+σY2)。这意味着两个独立正态分布随机变量的和是正态的,其平均值是两个平均值的和,其方差是两个方差的和。

前向扩散过程推导

定义前向传播过程:给定一组从真实数据分布中采样的数据 x 0 ∼ q ( x ) x_0∼q(x) x0q(x),即原始数据,分 T(注意,此处的 T 在训练过程中是一个可变参数)步来一步步对该样本叠加高斯噪声,最终的得到一系列经噪声叠加后的样本 x 1 , x 2 , … , x T x_1,x_2,…,x_T x1,x2,,xT,其中,步数 T 的大小受 β t β_t βt 约束( β t ∈ ( 0 , 1 ) } t = 1 T {\beta_t \in (0, 1) \}_{t=1}^T } βt(0,1)}t=1T)。

(目的:使得数据分布逐渐趋于一个一致的简单分布(通常是标准正态分布)。这个过程可以表示为一个马尔科夫链,其中每一步的状态 x t x_t xt 都依赖于前一步的状态 x t − 1 x_{t-1} xt1。)

q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) q(xtxt1)=N(xt;1βt xt1,βtI) q表示已知 x t − 1 x_{t-1} xt1 x t x_t xt的概率分布,它是一个多元高斯分布。

参数分别为(当前状态,均值,协方差矩阵)。

我们想要表达从初始状态 x 0 x_0 x0 开始,到时刻 T T T 的所有状态 x 1 , x 2 , … , x T x_1, x_2, \dots, x_T x1,x2,,xT 的联合分布 q ( x 1 : T ∣ x 0 ) q(x_{1:T} \mid x_0) q(x1:Tx0)

已知全概率公式形式为:

P ( A ) = ∑ i P ( A ∣ B i ) P ( B i ) P(A) = \sum_{i} P(A \mid B_i) P(B_i) P(A)=iP(ABi)P(Bi)

同时我们有

P ( A , B ) = P ( A ∣ B ) P ( B ) P(A, B) = P(A \mid B) P(B) P(A,B)=P(AB)P(B)(AB联合分布概率,可以理解为同时发生的概率)

对于多个事件 X 1 , X 2 , … , X n X_1, X_2, \dots, X_n X1,X2,,Xn,联合概率分布可以写成条件概率的乘积:

P ( X 1 , X 2 , … , X n ) = P ( X 1 ) P ( X 2 ∣ X 1 ) P ( X 3 ∣ X 1 , X 2 ) ⋯ P ( X n ∣ X 1 , X 2 , … , X n − 1 ) P(X_1, X_2, \dots, X_n) = P(X_1) P(X_2 \mid X_1) P(X_3 \mid X_1, X_2) \cdots P(X_n \mid X_1, X_2, \dots, X_{n-1}) P(X1,X2,,Xn)=P(X1)P(X2X1)P(X3X1,X2)P(XnX1,X2,,Xn1)

根据链式法则,对于随机变量序列 X 0 , X 1 , … , X T X_0, X_1, \dots, X_T X0,X1,,XT,我们有:

q ( x 0 , x 1 , … , x T ) = q ( x 0 ) q ( x 1 ∣ x 0 ) q ( x 2 ∣ x 0 , x 1 ) ⋯ q ( x T ∣ x 0 , x 1 , … , x T − 1 ) q(x_0, x_1, \dots, x_T) = q(x_0) q(x_1 \mid x_0) q(x_2 \mid x_0, x_1) \cdots q(x_T \mid x_0, x_1, \dots, x_{T-1}) q(x0,x1,,xT)=q(x0)q(x1x0)q(x2x0,x1)q(xTx0,x1,,xT1)

在扩散模型中,我们假设每个状态 x t x_t xt 只依赖于前一个状态 x t − 1 x_{t-1} xt1 (马尔科夫性质),即:

q ( x t ∣ x 0 , x 1 , … , x t − 1 ) = q ( x t ∣ x t − 1 ) q(x_t \mid x_0, x_1, \dots, x_{t-1}) = q(x_t \mid x_{t-1}) q(xtx0,x1,,xt1)=q(xtxt1)

因此,联合概率分布可以简化为:

q ( x 0 , x 1 , … , x T ) = q ( x 0 ) q ( x 1 ∣ x 0 ) q ( x 2 ∣ x 1 ) ⋯ q ( x T ∣ x t − 1 ) q(x_0, x_1, \dots, x_T) = q(x_0) q(x_1 \mid x_0) q(x_2 \mid x_1) \cdots q(x_T \mid x_{t-1}) q(x0,x1,,xT)=q(x0)q(x1x0)q(x2x1)q(xTxt1)

在给定初始状态 x 0 x_0 x0 的情况下,所有中间状态的条件联合分布为:

q ( x 1 : T ∣ x 0 ) = q ( x 1 ∣ x 0 ) q ( x 2 ∣ x 1 ) ⋯ q ( x T ∣ x t − 1 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) q(x_{1:T} \mid x_0) = q(x_1 \mid x_0) q(x_2 \mid x_1) \cdots q(x_T \mid x_{t-1}) = \prod_{t=1}^{T} q(x_t \mid x_{t-1}) q(x1:Tx0)=q(x1x0)q(x2x1)q(xTxt1)=t=1Tq(xtxt1)

其间,原始数据 x 0 x_0 x0 在前向扩散的 t t t步数的迭代,逐渐失去其独特鲜明的特征,最终当 T → ∞ T \rightarrow \infty T x T x_T xT 等价于一个符合各向同性的高斯分布噪声,如图所示。

往右边就是前向扩散,往左边就是逆向扩散

高斯噪声(Gaussian noise),也叫做白噪声 :概率密度函数遵循高斯分布(也称为正态分布)

在任意时间步 t t t x t x_t xt 进行采样。这里定义 α t = 1 − β t \alpha_t = 1 - \beta_t αt=1βt α ˉ t = ∏ i = 1 t α i \bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i αˉt=i=1tαi,给定高斯噪声: ϵ t − 1 , ϵ t − 2 , ⋯ ∼ N ( 0 , I ) \epsilon_{t-1}, \epsilon_{t-2}, \dots \sim N(0, I) ϵt1,ϵt2,N(0,I)

x t = α t x t − 1 + 1 − α t ϵ t − 1 x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon_{t-1} xt=αt xt1+1αt ϵt1 (这里开始,相当于每一步加一点噪声)

= α t α t − 1 x t − 2 + α t − α t α t − 1 ϵ t − 2 + 1 − α t ϵ t − 1 = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t - \alpha_t \alpha_{t-1}} \epsilon_{t-2} + \sqrt{1 - \alpha_t} \epsilon_{t-1} =αtαt1 xt2+αtαtαt1 ϵt2+1αt ϵt1

1 − α t ϵ t − 1 \sqrt{1 - \alpha_t} \epsilon_{t-1} 1αt ϵt1 α t − α t α t − 1 ϵ t − 2 \sqrt{\alpha_t - \alpha_t \alpha_{t-1}} \epsilon_{t-2} αtαtαt1 ϵt2 分别表示 ϵ t − 1 ∼ N ( 0 , σ 1 2 I ) \epsilon_{t-1} \sim N(0, \sigma_1^2 I) ϵt1N(0,σ12I) ϵ t − 2 ∼ N ( 0 , σ 2 2 I ) \epsilon_{t-2} \sim N(0, \sigma_2^2 I) ϵt2N(0,σ22I) ,方差变为原分布的 ( 1 − α t ) (1 - \alpha_t) (1αt) ( α t − α t α t − 1 ) (\alpha_t - \alpha_t \alpha_{t-1}) (αtαtαt1) 倍,

新的分布加和得到 ϵ t − 2 ∼ N ( 0 , ( σ 1 2 + σ 2 2 ) I ) \epsilon_{t-2} \sim N(0, (\sigma_1^2 + \sigma_2^2)I) ϵt2N(0,(σ12+σ22)I),该分布方差为 ( α t − α t α t − 1 ) + 1 − α t = 1 − α t α t − 1 (\alpha_t - \alpha_t \alpha_{t-1}) + 1 - \alpha_t = 1 - \alpha_t \alpha_{t-1} (αtαtαt1)+1αt=1αtαt1 的倍数。

x t = α t α t − 1 x t − 2 + 1 − α t α t − 1 ϵ ˉ t − 2 x_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\epsilon}_{t-2} xt=αtαt1 xt2+1αtαt1 ϵˉt2
= ⋯ = α ˉ t x 0 + 1 − α ˉ t ϵ = \dots = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon ==αˉt x0+1αˉt ϵ 其中, ϵ ˉ t − 2 \bar{\epsilon}_{t-2} ϵˉt2 为合并两个高斯分布 ϵ t − 1 \epsilon_{t-1} ϵt1 ϵ t − 2 \epsilon_{t-2} ϵt2 后的分布表示。

由此推导出的公式记为 x t = α ˉ t x 0 + 1 − α ˉ t ϵ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon xt=αˉt x0+1αˉt ϵ (由原始公式推导出来)

这样就可以得到在已知 x 0 x_0 x0的情况下, x t x_t xt的概率分布为:
q ( x t ∣ x 0 ) = N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) q(x_t \mid x_0) = N(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I) q(xtx0)=N(xt;αˉt x0,(1αˉt)I)
(这个是由于噪声的均值为0,协方差为 I I I)

逆向扩散过程(Reverse diffusion process)

扩散过程是将数据噪音化,反向过程就是一个去噪的过程。逆向扩散过程中,我们将以高斯噪声 x T ∼ N ( 0 , I ) x_T \sim N(0, I) xTN(0,I) 作为输入,从 q ( x t − 1 ∣ x t ) q(x_{t-1} \mid x_t) q(xt1xt) 中采样,推断并重构出真实样本。

训练出一个模型来对这些噪声的条件概率进行预测:
将前向传播每步生成的真实噪声记录下来作为标签,在模型做逆向扩散时,即可对前向扩散中所产生的高斯噪声进行预测,并一步一步推断,以还原最初始的样本数据。

而在去除 x t − 1 x_{t-1} xt1 所加的高斯噪声的后验估计分布为:

p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , ∑ θ ( x t , t ) ) p_\theta (x_{t-1} \mid x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \sum_{\theta} (x_t, t)) pθ(xt1xt)=N(xt1;μθ(xt,t),θ(xt,t)) 这里面的参数代表神经网络估计的值

虽然分布 q ( x t − 1 ∣ x t ) q(x_{t-1} \mid x_t) q(xt1xt) 是无法直接计算得到的,但是加上条件 x 0 x_0 x0 的后验分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1} \mid x_t, x_0) q(xt1xt,x0) 即可通过计算处理而得。

q ( x t − 1 ∣ x t , x 0 ) = q ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ ~ ( x t , x 0 ) , β ~ t I ) q(x_{t-1} \mid x_t, x_0) = q(x_{t-1} \mid x_t) = N(x_{t-1}; \tilde{\mu}(x_t, x_0), \tilde{\beta}_t I) q(xt1xt,x0)=q(xt1xt)=N(xt1;μ~(xt,x0),β~tI)
(最左边的式子表示在已知 x t x_t xt和已知 x 0 x_0 x0的情况下 x t − 1 的概率 x_{t-1}的概率 xt1的概率,别搞错了)

公式推导

  • 联合概率可以通过条件概率表示为:
    q ( x t − 1 , x t ∣ x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q(x_{t-1}, x_t \mid x_0) = q(x_t \mid x_{t-1}, x_0) q(x_{t-1} \mid x_0) q(xt1,xtx0)=q(xtxt1,x0)q(xt1x0)

  • 同时,也可以表示为:
    q ( x t − 1 , x t ∣ x 0 ) = q ( x t − 1 ∣ x t , x 0 ) q ( x t ∣ x 0 ) q(x_{t-1}, x_t \mid x_0) = q(x_{t-1} \mid x_t, x_0) q(x_t \mid x_0) q(xt1,xtx0)=q(xt1xt,x0)q(xtx0)

联立两个式子,可以得到
q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) q(x_{t-1} \mid x_t, x_0) = \frac{q(x_t \mid x_{t-1}, x_0) q(x_{t-1} \mid x_0)}{q(x_t \mid x_0)} q(xt1xt,x0)=q(xtx0)q(xtxt1,x0)q(xt1x0)

这时候右边三个式子都是已知的了

q ( x t ∣ x 0 ) = N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) q(x_t \mid x_0) = N(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I) q(xtx0)=N(xt;αˉt x0,(1αˉt)I) *已知1

q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) q(xtxt1)=N(xt;1βt xt1,βtI) *已知2

那么
q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) q(x_{t-1} \mid x_t, x_0) = \frac{q(x_t \mid x_{t-1}, x_0) q(x_{t-1} \mid x_0)}{q(x_t \mid x_0)} q(xt1xt,x0)=q(xtx0)q(xtxt1,x0)q(xt1x0)

∝ e − 1 2 ( ( x t − α t x t − 1 ) 2 β t + ( x t − 1 − α ˉ t − 1 x 0 ) 2 1 − α ˉ t − 1 − ( x t − α ˉ t x 0 ) 2 1 − α ˉ t ) \propto e^{-\frac{1}{2} \left( \frac{(x_t - \sqrt{\alpha_t} x_{t-1})^2}{\beta_t} + \frac{(x_{t-1} - \sqrt{\bar{\alpha}_{t-1}} x_0)^2}{1 - \bar{\alpha}_{t-1}} - \frac{(x_t - \sqrt{\bar{\alpha}_t} x_0)^2}{1 - \bar{\alpha}_t} \right)} e21(βt(xtαt xt1)2+1αˉt1(xt1αˉt1 x0)21αˉt(xtαˉt x0)2)

= e − 1 2 ( x t 2 − 2 α t x t x t − 1 + α t x t − 1 2 β t + x t − 1 2 − 2 α ˉ t − 1 x t − 1 x 0 + α ˉ t − 1 x 0 2 1 − α ˉ t − 1 − ( x t − α ˉ t x 0 ) 2 1 − α ˉ t ) = e^{-\frac{1}{2} \left( \frac{x_t^2 - 2 \sqrt{\alpha_t} x_t x_{t-1} + \alpha_t x_{t-1}^2}{\beta_t} + \frac{x_{t-1}^2 - 2 \sqrt{\bar{\alpha}_{t-1}} x_{t-1} x_0 + \bar{\alpha}_{t-1} x_0^2}{1 - \bar{\alpha}_{t-1}} - \frac{(x_t - \sqrt{\bar{\alpha}_t} x_0)^2}{1 - \bar{\alpha}_t} \right)} =e21(βtxt22αt xtxt1+αtxt12+1αˉt1xt122αˉt1 xt1x0+αˉt1x021αˉt(xtαˉt x0)2)

= e − 1 2 ( ( α t β t + 1 1 − α ˉ t − 1 ) x t − 1 2 − ( 2 α t β t x t + 2 α ˉ t − 1 1 − α ˉ t − 1 x 0 ) x t − 1 + C ( x t , x 0 ) ) = e^{-\frac{1}{2} \left( \left( \frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}} \right) x_{t-1}^2 - \left( \frac{2 \sqrt{\alpha_t}}{\beta_t} x_t + \frac{2 \sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} x_0 \right) x_{t-1} + C(x_t, x_0) \right)} =e21((βtαt+1αˉt11)xt12(βt2αt xt+1αˉt12αˉt1 x0)xt1+C(xt,x0))

这是一个类似高斯分布的式子,观察高斯分布形式:
e − 1 2 ( ( x t − 1 − μ ) 2 σ 2 ) e^{-\frac{1}{2} \left( \frac{(x_{t-1} - \mu)^2}{\sigma^2} \right)} e21(σ2(xt1μ)2)

配方之后可以得到:

( α t β t + 1 1 − α ˉ t − 1 ) ( x t − 1 2 − 2 ( α t x t β t + α ˉ t − 1 x 0 1 − α ˉ t − 1 ) x t − 1 + ( α t x t β t + α ˉ t − 1 x 0 1 − α ˉ t − 1 ) 2 ) \left( \frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}} \right) \left( x_{t-1}^2 - 2 \left( \frac{\sqrt{\alpha_t} x_t}{\beta_t} + \frac{\sqrt{\bar{\alpha}_{t-1}} x_0}{1 - \bar{\alpha}_{t-1}} \right) x_{t-1} + \left( \frac{\sqrt{\alpha_t} x_t}{\beta_t} + \frac{\sqrt{\bar{\alpha}_{t-1}} x_0}{1 - \bar{\alpha}_{t-1}} \right)^2 \right) (βtαt+1αˉt11)(xt122(βtαt xt+1αˉt1αˉt1 x0)xt1+(βtαt xt+1αˉt1αˉt1 x0)2)

那么均值和方差就可以表示为:

方差:

β ~ t = 1 ( α t β t + 1 1 − α ˉ t − 1 ) = 1 ( α t ( 1 − α ˉ t − 1 ) + β t β t ( 1 − α ˉ t − 1 ) ) = 1 ( 1 − β t − α ˉ t − 1 + β t β t ( 1 − α ˉ t − 1 ) ) = 1 − α ˉ t − 1 1 − α ˉ t β t \tilde{\beta}_t = \frac{1}{\left( \frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}} \right)} = \frac{1}{\left( \frac{\alpha_t (1 - \bar{\alpha}_{t-1}) + \beta_t}{\beta_t (1 - \bar{\alpha}_{t-1})} \right)} = \frac{1}{\left( \frac{1 - \beta_t - \bar{\alpha}_{t-1} + \beta_t}{\beta_t (1 - \bar{\alpha}_{t-1})} \right)} = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t β~t=(βtαt+1αˉt11)1=(βt(1αˉt1)αt(1αˉt1)+βt)1=(βt(1αˉt1)1βtαˉt1+βt)1=1αˉt1αˉt1βt

方差的所有参数都是已知的, α t \alpha_t αt, β t \beta_t βt都是已知的,随t不断变化的可以设置的参数。

均值:
μ ~ t ( x t , x 0 ) = ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) ( α t β t + 1 1 − α ˉ t − 1 ) = ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) 1 ( α t β t + 1 1 − α ˉ t − 1 ) = ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) 1 − α ˉ t − 1 1 − α ˉ t β t \tilde{\mu}_t (x_t, x_0) = \frac{\left( \frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} x_0 \right)}{\left( \frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}} \right)} =\left( \frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} x_0 \right) \frac{1}{\left( \frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}} \right)} = \left( \frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} x_0 \right) \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t μ~t(xt,x0)=(βtαt+1αˉt11)(βtαt xt+1αˉt1αˉt1 x0)=(βtαt xt+1αˉt1αˉt1 x0)(βtαt+1αˉt11)1=(βtαt xt+1αˉt1αˉt1 x0)1αˉt1αˉt1βt

= ( α t ( 1 − α ˉ t − 1 ) β t ( 1 − α ˉ t ) x t + α ˉ t − 1 β t ( 1 − α ˉ t − 1 ) ( 1 − α ˉ t ) x 0 ) = α t ( 1 − α ˉ t − 1 ) x t + α ˉ t − 1 β t x 0 β t ( 1 − α ˉ t ) = \left( \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{\beta_t (1 - \bar{\alpha}_t)} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{(1 - \bar{\alpha}_{t-1}) (1 - \bar{\alpha}_t)} x_0 \right) = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1}) x_t + \sqrt{\bar{\alpha}_{t-1}} \beta_t x_0}{\beta_t (1 - \bar{\alpha}_t)} =(βt(1αˉt)αt (1αˉt1)xt+(1αˉt1)(1αˉt)αˉt1 βtx0)=βt(1αˉt)αt (1αˉt1)xt+αˉt1 βtx0

= α t ( 1 − α ˉ t − 1 ) x t + α ˉ t − 1 β t x 0 1 − α ˉ t = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1}) x_t + \sqrt{\bar{\alpha}_{t-1}} \beta_t x_0}{1 - \bar{\alpha}_t} =1αˉtαt (1αˉt1)xt+αˉt1 βtx0

= α t ( 1 − α ˉ t − 1 ) x t + α ˉ t − 1 β t x 0 1 − α ˉ t = \sqrt{\alpha_t (1 - \bar{\alpha}_{t-1}) x_t + \sqrt{\bar{\alpha}_{t-1}} \beta_t x_0}{1 - \bar{\alpha}_t} =αt(1αˉt1)xt+αˉt1 βtx0 1αˉt

由于 x t = α ˉ t x 0 + 1 − α ˉ t ϵ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon xt=αˉt x0+1αˉt ϵ x 0 = 1 α ˉ t ( x t − 1 − α ˉ t ϵ ) x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} (x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon) x0=αˉt 1(xt1αˉt ϵ)
那么:
μ ~ t = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t 1 α ˉ t ( x t − 1 − α ˉ t ϵ ) \tilde{\mu}_t = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}} (x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon) μ~t=1αˉtαt (1αˉt1)xt+1αˉtαˉt1 βtαˉt 1(xt1αˉt ϵ)

= α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t 1 α ˉ t ( x t − 1 − α ˉ t ϵ ) = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}} (x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon) =1αˉtαt (1αˉt1)xt+1αˉtαˉt1 βtαˉt 1(xt1αˉt ϵ)

= 1 α ˉ t ( x t − 1 − α t 1 − α ˉ t ϵ ) = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \right) \quad =αˉt 1(xt1αˉt 1αtϵ)

这个 x t x_t xt会受到不同 x 0 x_0 x0影响,每次不是确定的。

优化目标

p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , ∑ θ ( x t , t ) ) p_\theta (x_{t-1} \mid x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \sum_{\theta} (x_t, t)) pθ(xt1xt)=N(xt1;μθ(xt,t),θ(xt,t)) 这里面的参数代表神经网络估计的值

现在,我们要预测的就是这里的 μ θ ( x t , t ) \mu_\theta(x_t, t) μθ(xt,t)

我们先看整体损失函数:

− log ⁡ ( P θ ( X 0 ) ) -\log (P_\theta (X_0)) log(Pθ(X0)),希望神经网络的参数 θ \theta θ 可以使得生成 X 0 X_0 X0 的概率越来越大越好。

前置知识 Kullback-Leibler散度(相对熵):用于衡量两个概率分布之间的差异。具体来说,它衡量的是当一个分布𝑞被另一个分布p 近似时,所损失的信息。

也就是说,当使用分布 p θ p_\theta pθ来近似分布𝑞时,会丢失多少信息。数学表达式为:
D K L ( q ∥ p ) = ∑ x q ( x ) log ⁡ q ( x ) p ( x ) D_{KL}(q \parallel p) = \sum_x q(x) \log \frac{q(x)}{p(x)} DKL(qp)=xq(x)logp(x)q(x)

由于损失函数无法直接计算,我们选择:
− log ⁡ p θ ( x 0 ) ≤ − log ⁡ p θ ( x 0 ) + D K L ( q ( x 1 : T ∣ x 0 ) ∥ p θ ( x 1 : T ∣ x 0 ) ) - \log p_\theta (x_0) \leq - \log p_\theta (x_0) + D_{KL}(q(x_{1:T} \mid x_0) \parallel p_\theta (x_{1:T} \mid x_0)) logpθ(x0)logpθ(x0)+DKL(q(x1:Tx0)pθ(x1:Tx0))

= D K L ( q ( x 1 : T ∣ x 0 ) ∥ p θ ( x 1 : T ∣ x 0 ) ) = E x 1 : T ∼ q ( x 1 : T ∣ x 0 ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 1 : T ∣ x 0 ) ] =D_{KL}(q(x_{1:T} \mid x_0) \parallel p_\theta (x_{1:T} \mid x_0)) = \mathbb{E}_{x_{1:T} \sim q(x_{1:T} \mid x_0)} \left[ \log \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{1:T} \mid x_0)} \right] =DKL(q(x1:Tx0)pθ(x1:Tx0))=Ex1:Tq(x1:Tx0)[logpθ(x1:Tx0)q(x1:Tx0)]

由于 p θ ( x 0 : T ) p θ ( x 0 ) ← p θ ( x 0 , x 1 : T ) p θ ( x 0 ) ← p θ ( x 0 ∣ x 1 : T ) p θ ( x 1 : T ) p θ ( x 0 ) \frac{p_\theta (x_{0:T})}{p_\theta (x_0)} \leftarrow \frac{p_\theta (x_0, x_{1:T})}{p_\theta (x_0)} \leftarrow \frac{p_\theta (x_0 \mid x_{1:T}) p_\theta (x_{1:T})}{p_\theta (x_0)} pθ(x0)pθ(x0:T)pθ(x0)pθ(x0,x1:T)pθ(x0)pθ(x0x1:T)pθ(x1:T)

= − log ⁡ p θ ( x 0 ) + E x 1 : T ∼ q ( x 1 : T ∣ x 0 ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) / p θ ( x 0 ) ] = -\log p_\theta (x_0) + \mathbb{E}_{x_{1:T} \sim q(x_{1:T} \mid x_0)} \left[ \log \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T}) / p_\theta (x_0)} \right] =logpθ(x0)+Ex1:Tq(x1:Tx0)[logpθ(x0:T)/pθ(x0)q(x1:Tx0)]

= − log ⁡ p θ ( x 0 ) + E q [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] + log ⁡ p θ ( x 0 ) = -\log p_\theta (x_0) + \mathbb{E}_q \left[ \log \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right] + \log p_\theta (x_0) =logpθ(x0)+Eq[logpθ(x0:T)q(x1:Tx0)]+logpθ(x0)

= E q [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] = \mathbb{E}_q \left[ \log \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right] =Eq[logpθ(x0:T)q(x1:Tx0)]
(由于最右边是一个常数值,对它取期望的操作相当于它本身)

(接下来的推导又会用到:
q ( x t − 1 , x t ∣ x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q(x_{t-1}, x_t \mid x_0) = q(x_t \mid x_{t-1}, x_0) q(x_{t-1} \mid x_0) q(xt1,xtx0)=q(xtxt1,x0)q(xt1x0)
q ( x t − 1 , x t ∣ x 0 ) = q ( x t − 1 ∣ x t , x 0 ) q ( x t ∣ x 0 ) q(x_{t-1}, x_t \mid x_0) = q(x_{t-1} \mid x_t, x_0) q(x_t \mid x_0) q(xt1,xtx0)=q(xt1xt,x0)q(xtx0)

E q [ log ⁡ ( ∏ t = 1 T q ( x t ∣ x t − 1 ) p θ ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ) ] E_q \left[ \log \left( \frac{\prod_{t=1}^{T} q(x_t | x_{t-1})}{p_\theta (x_T) \prod_{t=1}^{T} p_\theta (x_{t-1} | x_t)} \right) \right] Eq[log(pθ(xT)t=1Tpθ(xt1xt)t=1Tq(xtxt1))]

= E q [ − log ⁡ p θ ( x T ) + ∑ t = 2 T log ⁡ q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E_q \left[ -\log p_\theta (x_T) + \sum_{t=2}^{T} \log \frac{q(x_t | x_{t-1})}{p_\theta (x_{t-1} | x_t)} + \log \frac{q(x_1 | x_0)}{p_\theta (x_0 | x_1)} \right] =Eq[logpθ(xT)+t=2Tlogpθ(xt1xt)q(xtxt1)+logpθ(x0x1)q(x1x0)]

= E q [ − log ⁡ p θ ( x T ) + ∑ t = 2 T log ⁡ q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E_q \left[ -\log p_\theta (x_T) + \sum_{t=2}^{T} \log \frac{q(x_t | x_{t-1})}{p_\theta (x_{t-1} | x_t)} + \log \frac{q(x_1 | x_0)}{p_\theta (x_0 | x_1)} \right] =Eq[logpθ(xT)+t=2Tlogpθ(xt1xt)q(xtxt1)+logpθ(x0x1)q(x1x0)]

= E q [ − log ⁡ p θ ( x T ) + ∑ t = 2 T log ⁡ ( q ( x t ∣ x t − 1 ) q ( x t − 1 ∣ x 0 ) q ( x t − 1 ∣ x 0 ) p θ ( x t − 1 ∣ x t ) ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E_q \left[ -\log p_\theta (x_T) + \sum_{t=2}^{T} \log \left( \frac{q(x_t | x_{t-1}) q(x_{t-1} | x_0)}{q(x_{t-1} | x_0) p_\theta (x_{t-1} | x_t)} \right) + \log \frac{q(x_1 | x_0)}{p_\theta (x_0 | x_1)} \right] =Eq[logpθ(xT)+t=2Tlog(q(xt1x0)pθ(xt1xt)q(xtxt1)q(xt1x0))+logpθ(x0x1)q(x1x0)]

$ =E_q \left[ -\log p_\theta (x_T) + \sum_{t=2}^{T} \log \frac{q(x_t | x_{t-1})}{p_\theta (x_{t-1} | x_t)} + \sum_{t=2}^{T} \log \frac{q(x_{t-1} | x_0)}{q(x_{t-1} | x_0)} + \log \frac{q(x_1 | x_0)}{p_\theta (x_0 | x_1)} \right]$

= E q [ − log ⁡ p θ ( x T ) + ∑ t = 2 T log ⁡ q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) + log ⁡ q ( x T ∣ x 0 ) q ( x 1 ∣ x 0 ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E_q \left[ -\log p_\theta (x_T) + \sum_{t=2}^{T} \log \frac{q(x_t | x_{t-1})}{p_\theta (x_{t-1} | x_t)} + \log \frac{q(x_T | x_0)}{q(x_1 | x_0)} + \log \frac{q(x_1 | x_0)}{p_\theta (x_0 | x_1)} \right] =Eq[logpθ(xT)+t=2Tlogpθ(xt1xt)q(xtxt1)+logq(x1x0)q(xTx0)+logpθ(x0x1)q(x1x0)]

= E q [ log ⁡ q ( x T ∣ x 0 ) p θ ( x T ) + ∑ t = 2 T log ⁡ q ( x t − 1 ∣ x 0 ) p θ ( x t − 1 ∣ x t ) − log ⁡ p θ ( x 0 ∣ x 1 ) ] = E_q \left[ \log \frac{q(x_T | x_0)}{p_\theta (x_T)} + \sum_{t=2}^{T} \log \frac{q(x_{t-1} | x_0)}{p_\theta (x_{t-1} | x_t)} - \log p_\theta (x_0 | x_1) \right] =Eq[logpθ(xT)q(xTx0)+t=2Tlogpθ(xt1xt)q(xt1x0)logpθ(x0x1)]

= E q [ D K L ( q ( x T ∣ x 0 ) ∥ p θ ( x T ) ) + ∑ t = 2 T D K L ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) − log ⁡ p θ ( x 0 ∣ x 1 ) ] =E_q \left[ D_{KL}(q(x_T | x_0) \parallel p_\theta (x_T)) + \sum_{t=2}^{T} D_{KL}(q(x_{t-1} | x_t, x_0) \parallel p_\theta (x_{t-1} | x_t)) - \log p_\theta (x_0 | x_1) \right] =Eq[DKL(q(xTx0)pθ(xT))+t=2TDKL(q(xt1xt,x0)pθ(xt1xt))logpθ(x0x1)]

这时候第一项没有可以学习的参数,就是一个正向过程,同时 x T x_T xT已知

最后一项不要了

根据上文,我们现在有:
q ( x t − 1 ∣ x t , x 0 ) = q ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ ~ ( x t , x 0 ) , β ~ t I ) q(x_{t-1} \mid x_t, x_0) = q(x_{t-1} \mid x_t) = N(x_{t-1}; \tilde{\mu}(x_t, x_0), \tilde{\beta}_t I) q(xt1xt,x0)=q(xt1xt)=N(xt1;μ~(xt,x0),β~tI)

p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , ∑ θ ( x t , t ) ) p_\theta (x_{t-1} \mid x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \sum_{\theta} (x_t, t)) pθ(xt1xt)=N(xt1;μθ(xt,t),θ(xt,t))

真实值
μ ~ t = 1 α ˉ t ( x t − 1 − α t 1 − α ˉ t ϵ ) \tilde{\mu}_t= \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \right) \quad μ~t=αˉt 1(xt1αˉt 1αtϵ)

参数化表示预测值
μ θ ( x t , t ) = 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ θ ( x t , t ) ) \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta (x_t, t) \right) μθ(xt,t)=αt 1(xt1αˉt 1αtϵθ(xt,t))

用了一种均方误差表示法来计算误差:

L t = 1 2 σ t 2 ∥ μ ~ t ( x t , x 0 ) − μ θ ( x t , t ) ∥ 2 L_t = \frac{1}{2\sigma_t^2} \| \tilde{\mu}_t (x_t, x_0) - \mu_\theta (x_t, t) \|^2 Lt=2σt21μ~t(xt,x0)μθ(xt,t)2

↓ 将 μ 的表达式代入 \downarrow \text{将} \mu 的表达式代入 μ的表达式代入

= 1 2 σ t 2 ∥ 1 α t ( x t − β t 1 − α ˉ t ϵ ) − 1 α t ( x t − β t 1 − α ˉ t ϵ θ ( x t , t ) ) ∥ 2 = \frac{1}{2\sigma_t^2} \| \frac{1}{\sqrt{\alpha_t}} (x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon) - \frac{1}{\sqrt{\alpha_t}} (x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta (x_t, t)) \|^2 =2σt21αt 1(xt1αˉt βtϵ)αt 1(xt1αˉt βtϵθ(xt,t))2

= β t 2 2 σ t 2 α t ( 1 − α ˉ t ) ∥ ϵ − ϵ θ ( x t , t ) ∥ 2 = \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \| \epsilon - \epsilon_\theta (x_t, t) \|^2 =2σt2αt(1αˉt)βt2ϵϵθ(xt,t)2

如果把前面的系数全部丢掉的话,模型的效果更好。最终,我们就能得到一个非常简单的优化目标:

= ∥ ϵ − ϵ θ ( x t , t ) ∥ 2 = \| \epsilon - \epsilon_\theta (x_t, t) \|^2 =ϵϵθ(xt,t)2

;