Bootstrap

Denoising diffusion implicit models 阅读笔记2

Denoising diffusion probabilistic models (DDPMs)从马尔科夫链中采样生成样本,需要迭代多次,速度较慢。Denoising diffusion implicit models (DDIMs)的提出是为了在复用DDPM训练的网络的前提下,加速采样过程。
加速采样的基本思路是,原本的生成过程是从 [ T , ⋯   , 1 ] [T,\cdots,1] [T,,1]的序列逐步采样,加速时考虑从子序列 { τ 1 , … , τ S } , τ 1 > τ 2 > ⋯ > τ S ∈ [ 1 , T ] \{\tau_1, \dots, \tau_S\},\tau_1 > \tau_2 > \dots > \tau_S \in [1, T] {τ1,,τS},τ1>τ2>>τS[1,T]采样,通过跳步的方式减少采样的步数。比如DDPM网络原始训练包含1000步,但是采样时可以只从1000步中均匀的选出50步,用这50步采样出图像。
DDPM和DDIM都可以跳步采样,这个作者在实验中也进行了证明。DDIM的贡献主要是在复用DDPM训练的网络的前提下,提出了一种可以调节方差的生成形式,在步数较少的时候使用小的方差生成效果好。

DDIM论文中的符号和DDPM论文不同,本笔记中采用DDPM论文的符号。

引子

DDPM的优化目标是
L VLB = E q ( x 0 : T ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] = E q [ D KL ( q ( x T ∣ x 0 ) ∥ p θ ( x T ) ) ⏟ L T + ∑ t = 2 T D KL ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) ⏟ L t − 1 − log ⁡ p θ ( x 0 ∣ x 1 ) ⏟ L 0 ] \begin{aligned} L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ &= \mathbb{E}_q [\underbrace{D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))}_{L_{t-1}} \underbrace{- \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)}_{L_0} ] \end{aligned} LVLB=Eq(x0:T)[logpθ(x0:T)q(x1:Tx0)]=Eq[LT DKL(q(xTx0)pθ(xT))+t=2TLt1 DKL(q(xt1xt,x0)pθ(xt1xt))L0 logpθ(x0x1)]
其中 L t − 1 = E q [ D KL ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) ] L_{t-1}= \mathbb{E}_q [D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))] Lt1=Eq[DKL(q(xt1xt,x0)pθ(xt1xt))]是DDPM网络优化的目标项。
q ( x t − 1 ∣ x t , x 0 ) = q ( x t − 1 ∣ x 0 ) q ( x t ∣ x t − 1 , x 0 ) q ( x t ∣ x 0 ) q(\pmb{x}_{t-1}|\pmb{x}_{t},\pmb{x}_0)=\frac{q(\pmb{x}_{t-1}|\pmb{x}_0)q(\pmb{x}_t|\pmb{x}_{t-1},\pmb{x}_0)}{q(\pmb{x}_{t}|\pmb{x}_0)} q(xt1xt,x0)=q(xtx0)q(xt1x0)q(xtxt1,x0)所以 L t − 1 L_{t-1} Lt1只和边际分布 q ( x t ∣ x 0 ) q(\pmb{x}_t|\pmb{x}_0) q(xtx0)有关,而不是联合分布 q ( x 1 : T ∣ x 0 ) q(\pmb{x}_{1:T}|\pmb{x}_0) q(x1:Tx0)。所以,我们可以定义一个更为灵活的推理过程,只要它的边际分布 q ( x t ∣ x 0 ) q(\pmb{x}_t|\pmb{x}_0) q(xtx0)和DDPM一致,就可以复用DDPM优化的网络。

非马尔科夫的前向过程

DDPM中推理分布 q ( x 1 : T ∣ x 0 ) q(\mathbf x_{1:T}|\mathbf x_0) q(x1:Tx0)(推理分布是从可观测变量 x 0 \mathbf x_0 x0推理隐变量 x 1 : T \mathbf x_{1:T} x1:T的分布)是固定的马尔科夫链(DDPM中要求 q ( x t ∣ x t − 1 , x 0 ) = q ( x t ∣ x t − 1 ) : = N ( 1 − β t x t − 1 , β t I ) q(\pmb{x}_t|\pmb{x}_{t-1},\pmb{x}_0) = q(\pmb{x}_t|\pmb{x}_{t-1}) := \mathcal{N}(\sqrt{1 - \beta_t}\pmb{x}_{t-1}, \beta_t \pmb{I}) q(xtxt1,x0)=q(xtxt1):=N(1βt xt1,βtI))。现在放宽限制,不要求前向过程是马尔科夫的,也就是对 q ( x t ∣ x t − 1 ) q(\pmb{x}_t|\pmb{x}_{t-1}) q(xtxt1)不做形式要求。

作者定义由实向量 σ ∈ R ≥ 0 T \sigma \in \mathbb{R}^T_{\ge 0} σR0T索引的推理分布族Q:
q σ ( x 1 : T ∣ x 0 ) : = q σ ( x T ∣ x 0 ) ∏ t = 2 T q σ ( x t − 1 ∣ x t , x 0 ) \begin{split} q_\sigma (\pmb{x}_{1:T}|\pmb{x}_0) := q_\sigma(\pmb{x}_T|\pmb{x}_0)\prod_{t=2}^Tq_\sigma(\pmb{x}_{t-1}|\pmb{x}_t, \pmb{x}_0) \end{split} qσ(x1:Tx0):=qσ(xTx0)t=2Tqσ(xt1xt,x0)只要求满足边际分布和DDPM一样,即 q σ ( x t ∣ x 0 ) = N ( α ˉ t x 0 , ( 1 − α ˉ t ) I ) q_\sigma(\pmb{x}_t|\pmb{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\pmb{x}_0,(1-\bar{\alpha}_t)\pmb{I}) qσ(xtx0)=N(αˉt x0,(1αˉt)I)
通过待定系数法(参考[1])可以得到形式更自由的生成过程:
q σ ( x t − 1 ∣ x t , x 0 ) : = N ( α ˉ t − 1 x 0 + 1 − α ˉ t − 1 − σ t 2 ⋅ x t − α ˉ t x 0 1 − α ˉ t , σ t 2 I ) q_\sigma(\pmb{x}_{t-1}|\pmb{x}_t, \pmb{x}_0) := \mathcal{N}(\sqrt{\bar{\alpha}_{t-1}}\pmb{x}_0 + \sqrt{1 - \bar{\alpha}_{t-1}-\sigma_t^2}\cdot \frac{\pmb{x}_t - \sqrt{\bar{\alpha}_t}\pmb{x}_0}{\sqrt{1 - \bar{\alpha}_t}},\sigma_t^2 \pmb{I}) qσ(xt1xt,x0):=N(αˉt1 x0+1αˉt1σt2 1αˉt xtαˉt x0,σt2I)
对应的前向过程也是高斯分布,但前向过程变成了非马尔科夫的,因为每一步都依赖 x 0 \mathbf x_0 x0
q σ ( x t ∣ x t − 1 , x 0 ) = q σ ( x t − 1 ∣ x t , x 0 ) q σ ( x t ∣ x 0 ) q σ ( x t − 1 ∣ x 0 ) q_\sigma(\pmb{x}_t|\pmb{x}_{t-1}, \pmb{x}_0) = \frac{q_\sigma(\pmb{x}_{t-1}|\pmb{x}_t, \pmb{x}_0)q_\sigma(\pmb{x}_t|\pmb{x}_0)}{q_\sigma(\pmb{x}_{t-1}|\pmb{x}_0)} qσ(xtxt1,x0)=qσ(xt1x0)qσ(xt1xt,x0)qσ(xtx0)
如下图所示,DDPM的推理过程是非马尔科夫的。
在这里插入图片描述
注意,DDIM构造的推理分布和DDPM不同,但和DDPM优化相同的优化目标。

反向生成过程

根据上面的推理过程,定义需要学习的生成过程 p θ ( x 0 : T ) p_\theta(\mathbf{x}_{0:T}) pθ(x0:T),该过程利用 q σ ( x t − 1 ∣ x t , x 0 ) q_\sigma(\pmb{x}_{t-1}|\pmb{x}_t,\pmb{x}_0) qσ(xt1xt,x0)
直观地说,给定 x t \pmb{x}_t xt,我们首先预测对应的 x 0 \pmb{x}_0 x0,然后使用我们定义的反向条件分布 q σ ( x t − 1 ∣ x t , x 0 ) q_\sigma(\pmb{x}_{t-1}|\pmb{x}_t, \pmb{x}_0) qσ(xt1xt,x0)获得 x t − 1 \pmb{x}_{t-1} xt1
预测对应的 x 0 \pmb{x}_0 x0如下:
x ^ 0 = f θ ( t ) ( x t ) : = x t − 1 − α ˉ t ϵ θ ( t ) ( x t ) α ˉ t \hat{\pmb{x}}_0 = f_\theta^{(t)}(\pmb{x}_t) := \frac{\pmb{x}_t - \sqrt{1-\bar{\alpha}_t} \pmb{\epsilon}_\theta^{(t)}(\pmb{x}_t)}{\sqrt{\bar{\alpha}_t}} x^0=fθ(t)(xt):=αˉt xt1αˉt ϵθ(t)(xt)使用预测的 x ^ 0 \hat{\pmb{x}}_0 x^0通过 q σ ( x t − 1 ∣ x t , x 0 ) q_\sigma(\pmb{x}_{t-1}|\pmb{x}_t, \pmb{x}_0) qσ(xt1xt,x0)获得 x t − 1 \pmb{x}_{t-1} xt1如下:
x ^ t − 1 = α ˉ t − 1 x ^ 0 + 1 − α ˉ t − 1 − σ t 2 ⋅ x t − α ˉ t x ^ 0 1 − α ˉ t + σ t z , z ∼ N ( 0 , I ) \hat{\pmb{x}}_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\hat{\pmb{x}}_0 + \sqrt{1- \bar{\alpha}_{t-1}-\sigma_t^2}\cdot \frac{\pmb{x}_t - \sqrt{\bar{\alpha}_t}\hat{\pmb{x}}_0}{\sqrt{1 - \bar{\alpha}_t}} + \sigma_t \pmb{z}, \pmb{z} \sim \mathcal{N}(\pmb{0}, \pmb{I}) x^t1=αˉt1 x^0+1αˉt1σt2 1αˉt xtαˉt x^0+σtz,zN(0,I)写成使用 ϵ θ ( t ) ( x t ) \pmb{\epsilon}_\theta^{(t)}(\pmb{x}_t) ϵθ(t)(xt)的形式:
x ^ t − 1 = α ˉ t − 1 x t − 1 − α ˉ t ϵ θ ( t ) ( x t ) α ˉ t + 1 − α ˉ t − 1 − σ t 2 ⋅ ϵ θ ( t ) ( x t ) + σ t z , z ∼ N ( 0 , I ) \hat{\pmb{x}}_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \frac{\pmb{x}_t -\sqrt{1-\bar{\alpha}_t}\pmb{\epsilon}_\theta^{(t)}(\pmb{x}_t)}{\sqrt{\bar{\alpha}_t}} + \sqrt{1- \bar{\alpha}_{t-1}-\sigma_t^2}\cdot \pmb{\epsilon}_\theta^{(t)}(\pmb{x}_t) + \sigma_t \pmb{z}, \pmb{z} \sim \mathcal{N}(\pmb{0}, \pmb{I}) x^t1=αˉt1 αˉt xt1αˉt ϵθ(t)(xt)+1αˉt1σt2 ϵθ(t)(xt)+σtz,zN(0,I)
选择不同的 σ \sigma σ值会导致不同的生成过程,但它们使用相同的 ϵ θ \epsilon_{\theta} ϵθ模型。

DDPM和DDIM对比

DDPM :   q ( x t − 1 ∣ x t , x 0 ) = N ( α ˉ t − 1 β t 1 − α ˉ t x 0 + α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t , β t ( 1 − α ˉ t − 1 ) ( 1 − α ˉ t ) I ) DDIM :   q σ ( x t − 1 ∣ x t , x 0 ) : = N ( α ˉ t − 1 x 0 + 1 − α ˉ t − 1 − σ t 2 ⋅ x t − α ˉ t x 0 1 − α ˉ t , σ t 2 I ) \begin{split} \text{DDPM}:\ &q(\pmb{x}_{t-1}|\pmb{x}_t, \pmb{x}_0) = \mathcal{N}(\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\pmb{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 -\bar{\alpha}_t}\pmb{x}_t, \frac{\beta_t(1-\bar{\alpha}_{t-1})}{(1-\bar{\alpha}_t)}\pmb{I})\\ \text{DDIM}:\ &q_\sigma(\pmb{x}_{t-1}|\pmb{x}_t, \pmb{x}_0) := \mathcal{N}(\sqrt{\bar{\alpha}_{t-1}}\pmb{x}_0 + \sqrt{1 - \bar{\alpha}_{t-1}-\sigma_t^2}\cdot \frac{\pmb{x}_t - \sqrt{\bar{\alpha}_t}\pmb{x}_0}{\sqrt{1 - \bar{\alpha}_t}},\sigma_t^2 \pmb{I}) \end{split} DDPM: DDIM: q(xt1xt,x0)=N(1αˉtαˉt1 βtx0+1αˉtαt (1αˉt1)xt,(1αˉt)βt(1αˉt1)I)qσ(xt1xt,x0):=N(αˉt1 x0+1αˉt1σt2 1αˉt xtαˉt x0,σt2I) σ t = ( 1 − α ˉ t − 1 ) / ( 1 − α ˉ t ) ( 1 − α ˉ t / α ˉ t − 1 ) = β t ( 1 − α ˉ t − 1 ) ( 1 − α ˉ t ) \sigma_t = \sqrt{(1-\bar\alpha_{t-1})/(1-\bar\alpha_{t})}\sqrt{(1-\bar\alpha_{t}/\bar\alpha_{t-1})} = \sqrt{\frac{\beta_t(1-\bar{\alpha}_{t-1})}{(1-\bar{\alpha}_t)}} σt=(1αˉt1)/(1αˉt) (1αˉt/αˉt1) =(1αˉt)βt(1αˉt1) 时,DDPM的 q ( x t − 1 ∣ x t , x 0 ) q(\pmb{x}_{t-1}|\pmb{x}_t, \pmb{x}_0) q(xt1xt,x0)和DDIM的 q σ ( x t − 1 ∣ x t , x 0 ) q_\sigma(\pmb{x}_{t-1}|\pmb{x}_t,\pmb{x}_0) qσ(xt1xt,x0)是一样的,生成过程变成和DDPM是一样的。另外,DDIM的前向过程变成了马尔科夫的。

σ t = 0 \sigma_t=0 σt=0时,随机噪声前的系数是0, x 0 \mathbf x_0 x0 x T \mathbf x_T xT之间的关系是固定的,这属于隐概率模型(implicit probabilistic model)。
这时生成过程的每一步变为 x ^ t − 1 = α ˉ t − 1 x t − 1 − α ˉ t ϵ θ ( t ) ( x t ) α ˉ t + 1 − α ˉ t − 1 ⋅ ϵ θ ( t ) ( x t ) = 1 α t ( x t − ( 1 − α ˉ t − α t 1 − α ˉ t − 1 ) ϵ θ ( t ) ( x t ) ) \hat{\pmb{x}}_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \frac{\pmb{x}_t - \sqrt{1-\bar{\alpha}_t}\pmb{\epsilon}_\theta^{(t)}(\pmb{x}_t)}{\sqrt{\bar{\alpha}_t}} + \sqrt{1- \bar{\alpha}_{t-1}}\cdot \pmb{\epsilon}_\theta^{(t)}(\pmb{x}_t)\\ =\frac{1}{\sqrt{\alpha_{t}}}(\pmb{x}_t - (\sqrt{1-\bar{\alpha}_t} - \sqrt{\alpha_{t}}\sqrt{1-\bar{\alpha}_{t-1}}) \pmb{\epsilon}_\theta^{(t)}(\pmb{x}_t)) x^t1=αˉt1 αˉt xt1αˉt ϵθ(t)(xt)+1αˉt1 ϵθ(t)(xt)=αt 1(xt(1αˉt αt 1αˉt1 )ϵθ(t)(xt))对比DDPM生成过程的每一步
x ^ t − 1 = 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ θ ( t ) ( x t ) ) + σ t z \hat{\pmb{x}}_{t-1} = \frac{1}{\sqrt{\alpha_t}}(\pmb{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}}\pmb{\epsilon}_\theta^{(t)}(\pmb{x}_t)) + \sigma_t\pmb{z} x^t1=αt 1(xt1αˉt 1αtϵθ(t)(xt))+σtz

[2]中给出了下面的分解:
x t − 1 = α ˉ t − 1 x 0 + 1 − α ˉ t − 1 ϵ t − 1 = α ˉ t − 1 x 0 + 1 − α ˉ t − 1 − σ t 2 ϵ t + σ t ϵ = α ˉ t − 1 x 0 + 1 − α ˉ t − 1 − σ t 2 x t − α ˉ t x 0 1 − α ˉ t + σ t ϵ \begin{aligned} \mathbf{x}_{t-1} &= \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t-1}}\boldsymbol{\epsilon}_{t-1} \\ &= \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \boldsymbol{\epsilon}_t + \sigma_t\boldsymbol{\epsilon} \\ &= \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{\sqrt{1 - \bar{\alpha}_t}} + \sigma_t\boldsymbol{\epsilon} \end{aligned} xt1=αˉt1 x0+1αˉt1 ϵt1=αˉt1 x0+1αˉt1σt2 ϵt+σtϵ=αˉt1 x0+1αˉt1σt2 1αˉt xtαˉt x0+σtϵ分解的依据是不相关高斯随机变量 N ( 0 , σ 1 2 I ) \mathcal{N}(\mathbf{0}, \sigma_1^2\mathbf{I}) N(0,σ12I) N ( 0 , σ 2 2 I ) \mathcal{N}(\mathbf{0}, \sigma_2^2\mathbf{I}) N(0,σ22I)之和的分布是 N ( 0 , ( σ 1 2 + σ 2 2 ) I ) \mathcal{N}(\mathbf{0}, (\sigma_1^2 + \sigma_2^2)\mathbf{I}) N(0,(σ12+σ22)I)

在DDPM的逆向生成过程的每一步中,也可以认为是先估计 x ^ 0 \hat{\pmb{x}}_0 x^0,再求 x ^ t − 1 \hat{\pmb{x}}_{t-1} x^t1
DDIM与DDPM的主要区别是DDIM构造了一种更自由的过程,通过 σ \sigma σ改变了方差的大小,同时改变均值,使其依然符合DDPM的边际分布 q ( x t ∣ x 0 ) q(\pmb{x}_t|\pmb{x}_0) q(xtx0)

参考资料

[1] https://kxz18.github.io/2022/06/21/DDIM/
[2] https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

;