【论文精读 - DDPM】Deep Unsupervised Learning using Nonequilibrium Thermodynamics

发布时间 2023-04-03 16:25:04作者: Be(CN₃H₃)₂

数学推导【转载】

数学推导过程来自苏剑林大神的《生成扩散模型漫谈》系列,感谢苏神的无私奉献,让我这样数学功底不好的人也能领略这个当下最为火爆的模型的精髓。

系列中有部分步骤,一眼看过去可能有些费解,所以这里稍微做了展开,作为自己的笔记用。

通俗解释:DDPM=拆楼+建楼

生成模型实际上就是:随机噪声 \(\boldsymbol{z}\ \xrightarrow{变换}\) 样本数据 \(\boldsymbol{x}\)

我们把“拆楼”分为 \(T\) 步:

\[\begin{equation}\boldsymbol{x} = \boldsymbol{x}_0 \to \boldsymbol{x}_1 \to \boldsymbol{x}_2 \to \cdots \to \boldsymbol{x}_{T-1} \to \boldsymbol{x}_T = \boldsymbol{z}\end{equation} \]

如果能学会 \(\boldsymbol{x}_{t-1}=\boldsymbol{\mu}(\boldsymbol{x}_t)\),那么反复执行 \(\boldsymbol{x}_{T-1}=\boldsymbol{\mu}(\boldsymbol{x}_T),\,\boldsymbol{x}_{T-2}=\boldsymbol{\mu}(\boldsymbol{x}_{T-1}),\,\cdots,\,\boldsymbol{x}_1=\boldsymbol{\mu}(\boldsymbol{x}_0)\) 即可还原 \(\boldsymbol{x}_0\)

该怎么拆

DDPM将“拆楼”建模为:

\[\begin{equation}\boldsymbol{x}_t = \alpha_t \boldsymbol{x}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t,\quad \boldsymbol{\varepsilon}_t\sim\mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})\end{equation} \]

其中 \(\alpha_t,\beta_t > 0\)\(\alpha_t^2 + \beta_t^2=1\),通常 \(\beta_t\rightarrow 0\)\(\boldsymbol{\varepsilon}_t\) 为噪声。

反复执行这个拆楼的步骤,可以得到:

\[\begin{equation}\begin{aligned} \boldsymbol{x}_t =&\, \alpha_t \boldsymbol{x}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t \\ =&\, \alpha_t \big(\alpha_{t-1} \boldsymbol{x}_{t-2} + \beta_{t-1} \boldsymbol{\varepsilon}_{t-1}\big) + \beta_t \boldsymbol{\varepsilon}_t \\ =&\,\cdots\\ =&\,(\alpha_t\cdots\alpha_1) \boldsymbol{x}_0 + \underbrace{(\alpha_t\cdots\alpha_2)\beta_1 \boldsymbol{\varepsilon}_1 + (\alpha_t\cdots\alpha_3)\beta_2 \boldsymbol{\varepsilon}_2 + \cdots + \alpha_t\beta_{t-1} \boldsymbol{\varepsilon}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t}_{\text{多个相互独立的正态噪声之和}} \end{aligned}\end{equation} \]

式中花括号指出的部分可以看成一个整体的噪声。利用正态分布的叠加性:

\[\begin{equation}X_1\sim \mathcal{N}(\mu_1,\,\sigma_1^2),\,X_2\sim \mathcal{N}(\mu_2,\,\sigma_2^2)\Longrightarrow X_1+X_2\sim \mathcal{N}(\mu_1+\mu_2,\,\sigma_1^2+\sigma_2^2)\end{equation} \]

显然这些噪声的均值为 \(0\),我们来算它们的方差之和:

\[\begin{equation}\begin{aligned}&\,(\alpha_t\cdots\alpha_1)^2+(\alpha_t\cdots\alpha_2)^2\beta_1^2 + \cdots + (\alpha_t\alpha_{t-1})^2\beta_{t-2}^2+\alpha_t^2\beta_{t-1}^2 + \beta_t^2\\=&\,(\alpha_t\cdots\alpha_1)^2+(\alpha_t\cdots\alpha_2)^2\beta_1^2 + \cdots + (\alpha_t\alpha_{t-1})^2\beta_{t-2}^2+\alpha_t^2\beta_{t-1}^2 -\alpha_t^2+1\\=&\,(\alpha_t\cdots\alpha_1)^2+(\alpha_t\cdots\alpha_2)^2\beta_1^2 + \cdots+(\alpha_t\alpha_{t-1})^2\beta_{t-2}^2 - (\alpha_t\alpha_{t-1})^2 + 1\\=&\,\cdots\\=&\,(\alpha_t\cdots\alpha_1)^2+(\alpha_t\cdots\alpha_2)^2\beta_1^2-(\alpha_t\cdots\alpha_2)^2+1\\=&\,(\alpha_t\cdots\alpha_1)^2-(\alpha_t\cdots\alpha_1)^2+1\\=&\,1\end{aligned}\end{equation} \]

所以实际上相当于有:

\[\begin{equation}\begin{aligned}\boldsymbol{x}_t =& \underbrace{(\alpha_t\cdots\alpha_1)}_{\text{记为}\bar{\alpha}_t} \boldsymbol{x}_0 + \underbrace{\sqrt{1 - (\alpha_t\cdots\alpha_1)^2}}_{\text{记为}\bar{\beta}_t} \bar{\boldsymbol{\varepsilon}}_t,\quad \bar{\boldsymbol{\varepsilon}}_t\sim\mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})\\=&\bar{\alpha}_t\boldsymbol{x}_0+\bar{\beta}_t\bar{\boldsymbol{\varepsilon}}_t,\quad \bar{\boldsymbol{\varepsilon}}_t\sim\mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})\end{aligned}\end{equation} \]

此外DDPM还会选择适当的 \(\alpha_t\),使得 \(\bar{\alpha}_T\approx 0\),这意味着经过 \(T\) 步的拆楼后,所剩的楼体几乎可以忽略了,已经全部转化为原材料 \(\boldsymbol{\varepsilon}\)

又如何建

\(\boldsymbol{x}_{t-1}\to \boldsymbol{x}_t\) 有了,现在我们要学习 \(\boldsymbol{x}_t\to \boldsymbol{x}_{t-1}\)。设 \(\boldsymbol{x}_t=\boldsymbol{\mu}(\boldsymbol{x}_t)\),那么学习方案就是最小化欧氏距离:

\[\begin{equation}\|\boldsymbol{x}_{t-1} - \boldsymbol{\mu}(\boldsymbol{x}_t)\|^2\end{equation} \]

首先根据 \((1)\),反解 \(\boldsymbol{x}_{t-1}\) 就是 \(\boldsymbol{x}_{t-1} = \dfrac{1}{\alpha_t}\left(\boldsymbol{x}_t - \beta_t \boldsymbol{\varepsilon}_t\right)\)。所以我们就可以将 \(\boldsymbol{\mu}(\boldsymbol{x}_t)\) 设成:

\[\begin{equation}\boldsymbol{\mu}(\boldsymbol{x}_t) = \dfrac{1}{\alpha_t}\left(\boldsymbol{x}_t - \beta_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,\,t)\right)\end{equation} \]

其中 \(\boldsymbol{\theta}\) 是训练参数。代入到 \((3)\) 中,损失函数即:

\[\begin{equation}\|\boldsymbol{x}_{t-1} - \boldsymbol{\mu}(\boldsymbol{x}_t)\|^2 =\|\dfrac{1}{\alpha_t}\left(\boldsymbol{x}_t - \beta_t \boldsymbol{\varepsilon}_t\right) - \dfrac{1}{\alpha_t}\left(\boldsymbol{x}_t - \beta_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,\,t)\right)\|^2= \dfrac{\beta_t^2}{\alpha_t^2}\| \boldsymbol{\varepsilon}_t - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,\, t)\|^2\end{equation} \]

忽略掉权重 \(\dfrac{\beta_t^2}{\alpha_t^2}\),另外结合 \((1)\ (2)\) 可以将 \(\boldsymbol{x}_t\) 化为:

\[\begin{equation}\boldsymbol{x}_t = \alpha_t\boldsymbol{x}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t = \alpha_t\left(\bar{\alpha}_{t-1}\boldsymbol{x}_0 + \bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1}\right) + \beta_t \boldsymbol{\varepsilon}_t = \bar{\alpha}_t\boldsymbol{x}_0 + \alpha_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t\end{equation} \]

最终损失函数的形式为:

\[\begin{equation}\| \boldsymbol{\varepsilon}_t - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,\, t)\|^2=\| \boldsymbol{\varepsilon}_t - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \alpha_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t,\,t)\|^2\end{equation} \]

为什么需要 \(\boldsymbol{x}_t = \alpha_t\boldsymbol{x}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t\) 这一步呢?这是因为 \(\boldsymbol{\varepsilon}_t\)\(\bar{\boldsymbol{\varepsilon}}_t\) 不是相互独立的,所以只能用 \(\bar{\boldsymbol{\varepsilon}}_{t-1}\)\(\boldsymbol{\varepsilon}_t\)

降低方差

原则上 \((5)\) 就可以完成DDPM的训练,但由于这个式子中需要对 \(\boldsymbol{x}_0,\,\bar{\boldsymbol{\varepsilon}}_{t-1},\, \boldsymbol{\varepsilon}_t,\,t\) 四个随机变量分别采样,在实践中可能有方差过大的风险,从而导致收敛过慢等问题。我们可以将 \(\bar{\boldsymbol{\varepsilon}}_{t-1},\,\boldsymbol{\varepsilon}_t\) 合并成单个随机变量,从而缓解方差大的问题。

首先推一下 \(\bar{\beta}_{t-1}^2\)\(\beta_t^2,\,\bar{\beta}_t^2\) 的关系:

\[\begin{equation}\bar{\beta}_{t-1}^2=1-\bar{\alpha}_{t-1}^2=1-\dfrac{\bar{\alpha}_{t}^2}{\alpha_t^2}=1-\dfrac{1-\bar{\beta}_{t}^2}{1-\beta_t^2}=\dfrac{\bar{\beta}_{t}^2-\beta_t^2}{1-\beta_t^2}\end{equation} \]

然后和上面做过的事情一样,利用一下正态分布的叠加性:

\(\alpha_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t\) 均值为 \(0\),方差为 \(\alpha_t^2\bar{\beta}_{t-1}^2 + \beta_t^2=\alpha_t^2\dfrac{\bar{\beta}_{t}^2-\beta_t^2}{1-\beta_t^2} + \beta_t^2=\beta_t^2\),实际相当于 \(\bar{\beta}_t\boldsymbol{\varepsilon}|\boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})\)

\(\beta_t \bar{\boldsymbol{\varepsilon}}_{t-1} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\varepsilon}_t\) 均值为 \(0\),方差为 \(\alpha_t^2\bar{\beta}_{t-1}^2 + \beta_t^2=\alpha_t^2\dfrac{\bar{\beta}_{t}^2-\beta_t^2}{1-\beta_t^2} + \beta_t^2=\beta_t^2\),实际相当于 \(\bar{\beta}_t\boldsymbol{\omega}|\boldsymbol{\omega}\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})\)

然后我们来验证一下 \(\boldsymbol{\varepsilon}\)\(\boldsymbol{\omega}\) 是两个相互独立的正态随机变量。这可以用协方差为零证明,不过我们也可以计算 \(\mathbb{E}[\boldsymbol{\varepsilon}\boldsymbol{\omega}^{\top}]\)。我们先算 \(\mathbb{E}[(\bar{\beta}_t\boldsymbol{\varepsilon})(\bar{\beta}_t\boldsymbol{\omega})^{\top}]\)

\[\begin{equation}\begin{aligned}&\,\mathbb{E}[(\bar{\beta}_t\boldsymbol{\varepsilon})(\bar{\beta}_t\boldsymbol{\omega})^{\top}]\\=&\,\mathbb{E}[(\alpha_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t)(\beta_t \bar{\boldsymbol{\varepsilon}}_{t-1}^{\top} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\varepsilon}_t^{\top})]\\=&\,\mathbb{E}[\alpha_t\beta_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1}^{\top}-\alpha_t^2\bar{\beta}_{t-1}^2\bar{\boldsymbol{\varepsilon}}_{t-1}\boldsymbol{\varepsilon}_t^{\top}+\beta_t^2\boldsymbol{\varepsilon}_t\bar{\boldsymbol{\varepsilon}}_{t-1}^{\top}-\alpha_t\beta_t\bar{\beta}_{t-1}\boldsymbol{\varepsilon}_t\boldsymbol{\varepsilon}_t^{\top}]\\=&\,\alpha_t\beta_t\bar{\beta}_{t-1}\boldsymbol{I}-\boldsymbol{0}+\boldsymbol{0}-\alpha_t\beta_t\bar{\beta}_{t-1}\boldsymbol{I}\\=&\,\boldsymbol{0}\end{aligned}\end{equation} \]

这里用到了以下结论:\(\boldsymbol{\varepsilon},\,\boldsymbol{\omega}\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})\) ,且 \(\boldsymbol{\varepsilon},\,\boldsymbol{\omega}\) 独立,则有 \(\mathbb{E}[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^{\top}]=\mathbb{E}[\boldsymbol{\omega}\boldsymbol{\omega}^{\top}]=\boldsymbol{I}\)\(\mathbb{E}[\boldsymbol{\varepsilon}\boldsymbol{\omega}^{\top}]=\mathbb{E}[\boldsymbol{\varepsilon}^{\top}\boldsymbol{\omega}]=\boldsymbol{0}\)

接下来我们反过来解 \(\boldsymbol{\varepsilon}_t\)

\[\begin{equation}\begin{cases} \alpha_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t = \bar{\beta}_t\boldsymbol{\varepsilon}\\[5pt] \beta_t \bar{\boldsymbol{\varepsilon}}_{t-1} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\varepsilon}_t = \bar{\beta}_t\boldsymbol{\omega} \end{cases}\end{equation} \]

解得:

\[\begin{equation}\boldsymbol{\varepsilon}_t = \dfrac{\alpha_t\bar{\beta}_{t-1}\bar{\beta}_t\boldsymbol{\omega}-\beta_t\bar{\beta}_t \boldsymbol{\varepsilon}}{- \alpha_t^2\bar{\beta}_{t-1}^2-\beta_t^2 }=\dfrac{(\alpha_t\bar{\beta}_{t-1} \boldsymbol{\omega}-\beta_t \boldsymbol{\varepsilon})\bar{\beta}_t}{-\alpha_t^2\frac{\bar{\beta}_{t}^2-\beta_t^2}{1-\beta_t^2}-\beta_t^2} =\dfrac{(\beta_t \boldsymbol{\varepsilon} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\omega})\bar{\beta}_t}{\bar{\beta}_t^2}= \dfrac{\beta_t \boldsymbol{\varepsilon} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\omega}}{\bar{\beta}_t}\end{equation} \]

代入 \((5)\) 式得到:

\[\begin{equation}\begin{aligned} &\,\mathbb{E}_{\bar{\boldsymbol{\varepsilon}}_{t-1},\, \boldsymbol{\varepsilon}_t\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})}\left[\left\| \boldsymbol{\varepsilon}_t - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \alpha_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t,\, t)\right\|^2\right] \\ =&\,\mathbb{E}_{\boldsymbol{\omega},\, \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})}\left[\left\| \frac{\beta_t \boldsymbol{\varepsilon} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\omega}}{\bar{\beta}_t} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon},\, t)\right\|^2\right] \end{aligned}\end{equation} \]

我们先来处理 \(\boldsymbol{\omega}\)

\[\begin{equation}\begin{aligned}&\,\mathbb{E}_{\boldsymbol{\omega},\, \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})}\left[\left\| \dfrac{\beta_t \boldsymbol{\varepsilon} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\omega}}{\bar{\beta}_t} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon},\, t)\right\|^2\right]\\=&\,\mathbb{E}_{\boldsymbol{\omega},\, \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})}\left[\left\| -\dfrac{\alpha_t\bar{\beta}_{t-1}}{\bar{\beta}_t} \boldsymbol{\omega}+\dfrac{\beta_t }{\bar{\beta}_t}\boldsymbol{\varepsilon} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon},\, t)\right\|^2\right]\\=&\,\mathbb{E}_{\boldsymbol{\omega},\, \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})}\left[\left\|A \boldsymbol{\omega}+B\right\|^2\right]\\=&\,\mathbb{E}_{\boldsymbol{\omega},\, \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})}\left[\left\|A \boldsymbol{\omega}\right\|^2+2\|AB\boldsymbol{\omega}\|+\left\|B\right\|^2\right]\end{aligned}\end{equation} \]

直接打开,注意到 \(\mathbb{E}\boldsymbol{\omega}\)\(\mathbb{E}\boldsymbol{\omega}^2\) 都是常数,所以损失函数就相当于:

\[\begin{equation}\dfrac{\beta_t^2}{\bar{\beta}_t^2}\mathbb{E}_{\boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})}\left[\left\|\boldsymbol{\varepsilon} - \dfrac{\bar{\beta}_t}{\beta_t}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon},\, t)\right\|^2\right]+\text{常数}\end{equation} \]

再次忽略常数和权重,我们得到DDPM最终所用的损失函数:

\[\begin{equation}\mathbb{E}_{\boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\, \boldsymbol{I})}\left\|\boldsymbol{\varepsilon}-\dfrac{\bar{\beta}_t}{\beta_t}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon},\, t)\right\|^2\end{equation} \]

这个形式和DDPM原论文中的 \(L_{\mathrm{simple}}(\theta)\) 是完全一致的:

\[\begin{equation}L_{\mathrm{simple}}(\theta):=\mathbb{E}_{t,\,\boldsymbol{x}_0,\,\boldsymbol{\epsilon}}\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\sqrt{\bar{\alpha}_t}\boldsymbol{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon},\, t)\right\|^2\end{equation} \]

递归生成

训练完之后,我们就可以从一个随机噪声 \(\boldsymbol{x}_T\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\) 出发执行 \(T\)\((4)\) 来进行生成:

\[\begin{equation}\boldsymbol{x}_{t-1} = \dfrac{1}{\alpha_t}\left(\boldsymbol{x}_t - \beta_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,\, t)\right)\end{equation} \]

这对应于自回归解码中的Greedy Search。如果要进行Random Sample,那么需要补上噪声项:

\[\begin{equation}\boldsymbol{x}_{t-1} = \dfrac{1}{\alpha_t}\left(\boldsymbol{x}_t - \beta_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,\, t)\right) + \sigma_t \boldsymbol{z},\quad \boldsymbol{z}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\end{equation} \]

一般来说,我们可以让 \(\alpha_t=\beta_t\),即正向和反向的方差保持同步。

超参设置

在DDPM中,\(T=1000\)\(\alpha_t = \sqrt{1 - \dfrac{0.02t}{T}}\)

在重构的时候我们用了欧氏距离 \((3)\) 作为损失函数,而一般我们用DDPM做图片生成,以往做过图片生成的读者都知道,欧氏距离并不是图片真实程度的一个好的度量,VAE用欧氏距离来重构时,往往会得到模糊的结果,除非是输入输出的两张图片非常接近,用欧氏距离才能得到比较清晰的结果,所以选择尽可能大的 \(T\),正是为了使得输入输出尽可能相近,减少欧氏距离带来的模糊问题。

为什么要选择单调递减的 \(\alpha_t\) 呢?当 \(t\) 比较小时,\(x_t\) 还比较接近真实图片,所以我们要缩小 \(x_{t−1}\)\(x_t\) 的差距,以便更适用欧氏距离 \((3)\),因此要用较大的 \(\alpha_t\);当 \(t\) 比较大时,\(x_t\) 已经比较接近纯噪声了,噪声用欧式距离无妨,所以可以稍微增大 \(x_{t−1}\)\(x_t\) 的差距,即可以用较小的 \(\alpha_t\)。那么可不可以一直用较大的 \(α_t\) 呢?可以是可以,但是要增大 \(T\)

我们之前说过,应该有 \(\bar{\alpha}_T\approx 0\),我们利用 \(\alpha_t\) 的表达式来计算 \(\bar{\alpha}_T\)

\[\displaystyle\begin{equation}\log \bar{\alpha}_T = \sum_{t=1}^T \log\alpha_t = \frac{1}{2} \sum_{t=1}^T \log\left(1 - \frac{0.02t}{T}\right) < \frac{1}{2} \sum_{t=1}^T \left(- \frac{0.02t}{T}\right) = -0.005(T+1)\end{equation} \]

由此可以看出 \(T\) 要足够大,才能达到 \(\approx0\) 的标准。当 \(T=1000\) 时,\(\bar{\alpha}_T\approx \mathrm{e}^{-5}\)

最后我们留意到,“建楼”模型中的 \(\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon}, t)\) 中,我们在输入中显式地写出了 \(t\),这是因为原则上不同的 \(t\) 处理的是不同层次的对象,所以应该用不同的重构模型,即应该有 \(T\) 个不同的重构模型才对,于是我们共享了所有重构模型的参数,将 \(t\) 作为条件传入。按照论文附录的说法,\(t\) 是转换成位置编码后,直接加到残差模块上去的。

参考文献

苏剑林. (Jun. 13, 2022). 《生成扩散模型漫谈(一):DDPM = 拆楼 + 建楼 》[Blog post]. Retrieved from https://spaces.ac.cn/archives/9119