[机器学习] 2. 随机方差缩减梯度下降 SVRG

发布时间 2023-10-11 18:05:02作者: shiys22

ML Theory 太魔怔了!!!!!

接上文,GD 有 \(\frac 1T\) 的收敛速率而 SGD 只有 \(\frac 1{\sqrt T}\) 的收敛速率。有许多种方法可以加速 SGD 的收敛速度。有一类算法是通过让方差呈递减趋势下降,最终以与 GD 同阶的速度收敛(凸与 \(L\)-平滑以 \(\frac 1T\) 速度收敛;强凸与 \(L\)-平滑以线性速度收敛)。这里介绍其中一种,SVRG (stochastic variance reduced gradient)。

SVRG 的想法在于,有一个随机变量 \(X\),我们要通过蒙特卡洛法估计 \(\mathbb EX\),而有一个易于计算 \(\mathbb EY\) 的随机变量 \(Y\),且 \(X, Y\) 强相关,那么可以通过

\[\theta_\alpha = \alpha(X - Y) + \mathbb EY \]

来估计 \(\mathbb EX\)。其中

\[\mathbb E\theta_\alpha = \alpha \mathbb EX + (1 - \alpha) \mathbb EY \]

\[\mathrm{Var}(\theta_\alpha) = \mathrm{Var}(\alpha(X - Y)) = \alpha^2\mathrm{Var}(X - Y) \]

\(\alpha = 1\) 时,\(\mathbb E\theta_\alpha = \mathbb EX, \mathrm{Var}(\theta_\alpha) = \mathrm{Var}(X - Y)\)。这样便不依赖于 \(\mathrm{Var}(X)\) 而只依赖于 \(X, Y\) 的相关性。

我们可以这样设计一个算法:\(X\) 在当前的 \(\bm w_j\)\(Y\)\(\bm w_j\) 的 snapshot,\(\bm w_{\lfloor \frac jm \rfloor m}\)。我们将 \(m\) 设计得足够大使得可以接受计算 \(\bm w_{im}\) 完整梯度的代价,这些梯度就是 \(\mathbb EY\)

直观地讲,我们用 \(X_0\) 中随机梯度与梯度的差来估计 \(X_t\) 中随机梯度与梯度的差,如图,红色和橙色为两者的随机梯度,黑色为已知的,绿色为要估计的,取黑+橙-红。这样误差就不在于橙色与绿色的角度,而在于 \(X_0\)\(X_t\) 的相近程度。

完整算法描述如下:

SVRG 算法

  • For \(s = 1, 2, \ldots\)
    • \(\widetilde {\bm w} = \widetilde {\bm w}_{s-1}\)
    • \(\widetilde {\bm u} = \frac 1N\sum_{i=1}^N \nabla l_i(\widetilde {\bm w}) = \nabla f(\widetilde {\bm w})\)
    • \({\bm w}_0 = \widetilde {\bm w}\)
    • For \(t = 1, 2, \ldots, m\)
      • 随机选取 \(i \in [N]\)
      • \({\bm w}_t = {\bm w}_{t-1} - \eta(\nabla l_i({\bm w}_{t-1}) - \nabla l_i(\widetilde {\bm w}) + \widetilde {\bm u})\)
    • 随机选取 \({\bm w}_t,\ t \in \{0, 1, \ldots, m - 1\}\) 作为 \(\widetilde {\bm w}_s\)

最后一步随机选取是为了可以用 \(\mathbb E\) 说事。如果选取 \(\widetilde {\bm w}_m\),没人证得出来。。

这里给出 \(\mu\)-强凸的收敛性分析。

\(\bm v_t = \nabla l_i(\bm w_{t-1}) - \nabla l_i(\widetilde {\bm w}) + \widetilde {\bm u}\)。需要注意的是,下面的 \(\|\cdot \|^2_2\) 说明其在一般的范数下不一定成立,而上面几个证明并不要求这一点。

\[g(\bm w) = l_i(\bm w) - l_i(\bm w^*) - \langle \nabla l_i(\bm w^*), \bm w - \bm w^* \rangle \]

由于 \(l_i(\bm w)\) 是凸的,\(g(\bm w)\) 也是凸的,而 \(\nabla g_i(\bm w^*) = 0\),因此 \(g_i(\bm w^*) = \min_{\bm w} g_i(\bm w)\)

\[\begin{aligned} 0 = g_i(\bm w^*) &\leq \min_{\eta} \{g_i(\bm w - \eta \nabla g_i(\bm w))\} \\ &\leq \min_\eta\{g_i(\bm w) - \eta \|\nabla g_i(w)\|_2^2 + \frac L2 \eta^2 \|\nabla g_i(\bm w)\|_2^2\} && \text{Smoothness} \\ &= g_i(\bm w) - \frac 1{2L} \|\nabla g_i(\bm w)\|_2^2 \end{aligned}\]

这个过程就是平滑性取二次函数顶点。移项得

\[\|\nabla l_i(\bm w) - \nabla l_i(\bm w^*)\|_2^2 \leq 2L(l_i(\bm w) - l_i(\bm w^*) - \langle\nabla l_i(\bm w^*), \bm w - \bm w^*\rangle) \]

这个式子有点像 Lipschitz 的形式,区别在于左侧是范数的平方。

对所有 \(l_i\) 求和得

\[\mathbb E \|\nabla l_i(\bm w) - \nabla l_i(\bm w^*)\|_2^2 \leq 2L(f(\bm w) - f(\bm w^*)) \]

\[\begin{aligned} \mathbb E\|\bm v_t\|^2_2 &\leq 2 \mathbb E\|\nabla l_i(\bm w_{t-1}) - \nabla l_i(\bm w^*)\|^2_2 + 2 \mathbb E\|\nabla l_i(\widetilde {\bm w}) - \nabla l_i(\bm w^*) - \nabla f(\widetilde {\bm w})\|^2_2 && \|a + b\|^2_2 \leq 2\|a\|^2_2 + 2\|b\|^2_2 \\ &\leq 2 \mathbb E\|\nabla l_i(\bm w_{t-1}) - \nabla l_i(\bm w^*)\|^2_2 + 2 \mathbb E\|\nabla l_i(\widetilde {\bm w}) - \nabla l_i(\bm w^*) - \mathbb E(\nabla l_i(\widetilde {\bm w}) - \nabla l_i(\bm w^*))\|^2_2 && \nabla f(\bm w^*) = 0 \\ &\leq 2\mathbb E\|\nabla l_i(\bm w_{t-1}) - \nabla l_i(\bm w^*)\|^2_2 + 2\mathbb E\|\nabla l_i(\widetilde {\bm w}) - \nabla l_i(\bm w^*)\|^2_2 && \mathbb E\|X - \mathbb EX\|^2_2 = \mathbb E \|X\|^2_2 - \|\mathbb EX\|^2_2 \leq \mathbb E\|X\|_2^2 \\ &\leq 4L(f(\bm w_{t-1}) - f(\bm w^*) + f(\widetilde{\bm w}) - f(\bm w^*)) \end{aligned}\]

\[\begin{aligned} \mathbb E\|\bm w_t - \bm w^*\|_2^2 &= \|\bm w_{t-1} - \bm w^*\|_2^2 - 2 \eta\langle \bm w_{t-1} - \bm w^*, \mathbb E \bm v_t \rangle + \eta^2 \mathbb E\|\bm v_t\|_2^2 \\ &\leq \|\bm w_{t-1} - \bm w^*\|_2^2 - 2 \eta\langle \bm w_{t-1} - \bm w^*, \nabla f(\bm w_{t-1})\rangle + 4L\eta^2 (f(\bm w_{t-1}) - f(\bm w^*) + f(\widetilde{\bm w}) - f(\bm w^*)) \\ &\leq \|\bm w_{t-1} - \bm w^*\|_2^2 - 2 \eta(f(\bm w_{t-1}) - f(\bm w^*)) + 4L\eta^2 (f(\bm w_{t-1}) - f(\bm w^*) + f(\widetilde{\bm w}) - f(\bm w^*)) && \text{Convexity} \\ &= \|\bm w_{t-1} - \bm w^*\|_2^2 - 2 \eta(1 - 2L \eta)(f(\bm w_{t-1}) - f(\bm w^*)) + 4L \eta^2 (f(\widetilde{\bm w}) - f(\bm w^*)) \end{aligned}\]

其中 \(4L \eta^2 (f(\widetilde{\bm w}) - f(\bm w^*))\) 是误差部分,其会随着 \(\widetilde{\bm w} \to \bm w^*\) 减小。

\[\begin{aligned} \mathbb E\|\bm w_m - \bm w^*\|_2^2 &\leq \mathbb E \|\bm w_0 - \bm w^*\|_2^2 - 2m\eta(1 - 2 L \eta)\sum_{t=1}^m f((\bm w_{t-1}) - f(\bm w^*)) + 4mL\eta^2(f(\widetilde{\bm w}) - f(\bm w^*)) \\ &= \mathbb E \|\bm w_0 - \bm w^*\|_2^2 - 2m\eta(1 - 2 L \eta)\mathbb E(f(\widetilde {\bm w}_s) - f(\bm w^*)) + 4mL\eta^2(f(\widetilde{\bm w}) - f(\bm w^*)) \end{aligned}\]

我们拿到了想要的项,\(\mathbb E(f(\widetilde {\bm w}_s) - f(\bm w^*))\)\(f(\widetilde{\bm w}) - f(\bm w^*)\) 的线性组合,等式左边可以直接缩放为 \(0\),因此只需要消掉 \(\mathbb E \|\bm w_0 - \bm w^*\|_2^2\) 便大功告成。

\[\begin{aligned}\mathbb E \|\bm w_0 - \bm w^*\|_2^2 = \mathbb E \|\widetilde{\bm w} - \bm w^*\|_2^2 \leq \frac 2\mu \mathbb E(f(\widetilde{\bm w}) - f(\bm w^*)) && \text{Convexity} + \nabla f(\bm w^*) = 0\end{aligned} \]

\[\begin{aligned} \mathbb E(f(\widetilde{\bm w}_s) - f(\bm w^*)) \leq \left(\frac 1{m\mu \eta (1 - 2L \eta)} + \frac {2L\eta}{1 - 2L\eta}\right) \mathbb E(f(\widetilde{\bm w}_{s-1} - f(\bm w^*)) \end{aligned}\]

因此,\(\eta < \frac 1L\)\(m\) 足够大时,系数会 \(< 1\),因此线性收敛。