[机器学习] 3. 镜像下降 Mirror Descent 与线性耦合 Linear Coupling

发布时间 2023-10-14 20:57:51作者: shiys22

ML Theory 太魔怔了!!!!!

我们来考虑更快的下降算法。

\(L\)-smooth 的 Gradient Descent,我们有两种视角来看它。一种是局部视角,梯度方向相近的点的函数值一定会下降,另一种是全局视角,用一个二次函数为整个 \(f\) 提供了一个 lowerbound。当局部梯度的范数很大时,函数值会下降的很快;当全局梯度的范数很小时,每一个 lowerbound 会更紧。所以我们考虑从两种视角出发分别设计一种策略,之后将两者耦合,以达到更快的速率。

为了半形式化地描述两种视角,我们将 Gradient Descent 一般化,称其为 Mirror descent。名字 Mirror 来源于原空间到对偶空间的映射。如果 \(f\) 定义在一个原空间上,那么 \(\nabla f\) 的值域在其对偶空间上,即在每个位置提供了一个线性函数。此时,梯度下降 \(\bm x_{t+1} = \bm x_t + \eta \nabla f(\bm x_t)\) 便是将一个原空间的元素和一个对偶空间的元素进行了线性组合。在 \(L_2\) 范数下,由于其自对偶,这能够说得通,但是在一般的范数中这会变得奇怪。此时应该按照对偶空间的理念,将 \(\nabla f(\bm x_t)\) 认作一个线性函数,因此

\[f(\bm x_t) + (\nabla f(\bm x_t))(\bm x - \bm x_t) \]

便是对 \(f(\bm x)\) 的线性近似。

第一种视角称为正则化视角。我们希望 \(f(\bm x)\) 小,但是又希望 \(\bm x, \bm x_t\) 不要差太远,否则近似的效果会差,因此考虑加入正则化项 \(\frac 1{2\eta} \|\bm x - \bm x_t\|^2\),则

\[\bm x_{t+1} = \operatorname{argmin}_{\bm x} \left\{f(\bm x_t) + (\nabla f(\bm x_t))(\bm x - \bm x_t) + \frac 1{2\eta} \|\bm x - \bm x_t\|^2\right\} \]

它的解便是 \(\bm x_t + \eta \nabla f(\bm x_t)\)。这个正则化的观点,避免了原空间和对偶空间的直接接触。现在考虑对这个正则化项一般化。

定义 对一个凸函数 \(w\),其 Bregman divergence\(V_{\bm x}(\bm y) = w(\bm y) - \langle \nabla w(\bm x), \bm y - \bm x \rangle - w(\bm x)\)\(w\) 被称作距离生成函数。

\(w\) 的凸性是为了保证转化后的问题仍然是凸优化问题。

命题 (triangle equality) \(\forall \bm x, \bm y, \langle -\nabla V_{\bm x}(\bm y), \bm y - \bm u \rangle = V_{\bm x}(\bm u) - V_{\bm y}(\bm u) - V_{\bm x}(\bm y)\).

注意 \(\nabla V_{\bm x}(\bm y)\) 是认 \(\bm x\) 为参数,对 \(\bm y\) 求的梯度。

证明

\[\begin{aligned} \langle -\nabla V_{\bm x}(\bm y), \bm y - \bm u \rangle &= \langle \nabla w(\bm x) - \nabla w(\bm y), \bm y - \bm u\rangle \\ &= (w(\bm u) - w(\bm x) - \langle \nabla w(\bm x), \bm u - \bm x \rangle) - (w(\bm u) - w(\bm y) - \langle \nabla w(\bm y), \bm u - \bm y \rangle) \\ &\hphantom{{}={}} -(w(\bm y) - w(\bm x) - \langle \nabla w(\bm x), \bm y - \bm x \rangle) \\ &= V_{\bm x}(\bm u) - V_{\bm y}(\bm u) - V_{\bm x}(\bm y) \end{aligned}\]

其一个例子为 GD 一讲中

\[\frac 1{\eta} \langle \bm w_{i+1} - \bm w_i, \bm w_i - \bm w^*\rangle - \frac 1{2\eta} \|\bm w_i - \bm w_{i+1}\|^2_2 = \frac 1{2\eta} (\|\bm w_i - \bm w^*\|^2_2 - \|\bm w_{i+1} - \bm w^*\|^2_2) \]

其中 \(\bm x = \bm w_{i+1}, \bm y = \bm w_i, \bm u = \bm w^*, w(\bm x) = \|\bm x\|^2_2, \nabla w(\bm x) = 2\bm x\)

定义一个步长为 \(\alpha\) 的 Mirror step 为

\[\bm x_{k+1} = \mathrm{Mirr}_{\bm x}(\alpha \nabla f(\bm x_k)) \]

\[\mathrm{Mirr}_{\bm x}(\bm \xi) = \operatorname{argmin}_{\bm y} (\langle \bm \xi, \bm y - \bm x \rangle + V_{\bm x}(\bm y)) \]

其中 \(V\) 当作正则化项。如果 \(\bm x = \bm x_k, V_{\bm x}(\bm y) = \|\bm x - \bm y\|^2\) 那就是常见的 GD。

第二种视角称为镜像空间 (Mirror space) 视角,一个 Mirror step 可被视作对偶空间上的梯度下降,即声明一个新的梯度,去找其对应的极值点。过程形如

  • \(\bm x\) 通过 Mirror map 映射到对偶空间上的 \(\bm \theta_k\)
  • \(\bm \theta_{k+1} = \bm \theta_k - \alpha \nabla f(\bm x_k)\)
  • \(\bm \theta_{k+1}\) 映射回原空间上的 \(\overline{\bm x}_{k+1}\)
  • \(\overline{\bm x}_{k+1}\) 投影到约束集中,投影使用 Bregman divergence 作为其距离,即 \(\bm x_{k+1} = \operatorname{argmin}_{\bm y} V_{\overline{\bm x}_{k+1}}(\bm y)\)

按照 Mirror step 的式子,可以看出 Mirror map 就是 \(\nabla w(\cdot)\)。因此实际过程为

  • \(\bm \theta_k = \nabla w(\bm x)\).
  • \(\bm \theta_{k+1} = \bm \theta_k - \alpha \nabla f(\bm x_k)\).
  • \(\overline{\bm x}_{k+1} = (\nabla w)^{-1}(\bm \theta_{k+1})\).
  • \(\bm x_{k+1} = \operatorname{argmin}_{\bm y} V_{\overline{\bm x}_{k+1}}(\bm y)\).

这个视角提出了一点假设,\((\nabla w)^{-1}(\bm x_{k+1})\) 始终存在,即 \(\{\nabla w(\bm x)\} = \mathbb R^n\)

我们来简单地证明两种视角描述的算法是同一件事。

\[\begin{aligned} \bm x_{k+1} &= \operatorname{argmin}_{\bm y} V_{\overline{\bm x}_{k+1}}(\bm y) \\ &= \operatorname{argmin}_{\bm y} \{w(\bm y) - \langle \nabla w(\overline{\bm x}_{k+1}), \bm y - \overline{\bm x}_{k+1}\rangle - w(\overline{\bm x}_{k+1})\} \\ &= \operatorname{argmin}_{\bm y} \{w(\bm y) - \langle \nabla w(\overline{\bm x}_{k+1}), \bm y\rangle \} \\ &= \operatorname{argmin}_{\bm y} \{w(\bm y) - \langle \bm \theta_k - \alpha \nabla f(\bm x_k), \bm y\rangle \} \\ &= \operatorname{argmin}_{\bm y} \{ \alpha \langle\nabla f(\bm x_k), \bm y\rangle + w(\bm y) - \langle \nabla w(\bm x), \bm y\rangle \} \\ &= \operatorname{argmin}_{\bm y} \{ \alpha \langle\nabla f(\bm x_k), \bm y - \bm x\rangle + V_{\bm x}(\bm y) \} \\ \end{aligned}\]

命题\(f\)\(\rho\)-Lipschitz 的,\(w\)\(1\)-强凸的,则 \(T = O\left(\frac {\rho^2}{\epsilon^2}\right)\) 轮后,\(f(\overline {\bm x}) - f(\bm u) < \varepsilon\)

证明

\(1\)-强凸说明

\[V_{\bm x}(\bm y) \geq \frac 12 \|\bm x - \bm y\|^2_2 \]

先来做一个类似于 GD 中 \(L\)-smooth,不使用 \(f\) convex 的一段证明。

\[\begin{aligned} \langle \alpha \nabla f(\bm x_k), \bm x_k - \bm u \rangle &= \langle \alpha \nabla f(\bm x_k), \bm x_k - \bm x_{k+1} \rangle + \langle \alpha \nabla f(\bm x_k), \bm x_{k+1} - \bm u \rangle \\ &= \langle \alpha \nabla f(\bm x_k), \bm x_k - \bm x_{k+1} \rangle + \langle - \nabla V_{\bm x_k}(\bm x_{k+1}), \bm x_{k+1} - \bm u \rangle && (\nabla (V_{\bm x_k}(\bm y) + \langle \alpha \nabla f(\bm x_k), \bm y - \bm x_k\rangle))(\bm x_{k+1}) = \bm 0 \\ &= \langle \alpha \nabla f(\bm x_k), \bm x_k - \bm x_{k+1} \rangle + V_{\bm x_k}(\bm u) - V_{\bm x_{k+1}}(\bm u) - V_{\bm x_k}(\bm x_{k+1}) && \langle -\nabla V_{\bm x}(\bm y), \bm y - \bm u \rangle = V_{\bm x}(\bm u) - V_{\bm y}(\bm u) - V_{\bm x}(\bm y) \\ &\leq \left(\langle \alpha \nabla f(\bm x_k), \bm x_k - \bm x_{k+1} \rangle - \frac 12 \|\bm x_k - \bm x_{k+1}\|^2_2\right) + V_{\bm x_k}(\bm u) - V_{\bm x_{k+1}}(\bm u) \\ &\leq \frac{\alpha^2}2 \|\nabla f(\bm x_k)\|_2^2 + V_{\bm x_k}(\bm u) - V_{\bm x_{k+1}}(\bm u) \\ &\leq \frac {\alpha^2\rho^2}2 + V_{\bm x_k}(\bm u) - V_{\bm x_{k+1}}(\bm u) \end{aligned}\]

\(\overline{\bm x} = \frac 1k\sum_{t=0}^{k-1} \bm x_t\),根据 \(f\) 的凸性,有

\[f(\overline {\bm x}) \leq \frac 1k \sum_{t=0}^{k-1} f(\bm x_t) \]

\[\alpha T(f(\overline {\bm x}) - f(\bm u)) \leq T\frac{\alpha^2\rho^2}2 + V_{\bm x_0}(\bm u) - V_{\bm x_T}(\bm u) \]

假设 \(V_{\bm x_0}(\bm x^*) \leq \Theta\),其中 \(\Theta\) 为常数,则有

\[f(\overline{\bm x}) - f(\bm x^*) \leq \frac{\alpha \rho^2}2 + \frac{\Theta}{\alpha T} \]

\[\alpha = \frac{\sqrt{2 \Theta}}{\rho \sqrt T} \]

则有

\[f(\overline {\bm x}) - f(\bm x^*) \leq \frac{\rho\sqrt{2 \Theta}}{\sqrt T} \]

\(T = O\left(\frac{\rho^2}{\epsilon^2}\right)\)

考虑到 \(\rho\)-Lipschitz 等价于 \(\|\nabla f(x)\|_2 \leq \rho\)。如果我们能给 \(\|\nabla f(x)\|_2\) 设一个阈值 \(K\),对 GD,每一步减小 \(\frac{\|\nabla f\|_2^2}{2L} \geq \frac {K^2}{2L}\),所以只需要 \(O(\frac{\epsilon L}{K^2})\) 轮;对 MD,只需要 \(O(\frac{K^2}{\epsilon^2})\) 轮,令 \(K = \epsilon\sqrt{L \epsilon}\),就有 \(T = O\left(\frac 1{\sqrt{\epsilon}}\right)\),或称,\(O(\frac 1{T^2})\) 的收敛速率。

但 MD 对梯度的要求是全局的,GD 的速率是单点的,因此我们并不能把一个一般的函数用一个全局的阈值 \(K\) 分分清楚。于是我们引入今天的重点,线性耦合 (linear coupling)。当 GD 得到 \(\bm y_k\),MD 得到 \(\bm z_k\) 作为下一步时,我们让 \(\bm x_{k+1} = \tau \bm z_k + (1 - \tau) \bm y_k\)。具体地,

  • \(\bm x_0 = \bm y_0 = \bm z_0\).
  • \(\bm x_{k+1} = \tau \bm z_k + (1 - \tau)\bm y_k\).
  • \(\bm y_{k+1} = -\nabla f(\bm x_{k+1})\).
  • \(\bm z_{k+1} = \mathrm{Mirr}_{\bm z_k}(\alpha \nabla f(\bm x_{k+1}))\).

命题 存在 \(\tau \in (0, 1)\),使得 \(T = O(\frac 1{\sqrt{\epsilon}})\) 轮后,\(f(\overline {\bm x}) - f(\bm x^*) < \epsilon\)

这里非常取巧地把 \(\mathrm{Mirr}\) 的下标改为了和梯度中不一样的参数,这便是一般化起到的作用。在上述 MD 结论的推导中,我们并没有强依赖于下标和梯度参数的一致,在这里这么做实际上只是为了可以裂项,而只有我们将其一般化之后才意识到可以随意地改变参数,如果只是在 GD 的视角下,这个操作不合理的(即在一个点借用别处的梯度)。

证明

尝试使用和 MD 类似的方法,但是不把 \(\|\nabla f(\bm x)\|^2_2\) 化为 \(\rho^2\),而是用 GD 化为 \(f(\bm y)\),再考虑如何将 \(\bm y\) 也裂项。

\[\begin{aligned} \alpha \langle \nabla f(\bm x_{k+1}), \bm z_k - \bm u \rangle &\leq \frac{\alpha^2}2 \|\nabla f(\bm x_{k+1})\|_2^2 + V_{\bm z_k}(\bm u) - V_{\bm z_{k+1}}(\bm u) \\ &\leq \alpha^2 L(f(\bm x_{k+1}) - f(\bm y_{k+1})) + V_{\bm z_k}(\bm u) - V_{\bm z_{k+1}}(\bm u) \end{aligned}\]

\[\begin{aligned} \alpha \langle \nabla f(\bm x_{k+1}), \bm x_{k+1} - \bm u \rangle - \alpha \langle \nabla f(\bm x_{k+1}), \bm z_k - \bm u \rangle &= \alpha \langle\nabla f(\bm x_{k+1}), \bm x_{k+1} - \bm z_k\rangle \\ &= \frac{(1 - \tau)\alpha}{\tau} \langle \nabla f(\bm x_{k+1}), \bm y_k - \bm x_{k+1} \rangle && \tau(\bm x_{k+1} - \bm z_k) = (1 - \tau)(\bm y_k - \bm x_{k+1}) \\ &\leq \frac{(1 - \tau)\alpha}{\tau} (f(\bm y_k) - f(\bm x_{k+1})) && \text{Convexity} \end{aligned}\]

由此,令 \(\frac{1 - \tau}{\tau} = \alpha L\),则有

\[\alpha \langle \nabla f(\bm x_{k+1}), \bm x_{k+1} - \bm u \rangle \leq \alpha^2 L (f(\bm y_k) - f(\bm y_{k+1})) + V_{\bm z_k}(\bm u) - V_{\bm z_{k+1}}(\bm u) \]

\[\begin{aligned} \alpha T (f(\overline {\bm x}) - f(\bm x^*)) &\leq \sum_{k=0}^{T - 1} \alpha \langle \nabla f(\bm x_k), \bm x_k - \bm x^* \rangle \\ &\leq \alpha^2 L(f(\bm y_0) - f(\bm y_T)) + V_{\bm x_0}(\bm x^*) - V_{\bm x_T}(\bm x^*) \end{aligned}\]

假设 \(f(\bm y_0) - f(\bm x^*) \leq d, V_{\bm x_0}(\bm x^*) \leq \Theta\),则有

\[f(\overline{\bm x}) - f(\bm x^*) \leq \frac 1T \left(\alpha Ld + \frac \Theta \alpha\right) \]

\(\alpha = \sqrt{\frac \Theta{Ld}}\) 即可得到

\[f(\overline{\bm x}) - f(\bm x^*) \leq \frac{2\sqrt{L \Theta d}}T \]

\(T = 4 \sqrt{\frac{L \Theta}d}\),则

\[f(\overline {\bm x}) - f(\bm x^*) \leq \frac d2 \]

此时重新设置 \(\alpha\),再进行 \(T' = 4\sqrt{\frac{2L\Theta}{d}}\) 轮,便可以再缩小一倍,以此类推,最终达到 \(\epsilon\),总轮数为

\[O\left(\sqrt{\frac {L \Theta}\epsilon} + \sqrt{\frac {L \Theta}{2 \epsilon}} + \sqrt{\frac {L \Theta}{4\epsilon}} + \ldots\right) = O\left(\sqrt{\frac{L \Theta}{\epsilon}}\right) \]