Recursion in optimization

In this blog post, I aim to provide a overview of the various recursive methods I have seen in convex optimization. Optimization methods often yield a sequence denoted as \(\{x_t\}_{t\geq 0}\), prompting a detailed analysis of the convergence properties inherent in these sequences.

Convergent sequence

The concept of convergent and divergent sequences is fundamental in advanced mathematical studies, such as the well-known Cauchy sequence. In the following discussion, I will narrow our focus to specific recursion techniques, shedding light on their convergence properties. To delve deeper into this subject, let's explore a particular recursion method:

\[\begin{equation} x_{t+1} = \gamma_t x_t + \varepsilon_t. \label{eq:base} \end{equation} \]

The formulation mentioned above holds universal significance in the field of optimization (see Polyak ch 2.2.3).

Now, let's delve into two fundamental lemmas that play a crucial role in determining the convergence of a sequence governed by Eq.\eqref{eq:base}.

Lemma 1 Let \(x_t \geq 0\) and

\[x_{t+1} = \gamma_t x_{t} + \varepsilon_t, \]

where \(0\leq \gamma_t < 1\), \(\varepsilon \geq 0\), \(\sum_{t=0}^{\infty} (1-\gamma_t) =\infty\), and \(\varepsilon_t/ (1-\gamma_t) \rightarrow 0\), then

\[x_{t+1}\to 0. \]

The condition \(\frac{\varepsilon}{1-\gamma_t} \to 0\) implies that \(\varepsilon = o(1-\gamma_t)\).

Lemma 2 Let \(x_t \geq 0\) and

\[x_{t+1} = (1+\alpha_t) x_{t} + \varepsilon_t, \]

where \(\sum_{t=0}^{\infty} \alpha_t < \infty\) and \(\sum_{t=0}^\infty \varepsilon_t < \infty\), then

\[x_{t+1}\to \bar x. \]

Drawing from these two lemmas, it becomes evident that various convergent recursions can be derived, as illustrated in Franci, B..

Convergence rate

When dealing with a convergent sequence, a critical question arises: How many iterations are needed to obtain an \(\epsilon\)-approximate solution? In other words, what is the convergence rate for this sequence? Understanding the convergence rate allows us to select the most efficient method based on the least number of iterations required.

To begin our analysis, let's consider a scenario where both \(\gamma_t \equiv \gamma\) and \(\varepsilon_t \equiv \varepsilon\) are fixed. While this violates the assumption in Lemma 1, it provides a starting point for our exploration. Expanding the recursion \eqref{eq:base}, we obtain:

\[ x_{t+1} \leq \gamma^{T+1} x_0 +\frac{\varepsilon}{1-\gamma}, \]

This indicates that \(x_{t+1}\) lies within the \(\frac{\varepsilon}{1-\gamma}\)-neighborhood of \(0\). Achieving a convergence guarantee necessitates driving \(\frac{\varepsilon}{1-\gamma}\) to approach \(0\), as outlined in Lemma 1.

As highlighted in Bottou thm4.6, employing a constant learning rate results in the linear convergence of expected objective values to a neighborhood of the optimal value. While a smaller stepsize might degrade the contraction constant in the convergence rate, it facilitates approaching closer to the optimal value.

To ensure convergence, we opt for a diminishing stepsize strategy, leading to a linear convergence rate of \(\mathcal{O}\left(\frac{1}{T}\right)\).

In general, we reformulate \eqref{eq:base} as follows:

\[x_{t+1} \leq (1-\eta_t) x_t + \eta_t^2\sigma^2. \]

Theorem 1 Let \(x_t \geq 0\) and

\[x_{t+1} \leq (1-\eta_t) x_t + \eta_t^2\sigma^2, \]

where \(\eta_t = \frac{b}{a + t}\), \(b, \sigma^2 > 0\), \(a\geq 0\), then

\[x_{T} \leq \frac{v}{a + T}, \]

where \(v=\max\{(a + 1)x_0, \frac{b^2\sigma^2}{b-1}\}\).

We prove it by induction. When \(t=1\), it obviously holds. Assume that it holds for \(t\), then

\[x_{t+1} \leq \frac{t + a - 1}{(t+a)^2}v - \frac{bv - 1}{(a+t)^2} v + \frac{b^2\sigma^2}{(t+a)^2}, \]

which leads to \(x_{t+1} \leq \frac{v}{t+1+a}\).

In Polyak lemma 2.2.4, there is the convergence rate for the recursion

\[x_{t+1} = (1-\frac{c}{t})x_t + \frac{d}{t^{p+1}}, \]

\[\begin{matrix} x_t &\leq &d(c-p)^{-1}t^{-p} + o(t^{-p}), & c > p,\\ x_t &= &\mathcal{O}(t^{-c}\log t), & p =c,\\ x_t &= &\mathcal{O}(t^{-c}), & p> c. \end{matrix} \]

As shown in Bottou thm5.1, when \(\varepsilon_t\) converges geometrically, the recursion exhibits a linear convergence rate with a constant stepsize, a technique often referred to as dynamic sampling.