第6课-随机近似与随机梯度下降

6.1 Motivating examples

Mean Estimation

Revisit the mean estimation problem:

Consider a random variable \(X\).
Our aim is to estimate \(\mathbb{E}[X]\).
Suppose that we collected a sequence of iid samples \(\left\{x_i\right\}_{i=1}^N\).
The expectation of \(X\) can be approximated by

\[\mathbb{E}[X] \approx \bar{x}:=\frac{1}{N} \sum_{i=1}^N x_i . \]

采样N次，把所有数据收集起来求平均

We already know from the last lecture:

This approximation is the basic idea of Monte Carlo estimation.
We know that \(\bar{x} \rightarrow \mathbb{E}[X]\) as \(N \rightarrow \infty\). \(\bar{x}\)会逐渐趋近真实值
Why do we care about mean estimation so much?
Many values in RL such as state/action values are defined as means. 这些均值需要用数据去估计

incremental and iterative manner? 来几个就先计算几个，效率更高

假设：

\[w_{k+1}=\frac{1}{k} \sum_{i=1}^k x_i, \quad k=1,2, \ldots \]

可以得到：

\[w_{k+1}=w_k-\frac{1}{k}\left(w_k-x_k\right) \]

\[w_k \rightarrow \mathbb{E}[X] \text { as } k \rightarrow \infty \]

SA refers to a broad class of stochastic iterative algorithms solving root finding or optimization problems.
Compared to many other root-finding algorithms such as
gradient-based methods, SA is powerful in the sense that it does not require to know the expression of the objective function nor its derivative.

Suppose we would like to find the root of the equation

\[g(w)=0, \]

where \(w \in \mathbb{R}\) is the variable to be solved and \(g: \mathbb{R} \rightarrow \mathbb{R}\) is a function.

Many problems can be eventually converted to this root finding problem. For example, suppose \(J(w)\) is an objective function to be minimized. Then, the optimization problem can be converged to

\[g(w)=\nabla_w J(w)=0 \]

梯度为0

Note that an equation like \(g(w)=c\) with \(c\) as a constant can also be converted to the above equation by rewriting \(g(w)-c\) as a new function.

求解\(g(w)=0\)的问题

The Robbins-Monro (RM) algorithm can solve this problem:

\[w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right), \quad k=1,2,3, \ldots \]

where

\(w_k\) is the \(k\) th estimate of the root
\(\tilde{g}\left(w_k, \eta_k\right)=g\left(w_k\right)+\eta_k\) is the \(k\) th noisy observation
\(a_k\) is a positive coefficient.
The function \(g(w)\) is a black box! This algorithm relies on data:
Input sequence: \(\left\{w_k\right\}\)
Noisy output sequence: \(\left\{\tilde{g}\left(w_k, \eta_k\right)\right\}\)

In the Robbins-Monro algorithm, if

\(0<c_1 \leq \nabla_w g(w) \leq c_2\) for all \(w\);
\(\sum_{k=1}^{\infty} a_k=\infty\) and \(\sum_{k=1}^{\infty} a_k^2<\infty\);
\(\mathbb{E}\left[\eta_k \mid \mathcal{H}_k\right]=0\) and \(\mathbb{E}\left[\eta_k^2 \mid \mathcal{H}_k\right]<\infty\);
where \(\mathcal{H}_k=\left\{w_k, w_{k-1}, \ldots\right\}\), then \(w_k\) converges with probability 1 (w.p. 1) to the root \(w^*\) satisfying \(g\left(w^*\right)=0\).