RL 基础 | Value Iteration 的收敛性证明

发布时间 2023-10-24 09:48:24作者: MoonOut

(其实是专业课作业? 感觉算法岗面试可能会问,来存一下档)

问题:证明 Value Iteration 收敛性

Please prove the convergence of Value Iteration algorithm in discounted reward setting, and analyze the performance between resulting policy and optimal policy. 证明 Value Iteration 的收敛性。

背景:

  • Policy Iteration:先得到一个 policy,然后算出它的 value function,基于该 value function 取 max_a Q(s,a) 作为新 policy,再算新 policy 的 value function…
  • Value Iteration:不停对 value function 使用贝尔曼算子,新 V = max_a [r(s,a) + γV(s')]。

0 Definitions - 定义

First, we define

  • Discrete state space \(S=\{s_1,\cdots,s_n\}\), discrete action space \(A=\{a_1,\cdots,a_n\}\).
  • Value function: \(V(s)=E_{a\sim\pi(a|s)}\big[r(s,a) + \gamma E_{s'\sim p(s'|s,a)}V(s')\big]\) .
  • Bellman operator: for all \(s\in S\) ,

\[\begin{aligned} B_\pi V(s) &\triangleq E_{a\sim\pi(a|s)}\big[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}V(s') \big] \\ B_* V(s) &\triangleq \max_a \big[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}V(s') \big] \end{aligned} \]

The Value Iteration algorithm is to conduct Bellman operator \(B_*\) on the value function \(V(s)\). We want to prove that the sequence will converge to the optimal value function \(V^*\), which should be the of the Bellman equation \(B_*V=V\) .

我们定义了状态空间、动作空间、值函数(value function)和贝尔曼算子(Bellman operator)。VI 算法,就是对 value function 一直使用 Bellman operator,希望最后能收敛到最优的 value function。这样,该最优 value function 必然是 Bellman equation \(B_*V=V\) 的不动点。

1 Bellman operator is a Contraction Mapping - 贝尔曼算子是压缩映射

Then, we need to prove that Bellman Operator is a Contraction Mapping. The definition of Contraction Mapping is as follows:

\[d(f(x_1),f(x_2))\le\gamma d(x_1,x_2), \]

where \(d\) is a measurement over a metric space, \(\gamma\in(0,1)\), and \(f\) is the Contraction Mapping. In our case, the definition of Contraction Mapping is as follows:

\[|f(x_1)-f(x_2)|_\infty \le \gamma|x_1-x_2|_\infty, \]

where \(||_\infty\) is the infinite norm (return the max value over all dimensions of a vector).

接下来,证明贝尔曼算子是压缩映射(Contraction Mapping)。在这里,压缩映射的定义是,两个变量的无穷范数(向量各维之差的最大绝对值)会被 \(f\) 映射压缩 \(\gamma\) 倍。

Let's prove that Bellman Operator \(B_*\) is a Contraction Mapping. The proof is as follows.

\[\begin{aligned} &|B_*V_1(s)-B_*V_2(s)| \\ &=\bigg|\max_a\big[r(s,a)+\gamma E_{s'~p(s'|s,a)}V_1(s')\big] - \max_a\big[r(s,a)+\gamma E_{s}V_2(s')\big] \bigg| \\ &\le \bigg| \max_a[r(s,a) + \gamma E_{s'}V_1(s') - r(s,a) - \gamma E_{s'}V_2(s')]\bigg| \\ &= \bigg|\gamma\max_a E_{s'}(V_1(s')-V_2(s'))\bigg| \\ &\le \gamma\max_a\bigg|E_{s'}(V_1(s')-V_2(s'))\bigg| \\ &= \gamma\max_a \bigg|\sum_{s'}p(s'|s,a)\big[V_1(s')-V_2(s')\big]\bigg| \\ &\le \gamma\max_a \sum_{s'}p(s'|s,a)\bigg|V_1(s')-V_2(s')\bigg| \\ &= \gamma \sum_{s'}p(s'|s,a_*(s))\bigg|V_1(s')-V_2(s')\bigg| \\ &\le \gamma \sum_{s'}p(s'|s,a_*(s))\max_s\bigg|V_1(s)-V_2(s)\bigg| \\ &= \gamma|V_1-V_2|_\infty \end{aligned} \]

贝尔曼算子是压缩映射的证明:一连串放缩,如上。

2 Contraction Mapping converges to the optimal value function - 压缩映射能收敛到最优值函数

If Bellman Operator is a Contraction Mapping, then we can use the Banach fixed point theorem to prove that, the optimal value function \(V^*\) is the unique fixed point of the Bellman Operator \(B_*\), and the sequence \(B_*(B_*(\cdots B_*V))\) converges to \(V^*\).

如果贝尔曼算子是压缩映射,则可以用巴拿赫不动点定理(Banach fixed point theorem),证明 \(V^*\) 是 Bellman Equation \(B_*V=V\) 的唯一不动点,并且一定会收敛到该不动点。

The first step is to prove that the fixed point \(V^*\) is unique. Use proof by contradiction. Suppose there are two converged value \(V_1^*,V_2^*\), then we have \(|B_*V_1^*-B_*V_2^*|_\infty=|V_1^*-V_2^*|_\infty\). However, Bellman operator \(B_*\) is a Contraction Mapping, so there is \(|B_*V_1^*-B_*V_2^*|_\infty\le\gamma|V_1^*-V_2^*|_\infty\), which contradicts the previous one. Thus, there can be only one converged value, and it is obviously the fixed point \(V^*\).

第一步,证明收敛到的不动点是唯一的。使用反证法,假设有两个不动点,那么它们俩的压缩映射都映射到原地(不动点嘛),距离没有 γ 倍压缩,矛盾,从而得证。

The second step is to prove that the fixed point \(V^*\) really exists. If we can prove that \(\{V,B_*V,B_*^2V,\cdots\}\) is a Cauchy sequence, the fixed point \(V^*\) exists. Take two value \(B_*^a,B_*^b\) from the sequence, where \(a\gg b\). Use the triangle inequality of \(||_\infty\) as a measurement, we can get

\[\begin{aligned} |B_*^a-B_*^b|_\infty &\le |B_*^a-B_*^{a-1}|_\infty + |B_*^{a-1}-B_*^b|_\infty \\ &\le |B_*^a-B_*^{a-1}|_\infty + |B_*^{a-1}-B_*^{a-2}|_\infty + |B_*^{a-2}-B_*^b|_\infty \\ &\le |B_*^a-B_*^{a-1}|_\infty + \cdots + |B_*^{b+1}-B_*^b|_\infty \\ &\le \gamma^b|B_*V-V|\sum_{i=0}^{a-b-1}\gamma^i \le \frac{\gamma^b}{1-\gamma}|B_*V-V| \end{aligned}. \]

So, if we can choose a large enough \(b\), the value of \(|B_*^a-B_*^b|_\infty\) can be less than any \(\epsilon\gt0\). Thus, there must exists an \(N\in\mathbb N\), such that \(|B_*^a-B_*^b|_\infty\lt\epsilon\), for all \(a,b\gt N\) . The sequence is a Cauchy sequence, and the fixed point \(V^*\) exists.

第二步,证明不动点 \(V^*\) 存在。去证 \(\{V,B_*V,B_*^2V,\cdots\}\) 是柯西序列(Cauchy sequence),如果是柯西序列,那么就存在收敛的不动点。柯西序列的定义是,对于序列里的 \(x_a,x_b\),存在正整数 N 使得对于任意 \(a,b\gt N\),度量 \(d(x_a,x_b)\lt\epsilon\),其中 \(\epsilon\) 是任意小的正实数。先使用三角不等式,将 \(x_a-x_b\) 的形式拆成 \((x_a-x_{a-1})+(x_{a-1}-x_{a-2})+\cdots+(x_{b+1}-x_b)\),然后使用压缩映射的性质,放缩成 γ 的等比序列求和,即可得到,它小于任意的小实数。

Reference - 参考资料