【RL】CH2-Bellman equation

发布时间 2023-08-13 16:14:12作者: 鸽鸽的书房

the discounted return

\[\begin{aligned} G_t & =R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots \\ & =R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+\ldots\right) \\ & =R_{t+1}+\gamma G_{t+1} \end{aligned} \]

state-value function/the state value of s \(v_\pi(s)\)

\[\begin{aligned} v_\pi(s) & =\mathbb{E}\left[G_t \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1}+\gamma G_{t+1} \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right] \end{aligned} \]

Bellman Equation

\[\begin{aligned} v_\pi(s) & =\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right], \\ & =\underbrace{\sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{r \in \mathcal{R}} p(r \mid s, a) r}_{\text {mean of immediate rewards }}+\underbrace{\gamma \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right),}_{\text {mean of future rewards }} \\ & =\sum_{a \in \mathcal{A}} \pi(a \mid s)\left[\sum_{r \in \mathcal{R}} p(r \mid s, a) r+\gamma \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)\right], \quad \text { for all } s \in \mathcal{S} . \end{aligned} \]

two equivalent expressions

First

First, it follows from the law of total probability that

\[\begin{aligned} & p\left(s^{\prime} \mid s, a\right)=\sum_{r \in \mathcal{R}} p\left(s^{\prime}, r \mid s, a\right), \\ & p(r \mid s, a)=\sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime}, r \mid s, a\right) . \end{aligned} \]

Then, equation (2.7) can be rewritten as

\[v_\pi(s)=\sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} \sum_{r \in \mathcal{R}} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_\pi\left(s^{\prime}\right)\right] \]

Second

Second, the reward \(r\) may depend solely on the next state \(s^{\prime}\) in some problems. As a result, we can write the reward as \(r\left(s^{\prime}\right)\) and hence \(p\left(r\left(s^{\prime}\right) \mid s, a\right)=p\left(s^{\prime} \mid s, a\right)\), substituting which into \((2.7)\) gives

\[v_\pi(s)=\sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right)\left[r\left(s^{\prime}\right)+\gamma v_\pi\left(s^{\prime}\right)\right] \]