【RL】L7-Temporal-difference learning

发布时间 2023-08-13 16:54:55作者: 鸽鸽的书房

TD learning of state values

The data/experience required by the algorithm:

  • \(\left(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots\right)\) or \(\left\{\left(s_t, r_{t+1}, s_{t+1}\right)\right\}_t\) generated following the given policy \(\pi\).

The TD learning algorithm is

\[\begin{aligned} & v_{t+1}\left(s_t\right)=v_t\left(s_t\right)-\alpha_t\left(s_t\right)\left[v_t\left(s_t\right)-\left[r_{t+1}+\gamma v_t\left(s_{t+1}\right)\right]\right], \\ & v_{t+1}(s)=v_t(s), \quad \forall s \neq s_t \end{aligned} \]

where \(t=0,1,2, \ldots\) Here, \(v_t\left(s_t\right)\) is the estimated state value of \(v_\pi\left(s_t\right)\); \(\alpha_t\left(s_t\right)\) is the learning rate of \(s_t\) at time \(t\).

s: state space