Sample-Based Learning and Search with Permanent and Transient Memories-526互联

发表时间：2008（ICML 2008）
文章要点：这篇文章提出Dyna-2算法，把sample-based learning and sample-based search结合起来，并在Go上进行测试。作者认为，search算法是一种transient的算法，就是短期记忆用了就忘了，而像Sarsa这类learning算法是长期的永久性的记忆。所以作者同时维护两个memory，还维护了两个\(Q\)用来表示permanent value以及permanent和transient的组合

然后learning就学的\(Q\)，search的时候就学的\(\bar Q\),然后在学习的时候，permanent memory是不被清空的，而transient memory每个episode都会被清空。具体的，作者用的Sarsa来更新两个\(Q\)，动作的选择就是\(\epsilon\)-greedy。
总结：一篇很早很早的model-based方法的文章了，现在看起来感觉没啥新意，不过回想一下AlphaGo的思想，是不是也相当于一个permanent的network，在加一个transient的MCTS。这个思想贯穿了silver的整个研究生涯啊。
这里还学到了planning和search的一点区别，planning是在model里去做的，效果取决于model的准确度，而search是在真实状态上做的（Sample-based planning applies sample-based reinforcement learning methods to simulated experience. This requires a sample model of the world. In sample-based search, experience is simulated from the real state s, so as to identify the best action from this state.）。
疑问：因为当时还是线性函数近似，里面讲了很多构造特征的东西，看不明白。