Heuristic-Guided Reinforcement Learning-526互联

发表时间：2021 (NeurIPS 2021)
文章要点：这篇文章提出了一个Heuristic-Guided Reinforcement Learning (HuRL)的框架，用domain knowledge或者offline data构建heuristic，将问题变成一个shorter-horizon的子问题，从而更容易解决。
具体的，就是将原始的MDP变换成一个新的reward和gamma的MDP，其中reward由原始reward和heuristic组成，然后gamma就可以变小了

所以就相当于缩短了horizon。这个方式相当于在reward和heuristic之间做trade off，HuRL effectively introduces horizon-based regularization that determines whether long-term value information should come from collected experiences or the heuristic.
然后作者举了个例子，就想说，如果heuristic很好，可以产生很好的policy，如果heuristic不够好，那么对训练是有害的，

接下来就是几个证明，没看明白。
总结：感觉什么都没说错，但是也什么都没说。可能就是提出了horizon-based regularization for RL这么一个观念吧。
疑问：有的时候真的不能理解，到底什么样的文章能中。