On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
published: March 25, 2011, recorded: December 2010, views: 3520
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (a) using the past experience to estimate the gradient of the expected return at the current policy parameterization, rather than to obtain a more complete estimate, and (b) using past experience under the current policy rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines - a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds.
Download slides: nips2010_tang_cbi_01.pdf (129.8 KB)
Download article: nips2010_0796.pdf (493.0 KB)
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !