Online Markov Decision Processes under Bandit Feedback
Published on Mar 25, 20113126 Views
We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete