
en-de
en-es
en-fr
en-sl
en
en-zh
0.25
0.5
0.75
1.25
1.5
1.75
2
Online Markov Decision Processes under Bandit Feedback
Published on Feb 4, 20253130 Views
We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete