# Finite-time Analysis of the Multiarmed Bandit Problem

@article{Auer2004FinitetimeAO, title={Finite-time Analysis of the Multiarmed Bandit Problem}, author={Peter Auer and Nicol{\`o} Cesa-Bianchi and Paul Fischer}, journal={Machine Learning}, year={2004}, volume={47}, pages={235-256} }

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi… Expand

#### Figures and Topics from this paper

#### 4,934 Citations

An asymptotically optimal policy for finite support models in the multiarmed bandit problem

- Mathematics, Computer Science
- Machine Learning
- 2011

The minimum empirical divergence (MED) policy is proposed and an upper bound on the finite-time regret is derived which meets the asymptotic bound for the case of finite support models. Expand

Risk-Averse Explore-Then-Commit Algorithms for Finite-Time Bandits

- Computer Science, Mathematics
- 2019 IEEE 58th Conference on Decision and Control (CDC)
- 2019

Using a new notion of finite-time exploitation regret, an upper bound of order ln (1/(ϵ)) for the minimum number of experiments before commitment is found, to guarantee upper bound ϵ for regret. Expand

Pure Exploration in Multi-armed Bandits Problems

- Mathematics, Computer Science
- ALT
- 2009

The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret. Expand

A Structured Multiarmed Bandit Problem and the Greedy Policy

- Computer Science
- IEEE Transactions on Automatic Control
- 2009

In the infinite horizon discounted reward setting, it is shown that the greedy and optimal policies eventually coincide, and both settle on the best arm, and this is in contrast with the Incomplete Learning Theorem for the case of independent arms. Expand

On the evolution of the expected gain of a greedy action in the bandit problem

- Mathematics
- 2008

The K-armed bandit problem is a well-known formalization of the exploration versus exploitation dilemma. In a K-armed bandit problem, a player is confronted with a gambling machine with K arms where… Expand

Multi-armed bandit problems with heavy-tailed reward distributions

- Computer Science, Mathematics
- 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
- 2011

An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies and it is shown that when the moment-generating functions of the arm reward distributions are properly bounded, the optimal logarithmic order of the regret can be achieved by DSEE. Expand

Pure Exploration for Multi-Armed Bandit Problems

- Mathematics, Computer Science
- ArXiv
- 2008

It is able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions. Expand

Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards

- Computer Science, Mathematics
- ArXiv
- 2014

This paper fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward "variation" and the minimal achievable regret, and draws some connections between two rather disparate strands of literature. Expand

A structured multiarmed bandit problem and the greedy policy

- Computer Science, Mathematics
- 2008 47th IEEE Conference on Decision and Control
- 2008

In the infinite horizon discounted reward setting, it is shown that both the greedy and optimal policies eventually coincide and settle on the best arm, in contrast with the Incomplete Learning Theorem for the case of independent arms. Expand

A dynamic programming strategy to balance exploration and exploitation in the bandit problem

- Computer Science
- Annals of Mathematics and Artificial Intelligence
- 2010

The concept of “expected reward of greedy actions” which is based on the notion of probability of correct selection (PCS) is used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation. Expand

#### References

SHOWING 1-10 OF 30 REFERENCES

Gambling in a rigged casino: The adversarial multi-armed bandit problem

- Computer Science, Mathematics
- Proceedings of IEEE 36th Annual Foundations of Computer Science
- 1995

A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs is given. Expand

SAMPLE MEAN BASED INDEX POLICIES WITH O(logn) REGRET FOR THE MULTI-ARMED BANDIT PROBLEM

- Mathematics
- 1995

We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases sldwly with time. In their seminal work on… Expand

Q-Learning for Bandit Problems

- Mathematics, Computer Science
- ICML
- 1995

This paper suggests utilizing task-state-specific Q-learning agents to solve their respective restart-in-state-$i$ subproblems, and includes an example in which the online reinforcement learning approach is applied to a simple problem of stochastic scheduling. Expand

Multi-Armed bandit problem revisited

- Mathematics
- 1994

In this paper, we revisit aspects of the multi-armed bandit problem in the earlier work (Ref. 1). An alternative proof of the optimality of the Gittins index rule is derived under the discounted… Expand

Optimal Adaptive Policies for Sequential Allocation Problems

- Mathematics
- 1996

Consider the problem of sequential sampling frommstatistical populations to maximize the expected sum of outcomes in the long run. Under suitable assumptions on the unknown parametersformula, it is… Expand

Nonparametric bandit methods

- Mathematics
- 1991

Bandits are a finite collection of random variables. Bandit problems are Markov decision problems in which, at each decision time, the decision maker selects a random variable (referred to as a… Expand

Reinforcement Learning: An Introduction

- Computer Science
- IEEE Transactions on Neural Networks
- 2005

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications. Expand

Reinforcement Learning: A Survey

- Computer Science
- J. Artif. Intell. Res.
- 1996

Central issues of reinforcement learning are discussed, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. Expand

Learning in embedded systems

- Computer Science
- 1993

This dissertation addresses the problem of designing algorithms for learning in embedded systems using Sutton's techniques for linear association and reinforcement comparison, while the interval estimation algorithm uses the statistical notion of confidence intervals to guide its generation of actions. Expand

Learning to Act Using Real-Time Dynamic Programming

- Computer Science
- Artif. Intell.
- 1995

An algorithm based on dynamic programming, which is called Real-Time DP, is introduced, by which an embedded system can improve its performance with experience and illuminate aspects of other DP-based reinforcement learning methods such as Watkins'' Q-Learning algorithm. Expand