banditpylib.learners.mab_learner

Policies for ordinary multi-armed bandit with goal regret minimization.

We introduce notations in the following.

Notations

\(T\)

game horizon

\(N\)

total number of arms

\(i_t\)

arm pulled at time \(t\)

\(X_i^t\)

empirical reward of arm \(i\) at time \(t\) if arm \(i\) is pulled

\(T_i(t)\)

number of times arm \(i\) is played before time \(t\)

\(\bar{\mu}_i(t)\)

empirical mean of arm \(i\) before time \(t\)

\(\bar{V}_i(t)\)

empirical variance of arm \(i\) before time \(t\)

Classes

class banditpylib.learners.mab_learner.MABLearner(arm_num: int, name: Optional[str])[source]

Abstract class for learners playing with the ordinary multi-armed bandit

This type of learners aim to maximize the total collected rewards.

Parameters
  • arm_num (int) – number of arms

  • name (Optional[str]) – alias name

Inheritance

Inheritance diagram of MABLearner
property arm_num: int

Number of arms

property goal: banditpylib.learners.utils.Goal

Goal of the learner

property running_environment: Union[type, List[type]]

Type of bandit environment the learner plays with

class banditpylib.learners.mab_learner.EpsGreedy(arm_num: int, eps: float = 1.0, name: Optional[str] = None)[source]

Epsilon-Greedy policy

With probability \(\frac{\epsilon}{t}\) do uniform sampling and with the remaining probability play the arm with the maximum empirical mean.

Parameters
  • arm_num (int) – number of arms

  • eps (float) – epsilon

  • name (Optional[str]) – alias name

Inheritance

Inheritance diagram of EpsGreedy
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.UCB(arm_num: int, alpha: float = 2.0, name: Optional[str] = None)[source]

Upper Confidence Bound policy [ACBF02]

At time \(t\), play arm

\[\mathrm{argmax}_{i \in \{0, \dots, N-1\}} \left\{ \bar{\mu}_i(t) + \sqrt{ \frac{\alpha \ln(t) }{T_i(t)} } \right\}\]
Parameters
  • arm_num (int) – number of arms

  • alpha (float) – alpha

  • name (Optional[str]) – alias name

Inheritance

Inheritance diagram of UCB
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.ThompsonSampling(arm_num: int, prior_dist: str = 'beta', name: Optional[str] = None)[source]

Thompson Sampling policy [AG17]

Assume a prior distribution for every arm. At time \(t\), sample a virtual mean reward from the posterior distribution for every arm. Play the arm with the maximum sampled virtual mean reward.

Parameters
  • arm_num (int) – number of arms

  • prior_dist (str) – prior distribution of thompson sampling. Only two priors are supported i.e., beta and gaussian

  • name (Optional[str]) – alias name

Warning

Reward should be Bernoulli when Beta prior is chosen.

Inheritance

Inheritance diagram of ThompsonSampling
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.Uniform(arm_num: int, name: Optional[str] = None)[source]

Uniform policy

Play each arm in a round-robin way.

Parameters
  • arm_num (int) – number of arms

  • name (Optional[str]) – alias name

Inheritance

Inheritance diagram of Uniform
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.UCBV(arm_num: int, b: float = 1.0, name: Optional[str] = None)[source]

UCBV policy [AMSzepesvari09]

At time \(t\), play arm

\[\mathrm{argmax}_{i \in \{0, \dots, N-1\}} \left\{ \bar{\mu}_i(t) + \sqrt{ \frac{ 2 \bar{V}_i(t) \ln(t) }{T_i(t)} }+ \frac{ b \ln(t) }{T_i(t)} \right\}\]
Parameters
  • arm_num (int) – number of arms

  • b (float) – upper bound of rewards

  • name (Optional[str]) – alias name

Note

Reward has to be bounded within \([0, b]\).

Inheritance

Inheritance diagram of UCBV
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.MOSS(arm_num: int, horizon: int, name: Optional[str] = None)[source]

MOSS policy [AB09]

At time \(t\), play arm

\[\mathrm{argmax}_{i \in \{0, \dots, N-1\}} \left\{ \bar{\mu}_i(t) + \sqrt{\frac{\mathrm{max}(\ln( \frac{T}{N T_i(t)} ), 0 ) }{T_i(t)} } \right\}\]
Parameters
  • arm_num (int) – number of arms

  • horizon (int) – total number of time steps

  • name (Optional[str]) – alias name

Note

MOSS uses time horizon in its confidence interval. Reward has to be bounded in [0, 1].

Inheritance

Inheritance diagram of MOSS
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.EXP3(arm_num: int, gamma: float = 0.01, name: Optional[str] = None)[source]

EXP3 policy [ACesaBianchiFS02]

At time \(t\), with probability \(\gamma\), uniformly randomly sample an arm to play. With the remaining probability i.e., \((1 - \gamma)\), sample arm \(i\) to play with sampling weight

\[\begin{split}\left\{ \begin{array} ~w_i^{t-1} & \text{if}~i_{t-1} \neq i \\ w_i^{t-1} \exp\left( \frac{\gamma}{N} \frac{X_i^{t-1}}{p_i^{t-1}} \right) & \text{if}~i_{t-1} = i \\ \end{array} \right.\end{split}\]

where \(w_i^{t-1}\) and \(p_i^{t-1}\) denote the weight of arm \(i\) and the probability to pull arm \(i\) at time \((t-1)\) respectively and initially we set \(w_i^0 = 1\) for every arm \(i \in \{0, \dots, N-1\}\).

Parameters
  • arm_num (int) – number of arms

  • gamma (float) – probability to do uniform sampling

  • name (str) – alias name

Inheritance

Inheritance diagram of EXP3
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.ExploreThenCommit(arm_num: int, T_prime: int, name: Optional[str] = None)[source]

Explore-Then-Commit policy

During the first \(T' \leq T\) time steps (exploration period), play each arm in a round-robin way. Then for the remaining time steps, play the arm with the maximum empirical mean reward within exploration period consistently.

Parameters
  • arm_num (int) – number of arms

  • T_prime (int) – time steps to explore

  • name (Optional[str]) – alias name

Inheritance

Inheritance diagram of ExploreThenCommit
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.Softmax(arm_num: int, gamma: float = 1.0, name: Optional[str] = None)[source]

Softmax policy

At time \(t\), sample arm \(i\) to play with sampling weight

\[\exp\left( \bar{\mu}_i(t) / \gamma \right)\]

where \(\gamma\) is a parameter to control how much exploration we want.

Parameters
  • arm_num (int) – number of arms

  • gamma (float) – gamma

  • name (Optional[str]) – alias name

Note

When \(\gamma\) approaches 0, the learner will have an increasing probability to select the arm with the maximum empirical mean rewards. When \(\gamma\) approaches to infinity, the policy of the learner tends to become uniform sampling.

Inheritance

Inheritance diagram of Softmax
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed