`banditpylib.learners.mab_learner`¶

Policies for ordinary multi-armed bandit with goal regret minimization.

We introduce notations in the following.

Notations¶
\(T\)	game horizon
\(N\)	total number of arms
\(i_t\)	arm pulled at time \(t\)
\(X_i^t\)	empirical reward of arm \(i\) at time \(t\) if arm \(i\) is pulled
\(T_i(t)\)	number of times arm \(i\) is played before time \(t\)
\(\bar{\mu}_i(t)\)	empirical mean of arm \(i\) before time \(t\)
\(\bar{V}_i(t)\)	empirical variance of arm \(i\) before time \(t\)

Classes

Classes ¶

MABLearner: Abstract class for learners playing with the ordinary multi-armed bandit
EpsGreedy: Epsilon-Greedy policy
UCB: Upper Confidence Bound policy [ACBF02]
ThompsonSampling: Thompson Sampling policy [AG17]
Uniform: Uniform policy
UCBV: UCBV policy [AMSzepesvari09]
MOSS: MOSS policy [AB09]
EXP3: EXP3 policy [ACesaBianchiFS02]
ExploreThenCommit: Explore-Then-Commit policy
Softmax: Softmax policy

class banditpylib.learners.mab_learner.MABLearner(arm_num: int, name: Optional[str])[source]¶

Abstract class for learners playing with the ordinary multi-armed bandit

This type of learners aim to maximize the total collected rewards.

Parameters

arm_num (int) – number of arms
name (Optional[str]) – alias name

Inheritance

property arm_num: int¶: Number of arms

property goal: banditpylib.learners.utils.Goal¶: Goal of the learner

property running_environment: Union[type, List[type]]¶: Type of bandit environment the learner plays with

class banditpylib.learners.mab_learner.EpsGreedy(arm_num: int, eps: float = 1.0, name: Optional[str] = None)[source]¶

Epsilon-Greedy policy

With probability \(\frac{\epsilon}{t}\) do uniform sampling and with the remaining probability play the arm with the maximum empirical mean.

Parameters

arm_num (int) – number of arms
eps (float) – epsilon
name (Optional[str]) – alias name

Inheritance

actions(context: data_pb2.Context) → data_pb2.Actions[source]¶

Actions of the learner

Parameters: context – contextual information about the bandit environment
Returns: actions to take

reset()[source]¶: Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]¶

Update the learner

Parameters: feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.UCB(arm_num: int, alpha: float = 2.0, name: Optional[str] = None)[source]¶

Upper Confidence Bound policy [ACBF02]

At time \(t\), play arm

\[\mathrm{argmax}_{i \in \{0, \dots, N-1\}} \left\{ \bar{\mu}_i(t) + \sqrt{ \frac{\alpha \ln(t) }{T_i(t)} } \right\}\]

Parameters

arm_num (int) – number of arms
alpha (float) – alpha
name (Optional[str]) – alias name

Inheritance

actions(context: data_pb2.Context) → data_pb2.Actions[source]¶

Actions of the learner

Parameters: context – contextual information about the bandit environment
Returns: actions to take

reset()[source]¶: Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]¶

Update the learner

Parameters: feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.ThompsonSampling(arm_num: int, prior_dist: str = 'beta', name: Optional[str] = None)[source]¶

Thompson Sampling policy [AG17]

Assume a prior distribution for every arm. At time \(t\), sample a virtual mean reward from the posterior distribution for every arm. Play the arm with the maximum sampled virtual mean reward.

Parameters

arm_num (int) – number of arms
prior_dist (str) – prior distribution of thompson sampling. Only two priors are supported i.e., beta and gaussian
name (Optional[str]) – alias name

Warning

Reward should be Bernoulli when Beta prior is chosen.

Inheritance

actions(context: data_pb2.Context) → data_pb2.Actions[source]¶

Actions of the learner

Parameters: context – contextual information about the bandit environment
Returns: actions to take

reset()[source]¶: Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]¶

Update the learner

Parameters: feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.Uniform(arm_num: int, name: Optional[str] = None)[source]¶

Uniform policy

Play each arm in a round-robin way.

Parameters

arm_num (int) – number of arms
name (Optional[str]) – alias name

Inheritance

actions(context: data_pb2.Context) → data_pb2.Actions[source]¶

Actions of the learner

Parameters: context – contextual information about the bandit environment
Returns: actions to take

reset()[source]¶: Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]¶

Update the learner

Parameters: feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.UCBV(arm_num: int, b: float = 1.0, name: Optional[str] = None)[source]¶

UCBV policy [AMSzepesvari09]

At time \(t\), play arm

\[\mathrm{argmax}_{i \in \{0, \dots, N-1\}} \left\{ \bar{\mu}_i(t) + \sqrt{ \frac{ 2 \bar{V}_i(t) \ln(t) }{T_i(t)} }+ \frac{ b \ln(t) }{T_i(t)} \right\}\]

Parameters

arm_num (int) – number of arms
b (float) – upper bound of rewards
name (Optional[str]) – alias name

Note

Reward has to be bounded within \([0, b]\).

Inheritance

actions(context: data_pb2.Context) → data_pb2.Actions[source]¶

Actions of the learner

Parameters: context – contextual information about the bandit environment
Returns: actions to take

reset()[source]¶: Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]¶

Update the learner

Parameters: feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.MOSS(arm_num: int, horizon: int, name: Optional[str] = None)[source]¶

MOSS policy [AB09]

At time \(t\), play arm

\[\mathrm{argmax}_{i \in \{0, \dots, N-1\}} \left\{ \bar{\mu}_i(t) + \sqrt{\frac{\mathrm{max}(\ln( \frac{T}{N T_i(t)} ), 0 ) }{T_i(t)} } \right\}\]

Parameters

arm_num (int) – number of arms
horizon (int) – total number of time steps
name (Optional[str]) – alias name

Note

MOSS uses time horizon in its confidence interval. Reward has to be bounded in [0, 1].

Inheritance

actions(context: data_pb2.Context) → data_pb2.Actions[source]¶

Actions of the learner

Parameters: context – contextual information about the bandit environment
Returns: actions to take

reset()[source]¶: Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]¶

Update the learner

Parameters: feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.EXP3(arm_num: int, gamma: float = 0.01, name: Optional[str] = None)[source]¶

EXP3 policy [ACesaBianchiFS02]

At time \(t\), with probability \(\gamma\), uniformly randomly sample an arm to play. With the remaining probability i.e., \((1 - \gamma)\), sample arm \(i\) to play with sampling weight

\[\begin{split}\left\{ \begin{array} ~w_i^{t-1} & \text{if}~i_{t-1} \neq i \\ w_i^{t-1} \exp\left( \frac{\gamma}{N} \frac{X_i^{t-1}}{p_i^{t-1}} \right) & \text{if}~i_{t-1} = i \\ \end{array} \right.\end{split}\]

where \(w_i^{t-1}\) and \(p_i^{t-1}\) denote the weight of arm \(i\) and the probability to pull arm \(i\) at time \((t-1)\) respectively and initially we set \(w_i^0 = 1\) for every arm \(i \in \{0, \dots, N-1\}\).

Parameters

arm_num (int) – number of arms
gamma (float) – probability to do uniform sampling
name (str) – alias name

Inheritance

actions(context: data_pb2.Context) → data_pb2.Actions[source]¶

Actions of the learner

Parameters: context – contextual information about the bandit environment
Returns: actions to take

reset()[source]¶: Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]¶

Update the learner

Parameters: feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.ExploreThenCommit(arm_num: int, T_prime: int, name: Optional[str] = None)[source]¶

Explore-Then-Commit policy

During the first \(T' \leq T\) time steps (exploration period), play each arm in a round-robin way. Then for the remaining time steps, play the arm with the maximum empirical mean reward within exploration period consistently.

Parameters

arm_num (int) – number of arms
T_prime (int) – time steps to explore
name (Optional[str]) – alias name

Inheritance

actions(context: data_pb2.Context) → data_pb2.Actions[source]¶

Actions of the learner

Parameters: context – contextual information about the bandit environment
Returns: actions to take

reset()[source]¶: Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]¶

Update the learner

Parameters: feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mab_learner.Softmax(arm_num: int, gamma: float = 1.0, name: Optional[str] = None)[source]¶

Softmax policy

At time \(t\), sample arm \(i\) to play with sampling weight

\[\exp\left( \bar{\mu}_i(t) / \gamma \right)\]

where \(\gamma\) is a parameter to control how much exploration we want.

Parameters

arm_num (int) – number of arms
gamma (float) – gamma
name (Optional[str]) – alias name

Note

When \(\gamma\) approaches 0, the learner will have an increasing probability to select the arm with the maximum empirical mean rewards. When \(\gamma\) approaches to infinity, the policy of the learner tends to become uniform sampling.

Inheritance

actions(context: data_pb2.Context) → data_pb2.Actions[source]¶

Actions of the learner

Parameters: context – contextual information about the bandit environment
Returns: actions to take

reset()[source]¶: Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]¶

Update the learner

Parameters: feedback – feedback returned by the bandit environment after actions() is executed

banditpylib.learners.mab_learner¶

Classes¶

`banditpylib.learners.mab_learner`¶

Classes ¶