banditpylib.learners.mnl_bandit_learner

Classes

class banditpylib.learners.mnl_bandit_learner.MNLBanditLearner(revenues: numpy.ndarray, reward: banditpylib.bandits.mnl_bandit_utils.Reward, card_limit: int, use_local_search: bool, random_neighbors: int, name: Optional[str])[source]

Abstract class for learners playing with mnl bandit

Product 0 is reserved for non-purchase. And it is assumed that the preference parameter for non-purchase is 1.

Parameters
  • revenues (np.ndarray) – product revenues

  • reward (Reward) – reward the learner wants to maximize

  • card_limit (int) – cardinality constraint

  • use_local_search (bool) – whether to use local search for searching the best assortment

  • random_neighbors (int) – number of random neighbors to look up if local search is enabled

  • name (Optional[str]) – alias name

Inheritance

Inheritance diagram of MNLBanditLearner
property card_limit: int

Cardinality limit

property goal: banditpylib.learners.utils.Goal

Goal of the learner

property product_num: int

Product numbers (not including product 0)

property random_neighbors: int

Number of random neighbors to look up when local search is enabled

property revenues: numpy.ndarray

Revenues of products (product 0 is included)

property reward: banditpylib.bandits.mnl_bandit_utils.Reward

Reward the learner wants to maximize

property running_environment: Union[type, List[type]]

Type of bandit environment the learner plays with

Whether local search is enabled

class banditpylib.learners.mnl_bandit_learner.UCB(revenues: numpy.ndarray, reward: banditpylib.bandits.mnl_bandit_utils.Reward, card_limit: int = inf, use_local_search: bool = False, random_neighbors: int = 10, name: Optional[str] = None)[source]

UCB policy [AAGZ19]

Parameters
  • revenues (np.ndarray) – product revenues

  • reward (Reward) – reward the learner wants to maximize

  • card_limit (int) – cardinality constraint

  • use_local_search (bool) – whether to use local search for searching the best assortment

  • random_neighbors (int) – number of random neighbors to look up if local search is enabled

  • name (Optional[str]) – alias name

Inheritance

Inheritance diagram of UCB
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mnl_bandit_learner.EpsGreedy(revenues: numpy.ndarray, reward: banditpylib.bandits.mnl_bandit_utils.Reward, card_limit: int = inf, use_local_search: bool = False, random_neighbors: int = 10, eps: float = 1.0, name: Optional[str] = None)[source]

Epsilon-Greedy policy

With probability \(\frac{\epsilon}{t}\) do uniform sampling and with the remaining probability serve the assortment with the maximum empirical reward.

Parameters
  • revenues (np.ndarray) – product revenues

  • reward (Reward) – reward the learner wants to maximize

  • card_limit (int) – cardinality constraint

  • use_local_search (bool) – whether to use local search for searching the best assortment

  • random_neighbors (int) – number of random neighbors to look up if local search is enabled

  • eps (float) – epsilon

  • name (Optional[str]) – alias name

Inheritance

Inheritance diagram of EpsGreedy
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed

class banditpylib.learners.mnl_bandit_learner.ThompsonSampling(revenues: numpy.ndarray, horizon: int, reward: banditpylib.bandits.mnl_bandit_utils.Reward, card_limit: int = inf, use_local_search: bool = False, random_neighbors: int = 10, name: Optional[str] = None)[source]

Thompson sampling policy [AAGZ17]

Parameters
  • revenues (np.ndarray) – product revenues

  • horizon (int) – total number of time steps

  • reward (Reward) – reward the learner wants to maximize

  • card_limit (int) – cardinality constraint

  • use_local_search (bool) – whether to use local search for searching the best assortment

  • random_neighbors (int) – number of random neighbors to look up if local search is enabled

  • name (Optional[str]) – alias name

Inheritance

Inheritance diagram of ThompsonSampling
actions(context: data_pb2.Context)data_pb2.Actions[source]

Actions of the learner

Parameters

context – contextual information about the bandit environment

Returns

actions to take

reset()[source]

Reset the learner

Warning

This function should be called before the start of the game.

update(feedback: data_pb2.Feedback)[source]

Update the learner

Parameters

feedback – feedback returned by the bandit environment after actions() is executed