banditpylib.bandits

Functions

banditpylib.bandits.search_best_assortment(reward: banditpylib.bandits.mnl_bandit_utils.Reward, card_limit: int = inf)Tuple[float, Set[int]][source]

Search assortment with the maximum reward

Parameters
  • reward – reward definition

  • card_limit – cardinality constraint

Returns

assortment with the maximum reward

banditpylib.bandits.local_search_best_assortment(reward: banditpylib.bandits.mnl_bandit_utils.Reward, random_neighbors: int, card_limit: int, init_assortment: Optional[Set[int]] = None)Tuple[float, Set[int]][source]

Local search assortment with the maximum reward

Warning

This method does not guarantee to output the best assortment.

Todo

Implement this function with cppyy.

Parameters
  • reward – reward definition

  • random_neighbors – number of random neighbors to look up

  • card_limit – cardinality constraint

  • init_assortment – initial assortment to start

Returns

local best assortment with its reward

Classes

class banditpylib.bandits.Bandit[source]

Abstract class for bandit environments

Inheritance

Inheritance diagram of Bandit
abstract property context: data_pb2.Context

Contextual information about the bandit environment

abstract feed(actions: data_pb2.Actions)data_pb2.Feedback[source]
Parameters

actions – actions for the bandit environment to execute

Returns

feedback after actions are executed

abstract property name: str

Bandit name

abstract regret(goal: banditpylib.learners.utils.Goal)float[source]
Parameters

goal – goal of the learner

Returns

regret of the learner

abstract reset()[source]

Reset the bandit environment

Warning

This function should be called before the start of the game.

class banditpylib.bandits.MultiArmedBandit(arms: List[banditpylib.arms.utils.StochasticArm])[source]

Multi-armed bandit

Arms are indexed from 0 by default. Each pull of arm \(i\) will generate an i.i.d. reward from distribution \(\mathcal{D}_i\), which is unknown beforehand.

Parameters

arms (List[StochasticArm]) – available arms

Inheritance

Inheritance diagram of MultiArmedBandit
property arm_num: int

Total number of arms

property context: data_pb2.Context

Contextual information about the bandit environment

feed(actions: data_pb2.Actions)data_pb2.Feedback[source]
Parameters

actions – actions for the bandit environment to execute

Returns

feedback after actions are executed

property name: str

Bandit name

regret(goal: banditpylib.learners.utils.Goal)float[source]
Parameters

goal – goal of the learner

Returns

regret of the learner

reset()[source]

Reset the bandit environment

Warning

This function should be called before the start of the game.

class banditpylib.bandits.LinearBandit(features: List[numpy.ndarray], theta: numpy.ndarray, std: float = 1.0)[source]

Finite-armed linear bandit

Arms are indexed from 0 by default. Each pull of arm \(i\) will generate an i.i.d. reward from distribution \(\langle \theta, v_i \rangle + \epsilon\), where \(v_i\) is the feature vector of arm \(i\), \(\theta\) is the unknown parameter and \(\epsilon\) is a zero-mean noise.

Parameters
  • features (List[np.ndarray]) – feature vectors of the arms

  • theta (np.ndarray) – unknown parameter theta

  • std (float) – standard variance of noise

Inheritance

Inheritance diagram of LinearBandit
property arm_num: int

Total number of arms

property context: data_pb2.Context

Contextual information about the bandit environment

property features: List[numpy.ndarray]

Returns: feature vectors

feed(actions: data_pb2.Actions)data_pb2.Feedback[source]
Parameters

actions – actions for the bandit environment to execute

Returns

feedback after actions are executed

property name: str

Bandit name

regret(goal: banditpylib.learners.utils.Goal)float[source]
Parameters

goal – goal of the learner

Returns

regret of the learner

reset()[source]

Reset the bandit environment

Warning

This function should be called before the start of the game.

class banditpylib.bandits.Reward[source]

General reward class

Inheritance

Inheritance diagram of Reward
abstract calc(assortment: Set[int])float[source]
Parameters

assortment – assortment to calculate

Returns

reward of the assortment

abstract property name: str

Reward name

property preference_params: numpy.ndarray

preference parameters (product 0 is included)

property revenues: numpy.ndarray
set_preference_params(preference_params: numpy.ndarray)[source]
Parameters

preference_params – preference parameters of products

set_revenues(revenues: numpy.ndarray)[source]
Parameters

revenues – revenues of products

class banditpylib.bandits.MeanReward[source]

Mean reward

Inheritance

Inheritance diagram of MeanReward
calc(assortment: Set[int])float[source]
Parameters

assortment – assortment to calculate

Returns

reward of the assortment

property name: str

Reward name

class banditpylib.bandits.CvarReward(alpha: float)[source]

CVaR reward

Parameters

alpha (float) – percentile of cvar

Inheritance

Inheritance diagram of CvarReward
property alpha: float

Percentile of cvar

calc(assortment: Set[int])float[source]
Parameters

assortment – assortment to calculate

Returns

reward of the assortment

property name: str

Reward name

class banditpylib.bandits.MNLBandit(preference_params: numpy.ndarray, revenues: numpy.ndarray, card_limit: int = inf, reward: Optional[banditpylib.bandits.mnl_bandit_utils.Reward] = None, zero_best_reward: bool = False)[source]

MNL bandit

There are a total of \(N\) products, where products are numbered from 1 by default. During each time step \(t\), when an assortment \(S_t\) which is a subset of products is served, the online customer will make a choice i.e., whether to buy a product or purchase nothing. The choice is modeled by

\[\mathbb{P}(c_t = i) = \frac{v_i}{\sum_{i \in S_t \cup \{0\} } v_i}\]

where 0 is reserved for non-purchase and \(v_0 = 1\). It is also assumed that preference parameters are within the range \([0, 1]\).

Suppose the rewards are \((r_0, \dots, r_N)\), where \(r_0\) is always 0. Let \(F(S)\) be the cumulative function of the rewards when \(S\) is served. Let \(U\) be a quasiconvex function denoting the reward the learner wants to maximize. The regret is defined as

\[T U(F(S^*)) - \sum_{t = 1}^T U(F(S_t))\]

where \(S^*\) is the optimal assortment.

Parameters
  • reference_params (np.ndarray) – preference parameters (product 0 should be included)

  • revenue (np.ndarray) – revenue of products (product 0 should be included)

  • card_limit (int) – cardinality constraint of an assortment meaning the total number of products provided at a time is no greater than this number

  • reward (Reward) – reward the learner wants to maximize. The default goal is mean of rewards

  • zero_best_reward (bool) – whether to set the reward of the best assortment to 0. This is useful when data is too large to compute the best assortment. When best reward is set to zero, the regret equals to the minus total revenue.

Inheritance

Inheritance diagram of MNLBandit
property card_limit: float

Cardinality limit

property context: data_pb2.Context

Contextual information about the bandit environment

feed(actions: data_pb2.Actions)data_pb2.Feedback[source]
Parameters

actions – actions for the bandit environment to execute

Returns

feedback after actions are executed

property name: str

Bandit name

property product_num: int

Number of products (not including product 0)

regret(goal: banditpylib.learners.utils.Goal)float[source]
Parameters

goal – goal of the learner

Returns

regret of the learner

reset()[source]

Reset the bandit environment

Warning

This function should be called before the start of the game.

property revenues: numpy.ndarray

Revenues of products (product 0 is included, which is always 0.0)

class banditpylib.bandits.ThresholdingBandit(arms: List[banditpylib.arms.utils.StochasticArm], theta: float, eps: float)[source]

Thresholding bandit environment

Arms are indexed from 0 by default. Each time the learner pulls arm \(i\), she will obtain an i.i.d. reward generated from an unknown distribution \(\mathcal{D}_i\). Different from the ordinary MAB, there is a threshold parameter \(\theta\). The learner should try to infer whether an arm’s expected reward is above the threshold or not. Besides, the environment also accepts a parameter \(\epsilon >= 0\) which is the radius of indifference zone meaning that the answers about the arms with expected rewards within \([\theta - \epsilon, \theta + \epsilon]\) do not matter.

Parameters
  • arms (List[StochasticArm]) – arms in thresholding bandit

  • theta (float) – threshold

  • eps (float) – radius of indifferent zone

Inheritance

Inheritance diagram of ThresholdingBandit
property arm_num: int

Total number of arms

property context: data_pb2.Context

Contextual information about the bandit environment

feed(actions: data_pb2.Actions)data_pb2.Feedback[source]
Parameters

actions – actions for the bandit environment to execute

Returns

feedback after actions are executed

property name: str

Bandit name

regret(goal: banditpylib.learners.utils.Goal)float[source]
Parameters

goal – goal of the learner

Returns

regret of the learner

reset()[source]

Reset the bandit environment

Warning

This function should be called before the start of the game.

class banditpylib.bandits.ContextualBandit(context_generator: banditpylib.bandits.contextual_bandit_utils.ContextGenerator)[source]

Finite-armed contextual bandit

Arms are indexed from 0 by default. At time \(t\), it will generate a context and a list of rewards incurred by different arms denoted by \((X_t, \{r_i^t\}_i)\) where \(X_t\) is the context and \(r_i^t\) is the reward when arm \(i\) is pulled. After receiving learner’s action \(a_t\), the reward \(r_{a_t}^t\) will be revealed to the learner. The batched version can be defined in a similar way.

Parameters

context_generator (ContextGenerator) – context generator

Inheritance

Inheritance diagram of ContextualBandit
property arm_num: int

Total number of arms

property context: data_pb2.Context

Contextual information about the bandit environment

feed(actions: data_pb2.Actions)data_pb2.Feedback[source]
Parameters

actions – actions for the bandit environment to execute

Returns

feedback after actions are executed

property name: str

Bandit name

regret(goal: banditpylib.learners.utils.Goal)float[source]
Parameters

goal – goal of the learner

Returns

regret of the learner

reset()[source]

Reset the bandit environment

Warning

This function should be called before the start of the game.

class banditpylib.bandits.ContextGenerator(arm_num: int, dimension: int)[source]

Abstract context generator class

This class is used to generate the context of contextual bandit.

Parameters
  • arm_num (int) – number of actions

  • dimension (int) – dimension of the context

Inheritance

Inheritance diagram of ContextGenerator
property arm_num: int

Number of actions

abstract context()Tuple[numpy.ndarray, numpy.ndarray][source]

Context

Returns

the context and the rewards corresponding to different actions

property dimension: int

Dimension of the context

abstract property name: str

Context generator name

abstract reset()[source]

Reset the context generator

class banditpylib.bandits.RandomContextGenerator(arm_num: int, dimension: int)[source]

Random context generator

Fill contexts and rewards information with random numbers in [0, 1].

Parameters
  • arm_num (int) – number of actions

  • dimension (int) – dimension of the context

Inheritance

Inheritance diagram of RandomContextGenerator
context()Tuple[numpy.ndarray, numpy.ndarray][source]

Context

Returns

the context and the rewards corresponding to different actions

property name: str

Context generator name

reset()[source]

Reset the context generator