banditpylib.bandits
¶
Functions¶
search_best_assortment()
: Search assortment with the maximum rewardlocal_search_best_assortment()
: Local search assortment with the maximum reward
- banditpylib.bandits.search_best_assortment(reward: banditpylib.bandits.mnl_bandit_utils.Reward, card_limit: int = inf) → Tuple[float, Set[int]][source]¶
Search assortment with the maximum reward
- Parameters
reward – reward definition
card_limit – cardinality constraint
- Returns
assortment with the maximum reward
- banditpylib.bandits.local_search_best_assortment(reward: banditpylib.bandits.mnl_bandit_utils.Reward, random_neighbors: int, card_limit: int, init_assortment: Optional[Set[int]] = None) → Tuple[float, Set[int]][source]¶
Local search assortment with the maximum reward
Warning
This method does not guarantee to output the best assortment.
Todo
Implement this function with cppyy.
- Parameters
reward – reward definition
random_neighbors – number of random neighbors to look up
card_limit – cardinality constraint
init_assortment – initial assortment to start
- Returns
local best assortment with its reward
Classes¶
Bandit
: Abstract class for bandit environmentsMultiArmedBandit
: Multi-armed banditLinearBandit
: Finite-armed linear banditReward
: General reward classMeanReward
: Mean rewardCvarReward
: CVaR rewardMNLBandit
: MNL banditThresholdingBandit
: Thresholding bandit environmentContextualBandit
: Finite-armed contextual banditContextGenerator
: Abstract context generator classRandomContextGenerator
: Random context generator
- class banditpylib.bandits.Bandit[source]¶
Abstract class for bandit environments
Inheritance
- abstract property context: data_pb2.Context¶
Contextual information about the bandit environment
- abstract feed(actions: data_pb2.Actions) → data_pb2.Feedback[source]¶
- Parameters
actions – actions for the bandit environment to execute
- Returns
feedback after actions are executed
- abstract property name: str¶
Bandit name
- abstract regret(goal: banditpylib.learners.utils.Goal) → float[source]¶
- Parameters
goal – goal of the learner
- Returns
regret of the learner
- class banditpylib.bandits.MultiArmedBandit(arms: List[banditpylib.arms.utils.StochasticArm])[source]¶
Multi-armed bandit
Arms are indexed from 0 by default. Each pull of arm \(i\) will generate an i.i.d. reward from distribution \(\mathcal{D}_i\), which is unknown beforehand.
- Parameters
arms (List[StochasticArm]) – available arms
Inheritance
- property arm_num: int¶
Total number of arms
- property context: data_pb2.Context¶
Contextual information about the bandit environment
- feed(actions: data_pb2.Actions) → data_pb2.Feedback[source]¶
- Parameters
actions – actions for the bandit environment to execute
- Returns
feedback after actions are executed
- property name: str¶
Bandit name
- regret(goal: banditpylib.learners.utils.Goal) → float[source]¶
- Parameters
goal – goal of the learner
- Returns
regret of the learner
- class banditpylib.bandits.LinearBandit(features: List[numpy.ndarray], theta: numpy.ndarray, std: float = 1.0)[source]¶
Finite-armed linear bandit
Arms are indexed from 0 by default. Each pull of arm \(i\) will generate an i.i.d. reward from distribution \(\langle \theta, v_i \rangle + \epsilon\), where \(v_i\) is the feature vector of arm \(i\), \(\theta\) is the unknown parameter and \(\epsilon\) is a zero-mean noise.
- Parameters
features (List[np.ndarray]) – feature vectors of the arms
theta (np.ndarray) – unknown parameter theta
std (float) – standard variance of noise
Inheritance
- property arm_num: int¶
Total number of arms
- property context: data_pb2.Context¶
Contextual information about the bandit environment
- property features: List[numpy.ndarray]¶
Returns: feature vectors
- feed(actions: data_pb2.Actions) → data_pb2.Feedback[source]¶
- Parameters
actions – actions for the bandit environment to execute
- Returns
feedback after actions are executed
- property name: str¶
Bandit name
- regret(goal: banditpylib.learners.utils.Goal) → float[source]¶
- Parameters
goal – goal of the learner
- Returns
regret of the learner
- class banditpylib.bandits.Reward[source]¶
General reward class
Inheritance
- abstract calc(assortment: Set[int]) → float[source]¶
- Parameters
assortment – assortment to calculate
- Returns
reward of the assortment
- abstract property name: str¶
Reward name
- property preference_params: numpy.ndarray¶
preference parameters (product 0 is included)
- property revenues: numpy.ndarray¶
- class banditpylib.bandits.MeanReward[source]¶
Mean reward
Inheritance
- calc(assortment: Set[int]) → float[source]¶
- Parameters
assortment – assortment to calculate
- Returns
reward of the assortment
- property name: str¶
Reward name
- class banditpylib.bandits.CvarReward(alpha: float)[source]¶
CVaR reward
- Parameters
alpha (float) – percentile of cvar
Inheritance
- property alpha: float¶
Percentile of cvar
- calc(assortment: Set[int]) → float[source]¶
- Parameters
assortment – assortment to calculate
- Returns
reward of the assortment
- property name: str¶
Reward name
- class banditpylib.bandits.MNLBandit(preference_params: numpy.ndarray, revenues: numpy.ndarray, card_limit: int = inf, reward: Optional[banditpylib.bandits.mnl_bandit_utils.Reward] = None, zero_best_reward: bool = False)[source]¶
MNL bandit
There are a total of \(N\) products, where products are numbered from 1 by default. During each time step \(t\), when an assortment \(S_t\) which is a subset of products is served, the online customer will make a choice i.e., whether to buy a product or purchase nothing. The choice is modeled by
\[\mathbb{P}(c_t = i) = \frac{v_i}{\sum_{i \in S_t \cup \{0\} } v_i}\]where 0 is reserved for non-purchase and \(v_0 = 1\). It is also assumed that preference parameters are within the range \([0, 1]\).
Suppose the rewards are \((r_0, \dots, r_N)\), where \(r_0\) is always 0. Let \(F(S)\) be the cumulative function of the rewards when \(S\) is served. Let \(U\) be a quasiconvex function denoting the reward the learner wants to maximize. The regret is defined as
\[T U(F(S^*)) - \sum_{t = 1}^T U(F(S_t))\]where \(S^*\) is the optimal assortment.
- Parameters
reference_params (np.ndarray) – preference parameters (product 0 should be included)
revenue (np.ndarray) – revenue of products (product 0 should be included)
card_limit (int) – cardinality constraint of an assortment meaning the total number of products provided at a time is no greater than this number
reward (Reward) – reward the learner wants to maximize. The default goal is mean of rewards
zero_best_reward (bool) – whether to set the reward of the best assortment to 0. This is useful when data is too large to compute the best assortment. When best reward is set to zero, the regret equals to the minus total revenue.
Inheritance
- property card_limit: float¶
Cardinality limit
- property context: data_pb2.Context¶
Contextual information about the bandit environment
- feed(actions: data_pb2.Actions) → data_pb2.Feedback[source]¶
- Parameters
actions – actions for the bandit environment to execute
- Returns
feedback after actions are executed
- property name: str¶
Bandit name
- property product_num: int¶
Number of products (not including product 0)
- regret(goal: banditpylib.learners.utils.Goal) → float[source]¶
- Parameters
goal – goal of the learner
- Returns
regret of the learner
- reset()[source]¶
Reset the bandit environment
Warning
This function should be called before the start of the game.
- property revenues: numpy.ndarray¶
Revenues of products (product 0 is included, which is always 0.0)
- class banditpylib.bandits.ThresholdingBandit(arms: List[banditpylib.arms.utils.StochasticArm], theta: float, eps: float)[source]¶
Thresholding bandit environment
Arms are indexed from 0 by default. Each time the learner pulls arm \(i\), she will obtain an i.i.d. reward generated from an unknown distribution \(\mathcal{D}_i\). Different from the ordinary MAB, there is a threshold parameter \(\theta\). The learner should try to infer whether an arm’s expected reward is above the threshold or not. Besides, the environment also accepts a parameter \(\epsilon >= 0\) which is the radius of indifference zone meaning that the answers about the arms with expected rewards within \([\theta - \epsilon, \theta + \epsilon]\) do not matter.
- Parameters
arms (List[StochasticArm]) – arms in thresholding bandit
theta (float) – threshold
eps (float) – radius of indifferent zone
Inheritance
- property arm_num: int¶
Total number of arms
- property context: data_pb2.Context¶
Contextual information about the bandit environment
- feed(actions: data_pb2.Actions) → data_pb2.Feedback[source]¶
- Parameters
actions – actions for the bandit environment to execute
- Returns
feedback after actions are executed
- property name: str¶
Bandit name
- regret(goal: banditpylib.learners.utils.Goal) → float[source]¶
- Parameters
goal – goal of the learner
- Returns
regret of the learner
- class banditpylib.bandits.ContextualBandit(context_generator: banditpylib.bandits.contextual_bandit_utils.ContextGenerator)[source]¶
Finite-armed contextual bandit
Arms are indexed from 0 by default. At time \(t\), it will generate a context and a list of rewards incurred by different arms denoted by \((X_t, \{r_i^t\}_i)\) where \(X_t\) is the context and \(r_i^t\) is the reward when arm \(i\) is pulled. After receiving learner’s action \(a_t\), the reward \(r_{a_t}^t\) will be revealed to the learner. The batched version can be defined in a similar way.
- Parameters
context_generator (ContextGenerator) – context generator
Inheritance
- property arm_num: int¶
Total number of arms
- property context: data_pb2.Context¶
Contextual information about the bandit environment
- feed(actions: data_pb2.Actions) → data_pb2.Feedback[source]¶
- Parameters
actions – actions for the bandit environment to execute
- Returns
feedback after actions are executed
- property name: str¶
Bandit name
- regret(goal: banditpylib.learners.utils.Goal) → float[source]¶
- Parameters
goal – goal of the learner
- Returns
regret of the learner
- class banditpylib.bandits.ContextGenerator(arm_num: int, dimension: int)[source]¶
Abstract context generator class
This class is used to generate the context of contextual bandit.
- Parameters
arm_num (int) – number of actions
dimension (int) – dimension of the context
Inheritance
- property arm_num: int¶
Number of actions
- abstract context() → Tuple[numpy.ndarray, numpy.ndarray][source]¶
Context
- Returns
the context and the rewards corresponding to different actions
- property dimension: int¶
Dimension of the context
- abstract property name: str¶
Context generator name
- class banditpylib.bandits.RandomContextGenerator(arm_num: int, dimension: int)[source]¶
Random context generator
Fill contexts and rewards information with random numbers in [0, 1].
- Parameters
arm_num (int) – number of actions
dimension (int) – dimension of the context
Inheritance
- context() → Tuple[numpy.ndarray, numpy.ndarray][source]¶
Context
- Returns
the context and the rewards corresponding to different actions
- property name: str¶
Context generator name