Research
Research Interests
My research focuses on Multi-Armed Bandit problems. Generally, I am interested in reinforcement learning and its applications.
Papers
Note: * indicates alphabetic ordering.
-
Variance-Dependent Best Arm Identification (abstract, link)
Pinyan Lu*, Chao Tao*, Xiaojin Zhang*
UAI 2021
Abstract
We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of \(n\) arms indexed from \(1\) to \(n\), each arm \(i\) is associated with an unknown reward distribution supported on \([0,1]\) with mean \(\theta_i\) and variance \(\sigma_i^2\). Assume \(\theta_1 > \theta_2 \geq \cdots \geq\theta_n\). We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called \(\textit{grouped median elimination}\). The proposed algorithm guarantees to output the best arm with probability \((1-\delta)\) and uses at most \(O \left(\sum_{i = 1}^n \left(\frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i}\right)(\ln \delta^{-1} + \ln \ln \Delta_i^{-1})\right)\) samples, where \(\Delta_i\) (\(i \geq 2\)) denotes the reward gap between arm \(i\) and the best arm and we define \(\Delta_1 = \Delta_2\). This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra \(\ln n\) factor on the best arm compared with the state-of-the-art. We further show that \(\Omega \left( \sum_{i = 1}^n \left( \frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i} \right) \ln \delta^{-1} \right)\) samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.
-
Near-Optimal MNL Bandits Under Risk Criteria (abstract, link)
Guangyu Xi, Chao Tao, Yuan Zhou
AAAI 2021
Abstract
We study MNL bandits, which is a variant of the traditional multi-armed bandit problem, under risk criteria. Unlike the ordinary expected revenue, risk criteria are more general goals widely used in industries and business. We design algorithms for a broad class of risk criteria, including but not limited to the well-known conditional value-at-risk, Sharpe ratio and entropy risk, and prove that they suffer a near-optimal regret. As a complement, we also conduct experiments with both synthetic and real data to show the empirical performance of our proposed algorithms.
-
Thresholding Bandit with Optimal Aggregate Regret (abstract, link)
Chao Tao, Saúl A. Blanco, Jian Peng, Yuan Zhou
NeurIPS 2019
Abstract
We consider the thresholding bandit problem, whose goal is to find arms of mean rewards above a given threshold \(\theta\), with a fixed budget of \(T\) trials. We introduce LSA, a new, simple and anytime algorithm that aims to minimize the aggregate regret (or the expected number of mis-classified arms). We prove that our algorithm is instance-wise asymptotically optimal. We also provide comprehensive empirical results to demonstrate the algorithm's superior performance over existing algorithms under a variety of different scenarios.
-
Collaborative Learning with Limited Interaction: Tight Bounds for Distributed Exploration in Multi-Armed Bandits (abstract, link)
Chao Tao*, Qin Zhang*, Yuan Zhou*
FOCS 2019
Abstract
Best arm identification (or, pure exploration) in multi-armed bandits is a fundamental problem in machine learning. In this paper we study the distributed version of this problem where we have multiple agents, and they want to learn the best arm collaboratively. We want to quantify the power of collaboration under limited interaction (or, communication steps), as interaction is expensive in many settings. We measure the running time of a distributed algorithm as the speedup over the best centralized algorithm where there is only one agent. We give almost tight round-speedup tradeoffs for this problem, along which we develop several new techniques for proving lower bounds on the number of communication steps under time or confidence constraints.
-
Best Arm Identification in Linear Bandits with Linear Dimension Dependency (abstract, link)
Chao Tao, Saúl A. Blanco, Yuan Zhou
ICML 2018
Abstract
We study the best arm identification problem in linear bandits, where the mean reward of each arm depends linearly on an unknown \(d\)-dimensional parameter vector \(\theta\), and the goal is to identify the arm with the largest expected reward. We first design and analyze a novel randomized \(\theta\) estimator based on the solution to the convex relaxation of an optimal \(G\)-allocation experiment design problem. Using this estimator, we describe an algorithm whose sample complexity depends linearly on the dimension \(d\), as well as an algorithm with sample complexity dependent on the reward gaps of the best \(d\) arms, matching the lower bound arising from the ordinary top-arm identification problem. We finally compare the empirical performance of our algorithms with other state-of-the-art algorithms in terms of both sample complexity and computational time.
Talks Given
“Near-Optimal MNL Bandits Under Risk Criteria”, Shanghai Jiao Tong University, Remote from Bloomington, IN, 03/07/2021
“Near-Optimal MNL Bandits Under Risk Criteria”, Shandong University, Remote from Bloomington, IN, 12/29/2020
“Thresholding Bandit with Optimal Aggregate Regret”, Shandong University, Qingdao, China, 01/08/2020
“Optimal Maximum Gap Estimation in the Multi-armed Bandit,” INFORMS Annual Meeting, Houston, TX, 2017
Reading Group
Softwares
Services
Reviewer/Sub-reviewer for: ICDE (2021), SIGMOD (2021), SOSA (2021), VLDB (2020), ICML (2022, 2021, 2020), AAAI (2020), BigData (2019), ISAAC (2019, 2018), NeurIPS (2021, 2020, 2019), IJCAI (2019), ITCS (2018), TCS (2018)
|