Title: Conservative Bandits
Abstract: We study a novel multi-armed bandit problem that models the challenge faced by a company wishing to explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time. While previous work addressed the problem under the weaker requirement of maintaining the revenue constraint only at a given fixed time in the future, the algorithms previously proposed are unsuitable due to their design under the more stringent constraints. We consider both the stochastic and the adversarial settings, where we propose, natural, yet novel strategies and analyze the price for maintaining the constraints. Amongst other things, we prove both high probability and expectation bounds on the regret, while we also consider both the problem of maintaining the constraints with high probability or expectation. For the adversarial setting the price of maintaining the constraint appears to be higher, at least for the algorithm considered. A lower bound is given showing that the algorithm for the stochastic setting is almost optimal. Empirical results obtained in synthetic environments complement our theoretical findings.
Bio. Csaba Szepesvari (PhD'99) is currently a Professor at the Department of Computing Science of the University of Alberta and a Principal Investigator of the Alberta Innovates Center for Machine Learning. Before coming to work in Canada, he was the head of the machine learning group at the Computer and Automation Research Institute of the Hungarian Academy of sciences, before which he held various leadership positions in technology companies developing and using AI and machine learning techniques in language, speech and vision applications. The software that he developed has been deployed in various countries and some of the software he developed is still in use today.
The coauthor of a book on nonlinear approximate adaptive controllers and the author of a recent book on reinforcement learning, he published about 150 journal and conference papers. He is best known for the UCT algorithm that he co-invented. UCT is a tree-search algorithm, which inspired much subsequent work in AI, search and optimization. Today, UCT provides the foundations for many applications, including the recent computer go program developed by Deepmind, a program that beat the European Go Champion 5:0. He serves as an action editor of the Journal of Machine Learning Research and the Machine Learning Journal. He served as a co-chair for the 2014 Conference on Learning Theory, and for the Algorithmic Learning Theory Conference in 2011, as well as a senior PC member for NIPS, ICML, AISTATS, AAAI and IJCAI multiple times. He is interested in designing principled learning algorithms for agents that learn from and control their environments.