Abstract: The formulations and theories of multi-armed bandit (MAB) problems provide fundamental tools for optimal sequential decision making and learning in uncertain environments. They have been widely applied to resource allocation, scheduling, and routing in communication networks, particularly in recent years, as the field is seeing an increasing focus on adaptive online learning mechanisms to enhance system performance in stochastic, dynamic, and distributed environments. In this talk, we will discuss some problems lying in this topic.
The first part is about MAB with linear rewards. As they are fundamentally about combinatorial optimization in unknown environments, one would indeed expect to find even broader use of multi-armed bandits. However, a barrier to their wider application in practice has been the limitation of the basic formulation and corresponding policies, which generally treat each arm as an independent entity. They are inadequate to deal with many combinatorial problems of practical interest in which there are large (exponential) numbers of arms. In such settings, it is important to consider and exploit any structure in terms of dependencies between the arms. I will show in this talk that when the dependencies take a linear form, they can be handled tractably with policies that have provably good performance in terms of regret as well as storage and computation.
The second part of the talk is about learning in decentralized settings. The desired objective is to develop decentralized online learning algorithms running at each secondary user to make a selection among multiple choices, where there is no information exchange, such that the sum-throughput of all distributed users is maximized. We make two contributions in this problem. First, we consider the setting where the users have a prioritized ranking, such that it is desired for the K-th ranked user to learn to access the arm offering the K-th highest mean reward. For this problem, we present the first distributed policy that yields regret that is uniformly logarithmic over time without requiring any prior assumption about the mean rewards. Second, we consider the case when a fair access policy is required, i.e., it is desired for all users to experience the same mean reward. For this problem, we present a distributed policy that yields order-optimal regret scaling with respect to the number of users and arms, better than previously proposed policies in the literature. We also propose a novel online learning policy for the case that the state of each arm evolves as a finite-state Markov chain with unknown parameters.
Besides, I will also briefly overview my work on learning with non-linear rewards and other settings.
Biography: Yi Gai is a PhD candidate in the Electrical Engineering Department at University of Southern California (USC). She received the B.E. and M.S. degrees in Electrical Engineering at Tsinghua University, China, in 2005 and 2007 respectively. Yi is a recipient of an Annenberg Graduate Fellowship, WiSE Merit Fellowship and Oakley Fellowship at USC. She received the USC center for applied mathematical sciences (CAMS) prize in 2011 for excellence in research with a substantial mathematical component. She has also been the recipient of IEEE SECON best paper runner-up in 2012.