When your big data seems too small: accurate inferences beyond the empirical distribution
We discuss two problems related to the general challenge of making accurate inferences about a complex distribution, in the regime in which the amount of data (i.e the sample size) is too small for the empirical distribution of the samples to be an accurate representation of the underlying distribution. The first problem is the basic task of learning a discrete distribution, given access to independent draws. We show that one can accurately recover the unlabelled vector of probabilities of all domain elements whose true probability is greater than 1/(n log n). Stated differently, one can learn–up to relabelling–the portion of the distribution consisting of elements with probability greater than 1/(n log n). This result has several curious implications, including leading to an optimal algorithm for “de-noising” the empirical distribution of the samples, and implying that one can accurately estimate the number of new domain elements that would be seen given a new larger sample, of size up to n * log n. (Extrapolation beyond this sample size is provable information theoretically impossible, without additional assumptions on the distribution.) While these results are applicable generally, we highlight an adaptation of this general approach to some problems in genomics (e.g. quantifying the number of unobserved protein coding variants).
The second problem we consider is the task of accurately estimating the eigenvalues of the covariance matrix of a (high dimensional real-valued) distribution–the “population spectrum”. (These eigenvalues contain basic information about the distribution, including the presence or lack of low-dimensional structure in the distribution and the applicability of many higher-level machine learning and multivariate statistical tools.) As we show, even in the regime where the sample size is linear or sublinear in the dimensionality of the distribution, and hence the eigenvalues and eigenvectors of the empirical covariance matrix are misleading, accurate approximations to the true population spectrum are possible.
This talk is based on three papers, which are joint works with Paul Valiant, with Paul Valiant and James Zou, and with Weihao Kong.
Greg Valiant joined the Computer Science Department at Stanford as an Assistant Professor in Fall 2013, after completing a postdoc at Microsoft Research, New England. His main research interests are in algorithms, learning, applied probability and statistics; he is also interested in game theory, and has enjoyed working on problems in database theory. Valiant graduated from Harvard with a BA in Math and an MS in Computer Science, and obtained his PhD in Computer Science from UC Berkeley in 2012.