### Schedule session 1

Join a track by clicking the track title. Clicking any paper will show the abstract, and links to paper and video.

Only users who registered for UAI 2020 can access videos and Zoom - to register, click here.

### 77: Regret Analysis of Bandit Problems with Causal Background Knowledge

Track 1: Bandits Yangyi Lu, Amirhossein Meisami, Ambuj Tewari, William Yan
[Paper] [Video (inline)] [Video (new tab)]

We study how to learn optimal interventions sequentially given causal information represented as a causal graph along with associated conditional distributions. Causal modeling is useful in real world problems like online advertisement where complex causal mechanisms underlie the relationship between interventions and outcomes. We propose two algorithms, causal upper confidence bound (C-UCB) and causal Thompson Sampling (C-TS), that enjoy improved cumulative regret bounds compared with algorithms that do not use causal information. We thus resolve an open problem posed by Lattimore et al. (2016). Further, we extend C-UCB and C-TS to the linear bandit setting and propose causal linear UCB (CL-UCB) and causal linear TS (CL-TS) algorithms. These algorithms enjoy a cumulative regret bound that only scales with the feature dimension. Our experiments show the benefit of using causal information. For example, we observe that even with a few hundreds of iterations, the regret of causal algorithms is less than that of standard algorithms by a factor of three. We also show that under certain causal structures, our algorithms scale better than the standard bandit algorithms as the number of interventions increases.

### 49: Randomized Exploration for Non-Stationary Stochastic Linear Bandits

Track 1: Bandits Baekjin Kim, Ambuj Tewari
[Paper] [Video (inline)] [Video (new tab)]

We investigate two perturbation approaches to overcome conservatism that optimism based algorithms chronically suffer from in practice. The first approach replaces optimism with a simple randomization when using confidence sets. The second one adds random perturbations to its current estimate before maximizing the expected reward. For non-stationary linear bandits, where each action is associated with a $d$-dimensional feature and the unknown parameter is time-varying with total variation $B_T$, we propose two randomized algorithms, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS) via the two perturbation approaches. We highlight the statistical optimality versus computational efficiency trade-off between them in that the former asymptotically achieves the optimal dynamic regret $\tilde{\cO}(d ^{2/3}B_T^{1/3} T^{2/3})$, but the latter is oracle-efficient with an extra logarithmic factor in the number of arms compared to minimax-optimal dynamic regret. In a simulation study, both algorithms show the outstanding performance in tackling conservatism issue that Discounted LinUCB struggles with.

### 198: What You See May Not Be What You Get: UCB Bandit Algorithms Robust to ε-Contamination

Track 1: Bandits Laura Niss, Ambuj Tewari
[Paper] [Video (inline)] [Video (new tab)]

Motivated by applications of bandit algorithms in education, we consider a stochastic multi-armed bandit problem with ε-contaminated rewards. We allow an adversary to arbitrarily give unbounded contaminated rewards with full knowledge of the past and future. We impose only the constraint that at any time t the proportion of contaminated rewards for any actions is less than or equal to ε. We derive concentration inequalities for two robust mean estimators for sub-Gaussian distributions in the ε-contamination context. We define the ε-contaminated stochastic bandit problem and use our robust mean estimators to give two variants of a robust Upper Confidence Bound (UCB) algorithm, crUCB. Using regret derived from only the underlying stochastic rewards, both variants of crUCB achieve O(√KTlogT). Our simulations are designed to reflect reasonable settings a teacher would experience when implementing a bandit algorithm. We show that in certain adversarial regimes crUCB not only outperforms algorithms designed for stochastic (UCB1) and adversarial bandits (EXP3) but also those that have “best of both worlds” guarantees (EXP3++ and Tsallis-Inf) even when our constraint on the proportion of contaminated rewards is broken.

### 374: Semi-bandit Optimization in the Dispersed Setting

Track 1: Bandits Travis Dick, Wesley Pegden, Maria-Florina Balcan
[Paper] [Video (inline)] [Video (new tab)]

The goal of data-driven algorithm design is to obtain high-performing algorithms for specific application domains using machine learning and data. Across many fields in AI, science, and engineering, practitioners will often fix a family of parameterized algorithms and then optimize those parameters to obtain good performance on example instances from the application domain. In the online setting, we must choose algorithm parameters for each instance as they arrive, and our goal is to be competitive with the best fixed algorithm in hindsight. There are two major challenges in online data-driven algorithm design. First, it can be computationally expensive to evaluate the loss functions that map algorithm parameters to performance, which often require the learner to run a combinatorial algorithm to measure its performance. Second, the losses can be extremely volatile and have sharp discontinuities. However, we show that in many applications, evaluating the loss function for one algorithm choice can sometimes reveal the loss for a range of similar algorithms, essentially for free. We develop online optimization algorithms capable of using this kind of extra information by working in the semi-bandit feedback setting. Our algorithms achieve regret bounds that are essentially as good as algorithms under full-information feedback and are significantly more computationally efficient. We apply our semi-bandit results to obtain the first provable guarantees for data-driven algorithm design for linkage-based clustering and we improve the best regret bounds for designing greedy knapsack algorithms.

### 372: Zeroth Order Non-convex optimization with Dueling-Choice Bandits

Track 1: Bandits Yichong Xu, Aparna Joshi, Aarti Singh, Artur Dubrawski
[Paper] [Video (inline)] [Video (new tab)]

We consider a novel setting of zeroth order non-convex optimization, where in addition to querying the function value at a given point, we can also duel two points and get the point with the larger function value. We refer to this setting as optimization with dueling-choice bandits, since both direct queries and duels are available for optimization. We give the COMP-GP-UCB algorithm based on GP-UCB (Srinivas et al., 2009),, where instead of directly querying the point with the maximum Upper Confidence Bound (UCB), we perform constrained optimization and use comparisons to filter out suboptimal points. COMP-GP-UCB comes with theoretical guarantee of $O(\frac{\Phi}{\sqrt{T}})$ on simple regret where $T$ is the number of direct queries and $\Phi$ is an improved information gain stemming from a comparison-based constraint set that restricts the space for optimum search. In contrast, in the plain direct query setting, $\Phi$ depends on the entire domain. We discuss theoretical aspects and show experimental results to demonstrate efficacy of our algorithm.

### 202: Learning by Repetition: Stochastic Multi-armed Bandits under Priming Effect

Track 1: Bandits Priyank Agrawal, Theja Tulabandula
[Paper] [Video (inline)] [Video (new tab)]

We study the effect of persistence of engagement on learning in a stochastic multi-armed bandit setting. In advertising and recommendation systems, repetition effect includes a wear-in period, where the user's propensity to reward the platform via a click or purchase depends on how frequently they see the recommendation in the recent past. It also includes a counteracting wear-out period, where the user's propensity to respond positively is dampened if the recommendation was shown too many times recently. Priming effect can be naturally modelled as a temporal constraint on the strategy space, since the reward for the current action depends on historical actions taken by the platform. We provide novel algorithms that achieves sublinear regret in time and the relevant wear-in/wear-out parameters. The effect of priming on the regret upper bound is also additive, and we get back a guarantee that matches popular algorithms such as the UCB1 and Thompson sampling when there is no priming effect. Our work complements recent work on modeling time varying rewards, delays and corruptions in bandits, and extends the usage of rich behavior models in sequential decision making settings.

### 455: Generalized Policy Elimination: an efficient algorithm for Nonparametric Contextual Bandits

Track 1: Bandits Aurelien Bibaut, Antoine Chambaz, Mark van der Laan
[Paper] [Video (inline)] [Video (new tab)]

We propose the Generalized Policy Elimination (GPE) algorithm, an oracle-efficient contextual bandit (CB) algorithm inspired by the Policy Elimination algorithm of \cite{dudik2011}. We prove the first regret-optimality guarantee theorem for an oracle-efficient CB algorithm competing against a nonparametric class with infinite VC-dimension. Specifically, we show that GPE is regret-optimal (up to logarithmic factors) for policy classes with integrable entropy. For classes with larger entropy, we show that the core techniques used to analyze GPE can be used to design an $\varepsilon$-greedy algorithm with regret bound matching that of the best algorithms to date. We illustrate the applicability of our algorithms and theorems with examples of large nonparametric policy classes, for which the relevant optimization oracles can be efficiently implemented.

### 521: Bounded Rationality in Las Vegas: Probabilistic Finite Automata Play Multi-Armed Bandits

Track 1: Bandits Xinming Liu, Joseph Halpern
[Paper] [Video (inline)] [Video (new tab)]

While traditional economics assumes that humans are fully rational agents who always maximize their expected utility, in practice, we constantly observe apparently irrational behavior. One explanation is that people have limited computational power, so that they are, quite rationally, making the best decisions they can, given their computational limitations. To test this hypothesis, we consider the multi-armed bandit (MAB) problem. We examine a simple strategy for playing an MAB that can be implemented easily by a probabilistic finite automaton (PFA). Roughly speaking, the PFA sets certain expectations, and plays an arm as long as it meets them. If the PFA has sufficiently many states, it performs near-optimally. Its performance degrades gracefully as the number of states decreases. Moreover, the PFA acts in a "human-like" way, exhibiting a number of standard human biases, like an optimism bias and a negativity bias.

### 496: Distortion estimates for approximate Bayesian inference

Track 2: Inference Hanwen Xing, Geoff Nicholls, Jeong Lee
[Paper] [Video (inline)] [Video (new tab)]

Current literature on posterior approximation for Bayesian inference offers many alternative methods. Does our chosen approximation scheme work well on the observed data? The best existing generic diagnostic tools treating this kind of question by looking at performance averaged over data space, or otherwise lack diagnostic detail. However, if the approximation is bad for most data, but good at the observed data, then we may discard a useful approximation. We give graphical diagnostics for posterior approximation at the observed data. We estimate a distortion map'' that acts on univariate marginals of the approximate posterior to move them closer to the exact posterior, without recourse to the exact posterior.

### 39: Amortized variance reduction for doubly stochastic objective

Track 2: Inference Ayman Boustati, Sattar Vakili, James Hensman, ST John
[Paper] [Video (inline)] [Video (new tab)]

Approximate inference in complex probabilistic models such as deep Gaussian processes requires the optimisation of doubly stochastic objective functions. These objectives incorporate randomness both from mini-batch subsampling of the data and from Monte Carlo estimation of expectations. If the gradient variance is high, the stochastic optimisation problem becomes difficult with a slow rate of convergence. Control variates can be used to reduce the variance, but past approaches do not take into account how mini-batch stochasticity affects sampling stochasticity, resulting in sub-optimal variance reduction. We propose a new approach in which we use a recognition network to cheaply approximate the optimal control variate for each mini-batch, with no additional model gradient computations. We illustrate the properties of this proposal and test its performance on logistic regression and deep Gaussian processes.

### 518: Flexible Approximate Inference via Stratified Normalizing Flows

Track 2: Inference Chris Cundy, Stefano Ermon
[Paper] [Video (inline)] [Video (new tab)]

A major obstacle to forming posterior distributions in machine learning is the difficulty of evaluating partition functions. Monte-Carlo approaches are unbiased, but can suffer from high variance. Variational methods are biased, but tend to have lower variance. We develop an approximate inference procedure that allows explicit control of the bias/variance tradeoff, interpolating between the sampling and the variational regime. We use a normalizing flow to map the integrand onto a uniform distribution. We then randomly sample regions from a partition of this uniform distribution and fit simpler, local variational approximations in the image of these regions through the flow. When a partition with only one region is used, we recover standard variational inference, and in the limit of an infinitely fine partition we recover Monte-Carlo sampling. We show experiments validating the effectiveness of our approach.

### 177: High Dimensional Discrete Integration over the Hypergrid

Track 2: Inference Raj Kumar Maity, Arya Mazumdar, Soumyabrata Pal
[Paper] [Video (inline)] [Video (new tab)]

Recently Ermon et al. (2013) pioneered a way to practically compute approximations to large scale counting or discrete integration problems by using random hashes. The hashes are used to reduce the counting problem into many separate discrete optimization problems. The optimization problems then can be solved by an NP-oracle such as commercial SAT solvers or integer linear programming (ILP) solvers. In particular, Ermon et al. showed that if the domain of integration is $\{0,1\}^n$ then it is possible to obtain a solution within a factor of $16$ of the optimal (16-approximation) by this technique. In many crucial counting tasks, such as computation of partition function of ferromagnetic Potts model, the domain of integration is naturally $\{0,1,\dots, q-1\}^n, q>2$, the hypergrid. The straightforward extension of Ermon et al.'s method allows a $q^2$-approximation for this problem. For large values of $q$, this is undesirable. In this paper, we show an improved technique to obtain an approximation factor of $4+O(1/q^2)$ to this problem. We are able to achieve this by using an idea of optimization over multiple bins of the hash functions, that can be easily implemented by inequality constraints, or even in unconstrained way. The NP oracle in this setting can be simulated by using an ILP solver as in Ermon et. al. We provide simulation results to support the theoretical guarantees of our algorithms.

### 512: Symbolic Querying of Vector Spaces: Probabilistic Databases Meets Relational Embeddings

Track 2: Inference Tal Friedman, Guy Van den Broeck
[Paper] [Video (inline)] [Video (new tab)]

We propose unifying techniques from probabilistic databases and relational embedding models with the goal of performing complex queries on incomplete and uncertain data. We formalize a probabilistic database model with respect to which all queries are done. This allows us to leverage the rich literature of theory and algorithms from probabilistic databases for solving problems. While this formalization can be used with any relational embedding model, the lack of a well-defined joint probability distribution causes simple query problems to become provably hard. With this in mind, we introduce TractOR, a relational embedding model designed to be a tractable probabilistic database, by exploiting typical embedding assumptions within the probabilistic framework. Using a principled, efficient inference algorithm that can be derived from its definition, we empirically demonstrate that TractOR is an effective and general model for these querying tasks.

### 367: Ordering Variables for Weighted Model Integration

Track 2: Inference Vincent Derkinderen, Evert Heylen, Pedro Zuidberg Dos Martires, Samuel Kolb, Luc de Raedt
[Paper] [Video (inline)] [Video (new tab)]

State-of-the-art probabilistic inference algorithms, such as variable elimination and search-based approaches, rely heavily on the order in which variables are marginalized. Finding the optimal ordering is an NP-complete problem. This computational hardness has led to heuristics to find adequate variable orderings. However, these heuristics have mostly been targeting discrete random variables. We show how variable ordering heuristics from the discrete domain can be ported to the discrete-continuous domain. We equip the state-of-the-art F-XSDD(BR) solver for discrete-continuous problems with such heuristics. Additionally, we propose a novel heuristic called bottom-up min-fill (BU-MiF), yielding a solver capable of determining good variable orderings without having to rely on the user to provide such an ordering. We empirically demonstrate its performance on a set of benchmark problems.

### 486: On the Relationship Between Probabilistic Circuits and Determinantal Point Processes

Track 2: Inference Honghua Zhang, Steven Holtzen, Guy Van den Broeck
[Paper] [Video (inline)] [Video (new tab)]

Scaling probabilistic models to large realistic problems and datasets is a key challenge in machine learning. Central to this effort is the development of tractable probabilistic models (TPMs): models whose structure guarantees efficient probabilistic inference algorithms. The current landscape of TPMs is fragmented: there exist various kinds of TPMs with different strengths and weaknesses. Two of the most prominent classes of TPMs are determinantal point processes (DPPs) and probabilistic circuits (PCs). This paper provides the first systematic study of their relationship. We propose a unified analysis and shared language for discussing DPPs and PCs. Then we establish theoretical barriers for the unification of these two families, and prove that there are cases where DPPs have no compact representation as a class of PCs. We close with a perspective on the central problem of unifying these tractable models.

### 310: Complex Markov Logic Networks: Expressivity and Liftability

Track 2: Inference Ondrej Kuzelka
[Paper] [Video (inline)] [Video (new tab)]

We study expressivity of Markov logic networks (MLNs). We introduce complex MLNs, which use complex-valued weights, and show that, unlike standard MLNs with real-valued weights, complex MLNs are"fully expressive". We then observe that discrete Fourier transform can be computed using weighted first order model counting (WFOMC) with complex weights and use this observation to design an algorithm for computing "relational marginal polytopes" which needs substantially less calls to a WFOMC oracle than an existing recent algorithm.

### 180: Neural Likelihoods via Cumulative Distribution Functions

Track 3: Unsupervised learning Pawel Chilinski, Ricardo Silva
[Paper] [Video (inline)] [Video (new tab)]

We leverage neural networks as universal approximators of monotonic functions to build a parameterization of conditional cumulative distribution functions (CDFs). By the application of automatic differentiation with respect to response variables and then to parameters of this CDF representation, we are able to build black box CDF and density estimators. A suite of families is introduced as alternative constructions for the multivariate case. At one extreme, the simplest construction is a competitive density estimator against state-of-the-art deep learning methods, although it does not provide an easily computable representation of multivariate CDFs. At the other extreme, we have a flexible construction from which multivariate CDF evaluations and marginalizations can be obtained by a simple forward pass in a deep neural net, but where the computation of the likelihood scales exponentially with dimensionality. Alternatives in between the extremes are discussed. We evaluate the different representations empirically on a variety of tasks involving tail area probabilities, tail dependence and (partial) density estimation.

### 58: Nonparametric Fisher Geometry with Application to Density Estimation

Track 3: Unsupervised learning Andrew Holbrook, Shiwei Lan, Jeffrey Streets, Babak Shahbaba
[Paper] [Video (inline)] [Video (new tab)]

It is well known that the Fisher information induces a Riemannian geometry on parametric families of probability density functions. Following recent work, we consider the nonparametric generalization of the Fisher geometry. The resulting nonparametric Fisher geometry is shown to be equivalent to a familiar, albeit infinite-dimensional, geometric object---the sphere. By shifting focus away from density functions and toward square-root density functions, one may calculate theoretical quantities of interest with ease. More importantly, the sphere of square-root densities is much more computationally tractable. As discussed here, this insight leads to a novel Bayesian nonparametric density estimation model.

### 245: The Indian Chefs Process

Track 3: Unsupervised learning Patrick Dallaire, Luca Ambrogioni, Ludovic Trottier, Umut Güçlü, Max Hinne, Philippe Giguère, Marcel van Gerven, François Laviolette
[Paper] [Video (inline)] [Video (new tab)]

This paper introduces the Indian chefs process (ICP) as a Bayesian nonparametric prior on the joint space of infinite directed acyclic graphs (DAGs) and orders that generalizes the Indian buffet process. As our construction shows, the proposed distribution relies on a latent Beta process controlling both the orders and outgoing connection probabilities of the nodes, and yields a probability distribution on sparse infinite graphs. The main advantage of the ICP over previously proposed Bayesian nonparametric priors for DAG structures is its greater flexibility. To the best of our knowledge, the ICP is the first Bayesian nonparametric model supporting every possible DAG involving latent nodes. We demonstrate the usefulness of the ICP on learning the structure of deep generative sigmoid networks as well as convolutional neural networks.

### 115: Scalable and Flexible Clustering of Grouped Data via Parallel and Distributed Sampling in Versatile Hierarchical Dirichlet Processes

Track 3: Unsupervised learning Or Dinari, Oren Freifeld
[Paper] [Video (inline)] [Video (new tab)]

Adaptive clustering of grouped data is often done via the Hierarchical Dirichlet Process Mixture Model (HDPMM). That approach, however, is limited in its flexibility and usually does not scale well. As a remedy, we propose another, but closely related, hierarchical Bayesian nonparametric framework. Our main contributions are as follows. 1) a new model, called the Versatile HDPMM (vHDPMM), with two possible settings: full and reduced. While the latter is akin to the HDPMM's setting, the former supports not only global features (as HDPMM does) but also local ones. 2) An effective mechanism for detecting global features. 3) A new sampler that addresses the challenges posed by the vHDPMM and, in the reduced setting, scales better than HDPMM samplers. 4) An efficient, distributed, and easily-modifiable implementation that offers more flexibility (even in the reduced setting) than publicly-available HDPMM implementations. Finally, we show the utility of the approach in applications such as image cosegmentation, visual topic modeling, and clustering with missing data.

### 29: Kernel Conditional Moment Test via Maximum Moment Restriction

Track 3: Unsupervised learning Krikamol Muandet, Wittawat Jitkrittum, Jonas Kübler
[Paper] [Video (inline)] [Video (new tab)]

We propose a new family of specification tests called kernel conditional moment (KCM) tests. Our tests are built on a novel representation of conditional moment restrictions in a reproducing kernel Hilbert space (RKHS) called conditional moment embedding (CMME). After transforming the conditional moment restrictions into a continuum of unconditional counterparts, the test statistic is defined as the maximum moment restriction (MMR) within the unit ball of the RKHS. We show that the MMR not only fully characterizes the original conditional moment restrictions, leading to consistency in both hypothesis testing and parameter estimation, but also has an analytic expression that is easy to compute as well as closed-form asymptotic distributions. Our empirical studies show that the KCM test has a promising finite-sample performance compared to existing tests.

### 109: Testing Goodness of Fit of Conditional Density Models with Kernels

Track 3: Unsupervised learning Wittawat Jitkrittum, Heishiro Kanagawa, Bernhard Schölkopf
[Paper] [Video (inline)] [Video (new tab)]

We propose two nonparametric statistical tests of goodness of fit for conditional distributions: given a conditional probability density function p(y|x) and a joint sample, decide whether the sample is drawn from p(y|x)q(x) for some density q(x). Our tests, formulated with a Stein operator, can be applied to any differentiable conditional density model, and require no knowledge of the normalizing constant. We show that 1) our tests are consistent against any fixed alternative conditional model; 2) the statistics can be estimated easily, requiring no density estimation as an intermediate step; and 3) our second test offers an interpretable test result providing insight on where the conditional model does not fit well in the domain of the covariate. We demonstrate the interpretability of our test on a task of modeling the distribution of New York City's taxi drop-off location given a pick-up point. To our knowledge, our work is the first to propose such conditional goodness-of-fit tests that simultaneously have all these desirable properties.

### 340: Robust $k$-means++

Track 3: Unsupervised learning Amit Deshpande, Praneeth Kacham, Rameshwar Pratap
[Paper] [Video (inline)] [Video (new tab)]

A good seeding or initialization of cluster centers for the $k$-means method is important from both theoretical and practical standpoints. The $k$-means objective is inherently non-robust and sensitive to outliers. A popular seeding such as the $k$-means++ \cite{AV2007} that is more likely to pick outliers in the worst case may compound this drawback, thereby affecting the quality of clustering on noisy data. For any $0 < \delta \leq 1$, we show that using a mixture of $D^{2}$~\cite{AV2007} and uniform sampling, we can pick $O(k/\delta)$ candidate centers with the following guarantee: they contain some $k$ centers that give $O(1)$-approximation to the optimal robust $k$-means solution while discarding at most $\delta n$ more points than the outliers discarded by the optimal solution. That is, if the optimal solution discards its farthest $\beta n$ points as outliers, our solution discards its $(\beta + \delta) n$ points as outliers. The constant factor in our $O(1)$-approximation does not depend on $\delta$. This is an improvement over previous results for $k$-means with outliers based on LP relaxation and rounding \cite{Charikar} and local search \cite{GuptaKLMV17}. The $O(k/\delta)$ sized subset can be found in time $O(ndk)$. Our \emph{robust} $k$-means++ is also easily amenable to scalable, faster, parallel implementations of $k$-means++ \cite{Bahmani}. Our empirical results show a comparison of the above \emph{robust} variant of $k$-means++ with the usual $k$-means++, uniform random seeding, threshold $k$-means++~\cite{tkmeanspp} and local search on real world and synthetic data.

### 24: PAC-Bayesian Contrastive Unsupervised Representation Learning

Track 3: Unsupervised learning Kento Nozawa, Pascal Germain, Benjamin Guedj
[Paper] [Video (inline)] [Video (new tab)]

Contrastive unsupervised representation learning (CURL) is the state-of-the-art technique to learn representations (as a set of features) from unlabelled data. While CURL has collected several empirical successes recently, theoretical understanding of its performance was still missing. In a recent work, Arora et al. (2019) provide the first generalisation bounds for CURL, relying on a Rademacher complexity. We extend their framework to the flexible PAC-Bayes setting, allowing to deal with the non-iid setting. We present PAC-Bayesian generalisation bounds for CURL, which are then used to derive a new representation learning algorithm. Numerical experiments on real-life datasets illustrate that our algorithm achieves competitive accuracy, and yields non-vacuous generalisation bounds.