### Schedule session 4

Join a track by clicking the track title. Clicking any paper will show the abstract, and links to paper and video.

Only users who registered for UAI 2020 can access videos and Zoom - to register, click here.

### 339: Deep Sigma Point Processes

Track 1: Gaussian processes Martin Jankowiak, Geoff Pleiss, Jacob Gardner
[Paper] [Video (inline)] [Video (new tab)]

We introduce Deep Sigma Point Processes, a class of parametric models inspired by the compositional structure of Deep Gaussian Processes (DGPs). Deep Sigma Point Processes (DSPPs) retain many of the attractive features of (variational) DGPs, including mini-batch training and predictive uncertainty that is controlled by kernel basis functions. Importantly, since DSPPs admit a simple maximum likelihood inference procedure, the resulting predictive distributions are not degraded by any posterior approximations. In an extensive empirical comparison on univariate and multivariate regression tasks we find that the resulting predictive distributions are significantly better calibrated than those obtained with other probabilistic methods for scalable regression, including variational DGPs--often by as much as a nat per datapoint.

### 227: GPIRT: A Gaussian Process Model for Item Response Theory

Track 1: Gaussian processes JBrandon Duck-Mayr, Roman Garnett, Jacob Montgomery
[Paper] [Video (inline)] [Video (new tab)]

The goal of item response theoretic (IRT) models is to provide estimates of latent traits from binary observed indicators and at the same time to learn the item response funcitons (IRFs) that map from latent trait to observed response. However, in many cases observed behavior can deviate significantly from the parametric assumptions of traditional IRT models. Nonparametric IRT (NIRT) models overcome these challenges by relaxing assumptions about the form of the IRFs, but standard tools are unable to simultaneously estimate flexible IRFs and recover ability estimates for respondents. We propose a Bayesian nonparametric model that solves this problem by placing Gaussian process priors on the latent functions defining the IRFs. This allows us to simultaneously relax assumptions about the shape of the IRFs while preserving the ability to estimate latent traits. This in turn allows us to easily extend the model to further tasks such as active learning. GPIRT therefore provides a simple and intuitive solution to several longstanding problems in the IRT literature.

### 554: PoRB-Nets: Poisson Process Radial Basis Function Networks

Track 1: Gaussian processes Beau Coker, Melanie Fernandez Pradier, Finale Doshi-Velez
[Paper] [Video (inline)] [Video (new tab)]

Bayesian neural networks (BNNs) are flexible function priors well-suited to situations in which data are scarce and uncertainty must be quantified. Yet, common weight priors are able to encode little functional knowledge and can behave in undesirable ways. We present a novel prior over radial basis function networks (RBFNs) that allows for independent specification of functional amplitude variance and lengthscale (i.e., smoothness), where the inverse lengthscale corresponds to the concentration of radial basis functions. When the lengthscale is uniform over the input space, we prove consistency and approximate variance stationarity. This is in contrast to common BNN priors, which are highly nonstationary. When the input dependence of the lengthscale is unknown, we show how it can be inferred. We compare this model's behavior to standard BNNs and Gaussian processes using synthetic and real examples.

### 305: Generalized Bayesian Posterior Expectation Distillation for Deep Neural Networks

Track 1: Gaussian processes Meet Vadera, Brian Jalaian, Benjamin Marlin
[Paper] [Video (inline)] [Video (new tab)]

In this paper, we present a general framework for distilling expectations with respect to the Bayesian posterior distribution of a deep neural network classifier, extending prior work on the Bayesian Dark Knowledge framework. The proposed framework takes as input "teacher" and "student" model architectures and a general posterior expectation of interest. The distillation method performs an online compression of the selected posterior expectation using iteratively generated Monte Carlo samples. We focus on the posterior predictive distribution and expected entropy as distillation targets. We investigate several aspects of this framework including the impact of uncertainty and the choice of student model architecture. We study methods for student model architecture search from a speed-storage-accuracy perspective and evaluate down-stream tasks leveraging entropy distillation including uncertainty ranking and out-of-distribution detection.

### 485: Prediction Intervals: Split Normal Mixture from Quality-Driven Deep Ensembles

Track 1: Gaussian processes Tárik Saleh Salem, Helge Langseth, Heri Ramampiaro
[Paper] [Video (inline)] [Video (new tab)]

Prediction intervals are a machine- and human-interpretable way to represent predictive uncertainty in a regression analysis. In this paper, we present a method for generating prediction intervals along with point estimates from an ensemble of neural networks. We propose a multi-objective loss function fusing quality measures related to prediction intervals and point estimates, and a penalty function, which enforces semantic integrity of the results and stabilizes the training process of the neural networks. The ensembled prediction intervals are aggregated as a split normal mixture accounting for possible multimodality and asymmetricity of the posterior predictive distribution, and resulting in prediction intervals that capture aleatoric and epistemic uncertainty. Our results show that both our quality-driven loss function and our aggregation method contribute to well-calibrated prediction intervals and point estimates.

### 334: Batch simulations and uncertainty quantification in Gaussian process surrogate approximate Bayesian computation

Track 1: Gaussian processes Marko Jarvenpaa, Aki Vehtari, Pekka Marttinen
[Paper] [Video (inline)] [Video (new tab)]

The computational efficiency of approximate Bayesian computation (ABC) has been improved by using surrogate models such as Gaussian processes (GP). In one such promising framework the discrepancy between the simulated and observed data is modelled with a GP which is further used to form a model-based estimator for the intractable posterior. In this article we improve this approach in several ways. We develop batch-sequential Bayesian experimental design strategies to parallellise the expensive simulations. In earlier work only sequential strategies have been used. Current surrogate-based ABC methods also do not fully account the uncertainty due to the limited budget of simulations as they output only a point estimate of the ABC posterior. We propose a numerical method to fully quantify the uncertainty in, for example, ABC posterior moments. We also provide some new analysis on the GP modelling assumptions in the resulting improved framework called Bayesian ABC and discuss its connection to Bayesian quadrature (BQ) and Bayesian optimisation (BO). Experiments with toy and real-world simulation models demonstrate advantages of the proposed techniques.

### 124: Efficient Rollout Strategies for Bayesian Optimization

Track 1: Gaussian processes Eric Lee, David Eriksson, David Bindel, Bolong Cheng, Mike Mccourt
[Paper] [Video (inline)] [Video (new tab)]

Bayesian optimization (BO) is a class of sample-efficient global optimization methods, where a probabilistic model conditioned on previous observations is used to determine future evaluations via the optimization of an acquisition function. Most acquisition functions are myopic, meaning that they only consider the impact of the next function evaluation. Non-myopic acquisition functions consider the impact of the next h function evaluations and are typically computed through rollout, in which h steps of BO are simulated. These rollout acquisition functions are defined as h-dimensional integrals, and are expensive to compute and optimize. We show that a combination of quasi-Monte Carlo, common random numbers, and control variates significantly reduce the computational burden of rollout. We then formulate a policy-search based approach that removes the need to optimize the rollout acquisition function. Finally, we discuss the qualitative behavior of rollout policies in the setting of multi-modal objectives and model error.

### 329: Amortized Bayesian Optimization over Discrete Spaces

Track 1: Gaussian processes Kevin Swersky, Yulia Rubanova, David Dohan, Kevin Murphy
[Paper] [Video (inline)] [Video (new tab)]

Bayesian optimization is a principled approach for globally optimizing expensive, black-box functions by using a surrogate model of the objective. However, each step of Bayesian optimization involves solving an inner optimization problem, in which we maximize an acquisition function derived from the surrogate model to decide where to query next. This inner problem can be challenging to solve, particularly in discrete spaces, such as protein sequences or molecular graphs, where gradient-based optimization cannot be used. Our key insight is that we can train a generative model to generate candidates that maximize the acquisition function. This is faster than standard model-free local search methods, since we can amortize the cost of learning the model across multiple rounds of Bayesian optimization. We therefore call this Amortized Bayesian Optimization. On several challenging discrete design problems, we show this method generally outperforms other methods at optimizing the inner acquisition function, resulting in more efficient optimization of the outer black-box objective.

Track 2: Ranking

### 249: Spectral Methods for Ranking with Scarce Data

Track 2: Ranking Lalit Jain, Anna Gilbert, Umang Varma
[Paper] [Video (inline)] [Video (new tab)]

Given a number of pairwise preferences of items, a common task is to rank all the items.Examples include pairwise movie ratings, New Yorker cartoon caption contests, and many other consumer preferences tasks. What these settings have in common is two-fold: a scarcity of data (it may be costly to get comparisons for all the pairs of items) and additional feature information about the items (e.g., movie genre,director, and cast). In this paper we modify a popular and well studied method, RankCentrality for rank aggregation to account for few comparisons and that incorporates additional feature information. This method returns meaningful rankings even under scarce comparisons.Using diffusion based methods, we incorporate feature information that outperforms state-of-the-art methods in practice. We also provide improved sample complexity for RankCentrality in a variety of sampling schemes.

### 389: Risk Bounds for Low Cost Bipartite Ranking

Track 2: Ranking San Gultekin, John Paisley
[Paper] [Video (inline)] [Video (new tab)]

Bipartite ranking is an important supervised learning problem; however, unlike regression or classification, it has a quadratic dependence on the number of samples. To circumvent the prohibitive sample cost, many recent work focus on stochastic gradient-based methods. In this paper we consider an alternative approach, which leverages the structure of the widely-adopted pairwise squared loss, to obtain a stochastic and low cost algorithm that does not require stochastic gradients or learning rates. Using a novel uniform risk bound---based on matrix and vector concentration inequalities---we show that the sample size required for competitive performance against the all-pairs batch algorithm does not have a quadratic dependence. Generalization bounds for both the batch and low cost stochastic algorithms are presented. Experimental results show significant speed gain against the batch algorithm, as well as competitive performance against state-of-the-art bipartite ranking algorithms on real datasets.

### 95: On the design of consequential ranking algorithms

Track 2: Ranking Behzad Tabibian, Vicenç Gómez, Abir De, Bernhard Schölkopf, Manuel Gomez Rodriguez
[Paper] [Video (inline)] [Video (new tab)]

Ranking models are typically designed to optimize some measure of immediate utility to the users. As a result, they have been unable to anticipate an increasing number of undesirable long-term consequences of their proposed rankings, from fueling the spread of misinformation and increasing polarization to degrading social discourse. Can we design ranking models that anticipate the consequences of their proposed rankings and are able to avoid the undesirable ones? In this paper, we first introduce a joint representation of rankings and user dynamics using Markov decision processes. Then, we show that this representation greatly simplifies the construction of consequential ranking models that trade off the immediate utility and the long-term welfare. In particular, we can obtain optimal consequential rankings by applying weighted sampling on the rankings provided by models that maximize measures of immediate utility. However, in practice, such a strategy may be inefficient and impractical, specially in high dimensional scenarios. To overcome this, we introduce an efficient gradient-based algorithm to learn parameterized consequential ranking models that effectively approximate optimal ones. We illustrate our methodology using synthetic and real data gathered from Reddit and show that our consequential rankings may mitigate the spread of misinformation and improve the civility of online discussions.

### 240: A SUPER* Algorithm to Optimize Paper Bidding in Peer Review

Track 2: Ranking Tanner Fiez, Nihar Shah, Lillian Ratliff
[Paper] [Video (inline)] [Video (new tab)]

A number of applications involve sequential arrival of users, and require showing each user an ordering of items. A prime example is the bidding process in conference peer review where reviewers enter the system sequentially, each reviewer needs to be shown the list of submitted papers, and the reviewer then bids'' to review some papers. The order of the papers shown has a significant impact on the bids due to primacy effects. In deciding on the ordering of the list of papers to show, there are two competing goals: (i) obtaining sufficiently many bids for each paper, and (ii) satisfying reviewers by showing them relevant items. In this paper, we develop a framework to study this problem in a principled manner. We present an algorithm called SUPER*, inspired by the A* algorithm, for this goal. Theoretically, we show a local optimality guarantee of our algorithm and prove that popular baselines are considerably suboptimal. Moreover, under a community model for the similarities, we prove that SUPER* is near-optimal whereas the popular baselines are considerably suboptimal. In experiments on real data from ICLR 2018 and synthetic data, we find that SUPER* considerably outperforms baselines deployed in existing systems, consistently reducing the number of papers with fewer than requisite bids by 50-75% or more, and is also robust to various real world complexities.

### 73: Learning Behaviors with Uncertain Human Feedback

Track 2: Ranking Xu He, Haipeng Chen, Bo An
[Paper] [Video (inline)] [Video (new tab)]

Human feedback is widely used to train agents in many domains. However, previous works rarely consider the uncertainty when humans provide feedback, especially in cases that the optimal actions are not obvious to the trainers. For example, the reward of a sub-optimal action can be stochastic and sometimes exceeds that of the optimal action, which is common in games or real-world. Trainers are likely to provide positive feedback to sub-optimal actions, negative feedback to the optimal actions and even do not provide feedback in some confusing situations. Existing works, which utilize the Expectation Maximization (EM) algorithm and treat the feedback model as hidden parameters, do not consider uncertainties in the learning environment and human feedback. To address this challenge, we introduce a novel feedback model that considers the uncertainty of human feedback. However, this incurs intractable calculus in the EM algorithm. To this end, we propose a novel approximate EM algorithm, in which we approximate the expectation step with the Gradient Descent method. Experimental results in both synthetic scenarios and two real-world scenarios with human participants demonstrate the superior performance of our proposed approach.

### 173: Robust modal regression with direct gradient approximation of modal regression risk

Track 2: Ranking Hiroaki Sasaki, Tomoya Sakai, Takafumi Kanamori
[Paper] [Video (inline)] [Video (new tab)]

Modal regression is aimed at estimating the global mode (i.e., global maximum) of the conditional density function of the output variable given input variables, and has led to regression methods robust against a wide-range of noises. A typical approach for modal regression takes a two-step approach of firstly approximating the modal regression risk (MRR) and of secondly maximizing the approximated MRR with some gradient method. However, this two-step approach can be suboptimal in gradient-based maximization methods because a good MRR approximator does not necessarily give a good gradient approximator of MRR. In this paper, we take a novel approach of \emph{directly} approximating the gradient of MRR in modal regression. Based on the direct approach, we first propose a modal regression method with reproducing kernels where a new update rule to estimate the conditional mode is derived based on a fixed-point method. Then, the derived update rule is theoretically investigated. Furthermore, since our direct approach is compatible with recent sophisticated stochastic gradient methods (e.g., Adam), another modal regression method is also proposed based on neural networks. Finally, the superior performance of the proposed methods is demonstrated on various artificial and benchmark datasets.

### 176: Skewness Ranking Optimization for Personalized Recommendation

Track 2: Ranking Yu-Neng Chuang, Chih-Ming Chen, Chuan-Ju Wang, Ming-Feng Tsai
[Paper] [Video (inline)] [Video (new tab)]

In this paper, we propose a novel optimization criterion that leverages features of the skew normal distribution to better model the problem of personalized recommendation. Specifically, the developed criterion borrows the concept and the flexibility of the skew normal distribution, based on which three hyperparameters are attached to the optimization criterion. Furthermore, from a theoretical point of view, we not only establish the relation between the maximization of the proposed criterion and the shape parameter in the skew normal distribution, but also provide the analogies and asymptotic analysis of the proposed criterion to maximization of the area under the ROC curve. Experimental results conducted on a range of large-scale real-world datasets show that our model significantly outperforms the state of the art and yields consistently best performance on all tested datasets.

### 397: Multitask Soft Option Learning

Track 3: Reinforcement learning Maximilian Igl, Andrew Gambardella, Jinke He, Nantas Nardelli, N Siddharth, Wendelin Boehmer, Shimon Whiteson
[Paper] [Video (inline)] [Video (new tab)]

We present Multitask Soft Option Learning (MSOL), a hierarchical multitask framework based on Planning as Inference. MSOL extends the concept of options, using separate variational posteriors for each task, regularized by a shared prior. This “soft” version of options avoids several instabilities during training in a multitask setting, and provides a natural way to learn both intra-option policies and their terminations. Furthermore, it allows fine-tuning of options for new tasks without forgetting their learned policies, leading to faster training without reducing the expressiveness of the hierarchical policy. We demonstrate empirically that MSOL significantly outperforms both hierarchical and flat transfer-learning baselines.

### 66: Learning Intrinsic Rewards as a Bi-Level Optimization Problem

[Paper] [Video (inline)] [Video (new tab)]

We reinterpret the problem of finding intrinsic rewards in reinforcement learning (RL) as a bilevel optimization problem. Using this interpretation, we can make use of recent advancements in the hyperparameter optimization literature, mainly from Self-Tuning Networks (STN), to learn intrinsic rewards. To facilitate our methods, we introduces a new general conditioning layer: Conditional Layer Normalization (CLN). We evaluate our method on several continuous control benchmarks in the Mujoco physics simulator. On all of these benchmarks, the intrinsic rewards learned on the fly lead to higher final rewards.

### 497: Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch

Track 3: Reinforcement learning Michael Wan, Tanmay Gangwani, Jian Peng
[Paper] [Video (inline)] [Video (new tab)]

Deep reinforcement learning (RL) algorithms have achieved great success on a wide variety of sequential decision-making tasks. However, many of these algorithms suffer from high sample complexity when learning from scratch using environmental rewards, due to issues such as credit-assignment and high-variance gradients, among others. Transfer learning, in which knowledge gained on a source task is applied to more efficiently learn a different but related target task, is a promising approach to improve the sample complexity in RL. Prior work has considered using pre-trained teacher policies to enhance the learning of the student policy, albeit with the constraint that the teacher and the student MDPs share the state-space or the action-space. In this paper, we propose a new framework for transfer learning where the teacher and the student can have arbitrarily different state- and action-spaces. To handle this mismatch, we produce embeddings which can systematically extract knowledge from the teacher policy and value networks, and blend it into the student networks. To train the embeddings, we use a task-aligned loss and show that the representations could be enriched further by adding a mutual information loss. Using a set of challenging simulated robotic locomotion tasks involving many-legged centipedes, we demonstrate successful transfer learning in situations when the teacher and student have different state- and action-spaces.

### 414: Active Model Estimation in Markov Decision Processes

Track 3: Reinforcement learning Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, Alessandro Lazaric
[Paper] [Video (inline)] [Video (new tab)]

We study the problem of efficient exploration in order to learn an accurate model of an environment, modeled as a Markov decision process (MDP). Efficient exploration in this problem requires the agent to identify the regions in which estimating the model is more difficult and then exploit this knowledge to collect more samples there. In this paper, we formalize this problem, introduce the first algorithm to learn an $\epsilon$-accurate estimate of the dynamics, and provide its sample complexity analysis. While this algorithm enjoys strong guarantees in the large-sample regime, it tends to have a poor performance in early stages of exploration. To address this issue, we propose an algorithm that is based on maximum weighted entropy, a heuristic that stems from common sense and our theoretical analysis. The main idea here is to cover the entire state-action space with the weight proportional to the noise in their transition functions. Using a number of simple domains with heterogeneous noise in their transitions, we show that our heuristic-based algorithm outperforms both our original algorithm and the maximum entropy algorithm in the small sample regime, while achieving similar asymptotic performance as that of the original algorithm.

### 193: Unknown mixing times in apprenticeship and reinforcement learning

Track 3: Reinforcement learning Tom Zahavy, Alon Cohen, Haim Kaplan, Yishay Mansour
[Paper] [Video (inline)] [Video (new tab)]

We derive and analyze learning algorithms for apprenticeship learning, policy evaluation and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time.

### 235: Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

Track 3: Reinforcement learning Tengyang Xie, Nan Jiang
[Paper] [Video (inline)] [Video (new tab)]

We prove performance guarantees of two algorithms for approximating Q* in batch reinforcement learning. Compared to classical iterative methods such as Fitted Q-Iteration---whose performance loss incurs quadratic dependence on horizon---these methods estimate (some forms of) the Bellman error and enjoy linear-in-horizon error propagation, a property established for the first time for algorithms that rely solely on batch data and output stationary policies. One of the algorithms uses a novel and explicit importance-weighting correction to overcome the infamous "double sampling" difficulty in Bellman error estimation, and does not use any squared losses. Our analyses reveal its distinct characteristics and potential advantages compared to classical algorithms.

### 480: Finite-Memory Near-Optimal Learning for Markov Decision Processes with Long-Run Average Reward

Track 3: Reinforcement learning Jan Kretinsky, Fabian Michel, Lukas Michel, Guillermo Perez
[Paper] [Video (inline)] [Video (new tab)]

We consider learning policies online in Markov decision processes with the long-run average reward (a.k.a. mean payoff). To ensure implementability of the policies, we focus on policies with finite memory. Firstly, we show that near optimality can be achieved almost surely, using an unintuitive gadget we call forgetfulness. Secondly, we extend the approach to a setting with partial knowledge of the system topology, introducing two optimality measures and providing near-optimal algorithms also for these cases.

### 26: Static and Dynamic Values of Computation in MCTS

Track 3: Reinforcement learning Eren Sezener, Peter Dayan
[Paper] [Video (inline)] [Video (new tab)]

Monte-Carlo Tree Search (MCTS) is one of the most-widely used methods for planning, and has powered many recent advances in artificial intelligence. In MCTS, one typically performs computations (i.e., simulations) to collect statistics about the possible future consequences of actions, and then chooses accordingly. Many popular MCTS methods such as UCT and its variants decide which computations to perform by trading-off exploration and exploitation. In this work, we take a more direct approach, and explicitly quantify the value of a computation based on its expected impact on the quality of the action eventually chosen. Our approach goes beyond the \emph{myopic} limitations of existing computation-value-based methods in two senses: (I) we are able to account for the impact of non-immediate (ie, future) computations (II) on non-immediate actions. We show that policies that greedily optimize computation values are optimal under certain assumptions and obtain results that are competitive with the state-of-the-art.