Join a track by clicking the track title. Clicking any paper will show the abstract, and links to paper and video.

Only users who registered for UAI 2020 can access videos and Zoom - to register, click here.

[Paper] [Video (inline)] [Video (new tab)]

A fundamental component of neural network verification is the computation of bounds on the values their outputs can take. Previous methods have either used off-the-shelf solvers, discarding the problem structure, or relaxed the problem even further, making the bounds unnecessarily loose. We propose a novel approach based on Lagrangian Decomposition. Our formulation admits an efficient supergradient ascent algorithm, as well as an improved proximal algorithm. Both the algorithms offer three advantages: (i) they yield bounds that are provably at least as tight as previous dual algorithms relying on Lagrangian relaxations; (ii) they are based on operations analogous to forward/backward pass of neural networks layers and are therefore easily parallelizable, amenable to GPU implementation and able to take advantage of the convolutional structure of problems; and (iii) they allow for anytime stopping while still providing valid bounds. Empirically, we show that we obtain bounds comparable with off-the-shelf solvers in a fraction of their running time, and obtain tighter bounds in the same time as previous dual algorithms. This results in an overall speed-up when employing the bounds for formal verification. Code for our algorithms is available at https://github.com/oval-group/decomposition-plnn-bounds.

[Paper] [Video (inline)] [Video (new tab)]

We study probabilistic safety for Bayesian Neural Networks (BNNs) under adversarial input perturbations. Given a compact set of input points, $T \subseteq \mathbb{R}^m$, we study the probability w.r.t. the BNN posterior that all the points in $T$ are mapped to the same region $S$ in the output space. In particular, this can be used to evaluate the probability that a network sampled from the BNN is vulnerable to adversarial attacks. We rely on relaxation techniques from non-convex optimization to develop a method for computing a lower bound on probabilistic safety for BNNs, deriving explicit procedures for the case of interval and linear function propagation techniques. We apply our methods to BNNs trained on a regression task, airborne collision avoidance, and MNIST, empirically showing that our approach allows one to certify probabilistic safety of BNNs with millions of parameters.

[Paper] [Video (inline)] [Video (new tab)]

Collective learning methods exploit relations among data points to enhance classification performance. However, such relations, represented as edges in the underlying graphical model, expose an extra attack surface to the adversaries. We study adversarial robustness of an important class of such graphical models, Associative Markov Networks (AMN), to structural attacks, where an attacker can modify the graph structure at test time. We formulate the task of learning a robust AMN classifier as a bi-level program, where the inner problem is a challenging non- linear integer program that computes optimal structural changes to the AMN. To address this technical challenge, we first relax the attacker problem, and then use duality to obtain a convex quadratic upper bound for the robust AMN problem. We then prove a bound on the quality of the resulting approximately optimal solutions, and experimentally demonstrate the e

[Paper] [Video (inline)] [Video (new tab)]

Online detection of instantaneous changes in the generative process of a data sequence generally focuses on retrospective inference of such change points without considering their future occurrences. We extend the Bayesian Online Change Point Detection algorithm to also infer the number of time steps until the next change point (i.e., the residual time). This enables to handle observation models which depend on the total segment duration, which is useful to model data sequences with temporal scaling. The resulting inference algorithm for segment detection can be deployed in an online fashion, and we illustrate applications to synthetic and to two medical real-world data sets.

[Paper] [Video (inline)] [Video (new tab)]

While state-of-the-art NLP explainability (XAI) methods focus on explaining per-sample decisions in supervised end or probing tasks, this is insufficient to explain and quantify model knowledge transfer during (un-)supervised training. Thus, for TX-Ray, we modify the established computer vision explainability principle of ‘visualizing preferred inputs of neurons’ to make it usable for both NLP and for transfer analysis. This allows one to analyze, track and quantify how self- or supervised NLP models first build knowledge abstractions in pretraining (1), and then transfer abstractions to a new domain (2), or adapt them during supervised fine tuning (3) – see Fig. 1. TX-Ray expresses neurons as feature preference distributions to quantify fine-grained knowledge transfer or adaptation and guide human analysis. We find that, similar to Lottery Ticket based pruning, TX-Ray based pruning can improve test set generalization and that it can reveal how early stages of self-supervision automatically learn linguistic abstractions like parts-of-speech.

[Paper] [Video (inline)] [Video (new tab)]

Test-time data augmentation—averaging the predictions of a machine learning model across multiple augmented samples of data—is a widely used technique that improves the predictive performance. While many advanced learnable data augmentation techniques have emerged in recent years, they are focused on the training phase. Such techniques are not necessarily optimal for test-time augmentation and can be outperformed by a policy consisting of simple crops and flips. The primary goal of this paper is to demonstrate that test-time augmentation policies can be successfully learned too. We introduce greedy policy search (GPS), a simple but high-performing method for learning a policy of test-time augmentation. We demonstrate that augmentation policies learned with GPS achieve superior predictive performance on image classification problems, provide better in-domain uncertainty estimation, and improve the robustness to domain shift.

[Paper] [Video (inline)] [Video (new tab)]

While the success of semi-supervised learning(SSL) is still not fully understood, Schölkopf et al. (2012) have established a link to the principle of independent causal mechanisms. They conclude that SSL should be impossible when predicting a target variable from its causes, but possible when predicting it from its effects. Since both these cases are restrictive, we extend their work by considering classification using cause and effect features at the same time, such as predicting a disease from both risk factors and symptoms. While standard SSL exploits information contained in the marginal distribution of all inputs (to improve the estimate of the conditional distribution of the target given in-puts), we argue that in our more general setting we should use information in the conditional distribution of effect features given causal features. We explore how this insight generalises the previous understanding, and how it relates to and can be exploited algorithmically for SSL.

490: Wicker et al.

119: Zhou et al.

137: Agudelo-España et al.

197: Rethmeier et al.

535: Lyzhov et al.

11: Kügelgen et al.

[Paper] [Video (inline)] [Video (new tab)]

Although with progress in introducing auxiliary amortized inference models, learning discrete latent variable models is still challenging. In this paper, we show that the annoying difficulty of obtaining reliable stochastic gradients for the inference model and the drawback of indirectly optimizing the target log-likelihood can be gracefully addressed in a new method based on stochastic approximation (SA) theory of the Robbins-Monro type. Specifically, we propose to directly maximize the target log-likelihood and simultaneously minimize the inclusive divergence between the posterior and the inference model. The resulting learning algorithm is called joint SA (JSA). To the best of our knowledge, JSA represents the first method that couples an SA version of the EM (expectation-maximization) algorithm (SAEM) with an adaptive MCMC procedure. Experiments on several benchmark generative modeling and structured prediction tasks show that JSA consistently outperforms recent competitive algorithms, with faster convergence, better final likelihoods, and lower variance of gradient estimates.

[Paper] [Video (inline)] [Video (new tab)]

We propose a novel iterative channel estimation (ICE) algorithm that essentially removes the critical known noisy channel assumption for universal discrete denoising problem. Our algorithm is based on Neural DUDE (N-DUDE), a recently proposed neural network-based discrete denoiser, and it estimates the channel transition matrix as well as the neural network parameters in an alternating manner until convergence. While we do not make any probabilistic assumption on the underlying clean data, our ICE resembles Expectation-Maximization (EM) with variational approximation, and it takes advantage of the property that N-DUDE can always induce a marginal posterior distribution of the clean data. We carefully validate the channel estimation quality of ICE, and with extensive experiments on several radically different types of data, we show the ICE equipped neural network-based denoisers can perform \emph{universally} well regardless of the uncertainties in both the channel and the clean source. Moreover, we show ICE becomes extremely robust to its hyperparameters, and show the denoisers with ICE significantly outperform the strong baseline that can handle the channel uncertainties for denoising, the widely used Baum-Welch (BW) algorithm for hidden Markov models (HMM).

[Paper] [Video (inline)] [Video (new tab)]

Recent advances in nonlinear Independent Component Analysis (ICA) provide a principled framework for unsupervised feature learning and disentanglement. The central idea in such works is that the latent components are assumed to be independent conditional on some observed auxiliary variables, such as the time-segment index. This requires manual segmentation of data into non-stationary segments which is computationally expensive, inaccurate and often impossible. These models are thus not fully unsupervised. We remedy these limitations by combining nonlinear ICA with a Hidden Markov Model, resulting in a model where a latent state acts in place of the observed segment-index. We prove identifiability of the proposed model for a general mixing nonlinearity, such as a neural network. We also show how maximum likelihood estimation of the model can be done using the expectation-maximization algorithm. Thus, we achieve a new nonlinear ICA framework which is unsupervised, more efficient, as well as able to model underlying temporal dynamics.

[Paper] [Video (inline)] [Video (new tab)]

The field of neural generative models is dominated by the highly successful Generative Adversarial Networks (GANs) despite their challenges, such as training instability and mode collapse. Auto-Encoders (AE) with regularized latent space provide an alternative framework for generative models, albeit their performance levels have not reached that of GANs. In this work, we hypothesise that the dimensionality of the AE model's latent space has a critical effect on the quality of generated data. Under the assumption that nature generates data by sampling from a ``true" generative latent space followed by a deterministic function, we show that the optimal performance is obtained when the dimensionality of the latent space of the AE-model matches with that of the ``true" generative latent space. Further, we propose an algorithm called the Mask Adversarial Auto-Encoder (MaskAAE), in which the dimensionality of the latent space of an adversarial auto encoder is brought closer to that of the ``true" generative latent space, via a procedure to mask the spurious latent dimensions. We demonstrate through experiments on synthetic and several real-world datasets that the proposed formulation yields betterment in the generation quality.

[Paper] [Video (inline)] [Video (new tab)]

Gaussian processes (GP) are a natural tool for estimating unknown functions, typically based on a collection of point-wise observations. Interestingly, the GP formalism can be used also with observations that are integrals of the unknown function along some known trajectories, which makes GPs a promising technique for inverse problems in a wide range of physical sensing problems. However, in many real world applications collecting data is laborious and time consuming. We provide tools for optimizing sensor locations for GPs using integral observations, extending both model-based and geometric strategies for GP sensor placement. We demonstrate the techniques in ultrasonic detection of fouling in closed pipes.

[Paper] [Video (inline)] [Video (new tab)]

Gaussian processes (GPs) are nonparametric priors over functions. Fitting a GP implies computing a posterior distribution of functions consistent with the observed data. Similarly, deep Gaussian processes (DGPs) should allow us to compute a posterior distribution of compositions of multiple functions giving rise to the observations. However, exact Bayesian inference is intractable for DGPs, motivating the use of various approximations. We show that the application of simplifying mean-field assumptions across the hierarchy leads to the layers of a DGP collapsing to near-deterministic transformations. We argue that such an inference scheme is suboptimal, not taking advantage of the potential of the model to discover the compositional structure in the data. To address this issue, we examine alternative variational inference schemes allowing for dependencies across different layers and discuss their advantages and limitations.

[Paper] [Video (inline)] [Video (new tab)]

We propose a novel Gaussian process kernel that takes advantage of a deep neural network (DNN) structure but retains good interpretability. The resulting kernel is capable of addressing four major issues of the previous works of similar art, i.e., the optimality, explainability, model complexity, and sample efficiency. Our kernel design procedure comprises three steps: (1) Derivation of an optimal kernel with a non-stationary dot product structure that minimizes the prediction/test mean-squared-error (MSE); (2) Decomposition of this optimal kernel as a linear combination of shallow DNN subnetworks with the aid of multi-way feature interaction detection; (3) Updating the hyper-parameters of the subnetworks via an alternating rationale until convergence. The designed kernel does not sacrifice interpretability for optimality. On the contrary, each subnetwork explicitly demonstrates the interaction of a set of features in a transformation function, leading to a solid path toward explainable kernel learning. We test the proposed kernel with both synthesized and real-world data sets, and the proposed kernel is superior to its competitors in terms of prediction performance in most cases. Moreover, it tends to maintain the prediction performance and be robust to data over-fitting issue, when reducing the number of samples.

[Paper] [Video (inline)] [Video (new tab)]

Correlated topic models (CTM) are useful tools for statistical analysis of documents. They explicitly capture the correlation between topics associated with each document. We propose an extension to CTM that models the evolution of both topic correlation and word co-occurrence over time. This allows us to identify the changes of topic correlations over time, e.g., in the machine learning literature, the correlation between the topics “stochastic gradient descent” and “variational inference” increased in the last few years due to advances in stochastic variational inference methods. Our temporal dynamic priors are based on Gaussian processes (GPs), allowing us to capture diverse temporal behaviours such as smooth, with long-term memory, temporally concentrated, and periodic. The evolution of topic correlations is modeled through generalised Wishart processes (GWPs). We develop a stochastic variational inference method, which enables us to handle large sets of continuous temporal data. Our experiments applied to real world data demonstrate that our model can be used to effectively discover temporal patterns of topic distributions, words associated to topics and topic relationships.

55: Ahn et al.

379: Hälvä et al.

281: Mondal et al.

411: Longi et al.

206: Ustyuzhaninov et al.

328: Dai et al.

365: Tomasi et al.

[Paper] [Video (inline)] [Video (new tab)]

Bayesian inference of the Bayesian network structure requires averaging over all possible directed acyclic graphs, DAGs, each weighted by its posterior probability. For approximate averaging, the most popular method has been Markov chain Monte Carlo, MCMC. It was recently shown that collapsing the sampling space from DAGs to suitably deﬁned ordered partitions of the nodes substantially expedites the chain’s convergence; this partition-MCMC is similar to order-MCMC on node orderings, but it avoids biasing the sampling distribution. Here, we further collapse the state space by merging some number of adjacent members of a partition into layers. This renders the computation of the (unnormalized) posterior probability of a state, called layering, more involved, for which task we give an efﬁcient dynamic programming algorithm. Our empirical studies suggest that the resulting layering-MCMC is superior to partition-MCMC in terms of mixing time and estimation accuracy.

[Paper] [Video (inline)] [Video (new tab)]

The linear Lyapunov equation of a covariance matrix parametrizes the equilibrium covariance matrix of a stochastic process. This parametrization can be interpreted as a new graphical model class, and we show how the model class behaves under marginalization and introduce a method for structure learning via $\ell_1$-penalized loss minimization. Our proposed method is demonstrated to outperform alternative structure learning algorithms in a simulation study, and we illustrate its application for protein phosphorylation network reconstruction.

[Paper] [Video (inline)] [Video (new tab)]

We consider the problem of structure learning for linear causal models based on observational data. We treat models given by possibly cyclic mixed graphs, which allow for feedback loops and effects of latent confounders. Generalizing related work on bow-free acyclic graphs, we assume that the underlying graph is simple. This entails that any two observed variables can be related through at most one direct causal effect and that (confounding-induced) correlation between error terms in structural equations occurs only in absence of direct causal effects. We show that, despite new subtleties in the cyclic case, the considered simple cyclic models are of expected dimension and that a previously considered criterion for distributional equivalence of bow-free acyclic graphs has an analogue in the cyclic case. Our result on model dimension justifies in particular score-based methods for structure learning of linear Gaussian mixed graph models, which we implement via greedy search.

[Paper] [Video (inline)] [Video (new tab)]

While feedback loops are known to play important roles in many complex systems, their existence is ignored in a large part of the causal discovery literature, as systems are typically assumed to be acyclic from the outset. When applying causal discovery algorithms designed for the acyclic setting on data generated by a system that involves feedback, one would not expect to obtain correct results. In this work, we show that---surprisingly---the output of the Fast Causal Inference (FCI) algorithm is correct if it is applied to observational data generated by a system that involves feedback. More specifically, we prove that for observational data generated by a simple and sigma-faithful Structural Causal Model (SCM), FCI is sound and complete, and can be used to consistently estimate (i) the presence and absence of causal relations, (ii) the presence and absence of direct causal relations, (iii) the absence of confounders, and (iv) the absence of specific cycles in the causal graph of the SCM. We extend these results to constraint-based causal discovery algorithms that exploit certain forms of background knowledge, including the causally sufficient setting (e.g., the PC algorithm) and the Joint Causal Inference setting (e.g., the FCI-JCI algorithm).

[Paper] [Video (inline)] [Video (new tab)]

We consider the task of causal structure learning over measurement dependence inducing latent (MeDIL) causal models. We show that this task can be framed in terms of the graph theoretic problem of finding edge clique covers,resulting in an algorithm for returning minimal MeDIL causal models (minMCMs). This algorithm is non-parametric, requiring no assumptions about linearity or Gaussianity. Furthermore, despite rather weak assumptions aboutthe class of MeDIL causal models, we show that minimality in minMCMs implies some rather specific and interesting properties. By establishing MeDIL causal models as a semantics for edge clique covers, we also provide a starting point for future work further connecting causal structure learning to developments in graph theory and network science.

[Paper] [Video (inline)] [Video (new tab)]

Estimation of information theoretic quantities such as mutual information and its conditional variant has drawn interest in recent times owing to their multifaceted applications. Newly proposed neural estimators for these quantities have overcome severe drawbacks of classical $k$NN-based estimators in high dimensions. In this work, we focus on conditional mutual information (CMI) estimation by utilizing its formulation as a \textit{minmax} optimization problem. Such a formulation leads to a joint training procedure similar to that of generative adversarial networks. We find that our proposed estimator provides better estimates than the existing approaches on a variety of simulated datasets comprising linear and non-linear relations between variables. As an application of CMI estimation, we deploy our estimator for conditional independence (CI) testing on real data and obtain better results than state-of-the-art CI testers.

[Paper] [Video (inline)] [Video (new tab)]

We propose a method to learn a common bias vector for a growing sequence of low-variance tasks. Unlike state-of-the-art approaches, our method does not require tuning any hyper-parameter. Our approach is presented in the non-statistical setting and can be of two variants. The “aggressive” one updates the bias after each datapoint, the “lazy” one updates the bias only at the end of each task. We derive an across-tasks regret bound for the method. When compared to state-of-the-art approaches, the aggressive variant returns faster rates, the lazy one recovers standard rates, but with no need of tuning hyper-parameters. We then adapt the methods to the statistical setting: the aggressive variant becomes a multi-task learning method, the lazy one a meta-learning method. Experiments confirm the effectiveness of our methods in practice.

407: Varando et al.

408: Amendola et al.

481: Mooij et al.

244: Markham et al.

358: Mondal et al.

368: Denevi et al.

**Sponsors**