Join a track by clicking the track title. Clicking any paper will show the abstract, and links to paper and video.

Only users who registered for UAI 2020 can access videos and Zoom - to register, click here.

[Paper] [Video (inline)] [Video (new tab)]

Spatio-temporal incident prediction is a central issue in law enforcement, with applications in fighting crimes like poaching, human trafficking, illegal fishing, burglaries and smuggling. However, state of the art approaches fail to account for evasion in response to predictive models, a common form of which is spatial shift in incident occurrence. We present a general approach for incident forecasting that is robust to spatial shifts. We propose two techniques for solving the resulting robust optimization problem: first, a constraint generation method guaranteed to yield an optimal solution, and second, a more scalable gradient-based approach. We then apply these techniques to both discrete-time and continuous-time robust incident forecasting. We evaluate our algorithms on two different real-world datasets, demonstrating that our approach is significantly more robust than conventional methods.

[Paper] [Video (inline)] [Video (new tab)]

In this paper, we show that the distribution of entities and relations in common knowledge graphs is highly skewed, with some entities and relations being much more popular than the rest. We show that while knowledge graph embedding models give state-of-the-art performance in many relational learning tasks such as link prediction, current evaluation metrics like hits@k and mrr are biased towards popular entities and relations. We propose two new evaluation metrics, strat-hits@k and strat-mrr, which are unbiased estimators of the true hits@k and mrr when the items follow a power-law distribution. Our new metrics are generalizations of hits@k and mrr that take into account the popularity of the entities and relations in the data, with a tuning parameter determining how much emphasis the metric places on popular vs. unpopular items. Using our metrics, we run experiments on benchmark datasets to show that the performance of embedding models degrades as the popularity of the entities and relations decreases, and that current reported results overestimate the performance of these models by magnifying their accuracy on popular items.

[Paper] [Video (inline)] [Video (new tab)]

Graphs are ubiquitous in modelling relational structures. Recent endeavours in machine learning for graph structured data have led to many architectures and learning algorithms. However, the graph used by these algorithms is often constructed based on inaccurate modelling assumptions and/or noisy data. As a result, it fails to represent the true relationships between nodes. A Bayesian framework which targets posterior inference of the graph by considering it as a random quantity can be beneficial. In this paper, we propose a novel non-parametric graph model for constructing the posterior distribution of graph adjacency matrices. The proposed model is flexible in the sense that it can effectively take into account the output of graph based learning algorithms that target specific tasks. In addition, model inference scales well to large graphs. We demonstrate the advantages of this model in three different problem settings: node classification, link prediction and recommendation.

[Paper] [Video (inline)] [Video (new tab)]

We address in this study the problem of modeling weighted networks through generalized stochastic block models. Stochastic block models, and their extensions through mixed-membership versions, are indeed popular methods for network analysis as they can account for the underlying classes/communities structuring real-world networks and can be used for different applications. Our goal is to develop such models to solve the weight prediction problem that consists in predicting weights on links in weighted networks. To do so, we introduce new mixed-membership stochastic block models that can efficiently be learned through a coupling of collapsed and stochastic variational inference. These models, that represent the first weighted mixed-membership stochastic block models to our knowledge, can be deployed on large networks comprising millions of edges. The experiments, conducted on diverse real-world networks, illustrate the good behavior of these new models.

[Paper] [Video (inline)] [Video (new tab)]

We propose a novel probabilistic framework to model continuously generated interaction events data. Our goal is to infer the \emph{implicit} community structure underlying the temporal interactions among entities, and also to exploit how the latent structure influence their interaction dynamics. To this end, we model the reciprocating interactions between individuals using mutually-exciting Hawkes processes. The base rate of the Hawkes process for each pair of individuals is built upon the latent representations inferred using the hierarchical gamma process edge partition model (HGaP-EPM). In particular, our model allows the interaction dynamics between each pair of individuals to be modulated by their respective affiliated communities. Moreover, our model can flexibly incorporate the auxiliary individuals' attributes, or covariates associated with interaction events. Efficient Gibbs sampling and Expectation-Maximization algorithms are developed to perform inference via P\'olya-Gamma data augmentation strategy. Experimental results on real-world datasets demonstrate that our model not only achieves competitive performance compared with state-of-the-art methods, but also discovers interpretable latent structure behind the observed temporal interactions.

[Paper] [Video (inline)] [Video (new tab)]

Integrity of elections is vital to democratic systems, but it is frequently threatened by malicious actors. The study of algorithmic complexity of the problem of manipulating election outcomes by changing its structural features is known as election control Rothe [2016]. One means of election control that has been proposed, pertinent to the spatial voting model, is to select a subset of issues that determine voter preferences over candidates. We study a variation of this model in which voters have judgments about relative importance of issues, and a malicious actor can manipulate these judgments. We show that computing effective manipulations in this model is NP-hard even with two candidates or binary issues. However, we demonstrate that the problem becomes tractable with a constant number of voters or issues. Additionally, while it remains intractable when voters can vote stochastically, we exhibit an important special case in which stochastic voting behavior enables tractable manipulation.

[Paper] [Video (inline)] [Video (new tab)]

Small datasets form a significant portion of releasable data in high sensitivity domains such as healthcare. But, providing differential privacy for small dataset release is a hard task, where current state-of-the-art methods suffer from severe utility loss. As a solution, we propose DPRP (Differentially Private Data Release via Random Projections), a reconstruction based approach for releasing differentially private small datasets. DPRP has several key advantages over the state-of-the-art. Using seven diverse real-life datasets, we show that DPRP outperforms the current state-of-the-art on a variety of tasks, under varying conditions, and for all privacy budgets.

[Paper] [Video (inline)] [Video (new tab)]

We propose a new method that satisfies approximate differential privacy for top-$k$ selection with unordered output in the unknown data domain setting, not relying on the full knowledge of the domain universe. Our algorithm only requires looking at the top-$\bar{k}$ elements for any given $\bar{k} \geq k$, thus, enforcing the principle of minimal privilege. Unlike previous methods, our privacy parameter $\varepsilon$ does not scale with $k$, giving improved applicability for scenarios of very large $k$. Moreover, our novel construction, which combines the sparse vector technique and stability efficiently, can be applied as a general framework to any type of query, thus being of independent interest. We extensively compare our algorithm to previous work of top-$k$ selection on the unknown domain, and show, both analytically and on experiments, settings where we outperform the current state-of-the-art.

440: Mohamed et al.

538: Pal et al.

276: Dulac et al.

201: Yang et al.

151: Estornell et al.

264: Gondara et al.

459: Carvalho et al.

Sponsor Presentation: Layer 6

[Paper] [Video (inline)] [Video (new tab)]

Counterfactual explanations are usually obtainedby identifying the smallest change made to an input to change a prediction made by a fixed model (hereafter called sparse methods). Recent work, however, has revitalized an old insight: there often does not exist one superior solution to a prediction problem with respect to commonly used measures of interest (e.g. error rate). In fact, often multiple different classifiers give almost equal solutions. This phenomenon is known as predictive multiplicity (Breiman, 2001; Marx et al., 2019). In this work, we derive a general upper bound for the costs of counterfactual explanations under predictive multiplicity. Most notably, it depends on a discrepancy notion between two classifiers, which describes how differently they treat negatively predicted individuals. We then compare sparse and data support approaches empirically on real-world data. The results show that data support methods are more robust to multiplicity of different models. At the same time, we show that those methods have provably higher cost of generating counterfactual explanations under one fixed model. In summary, our theoretical and empirical results challenge the commonly held view that counterfactual recommendations should be sparse in general.

[Paper] [Video (inline)] [Video (new tab)]

Recent years have seen many advances in methods for causal structure learning from data. The empirical assessment of such methods, however, is much less developed. Motivated by this gap, we pose the following question: how can one assess, in a given problem setting, the practical efficacy of one or more causal structure learning methods? We formalize the problem in a decision-theoretic framework, via a notion of expected loss or risk for the causal setting. We introduce a theoretical notion of causal risk as well as sample quantities that can be computed from data, and study the relationship between the two, both theoretically and through an extensive simulation study. Our results provide an assumptions-light framework for assessing causal structure learning methods that can be applied in a range of practical use-cases.

[Paper] [Video (inline)] [Video (new tab)]

Many classical algorithms output graphical representations of causal structures by testing conditional independence among a set of random variables. In dynamical systems, local independence can be used analogously as a testable implication of the underlying data-generating process. We suggest some inexpensive methods for causal screening which provide output with a sound causal interpretation under the assumption of ancestral faithfulness. The popular model class of linear Hawkes processes is used to provide an example of a dynamical causal model. We argue that for sparse causal graphs the output will often be close to complete. We give examples of this framework and apply it to a challenging biological system.

[Paper] [Video (inline)] [Video (new tab)]

The recent availability of huge, many-dimensional data sets, like those arising from genome-wide association studies (GWAS), provides many opportunities for strengthening causal inference. One popular approach is to utilize these many-dimensional measurements as instrumental variables (instruments) for improving the causal effect estimate between other pairs of variables. Unfortunately, searching for proper instruments in a many-dimensional set of candidates is a daunting task due to the intractable model space and the fact that we cannot directly test which of these candidates are valid, so most existing search methods either rely on overly stringent modeling assumptions or fail to capture the inherent model uncertainty in the selection process. We show that, as long as at least some of the candidates are (close to) valid, without knowing a priori which ones, they collectively still pose enough restrictions on the target interaction to obtain a reliable causal effect estimate. We propose a general and efficient causal inference algorithm that accounts for model uncertainty by performing Bayesian model averaging over the most promising many-dimensional instrumental variable models, while at the same time employing weaker assumptions regarding the data generating process. We showcase the efficiency, robustness and predictive performance of our algorithm through experimental results on both simulated and real-world data.

[Paper] [Video (inline)] [Video (new tab)]

We propose an approach to estimate the effect of multiple simultaneous interventions in the presence of hidden confounders. To overcome the problem of hidden confounding, we consider the setting where we have access to not only the observational data but also sets of single-variable interventions in which each of the treatment variables is intervened on separately. We prove identifiability under the assumption that the data is generated from a nonlinear continuous structural causal model with additive Gaussian noise. In addition, we propose a simple parameter estimation method by pooling all the data from different regimes and jointly maximizing the combined likelihood. We also conduct comprehensive experiments to verify the identifiability result as well as to compare the performance of our approach against a baseline on both synthetic and real-world data.

[Paper] [Video (inline)] [Video (new tab)]

In this paper, we consider the problem of estimating all possible causal effects from observational data with two types of background knowledge: direct causal information and non-ancestral information. Following the IDA framework, we first provide locally valid orientation rules for maximal partially directed acyclic graphs (PDAGs), which are widely used to represent background knowledge. Based on the proposed rules, we present a fully local algorithm to estimate all possible causal effects with direct causal information. Furthermore, we consider non-ancestral information and prove that it can be equivalently transformed into direct causal information, meaning that we can also locally estimate all possible causal effects with non-ancestral information. The test results on both synthetic and real-world data sets show that our methods are efficient and stable.

[Paper] [Video (inline)] [Video (new tab)]

It is clear that some causal effects cannot be identified from observational data when the causal directed acyclic graph is absent. In such cases, IDA is a useful framework which estimates all possible causal effects by adjusting for all possible parental sets. In this paper, we combine the adjustment set selection procedure with the original IDA framework. Our goal is to find a common set that can be subtracted from all possible parental sets without influencing the back-door adjustment. To this end, we first introduce graphical conditions to decide whether a treatment's neighbor or parent in a completed partially directed acyclic graph (CPDAG) can be subtracted and then provide a procedure to construct a subtractable set from those subtractable vertices. We next combine the procedure with the IDA framework and provide a fully local modification of IDA. Experimental results show that, with our modification, both the number of possible parental sets and the size of each possible parental set enumerated by the modified IDA decrease, making it possible to estimate all possible causal effects more efficiently.

[Paper] [Video (inline)] [Video (new tab)]

The paper introduces a novel conditional independence (CI) based method for linear and nonlinear, lagged and contemporaneous causal discovery from observational time series in the causally sufficient case. Existing CI-based methods such as the PC algorithm and also common methods from other frameworks suffer from low recall and partially inflated false positives for strong autocorrelation which is an ubiquitous challenge in time series. The novel method, PCMCI$^+$, extends PCMCI [Runge et al., 2019b] to include discovery of contemporaneous links. PCMCI$^+$ improves the reliability of CI tests by optimizing the choice of conditioning sets and even benefits from autocorrelation. The method is order-independent and consistent in the oracle case. A broad range of numerical experiments demonstrates that PCMCI$^+$ has higher adjacency detection power and especially more contemporaneous orientation recall compared to other methods while better controlling false positives. Optimized conditioning sets also lead to much shorter runtimes than the PC algorithm. PCMCI$^+$ can be of considerable use in many real world application scenarios where often time resolutions are too coarse to resolve time delays and strong autocorrelation is present.

93: Eigenmann et al.

132: Mogensen

439: Bucur et al.

131: Saengkyongam et al.

127: Fang et al.

129: Liu et al.

579: Runge

Sponsor Presentation: Layer 6

[Paper] [Video (inline)] [Video (new tab)]

We consider the problem of sequentially optimising the conditional expectation of an objective function, with both the conditional distribution and the objective function assumed to be fixed but unknown. Assuming that the objective function belongs to a reproducing kernel Hilbert space (RKHS), we provide a novel upper confidence bound (UCB) based algorithm CME-UCB via estimation of the conditional mean embeddings (CME), and derive its regret bound. Along the way, we derive novel approximation guarantees for the CME estimates. Finally, experiments are carried out in a synthetic example and in a likelihood-free inference application that highlight the useful insights of the proposed method.

[Paper] [Video (inline)] [Video (new tab)]

In this paper we discuss sparsification of worker-to-server communication in large distributed systems. We improve upon algorithms that fit the following template: a local gradient estimate is computed independently by each worker, then communicated to a master, which subsequently performs averaging. The average is broadcast back to the workers, which use it to perform a gradient-type step to update the local version of the model. We observe that the above template is fundamentally inefficient in that too much data is unnecessarily communicated from the workers to the server, which slows down the overall system. We propose a fix based on a new update-sparsification method we develop in this work, which we suggest be used on top of existing methods. Namely, we develop a new variant of parallel block coordinate descent based on independent sparsification of the local gradient estimates before communication. We demonstrate that with only $m/n$ blocks sent by each of $n$ workers, where $m$ is the total number of parameter blocks, the theoretical iteration complexity of the underlying distributed methods is essentially unaffected. As an illustration, this means that when $n=100$ parallel workers are used, the communication of 99% blocks is redundant, and hence a waste of time. Our theoretical claims are supported through extensive numerical experiments which demonstrate an almost perfect match with our theory on a number of synthetic and real datasets.

[Paper] [Video (inline)] [Video (new tab)]

This work proposes a novel momentum technique, the Amortized Nesterov's Momentum, for stochastic convex optimization. The proposed method can be regarded as a smooth transition between Nesterov's method and mirror descent. By tuning only a single parameter, users can trade Nesterov's acceleration for robustness, that is, the variance control of the stochastic noise. Motivated by the recent success of using momentum in deep learning, we conducted extensive experiments to evaluate this new momentum in deep learning tasks. The results suggest that it can serve as a favorable alternative for Nesterov's momentum.

[Paper] [Video (inline)] [Video (new tab)]

This work examines the convergence of stochastic gradient-based optimization algorithms that use early stopping based on a validation function. The form of early stopping we consider is that optimization terminates when the norm of the gradient of a validation function falls below a threshold. We derive conditions that guarantee this stopping rule is well-defined, and provide bounds on the expected number of iterations and gradient evaluations needed to meet this criterion. The guarantee accounts for the distance between the training and validation sets, measured with the Wasserstein distance. We develop the approach in the general setting of a first-order optimization algorithm, with possibly biased update directions subject to a geometric drift condition. We then derive bounds on the expected running time for early stopping variants of several algorithms, including stochastic gradient descent (SGD), decentralized SGD (DSGD), and the stochastic variance reduced gradient (SVRG) algorithm. Finally, we consider the generalization properties of the iterate returned by early stopping.

[Paper] [Video (inline)] [Video (new tab)]

Online learning in dynamic environments has recently drawn considerable attention, where dynamic regret is usually employed to compare decisions of online algorithms to dynamic comparators. In previous works, dynamic regret bounds are typically established in terms of regularity of comparators $C_T$ or that of online functions $V_T$. Recently, Jadbabaie et al. [2015] propose an algorithm that can take advantage of both regularities and enjoy an $\tilde{O}(\sqrt{1+D_T} + \min\{\sqrt{(1+D_T)C_T}, (1+D_T)^{1/3}V_T^{1/3}T^{1/3}\})$ dynamic regret, where $D_T$ is an additional quantity to measure the niceness of environments. The regret bound adapts to the smaller regularity of problem environments and is tighter than all existing dynamic regret guarantees. Nevertheless, their algorithm involves non-convex programming at each iteration, and thus requires burdensome computations. In this paper, we design a simple algorithm based on the online ensemble, which provably enjoys the same (even slightly stronger) guarantee as the state-of-the-art rate, yet is much more efficient because our algorithm does not involve any non-convex problem solving. Empirical studies also verify the efficacy and efficiency.

[Paper] [Video (inline)] [Video (new tab)]

In this paper, we address the problem of jointly estimating dependencies across samples and dependencies across multiple features, where each set of dependencies is modeled as an inverse covariance matrix. In particular, we study a matrix-variate Gaussian distribution with the Kronecker-sum of sample-wise and feature-wise inverse covariances. While this Kronecker-sum model has been studied as an intuitively more appealing convex alternative to the Kronecker-product of two inverse covariance matrices, the existing methods do not scale to large datasets. We introduce a highly-efficient optimization method for estimating the Kronecker-sum structured inverse covariance matrix from matrix-variate data. In addition, we describe an alternative simpler approach for handling the non-identifiability of parameters than the strategies proposed in previous works. Using simulated and real data, we demonstrate our approach leads to one or two orders-of-magnitude speedup of the previous methods.

[Paper] [Video (inline)] [Video (new tab)]

Dominant generalized eigenspace computation, concerned with how to find one of the top-k generalized eigenspaces of a pair of real symmetric matrices, is one of the fundamental problems in scientific computing, data analysis, and statistics. In this work, we propose a practical Riemannian algorithm based on the first-order optimization on generalized Stiefel manifolds while efficiently leveraging second-order information. Particularly, we use inexact Riemannian gradients which result from running a fast least-squares solver to approximate matrix multiplications for avoiding costly matrix inversions involved therein. We also conduct a theoretical analysis that is different than existing ones, achieving a unified linear convergence rate regardless of the conventional generalized eigenvalue gap which is the key parameter to the currently dichotomized analysis: gap-dependent or gap-free. The resulting linear rate, albeit not optimal, remains valid in full generality. Despite the simplicity, empirically, our algorithm as a block generalized eigensolver remarkably outperforms existing solvers.

[Paper] [Video (inline)] [Video (new tab)]

Dictionary learning is a classic representation learning method that has been widely applied in signal processing and data analytics. In this paper, we investigate a family of $\ell_p$-norm ($p>2,p \in \mathbb{N}$) maximization approaches for the complete dictionary learning problem from theoretical and algorithmic aspects. Specifically, we prove that the global maximizers of these formulations are very close to the true dictionary with high probability, even when Gaussian noise is present. Based on the generalized power method (GPM), an efficient algorithm is then developed for the $\ell_p$-based formulations. We further show the efficacy of the developed algorithm: for the population GPM algorithm over the sphere constraint, it first quickly enters the neighborhood of a global maximizer, and then converges linearly in this region. Extensive experiments will demonstrate that the $\ell_p$-based approaches enjoy a higher computational efficiency and better robustness than conventional approaches and $p=3$ performs the best.

404: Mishchenko et al.

108: Zhou et al.

32: Flynn et al.

175: Zhang et al.

507: Yoon et al.

354: Xu et al.

128: Shen et al.

Sponsor Presentation: Layer 6

**Sponsors**