There will be 230 papers presented at the conference. The list of papers is below, together with the OpenReview page links. (These are not the conference proceedings. Papers may be changed/removed, so do not use this list as permanent).
Statistical tests for dataset shift are susceptible to false alarms: they are sensitive to minor differences when there is in fact adequate sample coverage and predictive performance. We propose instead a framework to detect adverse shifts based on outlier scores, \texttt{D-SOS} for short. \texttt{D-SOS} holds that the new (test) sample is not substantively worse than the reference (training) sample, and not that the two are equal. The key idea is to reduce observations to outlier scores and compare contamination rates at varying weighted thresholds. Users can define what \emph{worse} means in terms of relevant notions of outlyingness, including proxies for predictive performance. Compared to tests of equal distribution, our approach is uniquely tailored to serve as a robust metric for model monitoring and data validation. We show how versatile and practical \texttt{D-SOS} is on a wide range of real and simulated data.
Causal discovery for purely observational, categorical data is a long-standing challenging problem. Unlike continuous data, the vast majority of existing methods for categorical data focus on inferring the Markov equivalence class only, which leaves the direction of some causal relationships undetermined. This paper proposes an identifiable ordinal causal discovery method that exploits the ordinal information contained in many real-world applications to uniquely identify the causal structure. The proposed method is applicable beyond ordinal data via data discretization. Through real-world and synthetic experiments, we demonstrate that the proposed ordinal causal discovery method combined with simple score-and-search algorithms has favorable and robust performance compared to state-of-the-art alternative methods in both ordinal categorical and non-categorical data. An accompanied R package OCD is freely available.
Sahra Ghalebikesabi (University of Oxford); Harrison Wilde (Twitter); Jack Jewson (Universitat Pompeu Fabra); Arnaud Doucet (Google DeepMind); Sebastian Josef Vollmer (University of Kaiserslautern); Christopher C. Holmes (University of Oxford)
Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
Tuan-Duy Hien Nguyen (VinAI); Ngoc Bui (Hanoi University of Science and Technology); Duy Nguyen (VinAI Research); Man-Chung Yue (The University of Hong Kong); Viet Anh Nguyen (VinAI Research, Vietnam)
Algorithmic recourse aims to recommend an informative feedback to overturn an unfavorable machine learning decision. We introduce in this paper the Bayesian recourse, a model-agnostic recourse that minimizes the posterior probability odds ratio. Further, we present its min-max robust counterpart with the goal of hedging against future changes in the machine learning model parameters. The robust counterpart explicitly takes into account possible perturbations of the data in a Gaussian mixture ambiguity set prescribed using the optimal transport (Wasserstein) distance. We show that the resulting worst-case objective function can be decomposed into solving a series of two-dimensional optimization subproblems, and the min-max recourse finding problem is thus amenable to a gradient descent algorithm. Contrary to existing methods for generating robust recourses, the robust Bayesian recourse does not require a linear approximation step. The numerical experiment demonstrates the effectiveness of our proposed robust Bayesian recourse facing model shifts.
Data of general object images have two most common structures: (1) each object of a given shape can be rendered in multiple different views, and (2) shapes of objects can be categorized in such a way that the diversity of shapes is much larger across categories than within a category. Existing deep generative models can typically capture either structure, but not both. In this work, we introduce a novel deep generative model, called CIGMO, that can learn to represent category, shape, and view factors from image data. The model comprises multiple modules of shape representations that are each specialized to a particular category and disentangled from view representation, and can be learned using a group-based weakly supervised learning method. By empirical investigation, we show that our model can effectively discover categories of object shapes despite large view variation and quantitatively supersede various previous methods including the state-of-the-art invariant clustering algorithm. Further, we show that our approach using category-specialization can enhance the learned shape representation to better perform down-stream tasks such as one-shot object identification as well as shape-view disentanglement.
Sixie Yu (Washington University, St. Louis); P. Jeffrey Brantingham (University of California, Los Angeles); Matthew Valasik (Louisiana State University); Yevgeniy Vorobeychik (Washington University, St. Louis)
Network games are a natural modeling framework for strategic interactions of agents whose actions have local impact on others. Recently, a multi-scale network game model has been proposed to capture local effects at multiple network scales, such as among both individuals and groups. We propose a framework to learn the utility functions of binary multi-scale games from agents' behavioral data. Departing from much prior work in this area, we model agent behavior as following logit-response dynamics, rather than acting according to a Nash equilibrium. This defines a generative time-series model of joint behavior of both agents and groups, which enables us to naturally cast the learning problem as maximum likelihood estimation (MLE). We show that in the important special case of multi-scale linear-quadratic games, this MLE problem is convex. Extensive experiments using both synthetic and real data demonstrate that our proposed modeling and learning approach is effective in both game parameter estimation as well as prediction of future behavior, even when we learn the game from only a single behavior time series. Furthermore, we show how to use our framework to develop a statistical test for the existence of multi-scale structure in the game, and use it to demonstrate that real time-series data indeed exhibits such structure.
Martin Gubri (University of Luxemburg); Maxime Cordy (University of Luxemburg); Mike Papadakis (University of Luxemburg); Yves Le Traon (University of Luxemburg); Koushik Sen (University of California-Berkeley)
An established way to improve the transferability of black-box evasion attacks is to craft the adversarial examples on an ensemble-based surrogate to increase diversity. We argue that transferability is fundamentally related to uncertainty. Based on a state-of-the-art Bayesian Deep Learning technique, we propose a new method to efficiently build a surrogate by sampling approximately from the posterior distribution of neural network weights, which represents the belief about the value of each parameter. Our extensive experiments on ImageNet and CIFAR-10 show that our approach improves the transfer rates of four state-of-the-art attacks significantly (up to 62.1 percentage points), in both intra-architecture and inter-architecture cases. On ImageNet, our approach can reach 94% of transfer rate while reducing training computations from 11.6 to 2.4 exaflops, compared to an ensemble of independently trained DNNs. Our vanilla surrogate achieves 87.5% of the time higher transferability than 3 test-time techniques designed for this purpose. Our work demonstrates that the way to train a surrogate has been overlooked, although it is an important element of transfer-based attacks. We are, therefore, the first to review the effectiveness of several training methods in increasing transferability. We provide new directions to better understand the transferability phenomenon and offer a simple but strong baseline for future work.
Andreas Munk (University of British Columbia); Berend Zwartsenberg (Inverted AI); Adam Scibior (Inverted AI); Atilim Gunes Baydin (University of Oxford); Andrew Lawrence Stewart (University of British Columbia); Goran Fernlund (University of British Columbia); Anoush Poursartip (Convergent Manufacturing Technologies Inc); Frank Wood (University of British Columbia)
We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.
The intersection of learning to rank and choice modeling is an active area of research with applications in e-commerce, information retrieval and the social sciences. In some applications such as recommendation systems, the statistician is primarily interested in recovering the set of the top ranked items from a large pool of items as efficiently as possible using passively collected \emph{discrete choice data}, i.e., the user picks one item from a set of multiple items. Motivated by this practical consideration, we propose \emph{the choice-based Borda count algorithm} as a fast and accurate ranking algorithm for \emph{top $K$-recovery} i.e., correctly identifying all of the top $K$ items. We show that the choice-based Borda count algorithm has optimal sample complexity for top-$K$ recovery under a broad class of \emph{random utility models}. We prove that in the limit, the choice-based Borda count algorithm produces the same top-$K$ estimate as the commonly used Maximum Likelihood Estimate method but the former's speed and simplicity brings considerable advantages in practice. Experiments on both synthetic and real datasets show that the counting algorithm is competitive with commonly used ranking algorithms in terms of accuracy while being several orders of magnitude faster.
Fangting Zhou (Texas A&M); Kejun He (Renmin University of China); Yang Ni (Texas A&M)
We consider the problem of causal discovery (structure learning) from heterogeneous observational data. Most existing methods assume a homogeneous sampling scheme, which may lead to misleading conclusions when violated. We propose a novel approach that exploits data heterogeneity to infer possibly cyclic causal structures from causally insufficient systems. The core idea is to model the direct causal effects as functions of exogenous covariates that help explain data/sampling heterogeneity. We investigate the structure identifiability properties of the proposed model. Structure learning is carried out in a fully Bayesian fashion, which provides natural uncertainty quantification. We demonstrate its utility through extensive simulations and two real-world applications.
Shuang Wu (University of California, Los Angeles); Chi-Hua Wang (Purdue University); Yuantong Li (University of California, Los Angeles); Guang Cheng (University of California, Los Angeles)
We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\Tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}.
Jungtaek Kim (POSTECH); Seungjin Choi (Intellicode); Minsu Cho (POSTECH)
Bayesian optimization is a popular method for solving the problem of global optimization of an expensive-to-evaluate black-box function. It relies on a probabilistic surrogate model of the objective function, upon which an acquisition function is built to determine where next to evaluate the objective function. In general, Bayesian optimization with Gaussian process regression operates on a continuous space. When input variables are categorical or discrete, an extra care is needed. A common approach is to use one-hot encoded or Boolean representation for categorical variables which might yield a combinatorial explosion problem. In this paper we present a method for Bayesian optimization in a combinatorial space, which can operate well in a large combinatorial space. The main idea is to use a random mapping which embeds the combinatorial space into a convex polytope in a continuous space, on which all essential process is performed to determine a solution to the black-box optimization in the combinatorial space. We describe our combinatorial Bayesian optimization algorithm and present its regret analysis. Numerical experiments demonstrate that our method shows satisfactory performance compared to existing methods.
Sewhan Chun (Korea Advanced Institute of Science and Technology); Jae Young Lee (Korea Advanced Institute of Science and Technology); Junmo Kim (Korea Advanced Institute of Science and Technology)
In the recent studies of data augmentation of neural networks, the application of test time augmentation has been studied to extract optimal transformation policies to enhance performance with minimum cost. The policy search method with the best level of input data dependency involves training a loss predictor network to estimate suitable transformations for each of the given input image in independent manner, resulting in instance-level transformation extraction. In this work, we propose a method to utilize and modify the loss prediction pipeline to further improve the performance with the cyclic search for suitable transformations and the use of the entropy weight method. The cyclic usage of the loss predictor allows refining each input image with multiple transformations with a more flexible transformation magnitude. For cases where multiple augmentations are generated, we implement the entropy weight method to reflect the data uncertainty of each augmentation to force the final result to focus on augmentations with low uncertainty. The experimental result shows convincing qualitative outcome and robust performance for the corrupted conditions of data.
Tianshu Bao (Vanderbilt University); Shengyu Chen (University of Pittsburgh); Taylor T Johnson (Vanderbilt University); Peyman Givi (University of Pittsburgh); Shervin Sammak (University of Pittsburgh); Xiaowei Jia (University of Pittsburgh)
Direct numerical simulation (DNS) of turbulent flows is computationally expensive and cannot be applied to flows with large Reynolds numbers. Low-resolution large eddy simulation (LES) is a popular alternative, but it is unable to capture all of the scales of turbulent transport accurately. Reconstructing DNS from low-resolution LES is critical for large-scale simulation in many scientific and engineering disciplines, but it poses many challenges to existing super-resolution methods due to the complexity of turbulent flows and computational cost of generating frequent LES data. We propose a physics-guided neural network for reconstructing frequent DNS from sparse LES data by enhancing its spatial resolution and temporal frequency. Our proposed method consists of a partial differential equation (PDE)-based recurrent unit for capturing underlying temporal processes and a physics-guided super-resolution model that incorporates additional physical constraints. We demonstrate the effectiveness of both components in reconstructing the Taylor-Green Vortex using sparse LES data. Moreover, we show that the proposed recurrent unit can preserve the physical characteristics of turbulent flows by leveraging the physical relationships in the Navier-Stokes equation.
Qiang Zhou (Southeast University); Sinno Pan (Nanyang Technological University)
Linear coupling is recently proposed to accelerate first-order algorithms by linking gradient descent and mirror descent together, which is able to achieve the accelerated convergence rate for first-order algorithms. This work focuses on the convergence analysis of linear coupling for convex composite minimization when the proximal operator cannot be exactly computed. It is of particular interest to study the convergence of linear coupling because it not only achieves the accelerated convergence rate for first-order algorithm but also works for generic norms. We present convergence analysis of linear coupling by allowing the proximal operator to be computed up to a certain precision. Our analysis illustrates that the accelerated convergence rate of linear coupling with inexact proximal operator can be preserved if the error sequence of inexact proximal operator decreases in a sufficiently fast rate. More importantly, our analysis leads to better bounds than existing works on inexact proximal operator. Experiment results on several real-world datasets verify our theoretical results.
Zijun Cui (Rensselaer Polytechnic Institute); Hanjing Wang (Rensselaer Polytechnic Institute); Tian Gao (International Business Machines); Kartik Talamadupula (IBM, International Business Machines); Qiang Ji (Rensselaer Polytechnic Institute)
Maximum-A-Posteriori (MAP) inference is a fundamental task in probabilistic inference and belief propagation (BP) is a widely used algorithm for MAP inference. Though BP has been applied successfully to many different fields, it offers no performance guarantee and often performs poorly on loopy graphs. To improve the performance on loopy graphs and to scale up to large graphs, we propose a variational message passing neural network (V-MPNN), where we leverage both the power of neural networks in modeling complex functions and the well-established algorithmic theories on variational belief propagation. Instead of relying on a hand-crafted variational assumption, we propose a neural free energy where a general variational distribution is parameterized through a neural network. Message passing neural networks are utilized for the minimization of neural free energy. Training of MPNNs is thus guided by neural free energy, without requiring exact MAP configurations as annotations. We empirically demonstrate the effectiveness of the proposed V-MPNN by comparing against both state-of-the-art training-free methods and training-based methods.
Learning functions over Boolean variables is a fundamental problem in machine learning. But not much is known about learning such functions using neural networks. Here we focus on learning read-once DNFs under the uniform distribution with a convex neural network and gradient methods. We first observe empirically that gradient methods converge to compact solutions with neurons that are aligned with the terms of the DNF. This is despite the fact that there are many zero training error networks that do not have this property. Thus, the learning process has a clear inductive bias towards such logical formulas. Following recent results which connect the inductive bias of gradient flow (GF) to KKT points of minimum norm problems, we study these KKT points in our setting. We prove that zero training error solutions that memorize training points are not KKT points and therefore GF cannot converge to them. On the other hand, we prove that globally optimal KKT points correspond exactly to networks that are aligned with the DNF terms. These results suggest a strong connection between the inductive bias of GF and solutions that align with the DNF. We conclude with extensive experiments which verify our findings.
Ian Osband (DeepMind); Zheng Wen (Google DeepMind); Seyed Mohammad Asghari (DeepMind); Vikranth Dwaracherla (DeepMind); Xiuyuan Lu (Deepmind); Benjamin Van Roy (Google)
Most work on supervised learning research has focused on marginal predictions. In decision problems, joint predictive distributions are essential for good performance. Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i.i.d. from the testing distribution. With low-dimensional inputs, these methods distinguish agents that effectively estimate uncertainty from those that do not. We establish that the predictive distribution order required for such differentiation increases greatly with input dimension, rendering these methods impractical. To accommodate high-dimensional inputs, we introduce \textit{dyadic sampling}, which focuses on predictive distributions associated with random \textit{pairs} of inputs. We demonstrate that this approach efficiently distinguishes agents in high-dimensional examples involving simple logistic regression as well as complex synthetic and empirical data.
Zun Li (University of Michigan); Feiran Jia (Pennsylvania State University); Aditya Mate (Harvard University); Shahin Jabbari (Drexel University); Mithun Chakraborty (University of Michigan); Milind Tambe (Harvard University); Yevgeniy Vorobeychik (Washington University, St. Louis)
From large-scale organizations to decentralized political systems, hierarchical strategic decision making is commonplace. We introduce a novel class of \emph{structured hierarchical games (SHGs)} that formally capture such hierarchical strategic interactions. In an SHG, each player is a node in a tree, and strategic choices of players are sequenced from root to leaves, with root moving first, followed by its children, then followed by their children, and so on until the leaves. A player's utility in an SHG depends on its own decision, and on the choices of its parent and \emph{all} the tree leaves. SHGs thus generalize simultaneous-move games, as well as Stackelberg games with many followers. We leverage the structure of both the sequence of player moves as well as payoff dependence to develop a novel gradient-based back propagation-style algorithm, which we call \emph{Differential Backward Induction (DBI)}, for approximating equilibria of SHGs. We provide a sufficient condition for convergence of DBI and demonstrate its efficacy in finding approximate equilibrium solutions to several SHG models of hierarchical policy-making problems.
Finding the features relevant to the difference in treatment effects is essential to unveil underlying causal mechanisms. Existing methods seek such features by measuring how greatly the feature attributes affect the degree of the {\it conditional average treatment effect} (CATE). However, these methods may overlook important features because CATE, a measure of an average treatment effect, cannot detect the difference of other distribution parameters than the mean (e.g., variance). In this paper, we propose a feature selection framework for discovering {\it distributional treatment effect modifiers}. To resolve the weakness of the existing methods, we formulate a feature importance measure that quantifies how strongly the feature attributes influence the discrepancy between potential outcome distributions. We derive its computationally efficient estimator and develop a feature selection algorithm that can control the type I error rate at some desired level. Experimental results show that our framework successfully discovers important features and outperforms the existing mean-based method.
Jean Feng (University of California, San Francisco); Gene Pennello (U.S. Food and Drug Administration); Nicholas Petrick (U.S. Food and Drug Administration); Berkman Sahiner (U.S. Food and Drug Administration); Romain Pirracchio (University of California, San Francisco); Alexej Gossmann (U.S. Food and Drug Administration)
After initial release of a machine learning algorithm, the model can be fine-tuned by retraining on subsequently gathered data, adding newly discovered features, or more. Each modification introduces a risk of deteriorating performance and must be validated on a test dataset. It may not always be practical to assemble a new dataset for testing each modification, especially when most modifications are minor or are implemented in rapid succession. Recent works have shown how one can repeatedly test modifications on the same dataset and protect against overfitting by (i) discretizing test results along a grid and (ii) applying a Bonferroni correction to adjust for the total number of modifications considered by an adaptive developer. However, the standard Bonferroni correction is overly conservative when most modifications are beneficial and/or highly correlated. This work investigates more powerful approaches using alpha-recycling and sequentially-rejective graphical procedures (SRGPs). We introduce novel extensions that account for correlation between adaptively chosen algorithmic modifications. In empirical analyses, the SRGPs control the error rate of approving unacceptable modifications and approve a substantially higher number of beneficial modifications than previous approaches.
Zhenyi Wang (State University of New York, Buffalo); Xiaoyang Wang (Tencent AI Lab); Li Shen (JD Explore Academy); Qiuling Suo (State University of New York, Buffalo); Kaiqiang Song (Tencent AI Lab); Dong Yu (Tencent AI Lab); Yan Shen (State University of New York, Buffalo); Mingchen Gao (University at Buffalo, SUNY)
Existing meta-learning works assume that each task has available training and testing data. However, there are many available pre-trained models without accessing their training data in practice. We often need a single model to solve different tasks simultaneously as this is much more convenient to deploy the models. Our work aims to meta-learn a model initialization from these pre-trained models without using corresponding training data. We name this challenging problem setting as Data-Free Learning To Learn (DFL2L). We propose a distributionally robust optimization (DRO) framework to learn a black-box model to fuse and compress all the pre-trained models into a single network to address this problem. To encourage good generalization to the unseen new tasks, the proposed DRO framework diversifies the learned task embedding associated with each pre-trained model to cover the diversity in the underlying training task distributions. A model initialization is sampled from the black-box network during meta-testing as the meta learned initialization. Extensive experiments on offline and online DFL2L settings and several real image datasets demonstrate the effectiveness of the proposed methods.
Margot Herin (LIP6); Patrice Perny (LIP6, UMR 7606); Nataliya Sokolovska (UMR S 1269)
This paper deals with preference elicitation within Choquet Expected Utility (CEU) theory for decision making under uncertainty. We consider the Savage's framework with a finite set of states and assume that preferences of the Decision Maker over acts are observable. The CEU model involves two parameters that must be tuned to the value system of the decision maker: a set function (capacity) modeling weights attached to events, of size exponential in the number of states, and a utility function defined on the space of outcomes. Our aim is to learn a sparse representation of the CEU model from preference data. We propose and test a preference learning approach based on a spline representation of utilities and the sparse learning of capacities to obtain CEU models achieving a good tradeoff between the aim of sparsity and the expressivity required by preference data.
Prasanth Sengadu Suresh (University of Georgia); Prashant Doshi (University of Georgia)
We consider the problem of learning the behavioral preferences of an expert engaged in a task from noisy and partially-observable demonstrations. This is motivated by real-world applications such as a line robot learning from observing a human worker, where some observations are occluded by environmental elements. Furthermore, robotic perception tends to be imperfect and noisy. Previous techniques for inverse reinforcement learning (IRL) take the approach of either omitting the missing portions or inferring it as part of expectation-maximization, which tends to be slow and prone to local optima. We present a new method that generalizes the well-known Bayesian maximum-a-posteriori (MAP) IRL method by marginalizing the occluded portions of the trajectory. This is then extended with an observation model to account for perception noise. This novel application of marginal MAP (MMAP) to IRL significantly improves on the previous IRL technique under occlusion in both formative evaluations on a toy problem and in a summative evaluation on a produce sorting line task by a physical robot.
Wen Jie Li (Nanjing university); Wasif Naeem (The Queen's University Belfast); Jia Liu (Nanjing University); Dequan Zheng (Nanjing University); Wei Hao (Nanjing University); lijun Chen (Nanjing University)
Accurate absolute pose regression is one of the key challenges in robotics and computer vision. Existing direct regression methods suffer from two limitations. First, some noisy scenarios such as poor illumination conditions are likely to result in the uncertainty of pose estimation. Second, the output n-dimensional feature vector in the Euclidean space $\mathbb{R}^n$ cannot be well mapped to $SE(3)$ manifold. In this work, we propose a deep dual quaternion network that performs the absolute pose regression on $SE(3)$. We first develop an antipodally symmetric probability distribution over the unit dual quaternion on $SE(3)$ to model uncertainties and then propose an intermediary differential representation space to replace the final output pose, which avoids the mapping problem from $\mathbb{R}^n$ to $SE(3)$. In addition, we introduce a backpropagation method that considers the continuousness and differentiability of the proposed intermediary space. Extensive experiments on the camera re-localization task on the Cambridge Landmarks and 7-Scenes datasets demonstrate that our method greatly improves the accuracy of the pose as well as the robustness in dealing with uncertainty and ambiguity, compared to the state-of-the-art.
Egor Shulgin (KAUST); Peter Richtarik (King Abdullah University of Science and Technology (KAUST))
Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models, and lossy compression of exchanged information, such as stochastic gradients or models, is one of the most effective instruments to alleviate this issue. Among the most studied compression techniques is the class of unbiased compression operators with variance bounded by a multiple of the square norm of the vector we wish to compress. By design, this variance may remain high, and only diminishes if the input vector approaches zero. However, unless the model being trained is overparameterized, there is no a-priori reason for the vectors we wish to compress to approach zero during the iterations of classical methods such as distributed compressed {\sf SGD}, which has adverse effects on the convergence speed. Due to this issue, several more elaborate and seemingly very different algorithms have been proposed recently, with the goal of circumventing this issue. These methods are based on the idea of compressing the {\em difference} between the vector we would normally wish to compress and some auxiliary vector which changes throughout the iterative process. In this work we take a step back, and develop a unified framework for studying such methods, conceptually, and theoretically. Our framework incorporates methods compressing both gradients and models, using unbiased and biased compressors, and sheds light on the construction of the auxiliary vectors. Furthermore, our general framework can lead to the improvement of several existing algorithms, and can produce new algorithms. Finally, we performed several numerical experiments which illustrate and support our theoretical findings.
Anthony DiGiovanni (University of Michigan); Ambuj Tewari (University of Michigan Ann Arbor)
We study the problem of adaptability in repeated games: simultaneously guaranteeing low regret for several classes of opponents. We add the constraint that our algorithm is non-exploitable, in that the opponent lacks an incentive to use an algorithm against which we cannot achieve rewards exceeding some ??air??value. Our solution is an expert algorithm (LAFF), which searches within a set of sub-algorithms that are optimal for each opponent class, and punishes evidence of exploitation by switching to a policy that enforces a fair solution. With benchmarks that depend on the opponent class, we first show that LAFF has sublinear regret uniformly over these classes. Second, we show that LAFF discourages exploitation, because exploitative opponents have linear regret. To our knowledge, this work is the first to provide guarantees for both regret and non-exploitability in multi-agent learning.
Jakob Lindinger (Bosch Center for Artificial Intelligence); Barbara Rakitsch (Robert Bosch GmbH, Bosch); Christoph Lippert (Hasso Plattner Institute)
Gaussian process state-space models describe time series data in a probabilistic and non-parametric manner using a Gaussian process transition function. As inference is intractable, recent methods use variational inference and either rely on simplifying independence assumptions on the approximate posterior or learn the temporal states iteratively. The latter hampers optimization since the posterior over the presence can only be learned once the posterior governing the past has converged. We present a novel inference scheme that applies stochastic variational inference for the Gaussian process posterior and the Laplace approximation on the temporal states. This approach respects the conditional dependencies in the model and, through the Laplace approximation, treats the temporal states jointly, thereby avoiding their sequential learning. Our method is computationally efficient and leads to better calibrated predictions compared to state-of-the art alternatives on synthetic data and on a range of benchmark datasets.
Yajie Bao (Shanghai Jiao Tong University); Weidong Liu (Shanghai Jiao Tong University); Xiaojun Mao (Shanghai Jiao Tong University); Weijia Xiong
Communication cost and security issues are both important in large-scale distributed machine learning. In this paper, we investigate a multiclass sparse classification problem under two distributed systems. We propose two distributed multiclass sparse discriminant analysis algorithms based on mean-aggregation and median-aggregation under normal distributed system or Byzantine failure system. Both of them are computation and communication efficient. Several theoretical results, including estimation error bounds, and support recovery, are established. With moderate initial estimators, our iterative estimators achieve a (near-)optimal rate and exact support recovery after a constant number of rounds. Experiments on both synthetic and real datasets are provided to demonstrate the effectiveness of our proposed methods.
Seth Nabarro (Imperial College London); Stoil Krasimirov Ganev (University of Bristol); Adria Garriga-Alonso (University of Cambridge); Vincent Fortuin (Swiss Federal Institute of Technology); Mark van der Wilk (Imperial College London); Laurence Aitchison (University of Bristol)
"Bayesian neural networks that incorporate data augmentation implicitly use a ""randomly perturbed log-likelihood [which] does not have a clean interpretation as a valid likelihood function"" (Izmailov et al. 2021). Here, we provide several approaches to developing principled Bayesian neural networks incorporating data augmentation. We introduce a ""finite orbit"" setting which allows valid likelihoods to be computed exactly, and for the more usual ""full orbit"" setting we derive multi-sample bounds tighter than those used previously for Bayesian neural networks with data augmentation. These models cast light on the origin of the cold posterior effect. In particular, we find that the cold posterior effect persists even in these principled models incorporating data augmentation. This suggests that the cold posterior effect cannot be dismissed as an artifact of data augmentation using incorrect likelihoods."
Zhenhuan Yang (State University of New York, Albany); Shu Hu (Carnegie Mellon University); Yunwen Lei (University of Birmingham); Kush R. Varshney (International Business Machines); Siwei Lyu (State University of New York, Buffalo); Yiming Ying (State University of New York, Albany)
Stochastic gradient descent ascent (SGDA) and its variants have been the workhorse for solving minimax problems. However, in contrast to the well-studied stochastic gradient descent (SGD) with differential privacy (DP) constraints, there is little work on understanding the generalization (utility) of SGDA with DP constraints. In this paper, we use the algorithmic stability approach to establish the generalization (utility) of DP-SGDA in different settings. In particular, for the convex-concave setting, we prove that the DP-SGDA can achieve an optimal utility rate in terms of the weak primal-dual population risk in both smooth and non-smooth cases. To our best knowledge, this is the first-ever-known result for DP-SGDA in the non-smooth case. We further provide its utility analysis in the nonconvex-strongly-concave setting which is the first-ever-known result in terms of the primal population risk. The convergence and generalization results for this nonconvex setting are new even in the non-private setting. Finally, numerical experiments are conducted to demonstrate the effectiveness of DP-SGDA for both convex and nonconvex cases.
Ali Hasan (Duke University); Khalil Elkhalil (Duke University); Yuting Ng (Duke University); Joao M. Pereira (University of Texas, Austin); Sina Farsiu (Duke University); Jose Blanchet (Stanford University); Vahid Tarokh (Duke University)
We propose a novel neural network architecture that enables non-parametric calibration and generation of multivariate extreme value distributions (MEVs). MEVs arise from Extreme Value Theory (EVT) as the necessary class of models when extrapolating a distributional fit over large spatial and temporal scales based on data observed in intermediate scales. In turn, EVT dictates that $d$-max-decreasing, a stronger form of convexity, is an essential shape constraint in the characterization of MEVs. As far as we know, our proposed architecture provides the first class of non-parametric estimators for MEVs which preserve these essential shape constraints. We show that our architecture approximates the dependence structure encoded by MEVs at parametric rate. Moreover, we present a new method for sampling high-dimensional MEVs using a generative model. We demonstrate our methodology on a wide range of experimental settings, ranging from environmental sciences to financial mathematics and verify the structural properties of MEVs are retained compared to existing methods.
In statistical learning, many problem formulations have been proposed so far, such as multi-class learning, complementarily labeled learning, multi-label learning, multi-task learning, which provide theoretical models for various real-world tasks. Although they have been extensively studied, the relationship among them has not been fully investigated. In this work, we focus on a particular problem formulation called Multiple-Instance Learning (MIL), and show that various learning problems including all the problems mentioned above with some of new problems can be reduced to MIL with theoretically guaranteed generalization bounds, where the reductions are established under a new reduction scheme we provide as a by-product. The results imply that the MIL-reduction gives a simplified and unified framework for designing and analyzing algorithms for various learning problems. Moreover, we show that the MIL-reduction framework can be kernelized.
Yassir Fathullah (University of Cambridge); Mark Gales (University of Cambridge)
Deep learning is increasingly being applied in safety-critical domains. For these scenarios it is important to know the level of uncertainty in a model's prediction to ensure appropriate decisions are made by the system. Deep ensembles are the de-facto standard approach to obtaining various measures of uncertainty. However, ensembles often significantly increase the resources required in the training and/or deployment phases. Approaches have been developed that typically address the costs in one of these phases. In this work we propose a novel training approach, self-distribution distillation (S2D), which is able to efficiently train a single model that can estimate uncertainties. Furthermore it is possible to build ensembles of these models and apply hierarchical ensemble distillation approaches. Experiments on CIFAR-100 showed that S2D models outperformed standard models and Monte-Carlo dropout. Additional out-of-distribution detection experiments on LSUN, Tiny ImageNet, SVHN showed that even a standard deep ensemble can be outperformed using S2D based ensembles and novel distilled models.
Xiumin Xie (Guangxi Normal University); Chuanwen Hou (Guangxi Normal University); Zhixin Li (Guangxi Normal University)
Image-text retrieval relies on learning inter-modal correspondences. Most existing approaches focus on learning global or local correspondence and fail to explore fine-grained multi-level alignments. Moreover, it remains to be investigated how to infer more accurate similarity scores. In this paper, we propose a novel fine-grained matching with Multi-Perspective Similarity Modeling (MPSM) Network for cross-modal retrieval. Specifically, the Knowledge Graph Iterative Dissemination (KGID) module is designed to iteratively broadcast global semantic knowledge, enabling domain information to be integrated and relevant nodes to be associated, resulting in fine-grained modality representations. Subsequently, vector-based similarity representations are learned from multiple perspectives to model multi-level alignments comprehensively. The Relation Graph Reconstruction (RGR) module is further developed to enhance cross-modal correspondence by constructing similarity relation graphs and adaptively reconstructing them. Extensive experiments on the Flickr30K and MSCOCO datasets validate that our model significantly outperforms several state-of-the-art methods.
Motasem Alfarra (KAUST); Adel Bibi (University of Oxford); Philip Torr (University of Oxford); Bernard Ghanem (KAUST)
Randomized smoothing is a recent technique that achieves state-of-art performance in training certifiably robust deep neural networks. While the smoothing family of distributions is often connected to the choice of the norm used for certification, the parameters of these distributions are always set as global hyper parameters independent from the input data on which a network is certified. In this work, we revisit Gaussian randomized smoothing and show that the variance of the Gaussian distribution can be optimized at \emph{each} input so as to maximize the certification radius for the construction of the smooth classifier. Since the data dependent classifier does not directly enjoy sound certification with existing approaches, we propose a memory-enhanced data dependent smooth classifier that is certifiable by construction. This new approach is generic, parameter-free, and easy to implement. In fact, we show that our data dependent framework can be seamlessly incorporated into 3 randomized smoothing approaches, leading to consistent improved certified accuracy. When this framework is used in the training routine of these approaches followed by a data dependent certification, we achieve 9\% and 6\% improvement over the certified accuracy of the strongest baseline for a radius of 0.5 on CIFAR10 and ImageNet.
Tanya Braun (University of Muenster); Marcel Gehrke (University of Luebeck); Florian Lau (University of Luebeck); Ralf Moller (University of Luebeck)
A decentralised partially observable Markov decision problem (DecPOMDP) formalises collaborative multi-agent decision making. A solution to a DecPOMDP is a joint policy for the agents, fulfilling an optimality criterion such as maximum expected utility. A crux is that the problem is intractable regarding the number of agents. Inspired by lifted inference, this paper examines symmetries within the agent set for a potential tractability. Specifically, this paper contributes (i) specifications of counting and isomorphic symmetries, (ii) a compact encoding of symmetric DecPOMDPs as partitioned DecPOMDPs, and (iii) a formal analysis of their complexity and tractability. This work allows for solving a new optimisation problem that asks for the number of agents needed to satisfy a goal.
Shaocong Ma (University of Utah); Ziyi Chen (University of Utah); Yi Zhou (University of Utah); Kaiyi Ji (University of Michigan - Ann Arbor); Yingbin Liang (The Ohio State University)
Conventional machine learning applications typically assume that data samples are independently and identically distributed (i.i.d.). However, practical scenarios often involve a data-generating process that produces highly dependent data samples, which are known to heavily bias the stochastic optimization process and slow down the convergence of learning. In this paper, we conduct a fundamental study on how different stochastic data sampling schemes affect the sample complexity of online stochastic gradient descent (SGD) over highly dependent data. Specifically, with a ?-mixing model of data dependence, we show that online SGD with proper periodic data-subsampling achieves an improved sample complexity over the standard online SGD in the full spectrum of the data dependence level. Interestingly, even subsampling a subset of data samples can accelerate the convergence of online SGD over highly dependent data. Moreover, we show that online SGD with mini-batch sampling can further substantially improve the sample complexity over online SGD with periodic data subsampling over highly dependent data. Numerical experiments validate our theoretical results.
Joao Monteiro (ServiceNow Research); Mohamed Osama Ahmed (University of British Columbia); Hossein Hajimirsadeghi (Borealis AI); Greg Mori (Borealis AI)
We study settings where gradient penalties are used alongside risk minimization with the goal of obtaining predictors satisfying different notions of monotonicity. Specifically, we present two sets of contributions. In the first part of the paper, we show that different choices of penalties define the regions of the input space where the property is observed. As such, previous methods result in models that are monotonic only in a small volume of the input space. We thus propose an approach that uses mixtures of training instances and random points to populate the space and enforce the penalty in a much larger region. As a second set of contributions, we introduce regularization strategies that enforce other notions of monotonicity in different settings. In this case, we consider applications, such as image classification and generative modeling, where monotonicity is not a hard constraint but can help improve some aspects of the model. Namely, we show that inducing monotonicity can be beneficial in applications such as: (1) allowing for controllable data generation, (2) defining strategies to detect anomalous data, and (3) generating explanations for predictions. Our proposed approaches do not introduce relevant computational overhead while leading to efficient procedures that provide extra benefits over baseline models.
Hannes Eriksson (Chalmers University); Debabrota Basu (INRIA); Mina Alibeigi (Zenseact AB); Christos Dimitrakakis (University of Oslo, Norway)
In this paper, we consider risk-sensitive sequential decision-making in Reinforcement Learning (RL). Our contributions are two-fold. First, we introduce a novel and \emph{coherent} quantification of risk, namely \emph{composite risk}, which quantifies the joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or as an additive combination. We prove that the additive formulation is a particular case of the composite risk when the epistemic risk measure is replaced with expectation. Thus, the composite risk is more sensitive to both aleatory and epistemic uncertainty than the individual and additive formulations. We also propose an algorithm, SENTINEL-K, based on ensemble bootstrapping and distributional RL for representing epistemic and aleatory uncertainty respectively. The ensemble of K learners uses Follow The Regularised Leader (FTRL) to aggregate the return distributions and obtain the composite risk. We experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimates, demonstrates higher risk-sensitive performance than state-of-the-art risk-sensitive and distributional RL algorithms.
Mao Ye (University of Texas, Austin); Qiang Liu (University of Texas, Austin)
Many modern machine learning applications, such as multi-task learning, require finding optimal model parameters to trade-off multiple objective functions that may conflict with each other. The notion of the Pareto set allows us to focus on the set of (often infinite number of) models that cannot be strictly improved. But it does not provide an actionable procedure for picking one or a few special models to return to practical users. In this paper, we consider \emph{optimization in Pareto set (OPT-in-Pareto)}, the problem of finding Pareto models that optimize an extra reference criterion function within the Pareto set. This function can either encode a specific preference from the users, or represent a generic diversity measure for obtaining a set of diversified Pareto models that are representative of the whole Pareto set. Unfortunately, despite being a highly useful framework, efficient algorithms for OPT-in-Pareto have been largely missing, especially for large-scale, non-convex, and non-linear objectives in deep learning. A naive approach is to apply Riemannian manifold gradient descent on the Pareto set, which yields a high computational cost due to the need for eigen-calculation of Hessian matrices. We propose a first-order algorithm that approximately solves OPT-in-Pareto using only gradient information, with both high practical efficiency and theoretically guaranteed convergence property. Empirically, we demonstrate that our method works efficiently for a variety of challenging multi-task-related problems.
Mengjing Chen (Tsinghua University, Tsinghua University); Pingzhong Tang (Tsinghua University); Zihe Wang (Renmin University of China); Shenke Xiao (Tsinghua University, Tsinghua University); Xiwang Yang (New York University)
Motivated by a new generation of Internet advertising that has emerged in the live streaming e-commerce markets (e.g., Tiktok) over the past five years, we study a variant of online bipartite matching problem: advertisers send ad requests to influencers (aka, key opinion leaders) on a social media platform. Each influencer has a maximum number of ad requests she can accommodate. We assign a fixed number of influencers to an advertiser when she enters the platform. The advertiser then matches with each of the assigned influencers with a probability, which can be thought of as a set of negotiations between the advertiser and the set of assigned influencers. Unlike the standard online assignment problems, the outcome of any of these matches is not revealed throughout the session (negotiations take time). Our goal is to maximize the expected number of matches between advertisers and influencers. We put forward a new deterministic algorithm with a competitive ratio of $1/2$ and prove that no deterministic algorithm can achieve a better competitive ratio. We also show that the competitive ratio can be improved when randomness is allowed. We then study a setting where a match is successful with either probability 0 or a fixed $p$. We present an optimal randomized algorithm that achieves a competitive ratio of $1-1/e$ in this setting.
Tim Reichelt (University of Oxford); Adam Golinski (University of Oxford); Luke Ong (Department of Computer Science, University of Oxford); Tom Rainforth (University of Oxford)
We show that the standard computational pipeline of probabilistic programming systems (PPSs) can be inefficient for estimating expectations and introduce the concept of expectation programming to address this. In expectation programming, the aim of the backend inference engine is to directly estimate expected return values of programs, as opposed to approximating their conditional distributions. This distinction, while subtle, allows us to achieve substantial performance improvements over the standard PPS computational pipeline by tailoring computation to the expectation we care about. We realize a particular instance of our expectation programming concept, Expectation Programming in Turing (EPT), by extending the PPS Turing to allow so-called target-aware inference to be run automatically. We then verify the statistical soundness of EPT theoretically, and show that it provides substantial empirical gains in practice.
Klim Kireev (Swiss Federal Institute of Technology Lausanne); Maksym Andriushchenko (Swiss Federal Institute of Technology Lausanne); Nicolas Flammarion (Swiss Federal Institute of Technology Lausanne)
The literature on robustness towards common corruptions shows no consensus on whether adversarial training can improve the performance in this setting. First, we show that, when used with an appropriately selected perturbation radius, $\ell_p$ adversarial training can serve as a strong baseline against common corruptions improving both accuracy and calibration. Then we explain why adversarial training performs better than data augmentation with simple Gaussian noise which has been observed to be a meaningful baseline on common corruptions. Related to this, we identify the $\sigma$-overfitting phenomenon when Gaussian augmentation overfits to a particular standard deviation used for training which has a significant detrimental effect on common corruption accuracy. We discuss how to alleviate this problem and then how to further enhance $\ell_p$ adversarial training by introducing an efficient relaxation of adversarial training with learned perceptual image patch similarity as the distance metric. Through experiments on CIFAR-10 and ImageNet-100, we show that our approach does not only improve the $\ell_p$ adversarial training baseline but also has cumulative gains with data augmentation methods such as AugMix, DeepAugment, ANT, and SIN, leading to state-of-the-art performance on common corruptions.
John Chen (Rice University); Samarth Sinha (University of Toronto, Toronto University); Anastasios Kyrillidis (Rice University)
Techniques combining multiple images as input/output have proven to be effective data augmentations for training convolutional neural networks. In this paper, we present StackMix: each input is presented as a concatenation of two images, and the label is the mean of the two one-hot labels. On its own, StackMix rivals other widely used methods in the ``Mix'' line of work. More importantly, unlike previous work, significant gains across a variety of benchmarks are achieved by combining StackMix with existing Mix augmentation, effectively mixing more than two images. E.g., by combining StackMix with CutMix, test error in the supervised setting is improved across a variety of settings over CutMix, including 0.8\% on ImageNet, 3\% on Tiny ImageNet, 2\% on CIFAR-100, 0.5\% on CIFAR-10, and 1.5\% on STL-10. Similar results are achieved with Mixup. We further show that gains hold for robustness to common input corruptions and perturbations at varying severities with a 0.7\% improvement on CIFAR-100-C, by combining StackMix with AugMix over AugMix. On its own, improvements with StackMix hold across different number of labeled samples on CIFAR-100, maintaining approximately a 2\% gap in test accuracy --down to using only 5\% of the whole dataset-- and is effective in the semi-supervised setting with a 2\% improvement with the standard benchmark $\Pi$-model. Finally, we perform an extensive ablation study to better understand the proposed methodology.
Marius Hobbhahn (Max Planck Institute for Intelligent Systems, Max-Planck Institute); Agustinus Kristiadi (University of Tuebingen); Philipp Hennig (University of Tuebingen)
In Bayesian Deep Learning, distributions over the output of classification neural networks are approximated by first constructing a Gaussian distribution over the weights, then sampling from it to receive a distribution over the softmax outputs. This is costly. We reconsider old work (Laplace Bridge) to construct a Dirichlet approximation of this softmax output distribution, which yields an analytic map between Gaussian distributions in logit space and Dirichlet distributions (the conjugate prior to the Categorical distribution) in the output space. Importantly, the vanilla Laplace Bridge comes with certain limitations. We analyze those and suggest a simple solution that compares favorably to other commonly used estimates of the softmax-Gaussian integral. We demonstrate that the resulting Dirichlet distribution has multiple advantages, in particular, more efficient computation of the uncertainty estimate and scaling to large datasets and networks like ImageNet and DenseNet. We further demonstrate the usefulness of this Dirichlet approximation by using it to construct a lightweight uncertainty-aware output ranking for ImageNet.
Hisayoshi Nanmo (Yokohama National University); Manabu Kuroki (Yokohama National University)
This paper assumes that cause-effect relationships among variables can be described with a linear structural equation model. Then, a situation is considered where a set of observed covariates satisfies the back-door criterion but the ordinary least squares method cannot be applied to estimate linear causal effects because of multicollinearity/high-dimensional data problems. In this situation, we propose a novel regression approach, the ``partially adaptive Lp-regularized multiple regression analysis'' (PALpMA) method for estimating the total effects. Different from standard regularized regression analysis, PALpMA provides a consistent or less-biased estimator of the linear causal effect. PALpMA is also applicable to evaluating direct effects through the single-door criterion. Given space constraints, the proofs, some numerical experiments, and an industrial case study on setting up painting conditions of car bodies are provided in the Supplementary Material.
Varun R. Embar (University of California, Santa Cruz); Sriram Srinivasan (University of California, Santa Cruz); Lise Getoor (University of Maryland College Park)
Templated graphical models (TGMs) encode model structure using rules that capture recurring relationships between multiple random variables. While the rules in TGMs are interpretable, it is not clear how they can be used to generate explanations for the individual predictions of the model. Further, learning these rules from data comes with high computational costs: it typically requires an expensive combinatorial search over the space of rules and repeated optimization over rule weights. In this work, we propose a new structure learning algorithm, Explainable Structured Model Search (ESMS), that learns a templated graphical model and an explanation framework for its predictions. ESMS uses a novel search procedure to efficiently search the space of models and discover models that trade-off predictive accuracy and explainability. We introduce the notion of relational stability and prove that our proposed explanation framework is stable. Further, our proposed piecewise pseudolikelihood (PPLL) objective does not require re-optimizing the rule weights across models during each iteration of the search. In our empirical evaluation on three realworld datasets, we show that our proposed approach not only discovers models that are explainable, but also significantly outperforms existing state-out-the-art structure learning approaches.
Harald Leisenberger (Graz University of Technology); Franz Pernkopf (Graz University of Technology); Christian Knoll (Graz University of Technology)
Belief propagation is an iterative method for reasoning in probabilistic graphical models. Its well-known relationship to a classical concept from statistical physics, the Bethe free energy, puts it on a solid theoretical foundation. If belief propagation fails to approximate the marginals, then this is often due to a failure of the Bethe approximation. In this work, we show how modifications in a graphical model can be a great remedy for fixing the Bethe approximation. Specifically, we analyze how the removal of edges influences and improves belief propagation, and demonstrate that this positive effect is particularly distinct for dense graphs.
Chen Dun (Rice University); Cameron R. Wolfe (Rice University); Chris Jermaine (Rice University); Anastasios Kyrillidis (Rice University)
We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats until convergence. By construction, per iteration, ResIST communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, ResIST reduces the per-iteration communication, memory, and time requirements of ResNet training to only a fraction of the requirements of full-model training. In comparison to common protocols, like data-parallel training and data-parallel training with local SGD, ResIST yields a decrease in communication and compute requirements, while being competitive with respect to model performance.
Vladislav Polianskii (KTH Royal Institute of Technology); Giovanni Luca Marchetti (KTH Royal Institute of Technology); Alexander Kravberg (KTH Royal Institute of Technology); Anastasiia Varava (KTH Royal Institute of Technology); Florian T. Pokorny (KTH Royal Institute of Technology); Danica Kragic (KTH Royal Institute of Technology)
The Voronoi Density Estimator (VDE) is an established density estimation technique that adapts to the local geometry of data. However, its applicability has been so far limited to problems in two and three dimensions. This is because Voronoi cells rapidly increase in complexity as dimensions grow, making the necessary explicit computations infeasible. We define a variant of the VDE deemed Compactified Voronoi Density Estimator (CVDE), suitable for higher dimensions. We propose computationally efficient algorithms for numerical approximation of the CVDE and formally prove convergence of the estimated density to the original one. We implement and empirically validate the CVDE through a comparison with the Kernel Density Estimator (KDE). Our results indicate that the CVDE outperforms the KDE on sound and image data.
Tineke Blom (University of Amsterdam); Joris Mooij (University of Amsterdam)
Mathematical models of the real world are simplified representations of complex systems. A caveat to using mathematical models is that predicted causal effects and conditional independences may not be robust under model extensions, limiting applicability of such models. In this work, we consider conditions under which qualitative model predictions are preserved when two models are combined. Under mild assumptions, we show how to use the technique of causal ordering to efficiently assess the robustness of qualitative model predictions. We also characterize a large class of model extensions that preserve qualitative model predictions. For dynamical systems at equilibrium, we demonstrate how novel insights help to select appropriate model extensions and to reason about the presence of feedback loops. We apply our ideas to a viral infection model with immune responses.
Or Dinari (Ben Gurion University of the Negev); Oren Freifeld (Ben-Gurion University)
One popular deep-learning approach for the task of Out-Of-Distribution (OOD) detection is based on thresholding the values of per-class Gaussian likelihood of deep features. However, two issues arise with that approach: first, the distributions are often far from being Gaussian; second, many OOD data points fall within the effective support of the known classes??Gaussians. Thus, either way it is hard to find a good threshold. In contrast, our proposed solution for OOD detection is based on a new latent space where: 1) each known class is well captured by a nearly-isotropic Gaussian; 2) those Gaussians are far from each other and from the origin of the space (together, these properties effectively leave the area around the origin free for OOD data). Concretely, given a (possibly-trained) backbone deep net of choice, we use it to train a conditional variational model via a Kullback Leibler loss, a triplet loss, and a new distancing loss that pushes classes away from each other. During inference, the class-dependent log-likelihood values of a deep feature ensemble of the test point are also weighted based on reconstruction errors, improving further the decision rule. Experiments on popular benchmarks show our method yields state-of-the-art results, a feat achieved despite the fact that, unlike some competitors, we make no use of OOD data for training or hyperparameter tuning.
Paidamoyo Chapfuwa (Stanford University); Sherri Rose (Stanford University); Lawrence Carin (Duke University); Edward Meeds (VU University Toronto); Ricardo Henao (Duke University)
End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are collected under various conditions (inputs), such as treatments, or grouped in some way, such as part of sub-populations. Understanding the effects of these system inputs on system outputs is crucial to have any meaningful model of a dynamical system. To that end, we propose a structured latent ODE model that explicitly captures \emph{system input} variations within its latent representation. Building on a static latent variable specification, our model learns (independent) stochastic factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space. This approach provides actionable modeling through the \emph{controlled generation} of time-series data for novel input combinations (or perturbations). Additionally, we propose a flexible approach for quantifying uncertainties, leveraging a quantile regression formulation. Experimental results on challenging biological datasets show consistent improvements over competitive baselines in the controlled generation of observational data and prediction of biologically meaningful system inputs.
Chenghan Zhou (University of Virginia, Charlottesville); Andrew Spivey (University of Oregon); Haifeng Xu (University of Virginia); Thanh Hong Nguyen (University of Oregon)
This paper studies the problem of information design in a general security game setting in which multiple independent self-interested defenders attempt to provide protection simultaneously on the same set of important targets against an unknown attacker. A principal, who can be one of the defenders, has access to certain private information (i.e., attacker type) whereas other defenders do not. We investigate the question of how that principal, with additional private information, can influence the decisions of the defenders by partially and strategically revealing her information. We focus on the algorithmic study of information design for private signaling in this game setting. In particular, we develop a polynomial-time ellipsoid algorithm to compute an optimal private signaling scheme. Our key finding is that the separation oracle in the ellipsoid approach can be carefully reduced to bipartite matching. Furthermore, we introduce a compact representation of any ex-ante persuasive signaling schemes by exploiting intrinsic security resource allocation structures, enabling us to compute an optimal scheme significantly faster. Our experiment results show that by strategically revealing private information, the principal can significantly enhance the protection effectiveness on the targets.
Tuan Nguyen (Monash University); Van Nguyen (Monash University); Trung Le (Monash University); He Zhao (Monash University); Quan Hung Tran (Adobe Systems); Dinh Phung (Monash University)
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a model trained on a labeled source domain to an unlabeled target domain. To this end, we propose in this paper a novel cycle class-consistent model based on optimal transport (OT) and knowledge distillation. The model consists of two agents, a teacher and a student cooperatively working in a cycle process under the guidance of the distributional optimal transport and distillation manner. The OT distance is designed to bridge the gap between the distribution of the target data and a distribution over the source class-conditional distributions. The optimal probability matrix then provides pseudo labels to learn a teacher that achieves a good classification performance on the target domain. Knowledge distillation is performed in the next step in which the teacher distills and transfers its knowledge to the student. And finally, the student produces its prediction for the optimal transport step. This process forms a closed cycle in which the teacher and student networks are simultaneously trained to conduct transfer learning from the source to the target domain. Extensive experiments show that our proposed method outperforms existing methods, especially the class-aware and OT-based ones on benchmark datasets including Office-31, Office-Home, and ImageCLEF-DA.
Gaoyuan Zhang (International Business Machines); Songtao Lu (IBM Thomas J. Watson Research Center); Yihua Zhang (Michigan State University); Xiangyi Chen (University of Minnesota, Minneapolis); Pin-Yu Chen (International Business Machines); Quanfu Fan (MIT-IBM Watson AI Lab); Lee Martie (International Business Machines); Lior Horesh (International Business Machines); Mingyi Hong (Iowa State University); Sijia Liu (Michigan State University)
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective and popular approach, known as adversarial training (AT), has been shown to mitigate the negative impact of adversarial attacks by virtue of a min-max robust training method. While effective, it remains unclear whether it can successfully be adapted to the distributed learning context. The power of distributed optimization over multiple machines enables us to scale up robust training over large models and datasets. Spurred by that, we propose distributed adversarial training (DAT), a large-batch adversarial training framework implemented over multiple machines. We show that DAT is general, which supports training over labeled and unlabeled data, multiple types of attack generation methods, and gradient compression operations favored for distributed optimization. Theoretically, we provide, under standard conditions in the optimization theory, the convergence rate of DAT to the first-order stationary points in general non-convex settings. Empirically, we demonstrate that DAT either matches or outperforms state-of-the-art robust accuracies and achieves a graceful training speedup (e.g., on ResNet-50 under ImageNet).
Xueying Zhan (City University of Hong Kong); Yaowei Wang (Sun Yat-sen University); Antoni B. Chan (City University of Hong Kong)
Active Learning (AL) aims to optimize basic learned model(s) iteratively by selecting and annotating unlabeled data samples that are deemed to best maximise the model performance with minimal required data. However, the learned model is easy to overfit due to the biased distribution (sampling bias and dataset shift) formed by nonuniform sampling used in AL. Considering AL as an iterative sequential optimization process, we first provide a perspective on AL in terms of statistical properties, i.e., asymptotic unbiasedness, consistency and asymptotic efficiency, with respect to basic estimators when the sample size (size of labeled set) becomes large, and in the limit as sample size tends to infinity. We then discuss how biases affect AL. Finally, we proposed a flexible AL framework that aims to mitigate the impact of bias in AL by minimizing generalization error and importance-weighted training loss simultaneously.
Difeng Cai (Emory University); Yuliang Ji (Emory University); Huan He (Emory University); Qiang Ye (University of Kentucky); Yuanzhe Xi (Emory University)
"Nonlinear monotone transformations are used extensively in normalizing flows to construct invertible triangular mappings from simple distributions to complex ones. In existing literature, monotonicity is usually enforced by restricting function classes or model parameters and the inverse transformation is often approximated by root-finding algorithms as a closed-form inverse is unavailable. In this paper, we introduce a new integral-based approach termed ""Atomic Unrestricted Time Machine (AUTM)"", equipped with unrestricted integrands and easy-to-compute explicit inverse. AUTM offers a versatile and efficient way to the design of normalizing flows with explicit inverse and unrestricted function classes or parameters. Theoretically, we present a constructive proof that AUTM is universal: all monotonic normalizing flows can be viewed as limits of AUTM flows. We provide a concrete example to show how to approximate any given monotonic normalizing flow using AUTM flows with guaranteed convergence. Our result implies that AUTM can be used to transform an existing flow into a new one equipped with explicit inverse and unrestricted parameters. The performance of the new approach is evaluated on high dimensional density estimation, variational inference and image generation."
Mingyang Yi (University of Chinese Academy of Sciences)
Batch normalization (BN) has become a critical component across diverse deep neural networks. The network with BN is invariant to positively linear re-scaling of weights, which makes there exist infinite functionally equivalent networks with different scales of weights. However, optimizing these equivalent networks with the first-order method such as stochastic gradient descent will converge to different local optima owing to different gradients across training. To alleviate this, we propose a quotient manifold PSI manifold, in which all the equivalent weights of the network with BN are regarded as the same one element. We then construct gradient descent and stochastic gradient descent on the proposed PSI manifold. The two algorithms guarantee that every group of equivalent weights (caused by positively rescaling) converge to the equivalent optima. Besides that, we characterize the convergence rate of the proposed algorithms on PSI manifold and justify that they accelerate training compared with the algorithms on the Euclidean weight space. Empirical results show that our algorithms can consistently achieve better performances in the sense of both convergence rate and generalization ability over various experimental settings.
Naiqi Li (Tsinghua University); Wenjie Li (Tsinghua University); Yong Jiang (Graduate School at Shenzhen, Tsinghua University); Shu-Tao Xia (Graduate School at Shenzhen, Tsinghua University)
In this paper we propose the deep Dirichlet process mixture (DDPM) model, which is an unsupervised method that simultaneously performs clustering and feature learning. The traditional Dirichlet process mixture model can infer the number of mixture components, but its capacity is restricted since the clustering is performed in the raw feature space. Our method alleviates this limitation by using the flow-based deep neural network to learn more expressive features. DDPM unifies Dirichlet processes and the flow-based model with Monte Carlo expectation-maximization, and uses Gibbs sampling to sample from the posterior. This combination allows our method to exploit the mutually beneficial relation between clustering and feature learning. The effectiveness of DDPM is demonstrated by thorough experiments in various synthetic and real-world datasets.
This paper focuses on the Matrix Factorization based Clustering (MFC) method which is one of the few closed-form algorithms for the subspace clustering algorithm. Despite being simple, closed-form, and computation-efficient, MFC can outperform the other sophisticated subspace clustering methods in many challenging scenarios. We reveal the connection between MFC and the Innovation Pursuit (iPursuit) algorithm which was shown to be able to outperform the other spectral clustering based methods with a notable margin especially when the span of clusters are close. A novel theoretical study is presented which sheds light on the key performance factors of both algorithms (MFC/iPursuit) and it is shown that both algorithms can be robust to notable intersections between the span of clusters. Importantly, in contrast to the theoretical guarantees of other algorithms which emphasized on the distance between the subspaces as the key performance factor and without making the innovation assumption, it is shown that the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters.
Zhongxiang Dai (National University of Singapore); Yizhou Chen (National University of Singapore); Haibin Yu (Tencent Data Platform); Bryan Kian Hsiang Low (National University of Singapore); Patrick Jaillet (Massachusetts Institute of Technology)
Bayesian optimization (BO) has become popular for sequential optimization of black-box functions. When BO is used to optimize a target function, we often have access to previous evaluations of potentially related functions. This begs the question as to whether we can leverage these previous experiences to accelerate the current BO task through meta-learning (meta-BO), while ensuring robustness against potentially harmful dissimilar tasks that could sabotage the convergence of BO. This paper introduces two scalable and provably robust meta-BO algorithms: robust meta-Gaussian process-upper confidence bound (RM-GP-UCB) and RM-GP-Thompson sampling (RM-GP-TS). We prove that both algorithms are asymptotically no-regret even when some or all previous tasks are dissimilar to the current task, and show that RM-GP-UCB enjoys a better theoretical robustness than RM-GP-TS. We also exploit the theoretical guarantees to optimize the weights assigned to individual previous tasks through regret minimization via online learning, which diminishes the impact of dissimilar tasks and hence further enhances the robustness. Empirical evaluations show that (a) RM-GP-UCB performs effectively and consistently across various applications, and (b) RM-GP-TS, despite being less robust than RM-GP-UCB both in theory and in practice, performs competitively in some scenarios with less dissimilar tasks and is more computationally efficient.
Antti Hyttinen (University of Helsinki, Helsinki Institute for Information Technology); Vitoria Barin Pacela (Montreal Institute for Learning Algorithms, University of Montreal); Aapo Hyvarinen (University of Helsinki)
We consider independent component analysis of binary data. While fundamental in practice, this case has been much less developed than ICA for continuous data. We start by assuming a linear mixing model in a continuous-valued latent space, followed by a binary observation model. Importantly, we assume that the sources are non-stationary; this is necessary since any non-Gaussianity would essentially be destroyed by the binarization. Interestingly, the model allows for closed-form likelihood by employing the cumulative distribution function of the multivariate Gaussian distribution. In stark contrast to the continuous-valued case, we prove non-identifiability of the model with few observed variables; our empirical results imply identifiability when the number of observed variables is higher. We present a practical method for binary ICA that uses only pairwise marginals, which are faster to compute than the full multivariate likelihood. Experiments give insight into the requirements for the number of observed variables, segments and latent sources, that allow for the model to be estimated.
Punit Pankaj Dubey (IIT Mandi); Bhisham Dev Verma (Indian Institute of Technology Mandi); Rameshwar Pratap (IIT Mandi); Keegan Kang (Singapore University of Technology and Design)
Computing the angular similarity between pairs of vectors is a core part of various machine learning algorithms. The seminal work due to Charikar~\citep{simhash} (\textit{a.k.a.} Sign-Random-Projection (SRP) or SimHash) provides an unbiased estimate for the same. However, SRP suffers from the following limitations: (i) large variance in the similarity estimation, (ii) and high running time while computing the sketch. There are improved variants that address these limitations. However, they are known to improve on only one aspect in their proposal, for \textit{e.g.}~\citep{CBE} suggest a faster algorithm, ~\citep{superbit, MLE} provide estimates with a smaller variance. In this work, we propose a sketching algorithm that addresses both aspects in one algorithm -- a faster algorithm along with a smaller variance in the similarity estimation. Moreover, our algorithm is space-efficient as well. We present a rigorous theoretical analysis of our proposal and complement it via experiments on synthetic and real-world datasets.
Huanhuan Yang (National University of Defense Technology); Dianxi Shi; Guojun Xie; Yingxuan Peng (National University of Defense Technology); Yi Zhang; Yantai Yang; Shaowu Yang (National University of Defense Technology)
Learning from raw, pixel images is quite important for the real-world application of deep reinforcement learning (RL). Standard model-free RL algorithms aim to learn policies directly from high-dimensional observations and simultaneously solve the representation learning and policy learning problems in an end-to-end training process. However, learning directly from images is sample-inefficiency and sensitive to hyper-parameters in practice under the single supervision of model-free reward signals. In this work, we present \textbf{S}elf-\textbf{S}upervision \textbf{R}epresentations (S2R) for multi-view reinforcement learning, a sample-efficient representation learning method for learning policies from high-dimensional images when combined with the RL algorithm. S2R introduces a representation learning framework and defines an auxiliary objective based on the multi-view image observations and the Conditional Entropy Bottleneck (CEB) principle to learn efficient representations which can discard task-irrelevant information and preserve task-relevant information. We demonstrate the effectiveness of S2R in the visual DeepMind Control (DMControl) Suite and show its better performance on the default DMControl tasks and their variants by replacing the task's background with a random image or natural video.
Federico Tomasi (Spotify); Mounia Lalmas (Spotify); Zhenwen Dai (Spotify)
Dynamic topic modeling is a well established tool for capturing the temporal dynamics of the topics of a corpus. Currently, dynamic topic models can only consider a small set of frequent words because of their computational complexity and insufficient data for less frequent words. In this work, we develop a scalable dynamic topic model by utilising the correlation among the words in the vocabulary. By correlating previously independent temporal processes for words, our new model allows us to reliably estimate the topic representations containing less frequent words. We develop an amortised variational inference method with self-normalised importance sampling approximation to the word distribution that dramatically reduces the computational complexity and the number of variational parameters in order to handle large vocabularies. With extensive experiments on text datasets, we show that our method significantly outperforms the previous works by modeling word correlations, and it is able to handle real world data with a large vocabulary (80K words) which could not be processed by previous continuous dynamic topic models. With qualitative analyses, we show that our method can perform inference on infrequent but representative keywords much more reliably than previous methods.
Xutong Liu (The Chinese University of Hong Kong); Haoru Zhao (Shanghai Jiaotong University); Tong Yu (Adobe Research); Shuai Li (John Hopcroft Center, Shanghai Jiao Tong University); John Lui (Chinese University of Hong Kong)
Contextual multi-armed bandit (MAB) is an important sequential decision-making problem in recommendation systems. A line of works, called the clustering of bandits (CLUB), utilizes the collaborative effect over users and dramatically improves the recommendation quality. Owing to the increasing scale of application and public concerns of privacy, there is a growing demand to keep user data decentralized and push bandit learning to the local server side. Existing CLUB algorithms, however, are designed under the centralized setting where data are available at a central server. We focus on the study of the federated clustering of bandit (FCLUB) problem, whose task is to minimize the total regret meanwhile satisfying privacy and communication considerations. For this problem, we design a new phase-based scheme for cluster detection and a novel asynchronous communication protocol for cooperative bandit learning. To protect users' privacy, previous differential privacy (DP) definitions are not very suitable and we propose a new DP notion which acts on the user cluster level. We provide rigorous proofs to show that our algorithm simultaneously achieves (clustered) DP, sublinear communication complexity and sublinear regret. Finally, experimental evaluations show our superior performance compared with benchmark algorithms.
Yichen Yang (Huazhong University of Science and Technology); Xiaosen Wang (Huazhong University of Science and Technology); Kun He (Huazhong University of Sceince and Technology)
We attribute the vulnerability of natural language processing models to the fact that similar inputs are converted to dissimilar representations in the embedding space, leading to inconsistent outputs, and propose a novel robust training method, termed Fast Triplet Metric Learning (FTML). Specifically, we argue that the original sample should have similar representation with its adversarial counterparts and distinguish its representation from other samples for better robustness. To this end, we adopt the triplet metric learning into the standard training to pull the words closer to their positive samples (i.e., synonyms) and push away their negative samples (i.e., non-synonyms) in the embedding space. Extensive experiments demonstrate that FTML can significantly promote the model robustness against various advanced adversarial attacks while keeping competitive classification accuracy on original samples. Besides, our method is efficient as it only needs to adjust the embedding and introduces very little overhead on the standard training. Our work shows the great potential of improving the textual robustness through robust word embedding.
Xiaosen Wang (Huazhong University of Science and Technology); Yifeng Xiong (Huazhong University of Science and Technology); Kun He (Huazhong University of Sceince and Technology)
A line of work has shown that natural text processing models are vulnerable to adversarial examples. Correspondingly, various defense methods are proposed to mitigate the threat of textual adversarial examples, e.g. adversarial training, input transformations, detection, etc. In this work, we treat the optimization process for synonym substitution based textual adversarial attacks as a specific sequence of word replacement, in which each word mutually influences other words. We identify that we could destroy such mutual interaction and eliminate the adversarial perturbation by randomly substituting a word with its synonyms. Based on this observation, we propose a novel textual adversarial example detection method, termed Randomized Substitution and Vote (RS&V), which votes the prediction label by accumulating the logits of k samples generated by randomly substituting the words in the input text with synonyms. The proposed RS&V is generally applicable to any existing neural networks without modification on the architecture or extra training, and it is orthogonal to prior work on making the classification network itself more robust. Empirical evaluations on three benchmark datasets demonstrate that our RS&V could detect the textual adversarial examples more successfully than the existing detection methods while maintaining the high classification accuracy on benign samples.
Lucas Maystre (Spotify); Tiffany Wu (Spotify); Roberto Sanchis Ojeda (Spotify); Tony Jebara (Columbia University)
Driven by applications in clinical medicine and business, we address the problem of modeling trajectories over multiple states. We build on well-known methods from survival analysis and introduce a family of sequence models based on localized Bayesian Markov chains. We develop inference and prediction algorithms, and we apply the model to real-world data, demonstrating favorable empirical results. Our approach provides a practical and effective alternative to plain Markov chains and to existing (finite) mixture models; It retains the simplicity and computational benefits of the former while matching or exceeding the predictive performance of the latter.
Mathieu Roget (ENS Lyon); Giuseppe Di Molfetta (Aix Marseille Univ); Hachem Kadri (Aix Marseille Univ)
Quantum machine learning algorithms could provide significant speed-ups over their classical counterparts; however, whether they could also achieve good generalization remains unclear. Recently, two quantum perceptron models which give a quadratic improvement over the classical perceptron algorithm using Grover?? search have been proposed by Wiebe et al. [2016]. While the first model reduces the complexity with respect to the size of the training set, the second one improves the bound on the number of mistakes made by the perceptron. In this paper, we introduce a hybrid quantum-classical perceptron algorithm with lower complexity and better generalization ability than the classical perceptron. We show a quadratic improvement over the classical perceptron in both the number of samples and the margin of the data. We derive a bound on the expected error of the hypothesis returned by our algorithm, which compares favorably to the one obtained with the classical online perceptron. We use numerical experiments to illustrate the trade-off between computational complexity and statistical accuracy in quantum perceptron learning and discuss some of the key practical issues surrounding the implementation of quantum perceptron models into near-term quantum devices, whose practical implementation represents a serious challenge due to inherent noise. However, the potential benefits make correcting this worthwhile.
Epistemic voting interprets votes as noisy signals about a ground truth. We consider contexts where the truth consists of a set of objective winners, knowing a lower and upper bound on its cardinality. A prototypical problem for this setting is the aggregation of multi-label annotations with prior knowledge on the size of the ground truth. We posit noise models, for which we define rules that output an optimal set of winners. We report on experiments on multi-label annotations (which we collected).
Zihao Li (Nanyang Technological University); Xiaohui Bei (Nanyang Technological University); Zhenzhen Yan (Nanyang Technological University)
We study a fair resource allocation problem with indivisible items. The agents' preferences over items are assumed to be ordinal and have uncertainties. We adopt stochastic dominance proportionality as our fairness notion and study a sequence of problems related to finding allocations that are fair with a high probability. We provide complexity analysis for each problem and efficient algorithms for some problems. Finally, we propose several heuristic algorithms to find an allocation that is fair with the highest probability. We thoroughly evaluate the performance of the algorithms on both synthetic and real datasets.
Pashupati Rajaram Hegde (Aalto University); Cagatay Yildiz (Aalto University); Harri Lahdesmaki (Aalto University); Samuel Kaski (University of Manchester); Markus Heinonen (Aalto University)
Recent machine learning advances have proposed black-box estimation of $\textit{unknown continuous-time system dynamics}$ directly from data. However, earlier works are based on approximative solutions or point estimates. We propose a novel Bayesian nonparametric model that uses Gaussian processes to infer posteriors of unknown ODE systems directly from data. We derive sparse variational inference with decoupled functional sampling to represent vector field posteriors. We also introduce a probabilistic shooting augmentation to enable efficient inference from arbitrarily long trajectories. The method demonstrates the benefit of computing vector field posteriors, with predictive uncertainty scores outperforming alternative methods on multiple ODE learning tasks.
Xinyuan Chen (Universiti Kuala Lumpur); Zhongmei Zhou (Minnan Normal University); Meichun Gao (Fuzhou Technology and Business University); Daya Shi; Mohd Nizam Husen (Malaysian Institute of Information Technology)
Knowledge models endeavor to improve representation and feature extraction capabilities while keeping low computational cost. Firstly, existing embedding models in hypercomplex spaces of non-Abelian group are optimized. Then a method for fast quaternion multiplication is proposed with proof, with which path semantics are computed and further integrated with the attention mechanism based on the idea semantic extraction of relation sequences could be regarded as a multiple rotational blending problem. A depth-wise atrous circular convolution framework is set up for better feature extraction. Experiments including Link Prediction and Path Query are conducted on benchmark datasets verifying our model holds advantages over state-of-the-art models like Rotate3D. Moreover, the model is tested on a biomedical dataset simulating real-world applications. An ablation study is also performed to explore the effectiveness of different components.
Mao Ye (University of Texas, Austin); Ruichen Jiang (University of Texas at Austin); Haoxiang Wang (University of Illinois, Urbana Champaign); Dhruv Choudhary (Facebook); Xiaocong Du (Arizona State University); Bhargav Bhushanam (Facebook); Aryan Mokhtari (University of Texas, Austin); Arun Kejariwal (Facebook); Qiang Liu (University of Texas, Austin)
One of the key challenges of learning an online recommendation model is the temporal domain shift, which causes the mismatch between the training and testing data distribution and hence domain generalization error. To overcome, we propose to learn a meta future gradient generator that forecasts the gradient information of the future data distribution for training so that the recommendation model can be trained as if we were able to look ahead at the future of its deployment. Compared with Batch Update, a widely used paradigm, our theory suggests that the proposed algorithm achieves smaller temporal domain generalization error measured by a gradient variation term in a local regret. We demonstrate the empirical advantage by comparing with various representative baselines.
Christophe Denis (University of Paris-Est); Charlotte Dion-Blanc (Sorbonne University); Laure Sansonnet (UCLouvain)
We investigate the multiclass classification problem where the features are event sequences. More precisely, the data are assumed to be generated by a mixture of simple linear Hawkes processes. In this new setting, the classes are discriminated by various triggering kernels. A challenge is then to build an efficient classification procedure. We derive the optimal Bayes rule and provide a two-step estimation procedure of the Bayes classifier. In the first step, the weights of the mixture are estimated; in the second step, an empirical risk minimization procedure is performed to estimate the parameters of the Hawkes processes. We establish the consistency of the resulting procedure and derive rates of convergence. Finally, the numerical properties of the data-driven algorithm are illustrated through a simulation study where the triggering kernels are assumed to belong to the popular parametric exponential family. It highlights the accuracy and the robustness of the proposed algorithm. In particular, even if the underlying kernels are misspecified, the procedure exhibits good performance.
Anthony Sicilia (University of Pittsburgh); Katherine Atwell (University of Pittsburgh); Malihe Alikhani (University of Pittsburgh); Seong Jae Hwang (Yonsei University)
Multiclass neural networks are a common tool in modern unsupervised domain adaptation, yet an appropriate theoretical description for their non-uniform sample complexity is lacking in the adaptation literature. To fill this gap, we propose the first PAC-Bayesian adaptation bounds for multiclass learners. We facilitate practical use of our bounds by also proposing the first approximation techniques for the multiclass distribution divergences we consider. For divergences dependent on a Gibbs predictor, we propose additional PAC-Bayesian adaptation bounds which remove the need for inefficient Monte-Carlo estimation. Empirically, we test the efficacy of our proposed approximation techniques as well as some novel design-concepts which we include in our bounds. Finally, we apply our bounds to analyze a common adaptation algorithm that uses neural networks.
Shantanu Das (International Institute of Information Technology, Hyderabad); Swapnil Dhamal (Telecom SudParis); Ganesh Ghalme (Technion, Technion); Shweta Jain (IIT Ropar); Sujit Gujar (IIIT Hyderabad)
We study fairness in the context of feature-based price discrimination in monopoly markets. We propose a new notion of individual fairness, namely, \alpha-fairness, which guarantees that individuals with similar features face similar prices. First, we study discrete valuation space and give an analytical solution for optimal fair feature-based pricing. We show that the cost of fair pricing is defined as the ratio of expected revenue in an optimal feature-based pricing to the expected revenue in an optimal fair feature-based pricing (CoF) can be arbitrarily large in general. When the revenue function is continuous and concave with respect to the prices, we show that one can achieve CoF strictly less than 2, irrespective of the model parameters. Finally, we provide an algorithm to compute fair feature-based pricing strategy that achieves this CoF.
Tianyi Liu (Georgia Institute of Technology); Weihao Gao (University of Illinois, Urbana Champaign); Zhirui Wang (Princeton University Plasma Physics Lab); Chong Wang (Princeton University)
"Sampling from a Boltzmann distribution to calculate important macro statistics is one of the central tasks in the study of large atomic and molecular systems. Recently, a one-shot configuration sampler, the Boltzmann generator (Noe et al., 2019), is introduced. Though a Boltzmann generator can directly generate independent metastable states, it lacks the ability to find transition pathways and describe the whole transition process. In this paper, we propose PathFlow that can function as a one-shot generator as well as a transition pathfinder. More specifically, a normalizing flow model is constructed to map the base distribution and linear interpolated path in the latent space to the Boltzmann distribution and a minimum (free) energy path in the configuration space simultaneously. PathFlow can be trained by standard gradient-based optimizers using the proposed gradient estimator with a theoretical guarantee. PathFlow, validated with the extensively studied examples including a synthetic M\""{u}ller potential and Alanine dipeptide, shows a remarkable performance. "
Machine learning models have recently shown promise in predicting molecular quantum chemical properties. However, the path to real-life adoption requires (1) learning under low-resource constraints and (2) out-of-distribution generalization to unseen, structurally diverse molecules. We observe that these two challenges can be addressed via abundant labels, which is often not the case in quantum chemistry. We hypothesize that pseudo-labeling on a vast array of unlabeled molecules can serve as gold-label proxies to expand the training labeled dataset significantly. The challenge in pseudo-labeling is to prevent the bad pseudo-labels from biasing the model. Motivated by the entropy minimization framework, we develop a simple and effective strategy Pseudo that can assign pseudo-labels, detect bad pseudo-labels through evidential uncertainty, and prevent them from biasing the model using adaptive weighting. Empirically, Pseudo improves quantum calculations accuracy in full data, low data, and out-of-distribution settings.
Anton Matsson (Chalmers University); Fredrik Daniel Johansson (Chalmers University of Technology)
Importance sampling (IS) is often used to perform off-policy evaluation but it is prone to several issues---especially when the behavior policy is unknown and must be estimated from data. Significant differences between target and behavior policies can result in uncertain value estimates due to, for example, high variance. Standard practices such as inspecting IS weights may be insufficient to diagnose such problems and determine for which type of inputs the policies differ in suggested actions and resulting values. To address this, we propose estimating the behavior policy for IS using prototype learning. The learned prototypes provide a condensed summary of the input-action space, which allows for describing differences between policies and assessing the support for evaluating a certain target policy. In addition, we can describe a value estimate in terms of prototypes to understand which parts of the target policy have the most impact on the estimate. We find that this provides new insights in the examination of a learned policy for sepsis management. Moreover, we study the bias resulting from restricting models to use prototypes, how bias propagates to IS weights and estimated values and how this varies with history length.
David Watson (University College London); Ricardo Silva (University College London)
Inferring causal relationships from observational data is rarely straightforward, but the problem is especially difficult in high dimensions. For these applications, causal discovery algorithms typically require parametric restrictions or extreme sparsity constraints. We relax these assumptions and focus on an important but more specialized problem, namely recovering a directed acyclic subgraph of variables known to be causally descended from some (possibly large) set of confounding covariates, i.e. a {\it confounder blanket}. This is useful in many settings, for example when studying a dynamic biomolecular subsystem with genetic data providing causally relevant background information. Under a structural assumption that, we argue, must be satisfied in practice if informative answers are to be found, our method accommodates graphs of low or high sparsity while maintaining polynomial time complexity. We derive a sound and complete algorithm for identifying causal relationships under these conditions and implement testing procedures with provable error control for linear and nonlinear systems. We demonstrate our approach on a range of simulation settings.
There has been an increasing interest in methods that exploit permutation reasoning to search for directed acyclic causal models, including the ``Ordering Search?? of Teyssier and Kohler and GSP of Solus, Wang and Uhler. We extend the methods of the latter by a permutation-based operation \textit{tuck}, and develop a class of algorithms, namely GRaSP, that are efficient and pointwise consistent under increasingly weaker assumptions than faithfulness. The most relaxed form of GRaSP outperforms many state-of-the-art causal search algorithms in simulation, allowing efficient and accurate search even for dense graphs and graphs with more than 100 variables We propose GRaSP, a hierarchy of causal search algorithms characterized by a permutation-based operation. GRaSP algorithms are correct under increasingly weaker version of faithfulness, and the final version outperforms existing algorithms.
Zhongjie Yu (TU Darmstadt); Fabrizio Ventola (TU Darmstadt); Nils Thoma (TU Darmstadt); Devendra Singh Dhami (CS Department, TU Darmstadt, TU Darmstadt); Martin Mundt (TU Darmstadt); Kristian Kersting (TU Darmstadt)
Recent developments have shown that modeling in the spectral domain improves the accuracy in time series forecasting. However, state-of-the-art neural spectral forecasters do not generally yield trustworthy predictions. In particular, they lack the means to gauge predictive likelihoods and provide uncertainty estimates. We propose predictive Whittle networks to bridge this gap, which exploit both the advances of neural forecasting in the spectral domain and leverage tractable likelihoods of probabilistic circuits. For this purpose, we propose a novel Whittle forecasting loss that makes use of these predictive likelihoods to guide the training of the neural forecasting component. We demonstrate how predictive Whittle networks improve real-world forecasting accuracy, while also allowing a transformation back into the time domain, in order to provide the necessary feedback of when the model's prediction may become erratic.
Adam Karczmarz (University of Warsaw); Tomasz Michalak (University of Warsaw); Anish Mukherjee (University of Warsaw); Piotr Sankowski (University of Warsaw); Piotr Wygocki (University of Warsaw)
The Shapley value -- a fundamental game-theoretic solution concept -- has recently become one of the main tools used to explain predictions of tree ensemble models. Another well-known game-theoretic solution concept is the Banzhaf value. Although the Banzhaf value is closely related to the Shapley value, its properties w.r.t. explaining predictions have not been understood equally well. This paper shows that, for tree ensemble models, the Banzhaf value offers some crucial advantages over the Shapley value while providing essentially the same explanations.
Andrea Agiollo (University of Bologna); Andrea Omicini (University of Bologna)
The success of Neural Networks (NNs) is tightly linked with their architectural design---a complex problem by itself. We here introduce a novel framework leveraging Graph Neural Networks to Generate Neural Networks (GNN2GNN) where powerful NN architectures can be learned out of a set of available architecture-performance pairs. GNN2GNN relies on a three-way adversarial training of GNN, to optimise a generator model capable of producing predictions about powerful NN architectures. Unlike Neural Architecture Search (NAS) techniques proposing efficient searching algorithms over a set of NN architectures, GNN2GNN relies on learning NN architectural design criteria. GNN2GNN learns to propose NN architectures in a single step -- i.e., training of the generator --, overcoming the recursive approach characterising NAS. Therefore, GNN2GNN avoids the expensive and inflexible search of efficient structures typical of NAS approaches. Extensive experiments over two state-of-the-art datasets prove the strength of our framework, showing that it can generate powerful architectures with high probability. Moreover, GNN2GNN outperforms possible counterparts for generating NN architectures, and shows flexibility against dataset quality degradation. Finally, GNN2GNN paves the way towards generalisation between datasets.
Rudrajit Das (University of Texas, Austin); Anish Acharya (University of Texas, Austin); Abolfazl Hashemi (Purdue University); sujay sanghavi (University of Texas, Austin); Inderjit S Dhillon (University of Texas, Austin); ufuk topcu (University of Texas, Austin)
We propose \texttt{FedGLOMO}, a novel federated learning (FL) algorithm with an iteration complexity of $\mathcal{O}(\epsilon^{-1.5})$ to converge to an $\epsilon$-stationary point (i.e., $\mathbb{E}[\|\nabla f(x)\|^2] \leq \epsilon$) for smooth non-convex functions -- under arbitrary client heterogeneity and compressed communication -- compared to the $\mathcal{O}(\epsilon^{-2})$ complexity of most prior works. Our key algorithmic idea that enables achieving this improved complexity is based on the observation that the convergence in FL is hampered by two sources of high variance: (i) the global server aggregation step with multiple local updates, exacerbated by client heterogeneity, and (ii) the noise of the local client-level stochastic gradients. The first issue is particularly detrimental to FL algorithms that perform plain averaging at the server. By modeling the server aggregation step as a generalized gradient-type update, we propose a variance-reducing momentum-based global update at the server, which when applied in conjunction with variance-reduced local updates at the clients, enables \texttt{FedGLOMO} to enjoy an improved convergence rate. Moreover, we derive our results under a novel and more realistic client heterogeneity assumption which we verify empirically -- unlike prior assumptions that are hard to verify. Our experiments illustrate the intrinsic variance reduction effect of \texttt{FedGLOMO}, which implicitly suppresses client-drift in heterogeneous data distribution settings and promotes communication efficiency.
Fan Ding (Purdue University); Yexiang Xue (Purdue University)
Inverse Reinforcement Learning (IRL) is a powerful way of learning from demonstrations. In this paper, we address IRL problems with the availability of prior knowledge that optimal policies will never violate certain constraints. Conventional approaches ignoring these constraints need many demonstrations to converge. We propose XOR-Maximum Entropy Constrained Inverse Reinforcement Learning (X-MEN), which is guaranteed to converge to the optimal policy in linear rate w.r.t. the number of learning iterations. X-MEN embeds XOR-sampling -- a provable sampling approach which transforms the #-P complete sampling problem into queries to NP oracles -- into the framework of maximum entropy IRL. X-MEN also guarantees the learned policy will never generate trajectories that violate constraints. Empirical results in navigation demonstrate that X-MEN converges faster to the optimal policies compared to baseline approaches and always generates trajectories that satisfy multi-state combinatorial constraints.
Lu Yin (Eindhoven University of Technology); Vlado Menkovski (Eindhoven University of Technology); Meng Fang (Eindhoven University of Technology); Tianjin Huang (Eindhoven University of Technology); Yulong Pei (Eindhoven University of Technology); Mykola Pechenizkiy (Eindhoven University of Technology); Decebal Constantin Mocanu (University of Twente); Shiwei Liu (Eindhoven University of Technology)
Recent works on sparse neural network training (sparse training) have shown that a compelling trade-off between performance and efficiency can be achieved by training intrinsically sparse neural networks from scratch. Existing sparse training methods usually strive to find the best sparse subnetwork possible in one single run, without involving any expensive dense or pre-training steps. For instance, dynamic sparse training (DST), as one of the most prominent directions, is capable of reaching a competitive performance of dense training by iteratively evolving the sparse topology during the course of training. In this paper, we argue that it is better to allocate the limited resources to create multiple low-loss sparse subnetworks and superpose them into a stronger one, instead of allocating all resources entirely to find an individual subnetwork. To achieve this, two desiderata are required: (1) efficiently producing many low-loss subnetworks, the so-called cheap tickets, within one training process limited to the standard training time used in dense training; (2) effectively superposing these cheap tickets into one stronger subnetwork without going over the constrained parameter budget. To corroborate our conjecture, we present a novel sparse training approach, termed Sup-tickets, which can satisfy the above two desiderata concurrently in a single sparse-to-sparse training process. Across various modern architectures on CIFAR-10/100 and ImageNet, we show that Sup-tickets integrates seamlessly with the existing sparse training methods and demonstrates consistent performance improvement.
Zhe Wang (University of Virginia); Jake Grigsby (University of Virginia); Arshdeep Sekhon (University of Virginia); Yanjun Qi (University of Virginia)
Optimization-based meta-learning typically assumes tasks are sampled from a single distribution - an assumption that oversimplifies and limits the diversity of tasks that meta-learning can model. Handling tasks from multiple distributions is challenging for meta-learning because it adds ambiguity to task identities. This paper proposes a novel method, ST-MAML, that empowers model-agnostic meta-learning (MAML) to learn from multiple task distributions. ST-MAML encodes tasks using a stochastic neural network module, that summarizes every task with a stochastic representation. The proposed Stochastic Task (ST) strategy learns a distribution of solutions for an ambiguous task and allows a meta-model to self-adapt to the current task. ST-MAML also propagates the task representation to enhance input variable encodings. Empirically, we demonstrate that ST-MAML outperforms the state-of-the-art on two few-shot image classification tasks, one curve regression benchmark, one image completion problem, and a real-world temperature prediction application.
Jackson A. Killian (Harvard University); Lily Xu (Harvard University); Arpita Biswas (Harvard University); Milind Tambe (Harvard University)
We introduce robustness in \textit{restless multi-armed bandits} (RMABs), a popular model for constrained resource allocation among independent stochastic processes (arms). Nearly all RMAB techniques assume stochastic dynamics are precisely known. However, in many real-world settings, dynamics are estimated with significant \textit{uncertainty}, e.g., via historical data, which can lead to bad outcomes if ignored. To address this, we develop an algorithm to compute minimax regret--robust policies for RMABs. Our approach uses a double oracle framework (oracles for \textit{agent} and \textit{nature}), which is often used for single-process robust planning but requires significant new techniques to accommodate the combinatorial nature of RMABs. Specifically, we design a deep reinforcement learning (RL) algorithm, DDLPO, which tackles the combinatorial challenge by learning an auxiliary ``$\lambda$-network'' in tandem with policy networks per arm, greatly reducing sample complexity, with guarantees on convergence. DDLPO, of general interest, implements our reward-maximizing agent oracle. We then tackle the challenging regret-maximizing nature oracle, a non-stationary RL challenge, by formulating it as a multi-agent RL problem between a policy optimizer and adversarial nature. This formulation is of general interest---we solve it for RMABs by creating a multi-agent extension of DDLPO with a shared critic. We show our approaches work well in three experimental domains.
Akram Erraqabi (University of Montreal); Marlos C. Machado (DeepMind); Mingde Zhao (McGill University); Sainbayar Sukhbaatar (Facebook); Alessandro Lazaric (Facebook); Ludovic Denoyer (Facebook); Yoshua Bengio (University of Montreal)
In reinforcement learning, the graph Laplacian has proved to be a valuable tool in the task-agnostic setting, with applications ranging from skill discovery to reward shaping. Recently, learning the Laplacian representation has been framed as the optimization of a temporally-contrastive objective to overcome its computational limitations in large (or continuous) state spaces. However, this approach requires uniform access to all states in the state space, overlooking the exploration problem that emerges during the representation learning process. In this work, we propose an alternative method that is able to recover, in a non-uniform-prior setting, the expressiveness and the desired properties of the Laplacian representation. We do so by combining the representation learning with a skill-based covering policy, which provides a better training distribution to extend and refine the representation. We also show that a simple augmentation of the representation objective with the learned temporal abstractions improves dynamics-awareness and helps exploration. We find that our method succeeds as an alternative to the Laplacian in the non-uniform setting and scales to challenging continuous control environments. Finally, even if our method is not optimized for skill discovery, the learned skills can successfully solve difficult continuous navigation tasks with sparse rewards, where standard skill discovery approaches are no so effective.
Collusion in a competitive multi-agent game occurs when two or more agents co-operate covertly to the disadvantage of others. Most competitive multi-agent games do not allow players to share information and explicitly prohibit collusion. In this paper, we present a novel way of detecting collusion using a domain-independent information-theoretic approach. Specifically, we show that the use of mutual information between actions of the agents provides a good indication of collusive behavior. Our experiments show that our method can detect varying levels of collusion in repeated simultaneous games like iterated Rock Paper Scissors. We further extend the detection to partially observable sequential games like poker and show the effectiveness of our methodology.
In recent years, the vulnerability of network attracts the attention of researchers. However, in these methods, the impact of video compression coding on the added adversarial perturbation, i.e., the robustness of video adversarial sample, is not considered. When an adversarial sample is just generated, its attack capability is the strongest, but with multiple video encoding and video decoding in the process of Internet transmission, the added adversarial disturbance will be continuously eliminated, eventually leading to the attack of the adversarial sample performance disappears. We define this phenomenon as the decay of the lifetime of adversarial examples. To resist this performance degradation, we propose an adversarial attack method based on optimized integer space. The robustness of anti-coding, the visual concealment and the attack success rate are all considered during the process of attack. In addition, we have also reduced the rounding loss caused by normalization in the deep neural network model process. The contributions of our methods are: 1) We show the performance degradation caused by video compression coding on existing video adversarial attack methods, which seems an effective way for detecting of defensing video adversarial examples. 2) A robust video adversarial attack method is proposed to resist video compression coding. The experiment shows that our method achieves better performance on the robustness of anti-coding, the visual concealment, and the attack success rate.
Qi Chen (University of Science and Technology of China); Kai Liu (University of Science and Technology of China); Ruilong Yao (University of Science and Technology of China); Hu Ding (University of Science and Technology of China, Tsinghua University)
Greedy selection is a widely used idea for solving many machine learning problems. But greedy selection algorithms often have high complexities and thus may be prohibitive for large-scale data. In this paper, we consider two fundamental optimization problems in machine learning: -center clustering and convex hull approximation, where they both can be solved via greedy selection. We propose sublinear time algorithms for them through combining the strategies of randomization and greedy selection. Our results are similar in spirit to the linear time stochastic greedy selection algorithms for submodular maximization [Mirzasoleiman et al., AAAI 2015, Hassidim and Singer, ICML 2017], but with several important differences. Our runtimes are independent of the number of input data items . In particular, our runtime for -center clustering significantly improves upon that of the uniform sampling approach [Huang et al, FOCS 2018], especially when the dimensionality is high. Moreover, our algorithms are particularly suitable for the scenario that we cannot directly access the whole input data (due to the reasons like privacy preserving, data storage and transmission) and can only take a small sample via an oracle each time. Our sublinear algorithms yield the improvement on the efficiency for various applications, such as data selection and compression, active learning, topic modeling, {\em etc}.
Sachit Menon (Columbia University); David Blei (Columbia University); Carl Vondrick (Columbia University)
Variational autoencoders (VAEs) suffer from posterior collapse, where the powerful neural networks used for modeling and inference optimize the objective without meaningfully using the latent representation. We introduce inference critics that detect and incentivize against posterior collapse by requiring correspondence between latent variables and the observations. By connecting the critic?? objective to the literature in self-supervised contrastive representation learning, we show both theoretically and empirically that optimizing inference critics increases the mutual information between observations and latents, mitigating posterior collapse. This approach is straightforward to implement and requires significantly less training time than prior methods, yet obtains competitive results on three established datasets. Overall, the approach lays the foundation to bridge the previously disconnected frameworks of contrastive learning and probabilistic modeling with variational autoencoders, underscoring the benefits both communities may find at their intersection.
Pratyush Maini (Carnegie Mellon University); Xinyun Chen (University of California Berkeley); Bo Li (University of Illinois, Urbana Champaign); Dawn Song (University of California Berkeley)
Recent works in adversarial robustness have proposed defenses to improve the robustness of a single model against the union of multiple perturbation types. However, these methods still suffer significant trade-offs compared to the ones specifically trained to be robust against a single perturbation type. In this work, we introduce the problem of categorizing adversarial examples based on their perturbation types. We first theoretically show on a toy task that adversarial examples of different perturbation types constitute different distributions --- making it possible to distinguish them. We support these arguments with experimental validation on multiple $\ell_p$ attacks and common corruptions. Instead of training a single predictor, we propose PROTECTOR, a two-stage pipeline that first categorizes the perturbation type of the input, and then makes the final prediction using the predictor specifically trained against the predicted perturbation type. We theoretically show that at test time the adversary faces a natural trade-off between fooling the perturbation classifier and the succeeding predictor optimized with perturbation-specific adversarial training. This makes it challenging for an adversary to plant strong attacks against the whole pipeline. Experiments on MNIST and CIFAR-10 show that PROTECTOR outperforms prior adversarial training-based defenses by over 5% when tested against the union of $\ell_1, \ell_2, \ell_\infty$ attacks. Our method naturally extends to a more diverse attack suite, also showing large robustness gains against multiple $\ell_p$, spatial and recolor attacks.
Learning physics models in the form of Partial Differential Equations (PDEs) is carried out through back-propagation to match the simulations of the physics model with experimental observations. Nevertheless, such matching involves computation over billions of elements, presenting a significant computational overhead. We notice many PDEs in real world problems are sparse and decomposable, where the temporal updates and the spatial features are sparsely concentrated on small interface regions. We propose Rapid-PDE, an algorithm to expedite the learning of sparse and decomposable PDEs. Our Rapid-PDE first uses random projection to compress the high dimensional sparse updates and features into low dimensional representations and then use these compressed signals during learning. Crucially, such a conversion is only carried out once prior to learning and the entire learning process is conducted in the compressed space. Theoretically, we derive a constant factor approximation between the projected loss function and the original one with logarithmic number of projected dimensions. Empirically, we demonstrate Rapid-PDE with data compressed to 0.05% of its original size learns similar models compared with uncompressed algorithms in learning a set of phase-field models which govern the spatial-temporal dynamics of nano-scale structures in metallic materials.
Xinwei Shen (The Hong Kong University of Science and Technology); Shengyu Zhu (Huawei Noah's Ark Lab); Jiji Zhang (Hong Kong Baptist University); Shoubo Hu (Huawei Technologies Ltd.); Zhitang Chen (Huawei Technologies Ltd.)
In a nonparametric setting, the causal structure is often identifiable only up to Markov equivalence, and for the purpose of causal inference it is useful to learn a graphical representation of the Markov equivalence class (MEC). In this paper, we revisit the Greedy Equivalence Search (GES) algorithm, which is widely cited as a score-based algorithm for learning the MEC of the underlying causal structure. We observe that in order to make the GES algorithm consistent in a nonparametric setting, it is not necessary to design a scoring metric that evaluates graphs. Instead, it suffices to plug in a consistent estimator of a measure of conditional dependence to guide the search. We therefore present a reframing of the GES algorithm, which is more flexible than the standard score-based version and readily lends itself to the nonparametric setting with a general measure of conditional dependence. In addition, we propose a neural conditional dependence (NCD) measure, which utilizes the expressive power of deep neural networks to characterize conditional independence in a nonparametric manner. We establish the optimality of the reframed GES algorithm under standard assumptions and the consistency of using our NCD estimator to decide conditional independence. Together these results justify the proposed approach. Experimental results demonstrate the effectiveness of our method in causal discovery, as well as the advantages of using our NCD measure over kernel-based measures.
We study policy optimization in an infinite horizon, $\gamma$-discounted constrained Markov decision process (CMDP). Our objective is to return a policy that achieves large expected reward with a small constraint violation. We consider the online setting with linear function approximation and assume global access to the corresponding features. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of their primal and dual regret on online linear optimization problems. We instantiate this framework to use coin-betting algorithms and propose the Coin Betting Politex (CBP) algorithm. Assuming that the action-value functions are $\varepsilon_{\text{\tiny{b}}}$-close to the span of the $d$-dimensional state-action features and no sampling errors, we prove that $T$ iterations of CBP result in an $O\left(\frac{1}{(1 - \gamma)^3 \sqrt{T}} + \frac{\varepsilon_{\text{\tiny{b}}} \sqrt{d}}{(1 - \gamma)^2} \right)$ reward sub-optimality and an $O\left(\frac{1}{(1 - \gamma)^2 \sqrt{T}} + \frac{\varepsilon_{\text{\tiny{b}}} \sqrt{d}}{1 - \gamma} \right)$ constraint violation. Importantly, unlike gradient descent-ascent and other recent methods, CBP does not require extensive hyperparameter tuning. Via experiments on synthetic and Cartpole environments, we demonstrate the effectiveness and robustness of CBP.
Tongyi Tang (University of California, Davis); Krishna Balasubramanian (University of California, Davis); Thomas Chun Man Lee (University of California, Davis)
We develop and analyze robust Stochastic Frank-Wolfe type algorithms for projection-free stochastic convex optimization problems with heavy-tailed stochastic gradients. Existing works on the oracle complexity of such algorithms require a uniformly bounded variance assumption, and hold only in expectation. We develop tight high-probability bounds for robust versions of Stochastic Frank-Wolfe type algorithm under heavy-tailed assumptions, including infinite variance, on the stochastic gradient. Our methodological construction of the robust Stochastic Frank-Wolfe type algorithms leverage techniques from the robust statistic literature. Our theoretical analysis highlights the need to utilize robust versions of Stochastic Frank-Wolfe type algorithm for dealing with heavy-tailed data arising in practice.
Rachel Luo (Stanford University); Aadyot Bhatnagar (SalesForce.com); Yu Bai (Salesforce Research); Shengjia Zhao (Stanford University); Huan Wang (Yale University); Caiming Xiong (Salesforce Research); Silvio Savarese (Salesforce); Edward Schmerling (Stanford University); Marco Pavone (NVIDIA)
Probabilistic classifiers output confidence scores along with their predictions, and these confidence scores should be calibrated, i.e., they should reflect the reliability of the prediction. Confidence scores that minimize standard metrics such as the expected calibration error (ECE) accurately measure the reliability on average across the entire population. However, it is in general impossible to measure the reliability of an individual prediction. In this work, we propose the local calibration error (LCE) to span the gap between average and individual reliability. For each individual prediction, the LCE measures the average reliability of a set of similar predictions, where similarity is quantified by a kernel function on a pretrained feature space and by a binning scheme over predicted model confidences. We show theoretically that the LCE can be estimated sample-efficiently from data, and empirically find that it reveals miscalibration modes that are more fine-grained than the ECE can detect. Our key result is a novel local recalibration method LoRe, to improve confidence scores for individual predictions and decrease the LCE. Experimentally, we show that our recalibration method produces more accurate confidence scores, which improves downstream fairness and decision making on classification tasks with both image and tabular data.
Rohit Bhattacharya (Williams College); Razieh Nabi (Emory University)
The front-door criterion can be used to identify and compute causal effects despite the existence of unmeasured confounders between a treatment and outcome. However, the key assumptions -- (i) the existence of a variable (or set of variables) that fully mediates the effect of the treatment on the outcome, and (ii) which simultaneously does not suffer from similar issues of confounding as the treatment-outcome pair -- are often deemed implausible. This paper explores the testability of these assumptions. We show that under mild conditions involving an auxiliary variable, the assumptions encoded in the front-door model (and simple extensions of it) may be tested via generalized equality constraints a.k.a Verma constraints. We propose two goodness-of-fit tests based on this observation, and evaluate the efficacy of our proposal on real and synthetic data. We also provide theoretical and empirical comparisons to instrumental variable approaches to handling unmeasured confounding.
This paper considers the problem of estimating unknown intervention targets in causal directed acyclic graphs from observational and interventional data in the presence of latent variables. The focus is on linear structural equation models with soft interventions. The existing approaches to this problem involve extensive conditional independence tests, and they estimate the unknown intervention targets alongside learning the structure of the causal model in its entirety. This joint learning approach to estimating the intervention targets results in algorithms that are not scalable as graph sizes grow. This paper proposes an approach that does not necessitate learning the entire causal model and focuses on learning only the intervention targets. The key idea of this approach is the property that interventions impose sparse changes in the precision matrix of a linear model. Leveraging this property, the proposed framework consists of a sequence of precision difference estimation steps. Furthermore, the necessary knowledge to refine an observational Markov equivalence class (MEC) to an interventional MEC is inferred. Simulation results are provided to illustrate the scalability of the proposed algorithm and compare it with those of the existing approaches.
Tongzheng Ren (University of Texas, Austin); Tianjun Zhang (University of California Berkeley); Csaba Szepesvari (University of Alberta); Bo Dai (Google Brain)
Representation learning lies at the heart of the em- pirical success of deep learning for dealing with the curse of dimensionality. However, the power of representation learning has not been fully exploited yet in reinforcement learning (RL), due to i), the trade-off between expressiveness and tractability; and ii), the coupling between exploration and rep- resentation learning. In this paper, we first reveal the fact that under some noise assumption in the stochastic control model, we can obtain the lin- ear spectral feature of its corresponding Markov transition operator in closed-form for free. Based on this observation, we propose Spectral Dynam- ics Embedding (SPEDE), which breaks the trade- off and completes optimistic exploration for rep- resentation learning by exploiting the structure of the noise. We provide rigorous theoretical analysis of SPEDE, and demonstrate the practical superior performance over the existing state-of-the-art em- pirical algorithms on several benchmarks.
Sakshi Agarwal (University of California, Irvine); Kalev Kask (University of California, Irvine); Alexander Ihler (University of California, Irvine); Rina Dechter (University of California, Irvine)
A major limiting factor in graphical model inference is the complexity of computing the partition function. Exact message-passing algorithms such as Bucket Elimination (BE) require exponentially high levels of memory to compute the partition function, therefore approximations are necessary. In this paper, we build upon a recently introduced methodology called Deep Bucket Elimination (DBE) that uses classical Neural Networks (NNs) to approximate messages generated by BE when buckets have large memory requirements. The main feature of our new scheme called NeuroBE is that it customizes the architecture and learning of the NNs to the message size and its distribution. We also explore a new loss function for training taking into account the estimated message cost distribution. Our experiments demonstrate significant improvements in accuracy and time compared with DBE.
Jeffrey Smith (Australian National University); Jesse Cranney (Australian National University); Charles Gretton (Australian National University); Damien Gratadour (Australian National University)
We aim to significantly enhance the science return of astronomical observatories, and in particular giant terrestrial optical telescopes. Observatories employ Adaptive Optics (AO) systems in order to acquire high sensitivity diffraction limited im- ages of the sky. The incumbent ??orkhorse??for control of AO systems employs a linear real-time controller in a closed loop, with sensing of state performed via a (Shack-Hartmann) wavefront sen- sor (WFS). The actuators of a deformable mirror (DM) are driven, with the action performed in each iteration having a continuous representation as an array of DC voltages. The typical control regime is practical and scalable, nonetheless, there remains a residual uncompensated turbulence that leads to optical aberrations limiting the class of scientific assets that can be acquired. We have developed and trained a translational GAN model that accurately estimates residual perturbations from WFS images. Model inference occurs in 0.34 milliseconds using off-the-shelf GPU hardware, and is applicable for use in AO control where the control loop might be running at 500Hz. We develop an AO control regime with a second controller stage actuating a second DM controlled in an open loop according to the estimated residual turbulence. Using the open- source COMPASS tool for simulation, we are able to significantly improve the performance using our new regime.
Jacob Imola (University of California, San Diego, University of California, San Diego); Shiva Kasiviswanathan (Amazon); Stephen White (Georgia Institute of Technology); Abhinav Aggarwal (Amazon); Nathanael Teissier (Illinois Institute of Technology)
Metric differential privacy (mDP) is a modification of differential privacy that is more suitable when records can be represented in a general metric space, such as text data represented as word embeddings or geographical coordinates on a map. We consider the task of releasing elements of the metric space under metric differential privacy where utility is measured as the distance of the released element to the original element. Linear programming (LP) can be used to construct a mechanism that achieves the optimal utility for a particular mDP constraint. However, these LPs suffer from a polynomial explosion of variables and constraints that render them impractical for solving real-world problems. An important question is how to design rigorous mDP mechanisms that balance the utility-scalability tradeoff. Our main contribution is a new method for reducing the LP size used to generate mDP mechanisms by constraining the search space such that certain input and output pairs have transition probabilities derived from the exponential mechanism. Our method produces mDP mechanisms whose LPs are smaller that all prior work in this area. We also provide a lower bound on the best possible mechanism utility. Our experiments on real-world metric spaces highlight the superior utility-scalability tradeoff of our mechanism.
Subhojyoti Mukherjee (University of Wisconsin, Madison)
In this paper, we consider the setting of piecewise i.i.d. bandits under a safety constraint. In this piecewise i.i.d. setting, there exists a finite number of changepoints where the mean of some or all arms change simultaneously. We introduce the safety constraint studied in Wu et al. (2016) to this setting such that at any round the cumulative reward is above a constant factor of the default action reward. We propose two actively adaptive algorithms for this setting that satisfy the safety constraint, detect changepoints, and restart without the knowledge of the number of changepoints or their locations. We provide regret bounds for our algorithms and show that the bounds are comparable to their counterparts from the safe bandit and piecewise i.i.d. bandit literature. We also provide the first matching lower bounds for this setting. Empirically, we show that our safety-aware algorithms match the performance of the state-of-the-art actively adaptive algorithms that do not satisfy the safety constraint.
Yangchen Pan (University of Alberta); Jincheng Mei (University of Alberta); Amir-massoud Farahmand (Department of Computer Science, University of Toronto); Martha White (University of Alberta); Hengshuai Yao (University of Alberta); Mohsen Rohani (Huawei Technologies Ltd.); Jun Luo (Huawei Technologies Ltd.)
Prioritized Experience Replay (ER) has been empirically shown to improve sample efficiency across many domains and attracted great attention; however, there is little theoretical understanding of why such prioritized sampling helps and its limitations. In this work, we take a deep look at the prioritized ER. In a supervised learning setting, we show the equivalence between the error-based prioritized sampling method for mean squared error and uniform sampling for cubic power loss. We then provide theoretical insight into why it improves convergence rate upon uniform sampling during early learning. Based on the insight, we further point out two limitations of the prioritized ER method: 1) outdated priorities and 2) insufficient coverage of the sample space. To mitigate the limitations, we propose our model-based stochastic gradient Langevin dynamics sampling method. We show that our method does provide states distributed close to an ideal prioritized sampling distribution estimated by the brute-force method, which does not suffer from the two limitations. We conduct experiments on both discrete and continuous control problems to show our approach's efficacy and examine the practical implication of our method in an autonomous driving application.
Ruihan Wu (Cornell University); Xin Yang (University of Washington); Yuanshun Yao (ByteDance AI Lab); Jiankai Sun (ByteDance Inc.); Tianyi Liu (Georgia Institute of Technology); Kilian Q Weinberger (Cornell University); Chong Wang (Princeton University)
Differentially Private (DP) data release is a promising technique to disseminate data without compromising the privacy of data subjects. However the majority of prior work has focused on scenarios where a single party owns all the data. In this paper we focus on the multi-party setting, where different stakeholders own disjoint sets of attributes belonging to the same group of data subjects. We propose two differentially private algorithms within the context of linear regression that allow all parties to train models on the complete data without the ability to infer private attributes or identities of individuals. We prove that our DP algorithms asymptotically converge to the optimal (non-private) solutions with increasing dataset size and substantiate these results through experiments on both artificial and real-world datasets.
Tobias Hatt (Swiss Federal Institute of Technology); Daniel Tschernutter (Swiss Federal Institute of Technology); Stefan Feuerriegel (LMU Munich)
Learning personalized decision policies that generalize to the target population is of great relevance. Since training data is often not representative of the target population, standard policy learning methods may yield policies that do not generalize target population. To address this challenge, we propose a novel framework for learning policies that generalize to the target population. For this, we characterize the difference between the training data and the target population as a sample selection bias using a selection variable. Over an uncertainty set around this selection variable, we optimize the minimax value of a policy to achieve the best worst-case policy value on the target population. In order to solve the minimax problem, we derive an efficient algorithm based on a convex-concave procedure and prove convergence for parametrized spaces of policies such as logistic policies. We prove that, if the uncertainty set is well-specified, our policies generalize to the target population as they can not do worse than on the training data. Using simulated data and a clinical trial, we demonstrate that, compared to standard policy learning methods, our framework improves the generalizability of policies substantially.
Tomas Brazdil (Masaryk University); David Klaska (Masaryk University); Antonin Kucera (Masaryk University); Vit Musil (Masaryk University); Petr Novotny(Masaryk University); Vojtech Reha (Masaryk University)
We consider the problem of adaptation of patrolling strategies in a changing environment where the topology of Defender's moves and the importance of guarded targets unpredictably change. The Defender must instantly switch to a new strategy optimized for the new environment, not disrupting the ongoing patrolling task, and the new strategy must be computed promptly under all circumstances. Since strategy switching may cause unintended security risks compromising the achieved protection, our solution includes mechanisms for detecting and mitigating this problem. The efficiency of our framework is evaluated experimentally.
Iker Perez (Featurespace); Piotr Skalski (Featurespace); Alec Barns-Graham; Jason Wong (Featurespace); David Sutton (University of Cambridge)
Predictive uncertainties in classification tasks are often a consequence of model inadequacy or insufficient training data. In popular applications, such as image processing, we are often required to scrutinise these uncertainties by meaningfully attributing them to input features. This helps to improve interpretability assessments. However, there exist few effective frameworks for this purpose. Vanilla forms of popular methods for the provision of saliency masks, such as SHAP or integrated gradients, adapt poorly to target measures of uncertainty. Thus, state-of-the-art tools instead proceed by creating counterfactual or adversarial feature vectors, and assign attributions by direct comparison to original images. In this paper, we present a novel framework that combines path integrals, counterfactual explanations and generative models, in order to procure attributions that contain few observable artefacts or noise. We evidence that this outperforms existing alternatives through quantitative evaluations with popular benchmarking methods and data sets of varying complexity.
Zizhen Liu (Institute of Computing Technology, Chinese Academy of Sciences); Si Chen (Uppsala University); Jing Justin Ye (Institute of Computing Technology, Chinese Academy of Sciences); Junfeng Fan; Huawei Li (Institute of Computing Technology, Chinese Academy of Sciences); Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences)
To prevent private training data leakage in Federated Learning systems, we propose a novel secure aggregation scheme based on seed homomorphic pseudo-random generator (SHPRG), named SASH. SASH leverages the homomorphic property of SHPRG to simplify the masking and demasking scheme, which for each of the clients and for the server, entails a overhead linear w.r.t model size and constant w.r.t number of clients. We prove that even against worst-case colluding adversaries, SASH preserves training data privacy, while being resilient to dropouts without extra overhead. We experimentally demonstrate SASH significantly improves the efficiency to 20? over baseline, especially in the more realistic case where the numbers of clients and model size become large, and a certain percentage of clients drop out from the system.
Yao Shu (National University of Singapore); Yizhou Chen (National University of Singapore); Zhongxiang Dai (National University of Singapore); Bryan Kian Hsiang Low (National University of Singapore)
Recently, neural architecture search (NAS) has been applied to automate the design of neural networks in real-world applications. A large number of algorithms have been developed to improve the search cost or the performance of the final selected architectures in NAS. Unfortunately, these NAS algorithms aim to select only one single well-performing architecture from their search spaces and thus have overlooked the capability of neural network ensemble (i.e., an ensemble of neural networks with diverse architectures) in achieving improved performance over a single final selected architecture. To this end, we introduce a novel neural ensemble search algorithm, called neural ensemble search via Bayesian sampling (NESBS), to effectively and efficiently select well-performing neural network ensembles from a NAS search space. In our extensive experiments, NESBS algorithm is shown to be able to achieve improved performance over state-of-the-art NAS algorithms while incurring a comparable search cost, indicating the superior of our NESBS algorithm over these conventional NAS algorithms in practice.
Giorgia Ramponi (ETHZ - ETH Zurich); Marcello Restelli (Politecnico di Milano)
In this paper, we study the learning problem in two-player general-sum Markov Games. We consider the online setting where we control a single player, playing against an arbitrary opponent to minimize the regret. Previous works only consider the zero-sum Markov Games setting, in which the two agents are completely adversarial. However, in some cases, the two agents may have different reward functions without having conflicting objectives. This involves a stronger notion of regret than the one used in previous works. This class of games, called general-sum Markov Games is far to be well understood and studied. We show that the new regret minimization problem is significantly harder than in standard Markov Decision Processes and zero-sum Markov Games. To do this, we derive a lower bound on the expected regret of any ``good'' learning strategy which shows the constant dependencies with the number of deterministic policies, which is not present in zero-sum Markov Games and Markov Decision Processes. Then we propose a novel optimistic algorithm that nearly matches the proposed lower bound. Proving these results requires overcoming several new challenges that are not present in Markov Decision Processes or zero-sum Markov Games.
Or Dinari (Ben Gurion University of the Negev); Oren Freifeld (Ben-Gurion University)
DP-means, a nonparametric generalization of K- means, extends the latter to the case where K, the number of clusters, is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly- scalable solution. First, we study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of our PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup (with only a minor performance drop). We verify the utility of the proposed methods on multiple datasets. We also show that the proposed methods outperform other nonparametric methods (e.g., DBSCAN).
Aurghya Maiti (Adobe Systems); Vineet Nair (Google Research India); Gaurav Sinha (California Institute of Technology)
We study the problem of determining the best atomic intervention in a Causal Bayesian Network (CBN) specified only by its causal graph. We model this as a stochastic multi-armed bandit (MAB) problem with side-information, where interventions on CBN correspond to arms of the bandit instance. First, we propose a simple regret minimization algorithm that takes as input a causal graph with observable and unobservable nodes and in $T$ exploration rounds achieves $\tilde{O}(\sqrt{m(\mathcal{C})/T})$ expected simple regret. Here $m(\mathcal{C})$ is a parameter dependent on the input CBN $\mathcal{C}$ and could be much smaller than the number of arms. We also show that this is almost optimal for CBNs whose causal graphs have an $n$-ary tree structure. Next, we propose a cumulative regret minimization algorithm that takes as input a causal graph with observable nodes and performs better than the optimal MAB algorithms that do not use causal side-information. We experimentally compare both our algorithms with the best known algorithms in the literature.
Tycho F.A. van der Ouderaa (Imperial College London); Mark van der Wilk (Imperial College London)
Assumptions about invariances or symmetries in data can significantly increase the predictive power of statistical models. Many commonly used models in machine learning are constraint to respect certain symmetries in the data, such as translation equivariance in convolutional neural networks, and incorporation of new symmetry types is actively being studied. Yet, learning invariances from the data itself remains an open research problem. It has been shown that marginal likelihood offers a principled way to learn invariances in Gaussian Processes. We propose a weight-space equivalent to this approach, by minimizing a lower bound on the marginal likelihood to learn invariances in neural networks resulting in naturally higher performing models.
We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy. A robust classifier obtained via specialized techniques such as removing spurious features often has better OOD but worse ID accuracy compared to a standard classifier trained via vanilla ERM. In this paper, we find that a simple approach of ensembling the standard and robust models, after calibrating on only ID data, outperforms prior state-of-the-art both ID and OOD. On ten natural distribution shift datasets, ID-calibrated ensembles get the best of both worlds: strong ID accuracy of the standard model and OOD accuracy of the robust model. We analyze this method in stylized settings, and identify two important conditions for ensembles to perform well on both ID and OOD: (1) standard and robust models should be calibrated (on ID data, because OOD data is unavailable), (2) OOD has no anticorrelated spurious features.
Cause-effect relationships are typically evaluated by comparing outcome responses to binary treatment values, representing two arms of a hypothetical randomized controlled trial. However, in certain applications, treatments of interest are continuous and multidimensional. For example, understanding the causal relationship between severity of radiation therapy, summarized by a multidimensional vector of radiation exposure values and post-treatment side effects is a problem of clinical interest in radiation oncology. An appropriate strategy for making interpretable causal conclusions is to reduce the dimension of treatment. If individual elements of a multidimensional treatment vector weakly affect the outcome, but the overall relationship between treatment and outcome is strong, careless approaches to dimension reduction may not preserve this relationship. Further, methods developed for regression problems do not directly transfer to causal inference due to confounding complications. In this paper, we use semiparametric inference theory for structural models to give a general approach to causal sufficient dimension reduction of a multidimensional treatment such that the cause-effect relationship between treatment and outcome is preserved. We illustrate the utility of our proposals through simulations and a real data application in radiation oncology.
Omar Chehab (INRIA); Alexandre Gramfort (Inria); Aapo Hyvarinen (University of Helsinki)
Learning a parametric model of a data distribution is a well-known statistical problem that has seen renewed interest as it is brought to scale in deep learning. Framing the problem as a self-supervised task, where data samples are discriminated from noise samples, is at the core of state-of-the-art methods, beginning with Noise-Contrastive Estimation (NCE). Yet, such contrastive learning requires a good noise distribution, which is hard to specify; domain-specific heuristics are therefore widely used. While a comprehensive theory is missing, it is widely assumed that the optimal noise should in practice be made equal to the data, both in distribution and proportion. This setting underlies Generative Adversarial Networks (GANs) in particular. Here, we empirically and theoretically challenge this assumption on the optimal noise. We show that deviating from this assumption can actually lead to better statistical estimators, in terms of asymptotic variance. In particular, the optimal noise distribution is different from the data?? and even from a different family.
Tsviel Ben Shabat (Technion - Israel Institute of Technology, Technion - Israel Institute of Technology); Reshef Meir (Technion, Technion); David Azriel (Technion - Israel Institute of Technology, Technion)
When aggregating information from conflicting sources, one's goal is to find the truth. Most continuous-data Truth Discovery (TD) algorithms try to achieve this goal by estimating the trustworthiness of each source and then aggregating the conflicting information by weighing each source's answer proportionally to her trustworthiness. However, each of those algorithms requires more than a single source for such estimation and usually does not consider different estimation methods other than a weighted mean. Therefore, in this work we formulate, prove, and empirically test the conditions for an Empirical Bayes Estimator (EBE) to dominate the weighted mean aggregation. We further show that EBE is also a solution to the single source TD problem and we demonstrate that EBE, under mild conditions, can be used as a second step of any TD algorithm which can only improve the MSE.
Rui Yan (University of Oxford); Gabriel Santos (Department of Computer Science, University of Oxford); Xiaoming Duan (Shanghai Jiaotong University); David Parker (Birmingham University); Marta Kwiatkowska (Department of Computer Science, University of Oxford)
We present novel techniques for neuro-symbolic concurrent stochastic games, a recently proposed modelling formalism to represent a set of agents operating in a probabilistic, continuous-space environment using a combination of neural network based perception mechanisms and traditional symbolic methods. To date, only zero-sum variants of the model were studied, which is too restrictive when agents have distinct objectives. We formalise notions of equilibria for these models and present algorithms to synthesise them. Focusing on the finite-horizon setting, and (global) social welfare subgame-perfect optimality, we consider two distinct types: Nash equilibria and correlated equilibria. We first show that an exact solution based on backward induction may yield arbitrarily bad equilibria. We then propose an approximation algorithm called frozen subgame improvement, which proceeds through iterative solution of nonlinear programs. We develop a prototype implementation and demonstrate the benefits of our approach on two case studies: an automated car-parking system and an aircraft collision avoidance system.
Shoujian Yang (Peking University); Lian Yu (Peking University)
Existing BERT-based models for Chinese spelling correction (CSC) have three issues. 1) Bert tends to correct a correct low-frequency collocation into a high-frequency and leads to over-correcting. 2) The current learned knowledge for CSC ignores the phonic and glyph aspects of each character and unable to differentiate a near-phonic or a near-visual conversion. 3) Two-dimensional glyph information of Chinese characters is overlooked and fails to discover near-visual misused characters. This paper proposes a hybrid approach, CoSPA, to address these issues. 1) This paper proposes an alterable copy mechanism to alleviate over-correcting by jointly learning to copy a correct character from input sentence, or generate a character from BERT. No method has used copy mechanism in BERT for CSC. 2) The attention mechanism is further applied on the phonic and shape representation of each character at the output layer. 3) Shape representation is enhanced by mining character glyph with ResNet, and fused with stroke representation via an adaptive gating unit. The experimental results show that CoSPA outperforms the previous state-of-the-art methods on SIGHAN2015 datasets.
Sergey Bartunov (Charm Therapeutics); Fabian Bernd Fuchs (University of Oxford); Timothy P Lillicrap (DeepMind)
Processing sets or other unordered, potentially variable-sized inputs in neural networks is usually handled by \emph{aggregating} a number of input tensors into a single representation. While a number of aggregation methods already exist from simple sum pooling to multi-head attention, they are limited in their representational power both from theoretical and empirical perspectives. On the search of a principally more powerful aggregation strategy, we propose an optimization-based method called Equilibrium Aggregation. We show that many existing aggregation methods can be recovered as special cases of Equilibrium Aggregation and that it is provably more efficient in some important cases. Equilibrium Aggregation can be used as a drop-in replacement in many existing architectures and applications. We validate its efficiency on three different tasks: median estimation, class counting, and molecular property prediction. In all experiments, Equilibrium Aggregation achieves higher performance than the other aggregation techniques we test.
Rosanna Turrisi (University of Genova); Remi Flamary (Ecole polytechnique); Alain Rakotomamonjy (University of Rouen Normandy); Massimiliano Pontil (Istituto Italiano di Tecnologia & University College London)
This work addresses the problem of domain adaptation on an unlabeled target dataset using knowledge from multiple labelled source datasets. Most current approaches tackle this problem by searching for an embedding that is invariant across source and target domains, which corresponds to searching for a universal classifier that works well on all domains. In this paper, we address this problem from a new perspective: instead of crushing diversity of the source distributions, we exploit it to adapt better to the target distribution. Our method, named Multi-Source Domain Adaptation via Weighted Joint Distribution Optimal Transport (MSDA-WJDOT), aims at finding simultaneously an Optimal Transport-based alignment between the source and target distributions and a re-weighting of the sources distributions. We discuss the theoretical aspects of the method and propose a conceptually simple algorithm. Numerical experiments indicate that the proposed method achieves state-of-the-art performance on simulated and real datasets.
Samuel Daulton (University of Oxford); David Eriksson (Facebook); Maximilian Balandat (Facebook); Eytan Bakshy (Facebook)
Many real-world scientific and industrial applications require optimizing multiple competing black-box objectives. When the objectives are expensive-to-evaluate, multi-objective Bayesian optimization (BO) is a popular approach because of its high simple efficiency. However, even with recent methodological advances, most existing multi-objective BO methods perform poorly on search spaces with more than a few dozen parameters and rely on global surrogate models that scale cubically with the number of observations. In this work we propose MORBO, a scalable method for multi-objective BO over high-dimensional search spaces. MORBO identifies diverse globally optimal solutions by performing BO in multiple local regions of the design space in parallel using a coordinated strategy. We show that MORBO significantly advances the state-of-the-art in sample efficiency for several high-dimensional synthetic problems and real world applications, including an optical display design problem and a vehicle design problem with 146 and 222 parameters, respectively. On these problems, where existing BO algorithms fail to scale and perform well, MORBO provides practitioners with order-of-magnitude improvements in sample-efficiency over the current approach.
Thomas Mortier (Universiteit Gent); Krzysztof Dembczynski (Yahoo! Research (Verizon Media)); Eyke Hullermeier (Ludwig Maximilian University of Munich); Willem Waegeman (Ghent University)
Set-valued prediction is a well-known concept in multi-class classification. When a classifier is uncertain about the class label for a test instance, it can predict a set of classes instead of a single class. In this paper, we focus on hierarchical multi-class classification problems, where valid sets (typically) correspond to internal nodes of the hierarchy. We argue that this is a very strong restriction, and we propose a relaxation by introducing the notion of representation complexity for a predicted set. In combination with probabilistic classifiers, this leads to a challenging inference problem for which specific combinatorial optimization algorithms are needed. We propose three methods and evaluate them on benchmark datasets: a na?ve approach that is based on matrix-vector multiplication, a reformulation as a knapsack problem with conflict graph, and a recursive tree search method. Experimental results demonstrate that the last method is computationally more efficient than the other two approaches, due to a hierarchical factorization of the conditional class distribution.
John Isak Texas Falk (University College London); Carlo Ciliberto (University College London); Massimiliano Pontil (Istituto Italiano di Tecnologia & University College London)
Meta-learning algorithms have made significant progress in the context of meta-learning for image classification but less attention has been given to the regression setting. In this paper we propose to learn the probability distribution representing a random feature kernel that we wish to use within kernel ridge regression (KRR). We introduce two instances of this meta-learning framework, learning a neural network pushforward for a translation-invariant kernel and an affine pushforward for a neural network random feature kernel, both mapping from a Gaussian latent distribution. We learn the parameters of the pushforward by minimizing a meta-loss associated to the KRR objective. Since the resulting kernel does not admit an analytical form, we adopt a random feature sampling approach to approximate it. We call the resulting method Implicit Kernel Meta-Learning (IKML). We derive a meta-learning bound for IKML, which shows the role played by the number of tasks $T$, the task sample size $n$, and the number of random features $M$. In particular the bound implies that $M$ can be the chosen independently of $T$ and only mildly dependent on $n$. We introduce one synthetic and two real-world meta-learning regression benchmark datasets. Experiments on these datasets show that IKML performs best or close to best when compared against competitive meta-learning methods.
Gabriella Chouraqui (Ben-Gurion University of the Negev); Liron Cohen (Ben-Gurion University of the Negev); Gil Einziger (University of Massachusetts at Amherst); Liel leman (Ben Gurion University)
Machine learning classifiers are probabilistic in nature, and thus inevitably involve uncertainty. Predicting the probability of a specific input to be correct is called uncertainty (or confidence) estimation and is crucial for risk management. Post-hoc model calibrations can improve models??uncertainty estimations without the need for retraining, and without changing the model. Our work puts forward a geometric-based approach for uncertainty estimation. Roughly speaking, we use the geometric distance of the current input from the existing training inputs as a signal for estimating uncertainty and then calibrate that signal (instead of the model?? estimation) using standard post-hoc calibration techniques. We show that our method yields better uncertainty estimations than recently proposed approaches by extensively evaluating multiple datasets and models. In addition, we also demonstrate the possibility of performing our approach in near real-time applications. Our code is available at [anonymous-Git].
Thanh Vinh Vo (National University of Singapore); Young Lee (Harvard University); Trong Nghia Hoang (Washington State University); Tze-Yun Leong (National University of Singapore)
We propose a Bayesian framework for estimating causal effects from federated observational data sources. Bayesian causal inference is an important approach to learning the distribution of the causal estimands and understanding the uncertainty of causal effects. Our framework estimates the posterior distributions of the causal effects to compute the higher-order statistics that capture the uncertainty. We integrate local causal effects from different data sources without centralizing them. We then estimate the treatment effects from observational data using a non-parametric reformulation of the classical potential outcomes framework. We model the potential outcomes as a random function distributed by Gaussian processes, with defining parameters that can be efficiently learned from multiple data sources. Our method avoids exchanging raw data among the sources, thus contributing towards privacy-preserving causal learning. The promise of our approach is demonstrated through a set of simulated and real-world examples.
Bobak Pezeshki (University of California, Irvine); Radu Marinescu; Alexander Ihler (University of California, Irvine); Rina Dechter (University of California, Irvine)
The importance of designing proteins, such as high affinity antibodies, has become ever more apparent. Computational Protein Design can cast such design problems as optimization tasks with the objective of maximizing K*, an approximation of binding affinity. Here we lay out a graphical model framework for K* optimization that enables use of compact AND/OR search spaces. We introduce two distinct graphical model formulations, a new K* heuristic, AOBB-K* - an efficient depth-first branch-and-bound algorithm, and modifications that improve performance with theoretical guarantees. As AOBB-K* is inspired by algorithms from the well studied task of Marginal MAP, this work provides a foundation for adaptation of state-of-the-art mixed inference schemes to protein design.
Anirudh Kakarlapudi (University of Georgia); Gayathri Anil (University of Georgia); Adam Eck (Oberlin College); Prashant Doshi (University of Georgia); Leen-Kiat Soh (University of Nebraska-Lincoln)
In open multiagent systems, the set of agents operating in the environment changes over time and in ways that are nontrivial to predict. For example, if collaborative robots were tasked with fighting wildfires, they may run out of suppressants and be temporarily unavailable to assist their peers. Because an agent's optimal action depends on the actions of others, each agent must not only predict the actions of its peers, but, before that, reason whether they are even present to perform an action. Addressing openness thus requires agents to model each other?? presence, which can be enhanced through agents communicating about their presence in the environment. At the same time, communicative acts can also incur costs (e.g., consuming limited bandwidth), and thus an agent must tradeoff the benefits of enhanced coordination with the costs of communication. We present a new principled, decision-theoretic method in the context provided by the recent communicative interactive POMDP framework for planning in open agent settings that balances this tradeoff. Simulations of multiagent wildfire suppression problems demonstrate how communication can improve planning in open agent environments, as well as how agents tradeoff the benefits and costs of communication under different scenarios.
Hongwei Jin (University of Illinois, Chicago); Zishun Yu (University of Illinois, Chicago); Xinhua Zhang (University of Illinois, Chicago)
Comparing structured data from possibly different metric-measure spaces is a fundamental task in machine learning, with applications in, e.g., graph classification. The Gromov-Wasserstein (GW) discrepancy formulates a coupling between the structured data based on optimal transportation, tackling the incomparability between different structures by aligning the intra-relational geometries. Although efficient local solvers such as conditional gradient and Sinkhorn are available, the inherent non-convexity still prevents a tractable evaluation. To address this issue, we take inspiration from the connection with the quadratic assignment problem and propose the orthogonal Gromov-Wasserstein (OGW) discrepancy that admits an efficient and closed-form lower bound with the complexity of $\mathcal{O}(n^3)$. It also directly extends to the fused Gromov-Wasserstein (FGW) distance, incorporating node features into the coupling. Extensive experiments on both the synthetic and real-world datasets show the effectiveness and efficiency of our methods.
Giuseppe Spallitta (University of Trento); Gabriele Masina (University of Trento); Paolo Morettin (KU Leuven); Andrea Passerini (University of Trento); Roberto Sebastiani (University of Trento)
Weighted Model Integration (WMI) is a popular formalism aimed at unifying approaches for probabilistic inference in hybrid domains, involving logical and algebraic constraints. Despite a considerable amount of recent work, allowing WMI algorithms to scale with the complexity of the hybrid problem is still a challenge. In this paper we highlight some substantial limitations of existing state-of-the-art solutions, and develop an algorithm that combines SMT-based enumeration, an efficient technique in formal verification, with an effective encoding of the problem structure. This allows our algorithm to avoid generating redundant models, resulting in substantial computational savings. An extensive experimental evaluation on both synthetic and real-world datasets confirms the advantage of the proposed solution over existing alternatives.
Roman Andriushchenko (Brno University of Technology); Milan Ceska (Brno University of Technology); Sebastian Junges (Radboud University); Joost-Pieter Katoen (RWTH Aachen University)
We present a novel learning framework to obtain finite-state controllers (FSCs) for partially observable Markov decision processes and illustrate its applicability for indefinite-horizon specifications. Our framework builds on oracle-guided inductive synthesis to explore a design space compactly representing available FSCs. The inductive synthesis approach consists of two stages: The outer stage determines the design space, i.e., the set of FSC candidates, while the inner stage efficiently explores the design space. This framework is easily generalisable and shows promising results when compared to existing approaches. Experiments indicate that our technique is (i) competitive to state-of-the-art belief-based approaches for indefinite-horizon properties, (ii) yields smaller FSCs than existing methods for several models, and (iii) naturally treats multi-objective specifications.
Daira Pinto Prieto (University of Amsterdam, University of Amsterdam); Ronald de Haan (University of Amsterdam)
Dempster's rule of combination allows us to combine various independent pieces of evidence that each have a certain degree of uncertainty. This provides a useful way for dealing with uncertain evidence, but the rule is computationally intractable. In this paper, we analyze the complexity of this rule for differently structured bodies of evidence and we consider a known algorithm by Shafer and Logan to compute this rule efficiently over a hierarchical set of evidence. We show that one can check in polynomial time whether an arbitrary set of evidence has a hierarchical shape, enabling the use of Shafer and Logan's algorithm. Moreover, we consider two different approaches to deal with non-hierarchical sets of evidence: (i) considering hierarchical subsets and (ii) taking advantage of internal hierarchical structures in the overall set. For the former case, we conclude that getting different hierarchies from an arbitrary set of pieces of evidence corresponds to the VERTEX COVER problem and we present algorithms for obtaining these hierarchies based on this correspondence. For the latter case, we present a fixed-parameter tractable algorithm which computes the belief function of any piece of evidence included in the set.
Marcel Wienobst (University of Luebeck); Max Bannach (University of Luebeck); Maciej Liskiewicz (University of Luebeck)
Ancestral graphs are an important tool for encoding causal knowledge as they represent uncertainty about the presence of latent confounding and selection bias, and they can be inferred from data. As for other graphical models, several maximal ancestral graphs (MAGs) may encode the same statistical information in the form of conditional independencies. Such MAGs are said to be \emph{Markov equivalent}. This work concerns graphical characterizations and computational aspects of Markov equivalence between MAGs. These issues have been studied in past years leading to several criteria and methods to test Markov equivalence. The state-of-the-art algorithm, provided by Hu and Evans [UAI 2020], runs in time $O(n^5)$ for instances with $n$ vertices. We propose a new constructive graphical criterion for the Markov equivalence of MAGs, which allows us to develop a practically effective equivalence test with worst-case runtime $O(n^3)$. Additionally, our criterion is expressed in terms of natural graphical concepts, which is of independent value.
Yurong Ling (University College London, University of London); Jing-Hao Xue (University College London)
Dimension reduction for high-dimensional count data with a large proportion of zeros is an important task in various applications. As a large number of dimension reduction methods rely on the proximity measure, we develop a dissimilarity measure that is well-suited for small counts based on the Kullback-Leibler divergence. We compare the proposed measure with other widely used dissimilarity measures and show that the proposed one has superior discrimination ability when applied to high-dimensional count data having an excess of zeros. Extensive empirical results, on both simulated and publicly-available real-world datasets that contain many zeros, demonstrate that the proposed dissimilarity measure can improve a wide range of dimension reduction methods.
Gaurush Hiranandani (University of Illinois, Urbana Champaign); Jatin Mathur (University of Illinois, Urbana-Champaign); Harikrishna Narasimhan (Google); Oluwasanmi O Koyejo (Stanford University)
Metric elicitation is a recent framework for eliciting classification performance metrics that best reflect implicit user preferences based on the task and context. However, available elicitation strategies have been limited to linear (or quasi-linear) functions of predictive rates, which can be practically restrictive for many applications including fairness. This paper develops a strategy for eliciting more flexible multiclass metrics defined by quadratic functions of rates, designed to reflect human preferences better. We show its application in eliciting quadratic violation-based group-fair metrics. Our strategy requires only relative preference feedback, is robust to noise, and achieves near-optimal query complexity. We further extend this strategy to eliciting polynomial metrics -- thus broadening the use cases for metric elicitation.
Svenja Uhlemeyer (University of Wuppertal); Matthias Rottmann (University of Wuppertal); Hanno Gottschalk (Wuppertal University)
For the semantic segmentation of images, state-of-the-art deep neural networks (DNNs) achieve high segmentation accuracy if that task is restricted to a closed set of classes. However, as of now DNNs have limited ability to operate in an open world, where they are tasked to identify pixels belonging to unknown objects and eventually to learn novel classes, incrementally. Humans have the capability to say: ``I don't know what that is, but I've already seen something like that''. Therefore, it is desirable to perform such an incremental learning task in an unsupervised fashion. We introduce a method where unknown objects are clustered based on visual similarity. Those clusters are utilized to define new classes and serve as training data for unsupervised incremental learning. More precisely, the connected components of a predicted semantic segmentation are assessed by a segmentation quality estimate. Connected components with a low estimated prediction quality are candidates for a subsequent clustering. Additionally, the component-wise quality assessment allows for obtaining predicted segmentation masks for the image regions potentially containing unknown objects. The respective pixels of such masks are pseudo-labeled and afterwards used for re-training the DNN, i.e., without the use of ground truth generated by humans. In our experiments we demonstrate that, without access to ground truth and even with few data, a DNN's class space can be extended by a novel class, achieving considerable segmentation accuracy.
Eyke Hullermeier (Ludwig Maximilian University of Munich); Sebastien Destercke (University of Technology of Compiegne); Mohammad Shaker
The representation and quantification of uncertainty has received increasing attention in machine learning in the recent past. The formalism of credal sets provides an interesting alternative in this regard, especially as it combines the representation of epistemic (lack of knowledge) and aleatoric (statistical) uncertainty in a rather natural way. In this paper, we elaborate on uncertainty measures for credal sets from the perspective of machine learning. More specifically, we provide an overview of proposals, discuss existing measures in a critical way, and also propose a new measure that is more tailored to the machine learning setting. Based on an experimental study, we conclude that theoretically well-justified measures also lead to better performance in practice. Besides, we corroborate the difficulty of the disaggregation problem, that is, of decomposing the amount of total uncertainty into aleatoric and epistemic uncertainty in a sound manner, thereby complementing theoretical findings with empirical evidence.
Dexun Li (Singapore Management University); Pradeep Varakantham (Singapore Management University)
Restless Multi-Armed Bandits (RMAB) is an apt model to represent decision-making problems in public health interventions (e.g., tuberculosis, maternal, and child care), anti-poaching planning, sensor monitoring, personalized recommendations and many more. Existing research in RMAB has contributed mechanisms and theoretical results to a wide variety of settings, where the focus is on maximizing expected value. In this paper, we are interested in ensuring that RMAB decision making is also fair to different arms while maximizing expected value. In the context of public health settings, this would ensure that different people and/or communities are fairly represented while making public health intervention decisions. To achieve this goal, we formally define the fairness constraints in RMAB and provide planning and learning methods to solve RMAB in a fair manner. We demonstrate key theoretical properties of fair RMAB and experimentally demonstrate that our proposed methods handle fairness constraints without sacrificing significantly on solution quality.
Kwanghee Choi (Sogang University); Siyeong Lee (Naver Labs)
Notable progress has been made in numerous fields of machine learning based on neural network-driven mutual information (MI) bounds. However, utilizing the conventional MI-based losses is often challenging due to their practical and mathematical limitations. In this work, we first identify the symptoms behind their instability: (1) the neural network not converging even after the loss seemed to converge, and (2) saturating neural network outputs causing the loss to diverge. We mitigate both issues by adding a novel regularization term to the existing losses. We theoretically and experimentally demonstrate that added regularization stabilizes training. Finally, we present a novel benchmark that evaluates MI-based losses on both the MI estimation power and its capability on the downstream tasks, closely following the pre-existing supervised and contrastive learning settings. We evaluate six different MI-based losses and their regularized counterparts on multiple benchmarks to show that our approach is simple yet effective.
Yaroslav Kivva (Swiss Federal Institute of Technology Lausanne (EPFL)); Ehsan Mokhtarian (Swiss Federal Institute of Technology Lausanne); Jalal Etesami (Swiss Federal Institute of Technology Lausanne); Negar Kiyavash (Swiss Federal Institute of Technology Lausanne)
We revisit the problem of general identifiability originally introduced in [Lee et al., 2019] for causal inference and note that it is necessary to add positivity assumption of observational distribution to the original definition of the problem. We show that without such an assumption the rules of do-calculus and consequently the proposed algorithm in [Lee et al., 2019] are not sound. Moreover, adding the assumption will cause the completeness proof in [Lee et al., 2019] to fail. Under positivity assumption, we present a new algorithm that is provably both sound and complete. A nice property of this new algorithm is that it establishes a connection between general identifiability and classical identifiability by Pearl [1995] through decomposing the general identifiability problem into a series of classical identifiability sub-problems.
Kenshi Abe (CyberAgent, Inc.); Mitsuki Sakamoto (University of Electro-Communications, Tokyo Institute of Technology); Atsushi Iwasaki (University of Electro-Communications)
In this study, we consider a variant of the Follow the Regularized Leader (FTRL) dynamics in two-player zero-sum games. FTRL is guaranteed to converge to a Nash equilibrium when time-averaging the strategies, while many variants suffer from the issue of limit cycling behavior, i.e., lacks the last-iterate convergence guarantee. To resolve this issue, we propose a mutation-driven FTRL (M-FTRL), an algorithm that introduces mutation for the perturbation of action probabilities. We then investigate the continuous-time dynamics of M-FTRL and provide the strong convergence guarantees toward stationary points which approximate Nash equilibria under full-information feedback. Furthermore, our simulation demonstrates that M-FTRL can enjoy faster convergence rates than FTRL and optimistic FTRL under full-information feedback and surprisingly exhibits clear convergence under bandit feedback.
Zehao Su (ETHZ - ETH Zurich); Leonard Henckel (Copenhagen University)
Suppose we want to estimate a total effect with covariate adjustment in a linear structural equation model. We have a causal graph to decide what covariates to adjust for, but are uncertain about the graph. Here, we propose a testing procedure, that exploits the fact that there are multiple valid adjustment sets for the target total effect in the causal graph, to perform a robustness check on the graph. If the test rejects, it is a strong indication that we should not rely on the graph. We discuss what mistakes in the graph our testing procedure can detect and which ones it cannot and develop two strategies on how to select a list of valid adjustment sets for the procedure. We also connect our result to the related econometrics literature on coefficient stability tests.
Kaiyu Song; Kun Yue (Yunnan University); Liang Duan (Yunnan University); Mingze Yang; Angsheng Li
In the deep neural network based few-shot learning, the limited training data may make the neural network extract ineffective features, which leads to inaccurate results. By Bayesian graph neural network (BGNN), the probability distributions on hidden layers imply useful features, and the few-shot learning could improved by establishing the correlation among features. Thus, in this paper, we incorporate mutual information (MI) into BGNN to describe the correlation, and propose an innovative framework by adopting the Bayesian network with continuous variables (BNCV) for effective calculation of MI. First, we build the BNCV simultaneously when calculating the probability distributions of features from the Dropout in hidden layers of BGNN. Then, we approximate the MI values efficiently by probabilistic inferences over BNCV. Finally, we give the correlation based loss function and training algorithm of our BGNN model. Experimental results show that our MI based BGNN framework is effective for few-shot learning and outperforms some state-of-the-art competitors by large margins on accuracy.
Nina Laura Corvelo Benz (MPI-SWS); Manuel Gomez Rodriguez (MPI-SWS)
Automated decision support systems that are able to infer second opinions from experts can potentially facilitate a more efficient allocation of resources??hey can help decide when and from whom to seek a second opinion. In this paper, we look at the design of this type of support systems from the perspective of counterfactual inference. We focus on a multiclass classification setting and first show that, if experts make predictions on their own, the underlying causal mechanism generating their predictions needs to satisfy a desirable set invariant property. Further, we show that, for any causal mechanism satisfying this property, there exists an equivalent mechanism where the predictions by each expert are generated by independent sub-mechanisms governed by a common noise. This motivates the design of a set invariant Gumbel-Max structural causal model where the structure of the noise governing the sub-mechanisms underpinning the model depends on an intuitive notion of similarity between experts which can be estimated from data. Experiments on both synthetic and real data show that our model can be used to infer second opinions more accurately than its non-causal counterpart.
Yingzhen Yang (Arizona State University); Ping Li (Baidu)
High-dimensional data often lie in or close to low-dimensional subspaces. Sparse subspace clustering methods with sparsity induced by $\ell^{0}$-norm, such as $\ell^{0}$-Sparse Subspace Clustering ($\ell^{0}$-SSC) \citep{YangFJYH16-L0SSC}, are demonstrated to be more effective than its $\ell^{1}$ counterpart such as Sparse Subspace Clustering (SSC) \citep{ElhamifarV13}. However, the theoretical analysis of $\ell^{0}$-SSC is restricted to clean data that lie exactly in subspaces. Real data often suffer from noise and they may lie close to subspaces. In this paper, we show that an optimal solution to the optimization problem of noisy $\ell^{0}$-SSC achieves subspace detection property (SDP), a key element with which data from different subspaces are separated, under deterministic and semi-random model. Our results provide theoretical guarantee on the correctness of noisy $\ell^{0}$-SSC in terms of SDP on noisy data for the first time, which reveals the advantage of noisy $\ell^{0}$-SSC in terms of much less restrictive condition on subspace affinity. In order to improve the efficiency of noisy $\ell^{0}$-SSC, we propose Noisy-DR-$\ell^{0}$-SSC which provably recovers the subspaces on dimensionality reduced data. Noisy-DR-$\ell^{0}$-SSC first projects the data onto a lower dimensional space by random projection, then performs noisy $\ell^{0}$-SSC on the dimensionality reduced data for improved efficiency. Experimental results demonstrate the effectiveness of Noisy-DR-$\ell^{0}$-SSC.
Victor Picheny (Prowler.io); Henry Moss (Secondmind); Leonard Torossian (INRIA); Nicolas Durrande (Prowler.io)
Bayesian optimisation (BO) is widely used to optimise stochastic black box functions. While most BO approaches focus on optimising conditional expectations, many applications require risk-averse strategies and alternative criteria accounting for the distribution tails need to be considered. In this paper, we propose new variational models for Bayesian quantile and expectile regression that are well-suited for heteroscedastic noise settings. Our models consist of two latent Gaussian processes accounting respectively for the conditional quantile (or expectile) and the scale parameter of an asymmetric likelihood functions. Furthermore, we propose two BO strategies based on entropy search and Thompson sampling, that are tailored to such models and that can accommodate large batches of points. Contrary to existing BO approaches for risk-averse optimisation, our strategies can directly optimise for the quantile and expectile, without requiring replicating observations or assuming a parametric form for the noise. As illustrated in the experimental section, the proposed approach clearly outperforms the state of the art in the heteroscedastic, non-Gaussian case.
Petra Berenbrink (Simon Fraser University); Colin Cooper; Cristina Gava (King's College London, University of London); David Kohan Marzagao (University of Oxford); Frederik Mallmann-Trenn (King's College London); Tomasz Radzik (King's College London)
We consider a population protocol version of the SIR model. In every round, an individual is chosen uniformly at random. If the individual is susceptible, then it becomes infected w.p. $\beta I_t/N$, where $I_t$ is the number of infections at time $t$ and $N$ is the total number of individuals. If the individual is infected, then it recovers w.p. $\gamma$, whereas, if the individual is already recovered, nothing happens. We prove sharp bounds on the probability of the disease becoming pandemic vs extinguishing early (dying out quickly). The probability of extinguishing early, $\Pr{\mathcal{E}_{ext}}$, is typically neglected in prior work since most use (deterministic) differential equations. Leveraging on this, using $\Pr{\mathcal{E}_{ext}}$, we proceed by bounding the expected size of the population that contracts the disease $\mathbf{E}\left[R_\infty\right]$. Prior work only calculated $\mathbf{E}\left[R_\infty~|~\overline{\mathcal{E}_{ext}}\right]$, or obtained non-closed form solutions. We then study the two-country model also accounting for the role of $\Pr{\mathcal{E}_{ext}}$. We assume that both countries have different infection rates $\beta^{(i)}$, but share the same recovery rate $\gamma$. In this model, each round has two steps: First, an individual is chosen u.a.r. and travels w.p. $p_{travel}$ to the other country. Afterwards, the process continues as before with the respective infection rates. Finally, using simulations, we characterise the influence of $p_{travel}$ on the total number of infections. Our simulations show that, depending on the $\beta^{(i)}$, increasing $p_{travel}$ can decrease or increase the expected total number of infections $\mathbf{E}\left[R_\infty\right]$.
Sungryull Sohn (University of Michigan); Hyunjae Woo (University of Michigan); Jongwook Choi (University of Michigan); Lyubing Qiang (University of Michigan); Izzeddin Gur (Google); Aleksandra Faust (Google Brain); Honglak Lee (LG AI Research)
We tackle real-world problems with complex structures beyond the pixel-based game or simulator. We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph that defines a set of subtasks and their dependencies that are unknown to the agent. Different from the previous meta-RL methods trying to directly infer the unstructured task embedding, our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks, and use it as a prior to improve the task inference in testing. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks than various existing algorithms such as meta reinforcement learning, hierarchical reinforcement learning, and other heuristic agents.
Vaidyanathan Peruvemba Ramaswamy (TU Wien Vienna University of Technology); Stefan Szeider (TU Wien Vienna University of Technology)
We propose a new score-based algorithm for learning the structure of a Bayesian Network (BN). It is the first algorithm that simultaneously supports the requirements of (i) learning a BN of bounded treewidth (ii) satisfying expert constraints including positive and negative ancestry properties between nodes, and (iii) scaling up to BNs with several thousand nodes. The algorithm operates in two phases. In Phase 1, we utilize a modified version of the k-greedy algorithm by Scanagatta et al. (NeurIPS'16), modified to generate an initial DAG that supports a portion of the given constraints. In Phase 2, we follow the BN-SLIM framework, introduced by Peruvemba Ramaswamy and Szeider (AAAI'21). We improve the initial DAG by repeatedly running a MaxSAT solver on selected local parts. The MaxSAT encoding entails local versions of the expert constraints as hard constraints. We evaluate a prototype implementation of our algorithm on several standard benchmark sets. The encouraging results demonstrate the power and flexibility of the BN-SLIM framework. It boosts the score while increasing the number of satisfied expert constraints.
Charles K. Assaad (EasyVista); Emilie Devijver (CNRS); Eric Gaussier (University Grenoble Alpes)
This study addresses the problem of learning an extended summary causal graph on time series. The algorithms we propose fit within the well-known constraint-based framework for causal discovery and make use of information-theoretic measures to determine (in)dependencies between time series. We first introduce generalizations of the causation entropy measure to any lagged or instantaneous relations, prior to using this measure to construct extended summary causal graphs by adapting two well-known algorithms, namely PC and FCI. The behavior of our methods is illustrated through several experiments run on several simulated as well real datasets.
Gal Yona (Weizmann Institute); Shay Moran (Technion, Technion); Gal Elidan (Google); Amir Globerson (Tel Aviv University)
Supervised learning typically relies on manual annotation of the true labels. When there are many potential classes, searching for the best one can be prohibitive for a human annotator. On the other hand, comparing two candidate labels is often much easier. We focus on this type of pairwise supervision and ask how it can be used effectively in learning, and in particular in active learning. We obtain several insightful results in this context. In principle, finding the best of $k$ labels can be done with $k-1$ active queries. We show that there is a natural class where this approach is sub-optimal, and that there is a more comparison-efficient active learning scheme. A key element in our analysis is the ``label neighborhood graph'' of the true distribution, which has an edge between two classes if they share a decision boundary. We also show that in the PAC setting, pairwise comparisons cannot provide improved sample complexity in the worst case. We complement our theoretical results with experiments, clearly demonstrating the effect of the neighborhood graph on sample complexity.
Sahil Sidheekh (Verisk Analytics); Chris Barton Dock (Tufts University); Tushar Jain (New York University); Radu Balan (University of Maryland, College Park); Maneesh Kumar Singh (Verisk Analytics)
Normalizing flows provide an elegant approach to generative modeling that allows for efficient sampling and exact density evaluation of unknown data distributions. However, current techniques have significant limitations in their expressivity when the data distribution is supported on a low-dimensional manifold or has a non-trivial topology. We introduce a novel statistical framework for learning a mixture of local normalizing flows as ``chart maps'' over the data manifold. Our framework augments the expressivity of recent approaches while preserving the signature property of normalizing flows, that they admit exact density evaluation. We learn a suitable atlas of charts for the data manifold via a vector quantized auto-encoder (VQ-AE) and the distributions over them using a conditional flow. We validate experimentally that our probabilistic framework enables existing approaches to better model data distributions over complex manifolds.
Amartya Sanyal (Swiss Federal Institute of Technology); Yaxi Hu (ETHZ - ETH Zurich); Fanny Yang (Swiss Federal Institute of Technology)
As machine learning algorithms are increasingly deployed on sensitive data in critical decision making processes, it is important that they are also private and fair. When the data comprises multiple small subpopulations, in a long-tailed distribution, we prove that private learning algorithms with high average accuracy result in high error on the minority group with high probability. We further prove that relaxing overall accuracy can lead to good fairness even with strict privacy requirements. We then provide an extensive set of experiments that demonstrate how our theoretical results are reflected in a variety of differentially private algorithms (DP-SGD and DP-Random Forests) on synthetic, real-world vision (CIFAR-10 and CelebA), and tabular (Law school) datasets.
Weizhi Li (Arizona State University); Gautam Dasarathy (Arizona State University); Karthikeyan Natesan Ramamurthy (International Business Machines); Visar Berisha (Arizona State University)
Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed bimodal query is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying.
You Heng (Tianjin University); Tianpei Yang (University of Alberta); Yan Zheng (Tianjin Unibersity); Jianye HAO (Tianjin University); Matthew E. Taylor (University of Alberta)
Despite the impressive success achieved in various domains, deep reinforcement learning (DRL) is still faced with the sample inefficiency problem. Transfer learning (TL), which leverages prior knowledge from different but related tasks to accelerate the target task learning, has emerged as a promising direction to improve RL efficiency. The majority of prior work considers TL across tasks with the same state-action spaces, while transferring across domains with different state-action spaces is relatively unexplored. Furthermore, such existing cross-domain transfer approaches only enable transfer from a single source policy, leaving open the important question of how to best transfer from multiple source policies. This paper proposes a novel framework called Cross-domain Adaptive Transfer (CAT) to accelerate DRL. CAT learns the state-action correspondence from each source task to the target task and adaptively transfers knowledge from multiple source task policies to the target policy. CAT can be easily combined with existing DRL algorithms and experimental results show that CAT significantly accelerates learning and outperforms other cross-domain transfer methods on multiple continuous action control tasks.
Supriyo Ghosh (Microsoft); Laura Wynter (IBM Research); Shiau Hong Lim (IBM Research); Duc Thien Nguyen (IBM Research)
We propose a framework, called neural-progressive hedging (NP), that leverages stochastic programming during the online phase of executing a reinforcement learning (RL) policy. The goal is to ensure feasibility with respect to constraints and risk-based objectives such as conditional value-at-risk (CVaR) during the execution of the policy, using probabilistic models of the state transitions to guide policy adjustments. The framework is particularly amenable to the class of sequential resource allocation problems since feasibility with respect to typical resource constraints cannot be enforced in a scalable manner. The NP framework provides an alternative that adds modest overhead during the online phase. Experimental results demonstrate the efficacy of the NP framework on two continuous real-world tasks: (i) the portfolio optimization problem with liquidity constraints for financial planning, characterized by non-stationary state distributions; and (ii) the dynamic repositioning problem in bike sharing systems, that embodies the class of supply-demand matching problems. We show that the NP framework produces policies that are better than deep RL and other baseline approaches, adapting to non-stationarity, whilst satisfying structural constraints and accommodating risk measures in the resulting policies. Additional benefits of the NP framework are ease of implementation and better explainability of the policies.
Subhojyoti Mukherjee (University of Wisconsin, Madison); Josiah P. Hanna (University of Wisconsin - Madison); Robert D Nowak (University of Wisconsin, Madison)
This paper studies the problem of data collection for policy evaluation in Markov decision processes (MDPs). In policy evaluation, we are given a \textit{target} policy and asked to estimate the expected cumulative reward it will obtain in an environment formalized as an MDP. We develop theory for optimal data collection within the class of tree-structured MDPs by first deriving an oracle exploration strategy that uses knowledge of the variance of the reward distributions. We then introduce the \textbf{Re}duced \textbf{Var}iance Sampling (\rev\!) algorithm that approximates the oracle strategy when the reward variances are unknown a priori and bound its sub-optimality compared to the oracle strategy. Finally, we empirically validate that \rev leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
Yuyang Shi (University of Oxford); Valentin De Bortoli (University of Oxford); George Deligiannidis (Oxofrd, University of Oxford); Arnaud Doucet (Google DeepMind)
Denoising diffusion models have recently emerged as a powerful class of generative models. They provide state-of-the-art results, not only for unconditional simulation, but also when used to solve conditional simulation problems arising in a wide range of inverse problems such as image inpainting or deblurring. A limitation of these models is that they are computationally intensive at generation time as they require simulating a diffusion process over a long time horizon. When performing unconditional simulation, a Schr?dinger bridge formulation of generative modeling leads to a theoretically grounded algorithm shortening generation time which is complementary to other proposed acceleration techniques. We extend here the Schr?dinger bridge framework to conditional simulation. We demonstrate this novel methodology on various applications including image super-resolution and optimal filtering for state-space models.
Mayee F Chen (Stanford University); Daniel Yang Fu (Stanford University); Dyah Adila (University of Wisconsin, Madison); Michael Zhang (Stanford University); Frederic Sala (University of Wisconsin, Madison); Kayvon Fatahalian (Carnegie-Mellon University); Christopher Re (University of Wisconsin-Madison)
Foundation models offer an exciting new paradigm for constructing models with out-of-the-box embeddings and a few labeled examples. However, it is not clear how to best apply foundation models without labeled data. A potential approach is to fuse foundation models with weak supervision frameworks, which use weak label sources??re-trained models, heuristics, crowd-workers??o construct pseudolabels. The challenge is building a combination that best exploits the signal available in both foundation models and weak sources. We propose LIGER, a combination that uses foundation model embeddings to improve two crucial elements of existing weak supervision techniques. First, we produce finer estimates of weak source quality by partitioning the embedding space and learning per-part source accuracies. Second, we improve source coverage by extending source votes in embedding space. Despite the black-box nature of foundation models, we prove results characterizing how our approach improves performance and show that lift scales with the smoothness of label distributions in embedding space. On six benchmark NLP and video tasks, LIGER outperforms vanilla weak supervision by 14.1 points, weakly-supervised kNN and adapters by 11.8 points, and kNN and adapters supervised by traditional hand labels by 7.2 points.
We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with an ergodic Markov Decision Process. At every interaction, the agent obtains a reward. Further, there are $K$ cost functions. The agent aims to maximize the long-term average reward while simultaneously keeping the $K$ long-term average costs lower than a certain threshold. In this paper, we propose \textsc{CMDP-PSRL}, a posterior sampling-based algorithm using which the agent can learn optimal policies to interact with the CMDP. Further, for MDP with $S$ states, $A$ actions, and diameter $D$, we prove that following \textsc{CMDP-PSRL} algorithm, the agent can bound the regret of not accumulating rewards from an optimal policy by $\Tilde{O}(poly(DSA)\sqrt{T})$. Further, we show that the violations for any of the $K$ constraints is also bounded by $\Tilde{O}(poly(DSA)\sqrt{T})$. To the best of our knowledge, this is the first work that obtains a $\Tilde{O}(\sqrt{T})$ regret bounds for ergodic MDPs with long-term average constraints.
Yizuo Chen (University of California, Los Angeles); Adnan Darwiche (University of California, Los Angeles)
Causal treewidth is a recent notion which allows one to speed up Bayesian network inference and to bound its complexity in the presence of functional dependencies (causal mechanisms) whose identities are unknown. Causal treewidth is no greater than treewidth and can be bounded even when treewidth is unbounded. The utility of causal treewidth has been illustrated recently in the context of causal inference and model-based supervised learning. However, the current definition of causal treewidth is descriptive rather than perspective, therefore limiting its full exploitation in a practical setting. We provide an extensive study of causal treewidth in this paper which moves us closer to realizing the full computational potential of this notion both theoretically and practically.
Kamalika Chaudhuri (Facebook); Chuan Guo (Facebook AI Research); Michael Rabbat (McGill University)
Federated data analytics is a framework for distributed data analysis where a server compiles noisy responses from a group of distributed low-bandwidth user devices to estimate aggregate statistics. Two major challenges in this framework are privacy, since user data is often sensitive, and compression, since the user devices have low network bandwidth. Prior work has addressed these challenges separately by combining standard compression algorithms with known privacy mechanisms. In this work, we take a holistic look at the problem and design a family of privacy-aware compression mechanisms that work for any given communication budget. We first propose a mechanism for transmitting a single real number that has optimal variance under certain conditions. We then show how to extend it to metric differential privacy for location privacy use-cases, as well as vectors, for application to federated learning. Our experiments illustrate that our mechanism can lead to better utility vs. compression trade-offs for the same privacy loss in a number of settings.
Alexandru Tifrea (Swiss Federal Institute of Technology); Eric Petru Stavarache (Swiss Federal Institute of Technology); Fanny Yang (Swiss Federal Institute of Technology)
Deep neural networks often predict samples with high confidence even when they come from un- seen classes and should instead be flagged for expert evaluation. Current novelty detection algorithms cannot reliably identify such near OOD points unless they have access to labeled data that is similar to these novel samples. In this paper, we develop a new ensemble-based procedure for semi-supervised novelty detection (SSND) that successfully leverages a mixture of unlabeled ID and novel-class samples to achieve good detection performance. In particular, we show how to achieve disagreement only on OOD data using early stopping regularization. While we prove this fact for a simple data distribution, our extensive experiments suggest that it holds true for more complex scenarios: our approach significantly outperforms state-of-the-art SSND methods on standard image data sets (SVHN/CIFAR-10/CIFAR-100) and medical image data sets with only a negligible increase in computation cost.
Haohan Wang (Carnegie Mellon University); Zeyi Huang (University of Wisconsin - Madison); Hanlin Zhang (School of Computer Science, Carnegie Mellon University); Yong Jae Lee (Department of Computer Science, University of Wisconsin - Madison); Eric Xing (Mohamed bin Zayed Univeristy of AI)
Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. In this paper, we aim to offer another view of this problem in a perspective assuming the reason behind this accuracy drop is the reliance of models on the features that are not aligned well with how a data annotator considers similar across these two datasets. We refer to these features as misaligned features. We extend the conventional generalization error bound to a new one for this setup with the knowledge of how the misaligned features are associated with the label. Our analysis offers a set of techniques for this problem, and these techniques are naturally linked to many previous methods in robust machine learning literature. We also compared the empirical strength of these methods demonstrated the performance when these previous techniques are combined.
Daniel Alexander Nemirovsky (The Technical University of Catalonia); Nicolas Kevin Thiebaut (Paris-Sud University (Paris XI)); Ye Xu (Department of Computer Science, Dartmouth College); Abhishek Gupta
Model interpretability, fairness, and recourse for end users have increased as machine learning models have become increasingly popular in areas including criminal justice, finance, healthcare, and job marketplaces. This work presents a novel method of addressing these issues by producing meaningful counterfactuals that are aimed at providing recourse to users and highlighting potential model biases. A meaningful counterfactual is a reasonable alternative scenario that illustrates how input data perturbations can influence the model's output. The CounteRGAN method generates meaningful counterfactuals for a target classifier by utilizing a novel Residual Generative Adversarial Network (RGAN). We compare our method against leading state-of-the-art approaches on image and tabular datasets over a variety of performance metrics. The results indicate a significant improvement over existing techniques in combined metric performance, with a latency reduction of 2 to 7 orders of magnitude which enables providing real-time recourse to users.
Changlin Wan (Purdue University); Pengtao Dang (Purdue University); Tong Zhao (Amazon); Yong Zang; Chi Zhang (Indiana University, Bloomington); Sha Cao (Indiana University, School of Medicine)
Boolean matrix factorization (BMF) is a combinatorial problem arising from a wide range of applications including recommendation systems, collaborative filtering, and dimensionality reduction. Currently, the noise model of existing BMF methods is often assumed to be homoscedastic; however, in real-world data scenarios, the deviations of observed data from their true values are almost surely diverse due to stochastic noises, making each data point not equally suitable for fitting a model. In this case, it is not ideal to treat all data points equally distributed. Motivated by such observations, we introduce a probabilistic BMF model that recognizes the object- and feature-wise bias distribution respectively, called bias aware BMF (BABF).To the best of our knowledge, BABF is the first approach for Boolean decomposition with consideration of the feature-wise and object-wise bias in binary data. We conducted experiments on datasets with different levels of background noise, bias level, and sizes of the signal patterns, to test the effectiveness of our method in various scenarios. We demonstrated that our model outperforms the state-of-the-art factorization methods in both accuracy and efficiency in recovering the original datasets, and the inferred bias level is highly significantly correlated with true existing bias in both simulated and real-world datasets.
Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement Learning (MARL) problems. Recent studies have shown that MFC can well-approximate MARL when the population size is large and the agents are exchangeable. Unfortunately, the presumption of exchangeability implies that all agents uniformly interact with one another which is not true in many practical scenarios. In this article, we relax the assumption of exchangeability and model the interaction between agents via an arbitrary doubly stochastic matrix. As a result, in our framework, the mean-field `seen' by different agents are different. We prove that, if the reward of each agent is an affine function of the mean-field seen by that agent, then one can approximate such a non-uniform MARL problem via its associated MFC problem within an error of $e=\mathcal{O}(\frac{1}{\sqrt{N}}[\sqrt{|\mathcal{X}|} + \sqrt{|\mathcal{U}|}])$ where $N$ is the population size and $|\mathcal{X}|$, $|\mathcal{U}|$ are the sizes of state and action spaces respectively. Finally, we develop a Natural Policy Gradient (NPG) algorithm that can provide a solution to the non-uniform MARL with an error $\mathcal{O}(\max\{e,\epsilon\})$ and a sample complexity of $\mathcal{O}(\epsilon^{-3})$ for any $\epsilon >0$.
Misha Glazunov (Delft University of Technology); Apostolis Zarras (Delft University of Technology)
The problem of detecting the out-of-distribution (OoD) inputs is of prominent importance for deep neural networks (DNNs). It has been previously shown that even deep generative models (DGMs) that allow estimating the density of the inputs may not be reliable and often tend to make over-confident predictions for OoDs, assigning to them a higher density than to the in-distribution data. This over-confidence of a single model can be potentially mitigated with Bayesian inference over the model parameters that take into account epistemic uncertainty. In this paper, we investigate three different approaches to Bayesian inference: stochastic gradient Markov chain Monte Carlo, Bayes by backprop, and Stochastic Weight Averaging-Gaussian. The inference is implemented over the weights of the deep neural networks that parameterize the likelihood of the variational autoencoder (VAE). We empirically evaluate the approaches against several benchmarks that are often used for OoD detection: estimation of the marginal likelihood utilizing sampled model ensemble, typicality test, disagreement score, and Watanabe-Akaike Information Criterion. Finally, we introduce a simple test based on the idea of the information entropy that demonstrates the state-of-the-art performance.
Soumyajit Gupta (University of Texas, Austin); Gurpreet Singh (University of Texas, Austin); Raghu Bollapragada; Matthew Lease (Amazon)
Multi-objective optimization (MOO) problems require balancing competing objectives, often under constraints. The Pareto optimal solution set defines all possible optimal trade-offs over such objectives. In this work, we present a novel method for Pareto-front learning: inducing the full Pareto manifold at train-time so users can pick any desired optimal trade-off point at run-time. Our key insight is to exploit Fritz-John Conditions for a novel guided double gradient descent strategy. Evaluation on synthetic benchmark problems allows us to vary MOO problem difficulty in controlled fashion and measure accuracy v.s. known analytic solutions. We further test scalability and generalization in learning optimal neural model parameterizations for Multi-Task Learning (MTL) on image classification. Results show consistent improvement in accuracy and efficiency over prior MTL methods as well as techniques from operations research.
We consider an online optimization problem in a bandit setting in which a learner chooses decisions from a continuous decision set at discrete decision epochs, and receives noisy rewards from the environment in response. While the noise samples are assumed to be independent and sub-Gaussian, the mean reward at each epoch is a fixed but unknown linear function of a feature vector, which depends on the decision through a known (and possibly nonlinear) feature map. We study the problem within the framework of best-arm identification with fixed confidence, and provide a template algorithm for approximately learning the optimal decision in a PAC setting. More precisely, the template algorithm samples the decision space till a stopping condition is met, and returns a subset of decisions such that, with the required confidence, every element of the subset is approximately optimal for the unknown mean reward function. We provide a sample complexity bound for the template algorithm and then specialize it to the case where the mean-reward function is a univariate polynomial of a single decision variable. We provide an implementable algorithm for this case by explicitly instantiating all the steps in the template algorithm. Finally, we provide experimental results to demonstrate the efficacy of our algorithms.
Leena Chennuru Vankadara (University of Tuebingen); Philipp Michael Faller (Amazon); Michaela Hardt (Amazon); Lenon Minorics (Amazon Development Center Germany); Debarghya Ghoshdastidar (Technical University Munich); Dominik Janzing (Amazon)
Despite the increasing relevance of forecasting methods, the causal implications of these algorithms remain largely unexplored. This is concerning considering that, even under simplifying assumptions such as causal sufficiency, the statistical risk of a model can differ significantly from its \textit{causal risk}. Here, we study the problem of \textit{causal generalization}---generalizing from the observational to interventional distributions---in forecasting. Our goal is to find answers to the question: How does the efficacy of an autoregressive (VAR) model in predicting statistical associations compare with its ability to predict under interventions? To this end, we introduce the framework of \textit{causal learning theory} for forecasting. Using this framework, we obtain a characterization of the difference between statistical and causal risks, which helps identify sources of divergence between them. Under causal sufficiency, the problem of causal generalization amounts to learning under covariate shifts albeit with additional structure (restriction to interventional distributions). This structure allows us to obtain uniform convergence bounds on causal generalizability for the class of VAR models. To the best of our knowledge, this is the first work that provides theoretical guarantees for causal generalization in the time-series setting.
Michel Besserve (MPI for Intelligent Systems); Bernhard Scholkopf (Max-Planck Institute)
Complex systems often contain feedback loops that can be described as cyclic causal models. Intervening in such systems may lead to counterintuitive effects, which cannot be inferred directly from the graph structure. After establishing a framework for differentiable interventions based on Lie groups, we take advantage of modern automatic differentiation techniques and their application to implicit functions in order to optimize interventions in cyclic causal models. We illustrate the use of this framework by investigating scenarios of transition to sustainable economies.
Grace Deng (Cornell University); David S. Matteson (Cornell University)
We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize temporal relationships in terms of systemic risk and prediction variability. Forecast horizon hyperparameter $h$ allows for learning both short-term and equilibrium state network behaviors. Experiments for identifying source and sink nodes under various graph and error specifications show significant performance gains against state-of-the-art Bayesian Networks and deep-learning baselines. Applications to real-world systems also showcase BSG as an exploratory analysis tool for uncovering indirect spillovers and quantifying risk.
Huaqing Xiong (Ohio State University); Tengyu Xu (Ohio State University); Lin Zhao (National University of Singapore); Yingbin Liang (The Ohio State University); Wei Zhang (Southern University of Science and Technology of China)
The deterministic policy gradient (DPG) method proposed in Silver et al. (2014) has been demonstrated to exhibit superior performance particularly for applications with multi-dimensional and continuous action spaces. However, it remains unclear whether DPG converges, and if so, how fast it converges and whether it converges as efficiently as other PG methods. In this paper, we provide a theoretical analysis of DPG to answer those questions. We study the single timescale DPG (often the case in practice) in both on-policy and off-policy settings, and show that both algorithms attain an $\epsilon$-accurate stationary policy with a sample complexity of $\mathcal{O}(\epsilon^{-2})$. Moreover, we establish the convergence rate for DPG under Gaussian noise exploration, which is widely adopted in practice to improve the performance of DPG. To our best knowledge, this is the first non-asymptotic convergence characterization for DPG methods.
Jinwoo Go (Georgia Institute of Technology); Tobin Isaac (Georgia Tech Research Corporation)
The ranking of experiments by expected information gain (EIG) in Bayesian experimental design is sensitive to changes in the model's prior distribution, and the approximation of EIG yielded by sampling will have errors similar to the use of a perturbed prior. We define and analyze \emph{robust expected information gain} (REIG), a modification of the objective in EIG maximization by minimizing an affine relaxation of EIG over an ambiguity set of distributions that are close to the original prior in KL-divergence. We show that, when combined with a sampling-based approach to estimating EIG, REIG corresponds to a `log-sum-exp' stabilization of the samples used to estimate EIG, meaning that it can be efficiently implemented in practice. Numerical tests combining REIG with variational nested Monte Carlo (VNMC), adaptive contrastive estimation (ACE) and mutual information neural estimation (MINE) suggest that in practice REIG also compensates for the variability of under-sampled estimators.
Han Wu (Stanford University); Stefan Wager (Stanford University)
We consider the problem of deciding how best to target and prioritize existing vaccines that may offer protection against new variants of an infectious disease. Sequential experiments are a promising approach; however, challenges due to delayed feedback and the overall ebb and flow of disease prevalence make available method inapplicable for this task. We present a method, partial likelihood Thompson sampling, that can handle these challenges. Our method involves running Thompson sampling with belief updates determined by partial likelihood each time we observe an event. To test our approach, we ran a semi-synthetic experiment based on 200 days of COVID-19 infection data in the US.
Tristan Deleu (University of Montreal); Antonio Gois (Montreal Institute for Learning Algorithms, University of Montreal); Chris Chinenye Emezue (Technical University Munich); Mansi Rankawat (Georgia Institute of Technology); Simon Lacoste-Julien (University of Montreal); Stefan Bauer (Swiss Federal Institute of Technology); Yoshua Bengio (University of Montreal)
In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian networks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference.
Yuchen Zhu (University College London); Limor Gultchin (University of Oxford); Arthur Gretton (University College London); Matt Kusner (University College London); Ricardo Silva (University College London)
We propose a kernel-based nonparametric estimator for the causal effect when the cause is corrupted by error. We do so by generalizing estimation in the instrumental variable setting. Despite significant work on regression with measurement error, additionally handling unobserved confounding in the continuous setting is non-trivial: we have not seen any prior work. As a by-product of our investigation, we establish a connection between mean embeddings and characteristic functions, and how learning one simultaneously allows one to learn the other. This opens the way for kernel method research to leverage existing results in characteristic function estimation. Finally, we empirically show that our proposed method, MEKIV, improves over baselines and is robust under changes in the strength of measurement error and to the type of error distributions.
Ian A. Kash (University of Illinois, Chicago); Zhongkai Wen (University of Illinois, Chicago); Lenore Zuck (University of Illinois, Chicago)
To address spatial imbalances in the supply and demand of drivers, ridesharing platforms can make use of policies to direct driver relocation. We study a simple model of this problem, which allows us to give a constructive characterization of the unique fixpoint of system dynamics. Using this construction, we design a dynamic policy that provides stronger, than previous work, guarantees about its rate of convergence to the fixpoint. Simulations demonstrate the benefits of our approach.
Pola Schwobel (Technical University of Denmark); Frederik Rahbak Warburg (Technical University of Denmark); Martin Jorgensen (University of Oxford); Kristoffer Hougaard Madsen (Technical University of Denmark); Soren Hauberg (Technical University of Denmark)
Spatial Transformer Networks (STNs) estimate image transformations that can improve downstream tasks by ''zooming in'' on relevant regions in an image. However, STNs are hard to train and sensitive to mis-predictions of transformations. To circumvent these limitations, we propose a probabilistic extension that estimates a stochastic transformation rather than a deterministic one. Marginalising transformations allows us to consider each image at multiple poses, which makes the localisation task easier and the training more robust. As an additional benefit, the stochastic transformations act as a localised, learned data augmentation which improves the downstream tasks. We show across standard imaging benchmarks and on a challenging real world dataset that these two properties lead to improved classification performance, robustness and model calibration. We further demonstrate that the approach generalises to non-visual domains by improving model performance on time-series data.
Krishna C Kalagarla (University of Southern California); Dhruva Kartik (University of Southern California); Rahul Jain (University of Southern California); Pierluigi Nuzzo (University of Southern California); Ashutosh Nayyar (University of Southern California); Dongming Shen (University of Southern California)
Autonomous agents often operate in environments where the state is partially observed. In addition to maximizing their cumulative reward, agents must execute complex tasks with rich temporal and logical structures. These tasks can be expressed using temporal logic languages like finite linear temporal logic. This paper, for the first time, provides a structured framework for designing agent policies that maximize the reward while ensuring that the probability of satisfying the temporal logic specification is sufficiently high. We reformulate the problem as a constrained partially observable Markov decision process (POMDP) and provide a novel approach that can leverage off-the-shelf unconstrained POMDP solvers for solving it. Our approach guarantees approximate optimality and constraint satisfaction with high probability. We demonstrate its effectiveness by implementing it on several models of interest.
We propose an end-to-end online multi-object tracking (MOT) framework with a systematized event-aware loss, which is designed to control possible occurrences in an online MOT situation and compel the tracker to take appropriate actions when such events occur. Training samples from real candidates using a simulation tracker are generated, and a systematized event-aware association matrix is constructed for every frame to enable the tracker to learn the ideal action in a running environment. Several experiments, including ablation studies on various public MOT benchmark datasets, are conducted. The experimental results verify that each event affecting the tracking measure can be controlled, and the proposed method presents optimal results compared with recent state-of-the-art MOT methods.
Maxim Samarin (University of Basel); Volker Roth (University of Basel); David Belius (University of Basel)
The Neural Tangent Kernel is an important milestone in the ongoing effort to build a theory for deep learning. Its prediction that sufficiently wide neural networks behave as kernel methods, or equivalently as random feature models arising from linearized networks, has been confirmed empirically for certain wide architectures. In this paper, we compare the performance of two common finite-width convolutional neural networks, LeNet and AlexNet, to their linearizations on common benchmark datasets like MNIST and modified versions of it, CIFAR-10 and an ImageNet subset. We demonstrate empirically that finite-width neural networks, generally, greatly outperform the finite-width linearization of these architectures. When increasing the problem difficulty of the classification task, we observe a larger gap which is in line with common intuition that finite-width neural networks perform feature learning which finite-width linearizations cannot. At the same time, finite-width linearizations improve dramatically with width, approaching the behavior of the wider standard networks which in turn perform slightly better than their standard width counterparts. Therefore, it appears that feature learning for non-wide standard networks is important but becomes less significant with increasing width. We furthermore identify cases where both standard and linearized networks match in performance, in agreement with NTK theory, and a case where a wide linearization outperforms its standard width counterpart.
Keyang He (University of Georgia); Prashant Doshi (University of Georgia); Bikramjit Banerjee (University of Southern Mississippi)
Recent renewed interest in multi-agent reinforcement learning (MARL) has generated an impressive array of techniques that leverage deep reinforcement learning, primarily actor-critic architectures, and can be applied to a limited range of settings in terms of observability and communication. However, a continuing limitation of much of this work is the curse of dimensionality when it comes to representations based on joint actions, which grow exponentially with the number of agents. In this paper, we squarely focus on this challenge of scalability. We apply the key insight of action anonymity to a recently presented actor-critic based MARL algorithm, Interactive A2C. We introduce a Dirichlet-multinomial model based method for maintaining beliefs over the agent population when agents' actions are not perfectly observable. We show that the posterior is a mixture of Dirichlet distributions that we approximate as a single component for tractability. We also show that the prediction accuracy of this method increases with more agents. Finally we show empirically that our method can learn optimal behaviors in two recently introduced pragmatic domains with large agent population, and demonstrates robustness in partially observable environments.
Jinglin Chen (University of Illinois, Urbana Champaign); Nan Jiang (University of Illinois, Urbana Champaign)
We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sample-efficiency guarantees with a dataset lacking sufficient coverage, under only realizability-type assumptions for the function approximators. While the existing theory has addressed learning under realizability and under non-exploratory data separately, no work has been able to address both simultaneously (except for a concurrent work which we compare to in detail). Under an additional gap assumption, we provide guarantees to a simple pessimistic algorithm based on a version space formed by marginalized importance sampling, and the guarantee only requires the data to cover the optimal policy and the function classes to realize the optimal value and density-ratio functions. While similar gap assumptions have been used in other areas of RL theory, our work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL.
Chien Lu (Tampere University); Jaakko Peltonen (Tampere University); Timo Nummenmaa (Tampere University); Jyrki Nummenmaa (Tampere University)
In graph data, each node often serves multiple functionalities. However, most graph embedding models assume that each node can only possess one representation. We address this issue by proposing a nonparametric graph embedding model. The model allows each node to learn multiple representations where they are needed to represent the complexity of random walks in the graph. It extends the Exponential family graph embedding model with two nonparametric prior settings, the Dirichlet process, and the uniform process. The model combines the ability of Exponential family graph embedding to take the number of occurrences of context nodes into account with nonparametric priors giving it the flexibility to learn more than one latent representation for each node. The learned embedding outperforms other state-of-the-art approaches in node classification and link prediction tasks.
Andrea Baisero (Northeastern University); Brett Daley (Northeastern University); Christopher Amato (Northeastern University)
Offline training in simulated partially observable environments allows reinforcement learning methods to exploit privileged state information through a mechanism known as asymmetry. Such privileged information has the potential to greatly improve the optimal convergence properties, if used appropriately. However, current research in asymmetric reinforcement learning is often heuristic in nature, with few connections to underlying theory or theoretical guarantees, and is primarily tested through empirical evaluation. In this work, we develop the theory of \emph{asymmetric policy iteration}, an exact model-based dynamic programming solution method, and then apply relaxations which eventually result in \emph{asymmetric DQN}, a model-free deep reinforcement learning algorithm. Our theoretical findings are complemented and validated by empirical experimentation performed in environments which exhibit significant amounts of partial observability, and require both information gathering strategies and memorization.
Esther Rolf (University of California, Berkeley); Nikolay Malkin (Mila - Quebec AI institute); Alexandros Graikos (State University of New York, Stony Brook); Ana Jojic (University of Washington); Caleb Robinson (Microsoft); Nebojsa Jojic (Microsoft Research)
We propose a method for jointly inferring labels across a collection of data samples, where each sample consists of an observation and a prior belief about the label. By implicitly assuming the existence of a generative model for which a differentiable predictor is the posterior, we derive a training objective that allows learning under weak beliefs. This formulation unifies various machine learning settings; the weak beliefs can come in the form of noisy or incomplete labels, likelihoods given by a different prediction mechanism on auxiliary input, or common-sense priors reflecting knowledge about the structure of the problem at hand. We demonstrate the proposed algorithms on diverse problems: classification with negative training examples, learning from rankings, weakly and self-supervised aerial imagery segmentation, co-segmentation of video frames, and coarsely supervised text classification.
Tom Claassen (Radboud University Nijmegen); Ioan Gabriel Bucur (Radboud University Nijmegen)
We present Greedy PAG Search (GPS) for score-based causal discovery over equivalence classes, similar to the famous Greedy Equivalence Search algorithm [Chickering,2002], except now in the presence of latent confounders. It is based on a novel characterization of Markov equivalence classes for MAGs, that not only improves on state-of-the-art identification of Markov equivalence between MAGs, but also allows for efficient traversal over equivalence classes in the space of all MAGs. The resulting GPS algorithm is evaluated against several existing alternatives and found to show promising performance, both in terms of speed and accuracy.
Ragib Ahsan (University of Illinois, Chicago); Zahra Fatemi (University of Illinois, Chicago); David Arbour (Adobe Systems); Elena Zheleva (University of Illinois, Chicago)
Independence testing plays a central role in statistical and causal inference from observational data. Standard independence tests assume that the data samples are independent and identically distributed (i.i.d.) but that assumption is violated in many real- world datasets and applications centered on relational systems. This work examines the problem of estimating independence in data drawn from relational systems by defining sufficient representations for the sets of observations influencing individual instances. Specifically, we define marginal and conditional independence tests for relational data by considering the kernel mean embedding as a flexible aggregation function for relational variables. We propose a consistent, non-parametric, scalable kernel test to operationalize the relational independence test for non-i.i.d. observational data under a set of structural assumptions. We empirically evaluate our proposed method on a variety of synthetic and semi-synthetic networks and demonstrate its effectiveness compared to state-of-the-art kernel-based independence tests.
Shujian Yu (UiT - The Arctic University of Norway); Francesco Alesiani (NEC); Wenzhe Yin (University of Amsterdam); Robert Jenssen (UiT The Arctic University of Norway); Jose C Principe (University of Florida)
Graph sparsification aims to reduce the number of edges of a graph while maintaining its structural properties. In this paper, we propose the first general and effective information-theoretic formulation of graph sparsification, by taking inspiration from the Principle of Relevant Information (PRI). To this end, we extend the PRI from a standard scalar random variable setting to structured data (i.e., graphs). Our Graph-PRI objective is achieved by operating on the graph Laplacian, made possible by expressing the graph Laplacian of a subgraph in terms of a sparse edge selection vector $\mathbf{w}$. We provide both theoretical and empirical justifications on the validity of our Graph-PRI. We also analyze its analytical solutions in a few special cases. We finally present three representative real-world applications, namely graph sparsification, graph regularized multi-task learning, and medical imaging-derived brain network classification, to demonstrate the effectiveness, the versatility and the enhanced interpretability of our approach over prevalent sparsification techniques.
"The Bradley-Terry-Luce (BTL) model is a popular statistical approach for estimating the global ranking of a collection of items of interest using pairwise comparisons. To ensure accurate ranking, it is essential to obtain precise estimates of the model parameters in the $\ell_{\infty}$-loss. The difficulty of this task depends crucially on the topology of the pairwise comparison graph over the given items. However, beyond very few well-studied cases, such as the complete and Erd\""os-R\'enyi comparison graphs, little is known about the performance of the maximum likelihood estimator (MLE) of the BTL model parameters in the $\ell_{\infty}$-loss under more general graph topologies. In this paper, we derive novel, general upper bounds on the $\ell_{\infty}$ estimation error of the BTL MLE that depend explicitly on the algebraic connectivity of the comparison graph, the maximal performance gap across items and the sample complexity. We demonstrate that the derived bounds perform well and in some cases are sharper compared to known results obtained using different loss functions and more restricted assumptions and graph topologies. We carefully compare our results to Yan et al. (2012), which is closest in spirit to our work. We further provide minimax lower bounds under $\ell_{\infty}$-error that nearly match the upper bounds over a class of sufficiently regular graph topologies. Finally, we study the implications of our $\ell_{\infty}$-bounds for efficient tournament design. We illustrate and discuss our findings through various examples and simulations."
Wesley Maddox (New York University); Andres Potapcynski (New York University); Andrew Gordon Wilson (New York University)
Low precision arithmetic has had a transformative effect on the training of neural networks, reducing computation, memory and energy requirements. However, despite their promise, low precision operations have received little attention for Gaussian process (GP) training, largely because GPs require sophisticated linear algebra routines that are unstable in low precision. We study the different failure modes that can occur when training GPs in half-precision. To circumvent these failure modes, we propose a multi-faceted approach involving conjugate gradients with re-orthogonalization, mixed precision, compact kernels, and preconditioners. Our approach significantly improves the numerical stability and practical performance of conjugate gradients in low precision over a wide range of settings, and reduces the runtime of $1.8$ million data points to $10$ hours on a single GPU, without requiring any sparse approximations.
Yikai Zhang (Morgan Stanley); Wenjia Zhang (Rutgers University); Sammy Bald (City University of New York, City University of New York); Vamsi Pritham Pingali (State University of New York, Stony Brook); Chao Chen (State University of New York, Stony Brook); Mayank Goswami (CUNY Queens College)
Stochastic Gradient Descent (SGD) based meth-ods have been widely used for training large-scalemachine learning models that also generalize wellin practice. Several explanations have been offeredfor this generalization performance, a prominentone being algorithmic stability [Hardt et al., 2016].However, there are no known examples of smoothloss functions for which the analysis can be shownto be tight. Furthermore, apart from properties ofthe loss function, data distribution has also beenshown to be an important factor in generalizationperformance. This raises the question: is the stabil-ity analysis of [Hardt et al., 2016] tight for smoothfunctions, and if not, for what kind of loss func-tions and data distributions can the stability analy-sis be improved?In this paper we first settle open questions regard-ing tightness of bounds in the data-independentsetting: we show that for general datasets, the ex-isting analysis for convex and strongly-convex lossfunctions is tight, but it can be improved for non-convex loss functions. Next, we give novel and im-proved data-dependent bounds: we show stabilityupper bounds for a large class of convex regular-ized loss functions, withnegligible regularizationparameters, and improve existing data-dependentbounds in the non-convex setting. We hope that ourresults will initiate further efforts to better under-stand the data-dependent setting under non-convexloss functions, leading to an improved understand-ing of generalization abilities of deep networks
Alexander K. Lew (Massachusetts Institute of Technology); Marco Cusumano-Towner (Massachusetts Institute of Technology); Vikash Mansinghka (Massachusetts Institute of Technology)
A key challenge in applying Monte Carlo and variational inference (VI) is the design of proposals and variational families that are ??xible enough to closely approximate the posterior, but simple enough to admit tractable densities and variational bounds. This paper presents recursive auxiliaryvariable inference (RAVI), a new framework for exploiting ??xible proposals, for example based on involved simulations or stochastic optimization, within Monte Carlo and VI algorithms. The key idea is to estimate intractable proposal densities via meta-inference: additional Monte Carlo or variational inference targeting the proposal, rather than the model. RAVI generalizes and uni??s several existing methods for inference with expressive approximating families, which we show correspond to speci?? choices of meta-inference algorithm, and provides new theory for analyzing their bias and variance. We illustrate RAVI?? design framework and theorems by using them to analyze and improve upon Salimans et al.?? Markov Chain Variational Inference, and to design a novel sampler for Dirichlet process mixtures, achieving state-of-the-art results on a standard benchmark dataset from astronomy and on a challenging data-cleaning task with Medicare hospital data.
Sampling and Variational Inference (VI) are two large families of methods for approximate inference that have complementary strengths. Sampling methods excel at approximating arbitrary probability distributions, but can be inefficient. VI methods are efficient, but may misrepresent the true distribution. Here, we develop a general framework where approximations are stochastic mixtures of simple component distributions. Both sampling and VI can be seen as special cases: in sampling, each mixture component is a delta-function and is chosen stochastically, while in standard VI a single component is chosen to minimize divergence. We derive a practical method that interpolates between sampling and VI by solving an optimization problem over a mixing distribution. Intermediate inference methods then arise by varying a single parameter. Our method provably improves on sampling (reducing variance) and on VI (reducing bias+variance despite increasing variance). We demonstrate our method's bias/variance trade-off in practice on reference problems, and we compare outcomes to commonly used sampling and VI methods. This work takes a step towards a highly flexible yet simple family of inference methods that combines the complementary strengths of sampling and VI.
We study the problem of characterizing the expected hitting times for a robust generalization of continuous-time Markov chains. This generalization is based on the theory of imprecise probabilities, and the models with which we work essentially constitute sets of stochastic processes. Their inferences are tight lower- and upper bounds with respect to variation within these sets. We consider three distinct types of these models, corresponding to different levels of generality and structural independence assumptions on the constituent processes. Our main results are twofold; first, we demonstrate that the hitting times for all three types are equivalent. Moreover, we show that these inferences are described by a straightforward generalization of a well-known linear system of equations that characterizes expected hitting times for traditional time-homogeneous continuous-time Markov chains.
Kira A. Selby (University of Waterloo); Ahmad Rashid (Huawei Technologies Ltd.); Ivan Kobyzev (Huawei Noah's Ark Lab); Mehdi Rezagholizadeh (McGill University); Pascal Poupart (University of Waterloo)
We propose a general deep architecture for learning functions on multiple permutation-invariant sets. We also show how to generalize this architecture to sets of elements of any dimension by dimension equivariance. We demonstrate that our architecture is a universal approximator of these functions, and show superior results to existing methods on a variety of tasks including counting tasks, alignment tasks, distinguishability tasks and statistical distance measurements. This last task is quite important in Machine Learning. Although our approach is quite general, we demonstrate that it can generate approximate estimates of KL divergence and mutual information that are more accurate than previous techniques that are specifically designed to approximate those statistical distances.
kareem ahmed (University of California, Los Angeles); Eric Wang (University of California, Los Angeles); Kai-Wei Chang (University of California, Los Angeles); Guy Van den Broeck (University of California, Los Angeles)
In structured prediction, the goal is to jointly predict many output variables that together encode a structured object -- a path in a graph, an entity-relation triple, or an ordering of objects. Such a large output space makes learning hard and requires vast amounts of labeled data. Different approaches leverage alternate sources of supervision. One approach -- entropy regularization -- posits that decision boundaries should lie in low-probability regions. It extracts supervision from unlabeled examples, but remains agnostic to the structure of the output space. Conversely, neuro-symbolic approaches exploit the knowledge that not every prediction corresponds to a \emph{valid} structure in the output space. Yet, they do not further restrict the learned output distribution. This paper introduces a framework that unifies both approaches. We propose a loss, neuro-symbolic entropy regularization, that encourages the model to confidently predict a valid object. It is obtained by restricting entropy regularization to the distribution over only the valid structures. This loss can be computed efficiently when the output constraint is expressed as a tractable logic circuit. Moreover, it seamlessly integrates with other neuro-symbolic losses that eliminate invalid predictions. We demonstrate the efficacy of our approach on a series of semi-supervised and fully-supervised structured-prediction experiments, where it leads to models whose predictions are more accurate as well as more likely to be valid.
Ming Yin (UC, Santa Barbara); Wenjing Chen (Texas A&M University - College Station); Mengdi Wang (Princeton University); Yu-Xiang Wang (UC Santa Barbara)
Goal-oriented Reinforcement Learning, where the agent needs to reach the goal state while simultaneously minimizing the cost, has received significant attention in real-world applications. Its theoretical formulation, stochastic shortest path (SSP), has been intensively researched in the online setting. Nevertheless, it remains understudied when such an online interaction is prohibited and only historical data is provided. In this paper, we consider the offline stochastic shortest path problem when the state space and the action space are finite. We design the simple value iteration-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Notably, our analysis over these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal. We hope our study could help illuminate the fundamental statistical limits of the offline SSP problem and motive its further studies beyond the scope of current consideration.
hanqi yan (The university of Warwick); Lin Gui (University of Warwick); Wenjie Li (The Hong Kong Polytechnic University, The Hong Kong Polytechnic University); Yulan He (The university of Warwick)
Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can alleviate the token uniformity problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks.
Rohith Peddi (University of Texas, Dallas); Tahrima Rahman (University of Texas, Dallas); Vibhav Giridhar Gogate (University of Texas at Dallas)
Tractable probabilistic models (TPMs) compactly represent a joint probability distribution over a large number of random variables and admit polynomial time computation of (1) exact likelihoods; (2) marginal probability distributions over a small subset of variables given evidence; and (3) in some cases most probable explanations overall non-observed variables given observations. In this paper, we leverage these tractability properties to solve the robust maximum likelihood parameter estimation task in TPMs under the assumption that a TPM structure and complete training data is provided as input. Specifically, we show that TPMs learned by optimizing the likelihood perform poorly when data is subject to adversarial attacks/noise/perturbations/corruption and we can address this issue by optimizing robust likelihood. To this end, we develop an efficient approach for constructing uncertainty sets that model data corruption in TPMs and derive an efficient gradient-based local search method for learning TPMs that are robust against these uncertainty sets. We empirically demonstrate the efficacy of our proposed approach on a collection of benchmark datasets.
Elita Lobo (University of New Hampshire); Harvineet Singh (New York University); Marek Petrik (, University of New Hampshire); Cynthia Rudin (Duke University); Himabindu Lakkaraju (Harvard University)
Off-policy Evaluation (OPE) methods are a crucial tool for evaluating policies in high-stakes domains such as healthcare, where exploration is often infeasible or expensive. However, the extent to which such methods can be trusted under adversarial threats to data quality is largely unexplored. In this work, we make the first attempt at investigating the sensitivity of OPE methods to marginal adversarial perturbations in the data. We design a generic data poisoning attack framework leveraging influence functions from robust statistics to carefully construct perturbations that maximize error in the policy value estimates. We carry out extensive experimentation with multiple healthcare and control datasets. Our results demonstrate that many existing OPE methods are highly prone to generating value estimates with large errors when subject to data poisoning attacks, even for small adversarial perturbations. To combat this problem, we suggest ways to identify and improve the robustness of OPE methods.
Daniele Tramontano (The Technical University of Munich); Anthea Monod (Imperial College London, Imperial College London); Mathias Drton (The Technical University of Munich)
In the context of graphical causal discovery, we adapt the versatile framework of linear non-Gaussian acyclic models (LiNGAMs) to propose new algorithms to efficiently learn graphs that are polytrees. Our approach combines the Chow--Liu algorithm, which first learns the undirected tree structure, with novel schemes to orient the edges by imposing an algebraic condition on moments of the data-generating distribution. This algebraic edge orientation gains significant efficiency over current algorithms that assess conditional independence. We establish high-dimensional consistency results and compare the algorithms in numerical experiments.
Yao Liu (Computer Science Department, Stanford University); Yannis Flet-Berliac (Inria (SequeL team) / Univ. Lille); Emma Brunskill (Stanford University)
Offline policy optimization has a critical impact on many real-world decision-making problems, as online learning is costly and concerning in many applications. Importance sampling and its variants are a widely used type of estimator in offline policy evaluation. They can be helpful to remove as- sumptions on the chosen function approximations used to represent value functions and process mod- els. In this paper, we identify an important overfitting phenomenon in optimizing the importance weighted return, and propose an algorithm to avoid this overfitting. We provide a theoretical justification of the proposed algorithm through a better per-state-neighborhood normalization constraint and show the limitations of previous attempts to this approach. We test our proposed method in a healthcare-inspired simulator, a logged dataset collected from real hospitals and a continuous control task. These experiments show the proposed method yields less overfitting and better test performance compared to state-of- the-art batch reinforcement learning algorithms.
Guanyu Nie (Iowa State University); Mridul Agarwal (Purdue University); Abhishek Kumar Umrawal (Purdue University); Vaneet Aggarwal (Purdue University); Christopher John Quinn (Iowa State University)
We investigate the problem of combinatorial multi-armed bandits with stochastic submodular (in expectation) rewards and full-bandit feedback, where no extra information other than the reward of selected action at each time step $t$ is observed. We propose a simple algorithm, Explore-Then-Commit Greedy (ETCG) and prove that it achieves a $(1-1/e)$-regret upper bound of $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$ for a horizon $T$, number of base elements $n$, and cardinality constraint $k$. We also show in experiments with synthetic and real-world data that the ETCG empirically outperforms other full-bandit methods.
Haydn Jones (Los Alamos National Laboratory); Jacob M. Springer (Los Alamos National Laboratory); Garrett T. Kenyon (Los Alamos National Laboratory); Juston Moore (Los Alamos National Laboratory)
Previous work has shown that commonly-used metrics for comparing representations between neural networks overestimate similarity due to correlations between data points. We show that intra-example feature correlations also causes significant overestimation of network similarity and propose an image inversion technique to analyze only the features used by a network. With this technique, we find that similarity across architectures is significantly lower than commonly understood, but we surprisingly find that similarity between models with different architectures increases as the adversarial robustness of the models increase. Our findings indicate that robust networks tend towards a universal set of representations, regardless of architecture, and that the robust training criterion is a strong prior constraint on the functions that can be learned by diverse modern architectures. We also find that the representations learned by a robust network of any architecture have an asymmetric overlap with non-robust networks of many architectures, indicating that the representations used by robust neural networks are highly entangled with the representations used by non-robust networks.
Bingshan Hu (University of Alberta); Nidhi Hegde (University of Alberta)
We address differentially private stochastic bandits. We present two optimal Thompson Sampling-based learning algorithms: DP-TS and Lazy-DP-TS. The core idea in achieving optimality of our algorithms is the principle of optimism in the face of uncertainty. We reshape the posterior distribution in an optimistic way as compared to the non-private Thompson Sampling. Our DP-TS achieves a $\sum\limits_{j \in \mathcal{A}: \Delta_j > 0} O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\} )} \log \left(\frac{\log(T)}{\epsilon \cdot \Delta_j} \right) \right)$ regret bound, where $\mathcal{A}$ is an arm set, $\Delta_j$ is the sub-optimality gap of a sub-optimal arm $j$, and $\epsilon$ is the privacy parameter. Our Lazy-DP-TS gets rid of the extra $\log$ factor by using the idea of dropping observations. The regret of Lazy-DP-TS is at most $ \sum\limits_{j \in \mathcal{A}: \Delta_j > 0} O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\}} \right)$, which matches the regret lower bound. Additionally, we conduct experiments to compare the empirical performance of our proposed algorithms with the other two optimal algorithms for differentially private stochastic bandits.
Runxiang Cheng (University of Illinois at Urbana-Champaign); Gargi Balasubramaniam (University of Illinois, Urbana Champaign); Yifei He (University of Illinois at Urbana-Champaign); Yao-Hung Hubert Tsai (Carnegie Mellon University); Han Zhao (University of Illinois, Urbana Champaign)
Multimodal learning considers learning from multi-modality data, aiming to fuse heterogeneous sources of information. However, it is not always feasible to leverage all available modalities due to memory constraints. Further, training on all the modalities may be inefficient when redundant information exists within data, such as different subsets of modalities providing similar performance. In light of these challenges, we study modality selection, intending to efficiently select the most informative and complementary modalities under certain computational constraints. We formulate a theoretical framework for optimizing modality selection in multimodal learning and introduce a utility measure to quantify the benefit of selecting a modality. For this optimization problem, we present efficient algorithms when the utility measure exhibits monotonicity and approximate submodularity. We also establish a novel correspondence between the utility measure and existing marginal contribution scores based on the Shapely value. Last, we demonstrate the efficacy of our algorithm on synthetic (Patch-MNIST) and real-world (PEMS-SF Traffic) datasets.
Zhipeng Huang (University of Toledo); Hadeel Soliman (University of Toledo); Subhadeep Paul (Ohio State University); Kevin S. Xu (University of Toledo)
Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We propose an alternating minimization algorithm to jointly estimate the latent positions of the nodes and other model parameters. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieves superior prediction accuracy and provides more interpretability compared to existing models.
Elliot Nelson (International Business Machines); Debarun Bhattacharjya (Stanford University); Tian Gao (International Business Machines); Miao Liu (International Business Machines); Djallel Bouneffouf (IBM); Pascal Poupart (University of Waterloo)
In many real-world applications of multi-armed bandit problems, both rewards and contexts are often influenced by confounding latent variables which evolve stochastically over time. While the observed contexts and rewards are nonlinearly related, we show that prior knowledge of latent causal structure can be used to reduce the problem to the linear bandit setting. We develop two algorithms, Latent Linear Thompson Sampling (L2TS) and Latent Linear UCB (L2UCB), which use online EM algorithms for hidden Markov models to learn the latent transition model and maintain a posterior belief over the latent state, and then use the resulting posteriors as context features in a linear bandit problem. We upper bound the error in reward estimation in the presence of a dynamical latent state, and derive a novel problem-dependent regret bound for linear Thompson sampling with non-stationarity and unconstrained reward distributions, which we apply to L2TS under certain conditions. Finally, we demonstrate the superiority of our algorithms over related bandit algorithms through experiments.
Data-heterogeneous federated learning (FL) systems suffer from two significant sources of convergence error: 1) client drift error caused by performing multiple local optimization steps at clients, and 2) partial client participation error caused by the fact that only a small subset of the edge clients participate in every training round. We find that among these, only the former has received significant attention in the literature. To remedy this we propose FedVARP, a novel server-based variance reduction algorithm that eliminates error due to partial client participation. To do so, the server simply maintains in memory the most recent update for each client and uses these as surrogate updates for the non-participating clients in every round. Further, to alleviate the memory requirement at the server, we propose a novel clustering-based variance reduction algorithm ClusterFedVARP. Unlike previously proposed methods, both FedVARP and ClusterFedVARP do not require additional computation at clients or communication of additional optimization parameters. Through extensive experiments, we show that FedVARP outperforms state-of-the-art methods, and ClusterFedVARP achieves performance comparable to FedVARP with much less memory requirements.
Yiting Cao (University of Oklahoma); Chao Lan (University of Oklahoma)
Existing studies on individual fairness focus on the passive setting and typically require $O(\frac{1}{\varepsilon^2})$ labeled instances to achieve an $\varepsilon$ bias budget. In this paper, we build on the elegant Approximately Metric-Fair (AMF) learning framework and propose an active AMF learner that can provably achieve the same budget with only $O(\log \frac{1}{\varepsilon})$ labeled instances. To our knowledge, this is a first and substantial improvement of the existing sample complexity for achieving individual fairness. Through experiments on three data sets, we show the proposed active AMF learner improves fairness on linear and non-linear models more efficiently than its passive counterpart as well as state-of-the-art active learners, while maintaining a comparable accuracy. To facilitate algorithm design and analysis, we also design a provably equivalent form of the approximate metric fairness based on uniform continuity instead of the existing almost Lipschitz continuity.
Qiyang Li (University of California Berkeley); Ajay Jain (Google); Pieter Abbeel (Covariant)
Autoregressive generative models can estimate complex continuous data distributions, like trajectory rollouts in an RL environment, image intensities, and audio. Most state-of-the-art models discretize continuous data into several bins and use categorical distributions over the bins to approximate the continuous data distribution. The advantage is that the categorical distribution can easily express multiple modes and are straightforward to optimize. However, such approximation cannot express sharp changes in density without using significantly more bins, which makes it parameter inefficient. We propose an efficient, expressive, multimodal parameterization called Adaptive Categorical Discretization (AdaCat). AdaCat discretizes each dimension of an autoregressive model adaptively, which allows the model to allocate density to fine intervals of interest, improving parameter efficiency. AdaCat generalizes both categoricals and quantile-based regression. AdaCat is a simple add-on to any discretization-based distribution estimator. In experiments, AdaCat improves density estimation for real-world tabular data, images, audio, and trajectories, and improves planning in model-based offline RL.
Estimating transfer entropy (TE) between time series is a difficult, yet highly impactful problem in fields such as finance and neuroscience. The well known nearest neighbor estimator of TE potentially fails if temporal dependencies are noisy and long ranged, primarily because it relies on the estimation of joint entropy terms in high dimensions, which is a hard problem in itself. Other estimators, such as those based on Copula entropy or conditional mutual information have similar limitations. Leveraging the successes of modern discriminative models that operate in high dimensional (noisy) feature spaces, we express TE as a difference of two conditional entropy terms, which we directly estimate from conditional likelihoods computed in-sample from any discriminator trained per maximum likelihood principle. The challenge here is to ensure that the in-sample log likelihood estimates are not overfit to the data. Inspired by recent work on adversarial robustness of deep neural nets, we propose a novel perturbation model based on locality sensitive hash (LSH) functions, which regularizes a discriminative model to have smooth functional outputs within local neighborhoods of the input space. We theoretically prove that our estimator is consistent under some mild regularity conditions, and its variance reduces linearly in sample size. We also demonstrate its superiority w.r.t. state-of-the-art estimators through empirical evaluations on a synthetic as well as real world datasets from the neuroscience and finance domains.
Karthik Abinav Sankararaman (Facebook); Anand Louis (Indian Institute of Science); Navin Goyal (Microsoft)
We consider the problem of robust parameter estimation from observational data in the context of linear structural equation models (LSEMs). Under various conditions on LSEMs and the model parameters the prior work provides efficient algorithms to recover the parameters. However, these results are often about generic identifiability. In practice, generic identifiability is not sufficient and we need robust identifiability: small changes in the observational data should not affect the parameters by a huge amount. Robust identifiability has received far less attention and remains poorly understood. Sankararaman et al. (2019) recently provided a set of sufficient conditions on parameters under which robust identifiability is feasible. However, a limitation of their work is that their results only apply to a small sub-class of LSEMs, called ``bow-free paths.'' In this work, we show that for \emph{any} ``bow-free model'', in all but $\frac{1}{\poly(n)}$-measure of instances \emph{robust identifiability} holds. Moreover, whenever an instance is robustly identifiable, the algorithm proposed in Foygel et al., (2012) can be used to recover the parameters in a robust fashion. In contrast, for generic identifiability Foygel et al., (2012) proved that with measure $1$, instances are generically identifiable. Thus, we show that robust identifiability is a \emph{strictly} harder problem than generic identifiability. Finally, we validate our results on both simulated and real-world datasets.
Zhiyang Teng (Westlake University); Chenhua Chen; Yan Zhang (National University of Singapore); Yue Zhang (Westlake University)
Deep latent variable models such as variational autoencoders and energy-based models are widely used for neural text generation. Most of them focus on matching the prior distribution with the posterior distribution of the latent variable for text reconstruction. In addition to instance-level reconstruction, this paper aims to integrate contrastive learning in the latent space, forcing the latent variables to learn high-level semantics by exploring inter-instance relationships. Experiments on various text generation benchmarks show the effectiveness of our proposed method. We also empirically show that our method can mitigate the posterior collapse issue for latent variable based text generation models.
Vishal Sharma (Indian Institute of Technology Delhi); Daman Arora; Florian Geisser; Mausam (Indian Institute of Technology Delhi); Parag Singla (Indian Institute of Technology Delhi)
Relational MDPs (RMDPs) compactly represent an infinite set of MDPs with an unbounded number of objects. Solving an RMDP requires a generalized policy that applies to all instances of a domain. Recently, Garg et al. proposed SymNet for this task -- it constructs a graph neural network that shares parameters across all instances in a domain, thus making it applicable to any instance in a zero-shot manner. Our analysis of SymNet reveals that it performs no better than random on 1/4th of planning competition domains. The key reasons are its design choices: it misses important information during graph construction, leading to (1) poor generalizability, and (2) potential non-identifiability of different actions. In response, our solution, ENFANet, substantially augments SymNet's graph construction approach by introducing additional nodes and edges which allow a better transfer of important information about a domain. It also improves SymNet's action decoders with relevant information from objects to make different actions identifiable during scoring. Extensive experiments on twelve competition domains, where we use imitation learning over data generated from the PROST planner, demonstrate that ENFANet performs vastly better than SymNet. Interestingly, even though ENFANet is trained over data from PROST, it outperforms the planner on several test instances due to former's ability to scale to large instances in a zero-shot manner.
Niklas Pfister (University of Copenhagen); Jonas Peters (University of Copenhagen)
Exogenous heterogeneity, for example, in the form of instrumental variables can help us learn a system?? underlying causal structure and predict the outcome of unseen intervention experiments. In this paper, we consider linear models in which the causal effect from covariates X on a response Y is sparse. We prove that the causal coefficient becomes identifiable under weak conditions and may even be identified in models, where the number of instruments is as small as the number of causal parents. We also develop graphical criteria under which the identifiability holds with probability one if the edge coefficients are sampled randomly from a distribution that is absolutely continuous with respect to Lebesgue measure. As an estimator, we propose spaceIV and prove that it consistently estimates the causal effect if the model is identifiable and evaluate its performance on simulated data.
Houston Warren (The University of Sydney, University of Sydney); Rafael Oliveira (University of Sydney); Fabio Ramos (University of Sydney)
Bayesian probabilistic integration, or Bayesian quadrature (BQ), has arisen as a popular means of numerical integral estimation with quantified uncertainty for problems where computational cost limits data availability. BQ leverages flexible Gaussian processes (GPs) to model an integrand which can be subsequently analytically integrated through properties of Gaussian distributions. However, BQ is inherently limited by the fact that the method relies on the use of a strict set of kernels for use in the GP model of the integrand, reducing the flexibility of the method in modeling varied integrand types. In this paper, we present spectral Bayesian quadrature, a form of Bayesian quadrature that allows for the use of any shift-invariant kernel in the integrand GP model while still maintaining the analytical tractability of the integral posterior, increasing the flexibility of BQ methods to address varied problem settings. Additionally our method enables integration with respect to a uniform expectation, effectively computing definite integrals of challenging integrands. We derive the theory and error bounds for this model, as well as demonstrate GBQ's improved accuracy, flexibility, and data efficiency, compared to traditional BQ and other numerical integration methods, on a variety of quadrature problems.
Sujay Bhatt (Baidu); Guanhua Fang (Baidu); Ping Li (Baidu)
In this work, we propose a non-parametric and robust change detection algorithm to detect multiple change points in time series data under non-adversarial contamination. The algorithm is de-signed for the offline setting, where the objective is to detect changes when all data are received. We only make weak moment assumptions on the inliers (uncorrupted data) to handle a large class of distributions. The robust scan statistic in the change detection algorithm is fashioned using mean estimators based on influence functions. We establish the consistency of the estimated change point indexes as the number of samples increases, and provide empirical evidence to support the consistency results.
The list above is provisional. Conference proceedings will be published by PMLR.