Testing for Conditional Mean Independence with Covariates through Martingale Difference Divergence
Ze Jin, Xiaohan Yan, David S. Matteson
As a crucial problem in statistics is to decide whether additional variables are needed in a regression model. We propose a new multivariate test to investigate the conditional mean independence of Y given X conditioning on some known effect Z, i.e., E(Y|X, Z) = E(Y|Z). Assuming that E(Y|Z) and Z are linearly related, we reformulate an equivalent notion of conditional mean independence through transformation, which is approximated in practice. We apply the martingale difference divergence (Shao and Zhang, 2014) to measure conditional mean dependence, and show that the estimation error from approximation is negligible, as it has no impact on the asymptotic distribution of the test statistic under some regularity assumptions. The implementation of our test is demonstrated by both simulations and a financial data example.
ID: 14
Analysis of Thompson Sampling for Graphical Bandits Without the Graphs
Fang Liu, Zizhan Zheng, Ness Shroff
We study multi-armed bandit problems with graph feedback, in which the decision maker is allowed to observe the neighboring actions of the chosen action, in a setting where the graph may vary over time and is never fully revealed to the decision maker. We show that when the feedback graphs are undirected the original Thompson Sampling achieves the optimal (within logarithmic factors) regret $\tilde{O}\left(\sqrt{\beta_0(G)T}\right)$ over time horizon $T$, where $\beta_0(G)$ is the average independence number of the latent graphs. To the best of our knowledge, this is the first result showing that the original Thompson Sampling is optimal for graphical bandits in the undirected setting. A slightly weaker regret bound of Thompson Sampling in the directed setting is also presented. To fill this gap, we propose a variant of Thompson Sampling, that attains the optimal regret in the directed setting within a logarithmic factor. Both algorithms can be implemented efficiently and do not require the knowledge of the feedback graphs at any time.
We investigate structured sparsity methods for variable selection in regression problems where the target depends nonlinearly on the inputs. We focus on general nonlinear functions not limiting a priori the function space to additive models. We propose two new regularizers based on partial derivatives as nonlinear equivalents of group lasso and elastic net. We formulate the problem within the framework of learning in reproducing kernel Hilbert spaces and show how the variational problem can be reformulated into a more practical finite dimensional equivalent. We develop a new algorithm derived from the ADMM principles that relies solely on closed forms of the proximal operators. We explore the empirical properties of our new algorithm for Nonlinear Variable Selection based on Derivatives (NVSD) on a set of experiments and confirm favourable properties of our structured-sparsity models and the algorithm in terms of both prediction and variable selection accuracy.
ID: 23
Identification of Strong Edges in AMP Chain Graphs
Jose M. Peña
The essential graph is a distinguished member of a Markov equivalence class of AMP chain graphs. However, the directed edges in the essential graph are not necessarily strong or invariant, i.e. they may not be shared by every member of the equivalence class. Likewise for the undirected edges. In this paper, we develop a procedure for identifying which edges in an essential graph are strong. We also show how this makes it possible to bound some causal effects when the true chain graph is unknown.
ID: 32
A Univariate Bound of Area Under ROC
Siwei Lyu
Area under ROC (AUC) is an important metric for binary classification and bipartite ranking problems. However, it is difficult to directly optimizing AUC as a learning objective, so most existing algorithms are based on optimizing a surrogate loss to AUC. One significant drawback of these surrogate losses is that they require pairwise comparisons among training data, which leads to slow running time and increasing local storage for online learning. In this work, we describe a new surrogate loss based on a reformulation of the AUC risk, which does not require pairwise comparison but rankings of the predictions. We further show that the ranking operation can be avoided, and the learning objective obtained based on this surrogate enjoys linear complexity in time and storage. We perform experiments to demonstrate the effectiveness of the online and batch algorithms for AUC optimization based on the proposed surrogate loss.
ID: 34
Efficient Bayesian Inference for a Gaussian Process Density Model
Christian Donner, Manfred Opper
We reconsider a nonparametric density model based on Gaussian processes. By augmenting the model with latent Pólya–Gamma random variables and a latent marked Poisson process we obtain a new likelihood which is conjugate to the model’s Gaussian process prior. The augmented posterior allows for efficient inference by Gibbs sampling and an approximate variational mean field approach. For the latter we utilise sparse GP approximations to tackle the infinite dimensionality of the problem. The performance of both algorithms and comparisons with other density estimators are demonstrated on artificial and real datasets with up to several thousand data points.
ID: 35
Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return
Craig Sherstan, Brendan Bennett, Dylan R. Ashley, Kenny Young, Adam White, Martha White, Richard S. Sutton
Temporal-difference (TD) learning methods are widely used in reinforcement learning to estimate the expected return for each state, without a model, because of their significant advantages in computational and data efficiency. For many applications involving risk mitigation it would also be useful to estimate the variance of the return by TD methods. In this paper we describe a way of doing this that is substantially simpler than those proposed by Tamar, Di Castro, and Mannor in 2012, or that proposed by White and White in 2016. We show that expectation and variance estimates can be learned by two TD learners operating in series. The trick is to use the square of the TD error of the expectation learner as the reward of the variance learner, and the square of the expectation learner’s discount rate as the discount rate of the variance learner. With these two modifications, the variance learning problem becomes a conventional TD learning problem to which standard theoretical results can be applied. Our formal results are limited to the table lookup case, for which our method is still novel, but the extension to function approximation is immediate and also given here, along with empirical results. Our experimental results show that our direct method behaves just as well as a comparable indirect method, but is generally more robust.
ID: 37
How well does your sampler really work?
Ryan Turner, Brady Neal
We present a new data-driven benchmark system to evaluate the performance of new MCMC samplers. Taking inspiration from the COCO benchmark in optimization, we view this task as having critical importance to machine learning and statistics given the rate at which new samplers are proposed. The common hand-crafted examples to test new samplers are unsatisfactory; we take a meta-learning-like approach to generate benchmark examples from a large corpus of data sets and models. Surrogates of posteriors found in real problems are created using highly flexible density models including modern neural network based approaches. We provide new insights into the real effective sample size of various samplers per unit time and the estimation efficiency of the samplers per sample. Additionally, we provide a meta-analysis to assess the predictive utility of various MCMC diagnostics and perform a nonparametric regression to combine them.
ID: 39
Learning Deep Hidden Nonlinear Dynamics from Aggregate Data
Yisen Wang, Bo Dai, Lingkai Kong, Hongyuan Zha, Xingjun Ma, Sarah Monazam Erfani, James Bailey, Le Song, Shu-Tao Xia
Learning nonlinear dynamics from diffusion data is a challenging problem since the individuals observed may be different at different time points, generally following an aggregate behaviour. Existing work cannot handle the tasks well since they model such dynamics either directly on observations or enforce the availability of complete longitudinal individual-level trajectories. However, in most of the practical applications, these requirements are unrealistic: the evolving dynamics may be too complex to be modeled directly on observations, and individual-level trajectories may not be available due to technical limitations, experimental costs and/or privacy issues. To address these challenges, we formulate a model of diffusion dynamics as the {\em hidden stochastic process} via the introduction of hidden variables for flexibility, and learn the hidden dynamics directly on {\em aggregate observations} without any requirement for individual-level trajectories. We propose a dynamic generative model with Wasserstein distance for LEarninG dEep hidden Nonlinear Dynamics (LEGEND) and prove its theoretical guarantees as well. Experiments on a range of synthetic and real-world datasets illustrate that LEGEND has very strong performance compared to state-of-the-art baselines.
ID: 40
Revisiting differentially private linear regression: optimal and adaptive prediction and estimation in unbounded domain
Yu-Xiang Wang
We revisit the problem of linear regression under differential privacy constraint. By consolidating existing pieces in the literature, we clarify the correct dependence of the feature, label and coefficient domain in the optimization error and estimation error, hence revealing the delicate price of differential privacy in statistical estimation and statistical learning. Moreover, we propose simple modifications of two existing DP algorithms: (a) posterior sampling, (b) sufficient statistics perturbation, and show that they can be upgraded into **adaptive algorithms** that are able to exploit data-dependent quantities and behave nearly optimally **for every instance**. Extensive experiments are conducted on both simulated data and real data, which conclude that both AdaOPS and AdaSSP outperform the existing techniques on nearly all 36 data sets that we test on.
ID: 42
Imaginary Kinematics
Sabina Marchetti
We introduce a class of adjustment rules for a collection of beliefs, extending Lewis' Imaging to absorb probabilistic evidence in generalized settings. Unlike standard tools for belief revision, our proposals may be used when information is inconsistent with an agent's belief base. We show the functionals we introduce are based on the Imaginary counterparts of Probability Kinematics for standard belief revision, and prove they satisfy all standard postulates for belief revision, under certain conditions.
ID: 43
From Deterministic ODEs to Dynamic Structural Causal Models
Paul K. Rubenstein, Stephan Bongers, Joris M. Mooij, Bernhard Schoelkopf
Structural Causal Models are widely used in causal modelling, but how they relate to other modelling tools is poorly understood. In this paper we provide a novel perspective on the relationship between Ordinary Differential Equations and Structural Causal Models. We show how, under certain conditions, the asymptotic behaviour of an Ordinary Differential Equation under non-constant interventions can be modelled using Dynamic Structural Causal Models. In contrast to earlier work, we study not only the effect of interventions on equilibrium states; rather, we model asymptotic behaviour that is dynamic under interventions that vary in time, and include as a special case the study of static equilibria.
ID: 45
Frank-Wolfe Optimization for Symmetric-NMF under Simplicial Constraint
Han Zhao, Geoff Gordon
Symmetric nonnegative matrix factorization has found abundant applications in various domains by providing a symmetric low-rank decomposition of nonnegative matrices. In this paper we propose a Frank-Wolfe (FW) solver to optimize the symmetric nonnegative matrix factorization problem under a simplicial constraint, which has recently been proposed for probabilistic clustering. Compared with existing solutions, this algorithm is extremely simple to implement, and has almost no hyperparameters to be tuned. Building on the recent advances of FW algorithms in nonconvex optimization, we prove an $O(1/\eps^2)$ convergence rate to $\eps$-approximate KKT points, via a tight bound $\Theta(n^2)$ on the curvature constant, which matches the best known result in unconstrained nonconvex setting using gradient methods. Numerical results demonstrate the effectiveness of our algorithm. As a side contribution, we construct a simple nonsmooth convex problem where the FW algorithm fails to converge to the optimum. This result raises an interesting question about necessary conditions of the success of the FW algorithm on convex problems.
ID: 50
Learning Time Series Segmentation Models from Temporally Imprecise Labels
Roy Adams, Benjamin M. Marlin
This paper considers the problem of learning time series segmentation models when the labeled data is subject to temporal uncertainty or noise. Our approach augments the semi-Markov conditional random field (semi-CRF) model with a probabilistic model of the label observation process. This augmentation allows us to estimate the parameters of the semi-CRF from timestamps corresponding roughly to the occurrence of segment transitions. We show how exact marginal inference can be performed in polynomial time enabling learning based on marginal likelihood maximization. Our experiments on two activity detection problems show that the proposed approach can learn models from temporally imprecise labels, and can successfully refine imprecise segmentations through posterior inference. Finally, we show how inference complexity can be reduced by a factor of 40 using static and model-based pruning of the inference dynamic program.
ID: 53
Multi-Target Optimisation via Bayesian Optimisation and Linear Programming
Alistair Shilton, Santu Rana, Sunil Gupta, Svetha Venkatesh
In Bayesian Multi-Objective optimisation, expected hypervolume improvement is often used to measure the goodness of candidate solutions. However when there are many objectives the calculation of expected hypervolume improvement can become computationally prohibitive. An alternative approach measures the goodness of a candidate based on the distance of that candidate from the Pareto front in objective space. In this paper we present a novel distance-based Bayesian Many-Objective optimisation algorithm. We demonstrate the efficacy of our algorithm on three problems, namely the DTLZ2 benchmark problem, a hyper-parameter selection problem, and high-temperature creep-resistant alloy design.
ID: 54
Stochastic Learning for Sparse Discrete Markov Random Fields with Controlled Gradient Approximation Error
Sinong Geng, Zhaobin Kuang, Jie Liu, Stephen Wright, David Page
We study the L1-regularized maximum likelihood estimator/estimation (MLE) problem for discrete Markov random fields (MRFs), where efficient and scalable learning requires both sparse regularization and approximate inference. To address these challenges, we consider a stochastic learning framework called stochastic proximal gradient (SPG; Honorio 2012a, Atchade et al. 2014, Miasojedow and Rejchel 2016). SPG is an inexact proximal gradient algorithm [Schmidt et al., 2011], whose inexactness stems from the stochastic oracle (Gibbs sampling) for gradient approximation – exact gradient evaluation is infeasible in general due to the NP-hard inference problem for discrete MRFs [Koller and Friedman, 2009]. Theoretically, we provide novel verifiable bounds to inspect and control the quality of gradient approximation. Empirically, we propose the tighten asymptotically (TAY) learning strategy based on the verifiable bounds to boost the performance of SPG.
ID: 57
Active Information Acquisition for Linear Optimization
Shuran Zheng, Bo Waggoner, Yang Liu, Yiling Chen
We consider partially-specified optimization problems where the goal is to actively, but efficiently, acquire missing information about the problem in order to solve it. An algorithm designer wishes to solve a linear program (LP), $\max \V{c}^T \V{x}$ s.t. $\V{A}\V{x} \leq \V{b}, \V{x} \ge \V{0}$, but does not initially know some of the parameters. The algorithm can iteratively choose an unknown parameter and gather information in the form of a noisy sample centered at the parameter's (unknown) value. The goal is to find an approximately feasible and optimal solution to the underlying LP with high probability while drawing a small number of samples. We focus on two cases.
(1) When the parameters $\V{b}$ of the constraints are initially unknown, we propose an efficient algorithm combining techniques from the ellipsoid method for LP and confidence-bound approaches from bandit algorithms. The algorithm adaptively gathers information about constraints only as needed in order to make progress. We give sample complexity bounds for the algorithm and demonstrate its improvement over a naive approach via simulation.
(2) When the parameters $\V{c}$ of the objective are initially unknown, we take an information-theoretic approach and give roughly matching upper and lower sample complexity bounds, with an (inefficient) successive-elimination algorithm.
ID: 61
Transferable Meta Learning Across Domains
Bingyi Kang, Jiashi Feng
Meta learning algorithms are effective at obtaining meta models with the capability of solving new tasks quickly. However, they critically require sufficient tasks for meta model training and the resulted model can only solve new tasks similar to the training ones. These limitations make them suffer performance decline in presence of insufficiency of training tasks in target domains and task heterogeneity—the source (model training) tasks presents different characteristics from target (model application) tasks. To overcome these two significant limitations of existing meta learning algorithms, we introduce the cross-domain meta learning framework and propose a new transferable meta learning (TML) algorithm. TML performs meta task adaptation jointly with meta model learning, which effectively narrows divergence between source and target tasks and enables transferring source meta-knowledge to solve target tasks. Thus, the resulted transferable meta model can solve new learning tasks in new domains quickly. We apply the proposed TML to cross-domain few-shot classification problems and evaluate its performance on multiple benchmarks. It performs significantly better and faster than well-established meta learning algorithms and fine-tuned domain-adapted models.
ID: 65
Learning the Causal Structure of Copula Models with Latent Variables
Ruifei Cui, Perry Groot, Moritz Schauer, Tom Heskes
A common goal in psychometrics, sociology, and econometrics is to uncover causal relations among latent variables representing hypothetical constructs that cannot be measured directly, such as attitude, intelligence, and motivation. Through measurement models, these constructs are typically linked to measurable indicators, e.g., responses to questionnaire items. This paper addresses the problem of causal structure learning among such latent variables and other observed variables. We propose the 'Copula Factor PC' algorithm as a novel two-step approach. It first draws samples of the underlying correlation matrix in a Gaussian copula factor model via a Gibbs sampler on rank-based data. These are then translated into an average correlation matrix and an effective sample size, which are taken as input to the standard PC algorithm for causal discovery in the second step. We prove the consistency of our 'Copula Factor PC' algorithm, and demonstrate that it outperforms the PC-MIMBuild algorithm and a greedy step-wise approach. We illustrate our method on a real-world data set about children with Attention Deficit Hyperactivity Disorder.
ID: 68
$f_{BGD}$: Learning Embeddings From Positive-Only Data with BGD
Learning embeddings from sparse positive data is a fundamental task for problems of several domains, such as natural language processing (NLP), computer vision (CV), and information retrieval (IR). By far, the most widely used optimization methods rely on stochastic gradient descent (SGD) with negative sampling (NS), particularly for learning from large-scale data. However, the convergence and effectiveness of SGD depend largely on the sampling distribution of negative examples. Moreover, SGD suffers from dramatic fluctuation due to its one-sample learning scheme. To address the above common issues of existing embedding methods, we present a generic batch gradient descent optimizer ($f_{BGD}$) to learn embeddings from \emph{all} training examples without sampling. Our main contribution is that we accelerate $f_{BGD}$ by several magnitudes, making its time complexity the same level as the NS-based SGD. We evaluate $f_{BGD}$ on three well-known tasks across domains, namely, word embedding (NLP), image classification (CV), and item recommendation (IR). Experiments show that $f_{BGD}$ significantly outperforms NS-based SGD models on all three tasks with comparable efficiency. Codes will be made available.
ID: 70
Soft-Robust Actor-Critic Policy-Gradient
Esther Derman, Daniel J Mankowitz, Timothy A Mann, Shie Mannor
Robust Reinforcement Learning aims to derive an optimal behavior that accounts for model uncertainty
in dynamical systems. However, previous studies have shown that by considering the worst case scenario,
robust policies can be overly conservative. Our soft-robust framework is an attempt to overcome
this issue. In this paper, we present a novel Soft-Robust Actor-Critic algorithm (SR-AC).
It learns an optimal policy with respect to a distribution over an uncertainty set and stays robust to model uncertainty but avoids the conservativeness of robust strategies. We show the convergence of SR-AC and test the efficiency of our approach
on two different domains by comparing it against regular learning methods and their robust formulations.
ID: 71
Constant Step Size Stochastic Gradient Descent for Probabilistic Modeling
Dmitry Babichev, Francis Bach
Stochastic gradient methods enable learning probabilistic models from large amounts of data. While large step-sizes (learning rates) have shown to be best for least-squares (e.g., Gaussian noise) once combined with parameter averaging, these are not leading to convergent algorithms in general. In this paper, we consider generalized linear models, that is, conditional models based on exponential families. We propose averaging moment parameters instead of natural parameters for constant-step-size stochastic gradient descent.
For finite-dimensional models, we show that this can sometimes (and surprisingly) lead to better predictions than the best linear model.
For infinite-dimensional models, we show that it always converges to optimal predictions, while averaging natural parameters never does.
We illustrate our findings with simulations on synthetic data and classical benchmarks.
ID: 75
Discrete Sampling using Semigradient-based Product Mixtures
Alkis Gotovos, Stefanie Jegelka, Hamed Hassani, Andreas Krause
We consider the problem of inference in discrete probabilistic models, that is, distributions over subsets of a finite ground set. These encompass a range of well-known models in machine learning, such as determinantal point processes and Ising models. Locally-moving Markov chain Monte Carlo algorithms, such as the Gibbs sampler, are commonly used for inference in such models, but their convergence is, at times, prohibitively slow. This is often caused by state-space bottlenecks that greatly hinder the movement of such samplers. We propose a novel sampling strategy that uses a specific mixture of product distributions to propose global moves and, thus, accelerate convergence. Furthermore, we show how to construct such a mixture using semigradient information. We illustrate the effectiveness of combining our sampler with existing ones, both theoretically on an example model, as well as practically on three models learned from real-world data sets.
ID: 83
Combining Knowledge and Reasoning through Probabilistic Soft Logic for Image Puzzle Solving
The uncertainty associated with Human perception is often reduced by one's extensive prior experience and knowledge. Current datasets and systems do not emphasize the necessity and benefit of using such knowledge. In this work, we propose the task of solving a genre of image-puzzles (``image riddles) that require both capabilities involving visual detection (including object, activity recognition) and, knowledge-based or commonsense reasoning. Each puzzle involves a set of images and the question ``"what word connects these images?". We compile a dataset of over 3k riddles where each riddle consists of 4 images and a ground truth answer. The annotations are validated using crowd-sourced evaluation. We also define an automatic evaluation metric to track future progress. Our task bears similarity with the commonly known IQ tasks such as analogy solving, sequence filling that is often used to test intelligence. We develop a Probabilistic Reasoning-based approach that utilizes commonsense knowledge about words and phrases to answer these riddles with a reasonable accuracy. Our approach achieves some promising results for these riddles and provides a strong baseline for future attempts. We make the entire dataset and related materials publicly available to the community (bit.ly/22f9Ala).
ID: 92
Nesting Probabilistic Programs
Tom Rainforth
We formalize the notion of nesting probabilistic programming queries and investigate the resulting statistical implications. We demonstrate that query nesting allows the definition of models which could not otherwise be expressed, such as those involving agents reasoning about other agents, but that existing systems take approaches that lead to inconsistent estimates. We show how to correct this by delineating possible ways one might want to nest queries and asserting the respective conditions required for convergence. We further introduce, and prove the correctness of, a new online nested Monte Carlo estimation method that makes it substantially easier to ensure these conditions are met, thereby providing a simple framework for designing statistically correct inference engines.
ID: 99
Scalable Algorithms for Learning High-Dimensional Linear Mixed Models
Linear mixed models (LMMs) are used extensively to model observations that are not independent. Parameter estimation for LMMs can be computationally prohibitive on big data. State-of-the-art learning algorithms require computational complexity which depends at least linearly on the dimension $p$ of the covariates, and often use heuristics that do not offer theoretical guarantees. We present scalable algorithms for learning high-dimensional LMMs with sublinear computational complexity dependence on $p$. Key to our approach are novel dual estimators which use only kernel functions of the data, and fast computational techniques based on the subsampled randomized Hadamard transform. We provide theoretical guarantees for our learning algorithms, demonstrating the robustness of parameter estimation. Finally, we complement the theory with experiments on large synthetic and real data.
ID: 117
Constraint-based Causal Discovery for Non-Linear Structural Causal Models with Cycles and Latent Confounders
Patrick Forré, Joris M. Mooij
We address the problem of causal discovery from data, making use of the recently proposed causal modeling framework of modular structural causal models (mSCM). We introduce σ-connection graphs (σ-CG), a new class of mixed graphs (containing undirected, bidirected and directed edges) with additional structure, and extend the concept of σ-separation, a recently proposed generalization of the well-known notion of d-separation, to σ-CGs. We prove the closedness of σ-separation under marginalisation and conditioning and exploit this for implementing a simple test of σ-separation on a σ-CG. Based on this, we propose the first causal discovery algorithm that can handle non-linear functional relations, latent confounders, cyclic dependencies due to feedback loops, and data from different (stochastic) perfect interventions. As a proof of concept, we show on synthetic data that our algorithm can recover the causal graph of a modular structural causal model with sufficiently many observational and interventional data.
ID: 118
Marginal Weighted Maximum Log-likelihood for Efficient Learning of Perturb-and-Map models
Tatiana Shpakova, Francis Bach, Anton Osokin
We consider the structured-output prediction problem through probabilistic approaches and generalize the ``''perturb-and-MAP'' framework to more challenging weighted Hamming losses, which are crucial in applications. While in principle our approach is a straightforward marginalization, it requires solving many related MAP inference problems. We show that for log-supermodular pairwise models these operations can be performed efficiently using the machinery of dynamic graph cuts. We also propose to use \emph{double} stochastic gradient descent, both on the data and on the perturbations, for efficient learning. Our framework can naturally take weak supervision (e.g., partial labels) into account. We conduct a set of experiments on large-scale character recognition and image segmentation, showing the benefits of our algorithms.
ID: 119
Variational Inference for Gaussian Process with Panel Count Data
Hongyi Ding, Young Lee, Issei Sato, Masashi Sugiyama
We present the first framework for Gaussian-process-modulated Poisson processes when the temporal data appear in the form of panel counts. Panel count data frequently arise when experimental subjects are observed only at discrete time points and only the numbers of occurrences of the events between subsequent observation times are available. The exact occurrence timestamps of the events are unknown. The method of conducting the efficient variational inference is presented, based on the assumption of a Gaussian-process-modulated intensity function. We derive a tractable lower bound to alleviate the problems of the intractable evidence lower bound inherent in the variational inference framework. Our algorithm outperforms classical methods on both synthetic and three real panel count sets.
ID: 123
A unified probabilistic model for learning latent factors and their connectivities from high-dimensional data
Ricardo Pio Monti, Aapo Hyvarinen
Connectivity estimation is challenging in the context of high-dimensional data. A useful preprocessing step is to group variables into clusters, however, it is not always clear how to do so from the perspective of connectivity estimation. Another practical challenge is that we may have data from multiple related classes (e.g., multiple subjects or conditions) and wish to incorporate constraints on the similarities across classes. We propose a probabilistic model which simultaneously performs both a grouping of variables (i.e., detecting community structure) and estimation of connectivities between the groups which correspond to latent variables. The model is essentially a factor analysis model where the factors are allowed to have arbitrary correlations, while the factor loading matrix is constrained to express a community structure. The model can be applied on multiple classes so that the connectivities can be different between the classes, while the community structure is the same for all classes. We propose an efficient estimation algorithm based on score matching, and prove the identifiability of the model. Finally, we present an extension to directed (causal) connectivities over latent variables. Simulations and experiments on fMRI data validate the practical utility of the method.
ID: 125
Improved Stochastic Trace Estimation using Mutually Unbiased Bases
JK Fitzsimons, MA Osborne, SJ Roberts, JF Fitzsimons
The paper begins by introducing the definition and construction of mutually unbiased bases, which are a widely used concept in quantum information processing but have received little to no attention in the machine learning and statistics literature. We demonstrate their usefulness by using them to create a new sampling technique which offers an improvement on the previously well established bounds of stochastic trace estimation. This approach offers a new state of the art single shot sampling variance while requiring $\mathcal{O}(\log(n))$ random bits for $\x \in \mathbb{R}^{n}$ which significantly improves on traditional methods such as fixed basis methods, Hutchinson's and Gaussian estimators in terms of the number of random bits required and worst case sample variance.
ID: 128
Unsupervised Multi-view Nonlinear Graph Embedding
Jiaming Huang, Zhao Li, Vincent W. Zheng, Wen Wen, Yifan Yang, Yuanmi Chen
In this paper, we study the unsupervised multi-view graph embedding (UMGE) problem, which aims to learn graph embedding from multiple perspectives in an unsupervised manner. However, the vast majority of multi-view learning work focuses on non-graph data, and surprisingly there are limited work on UMGE. By systematically analyzing different existing methods for UMGE, we discover that cross-view and nonlinearity play a vital role in efficiently improving graph embedding quality. Motivated by this concept, we develop an unsupervised Multi-viEw nonlineaR Graph Embedding (MERGE) approach to model relational multi-view consistency. Experimental results on five benchmark datasets demonstrate that MERGE significantly outperforms the state-of-the-art baselines in terms of accuracy in node classification tasks without sacrificing the computational efficiency.
ID: 132
Graph-based Clustering under Differential Privacy
Rafael Pinot, Anne Morvan, Florian Yger, Cedric Gouy-Pailler, Jamal Atif
In this paper, we present the first differentially private clustering method for arbitrary-shaped node clusters in a graph. This algorithm takes as input only an approximate Minimum Spanning Tree (MST) $\mathcal{T}$ released under weight differential privacy constraints from the graph. Then, the underlying nonconvex clustering partition is successfully recovered from cutting optimal cuts on $\mathcal{T}$. As opposed to existing methods, our algorithm is theoretically well-motivated. Experiments support our theoretical findings.
ID: 139
GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs
We propose a new network architecture, Gated Attention Networks (GaAN), for learning on graphs. Unlike the traditional multi-head attention mechanism, which equally consumes all attention heads, GaAN uses a convolutional sub-network to control each attention head's importance. We demonstrate the effectiveness of GaAN on the inductive node classification problem. Moreover, with GaAN as a building block, we construct the Graph Gated Recurrent Unit (GGRU) to address the traffic speed forecasting problem. Extensive experiments on three real-world datasets show that our GaAN framework achieves state-of-the-art results on both tasks.
ID: 142
Causal Learning for Partially Observed Stochastic Dynamical Systems
Søren Wengel Mogensen, Daniel Malinsky, Niels Richard Hansen
Many models of dynamical systems have causal interpretations that support reasoning about the consequences of interventions, suitably defined. Furthermore, local independence has been suggested as a useful independence concept for stochastic dynamical systems. There is, however, no well-developed theoretical framework for causal learning based on this notion of independence. We study independence models induced by directed mixed graphs (DMGs) and provide abstract graphoid properties that guarantee that an independence model has the global Markov property w.r.t. a DMG. We apply these results to Ito diffusions and event processes. Under a faithfulness assumption, we develop a sound and complete learning algorithm of the directed mixed equivalence graph (DMEG) representing the local independence model of a partially observed stochastic dynamical system.
ID: 148
Variational zero-inflated Gaussian processes with sparse kernels
Pashupati Hegde, Markus Heinonen, Samuel Kaski
Zero-inflated datasets, which have an excess of zero outputs, are commonly encountered in problems such as climate or rare event modelling. Conventional machine learning approaches tend to overestimate the non-zeros leading to poor performance. We propose a novel model family of zero-inflated Gaussian processes (ZiGP) for such zero-inflated datasets, produced by sparse kernels through learning a latent probit Gaussian process that can zero out kernel rows and columns whenever the signal is absent. The ZiGPs are particularly useful for making the powerful Gaussian process networks more interpretable. We introduce sparse GP networks where variable-order latent modelling is achieved through sparse mixing signals. We derive the non-trivial stochastic variational inference tractably for scalable learning of the sparse kernels in both models. The novel output-sparse approach improves both prediction of zero-inflated data and interpretability of latent mixing models.
ID: 149
KBlrn: End-to-End Learning of Knowledge Base Representations with Latent, Relational, and Numerical Features
Alberto Garcia-Duran, Mathias Niepert
We present KBLRN, a framework for end-to-end learning of knowledge base representations from latent, relational, and numerical features. KBLRN integrates feature types with a novel combination of neural representation learning and probabilistic product of experts models. To the best of our knowledge, KBLRN is the first approach that learns representations of knowledge bases by integrating latent, relational, and numerical features. We show that instances of KBLRN outperform existing methods on a range of knowledge base completion tasks. We contribute a novel data set enriching commonly used knowledge base completion benchmarks with numerical features. We also investigate the impact numerical features have on KB completion performance.
ID: 151
Probabilistic AND-OR Attribute Grouping for Zero-Shot Learning
Yuval Atzmon, Gal Chechik
Learning with semantic features could benefit from taking into account the structure of the input space, but learning this structure from limited data is very challenging. This problem commonly arises in zero-shot learning (ZSL) - the problem of categorizing objects to classes that have no training examples. In vision, ZSL is often addressed by learning to detect visual attributes (like "green wing"), and then using predefined knowledge to map from attributes to new classes. This mapping is typically hard because data is not sufficient for training rich attribute-based classifiers, and flat models over attributes are often too limited.
Here we describe LAGO, probabilistic model designed to capture natural soft and-or relations across groups of attributes, and show how this model can be learned in an end-to-end manner with a deep attribute detection model. The model also learns a soft group structure, and readily benefits from prior knowledge about groups if available. We find that the soft and-or structure succeeds to capture meaningful and predictive structures, improving the accuracy of zero-shot learning on two machine vision ZSL benchmark datasets.
ID: 156
Sylvester Normalizing Flows for Variational Inference
Rianne van den Berg, Leonard Hasenclever, Jakub Tomczak, Max Welling
Variational inference relies on flexible approximate posterior distributions. Normalizing flows provide a general recipe to con-
struct flexible variational posteriors. We introduce Sylvester normalizing flows, which can be seen as a generalization of planar flows. Sylvester normalizing flows remove the well-known single-unit bottleneck from planar flows, making a single transformation much more flexible. We compare the performance of Sylvester normalizing flows against planar flows and inverse autoregressive flows and demonstrate that they compare favorably on several datasets.
ID: 163
Holistic Representations for Memorization and Inference
Yunpu Ma, Marcel Hildebrandt, Volker Tresp, Stephan Baier
In this paper we introduce a novel holographic memory model for the distributed storage of complex association patterns and apply it to knowledge graphs, which consist of directed labelled links between nodes. A link connects a subject node with an object node, jointly forming subject-predicate-object triples. In the presented work, nodes and links have initial random representations, plus holistic representations derived from the initial representations of nodes and links in their local neighbourhoods. A memory trace is represented in the same vector space as the holistic representations themselves. To reduce the interference between stored information, it is required that the initial random vectors should be nearly pairwise orthogonal. We show that the pairwise quasi-orthogonality can be improved by drawing vectors from heavy-tailed distributions, e.g. a Cauchy distribution. As a consequence, memory capacity of holistic representations can be significantly improved. Furthermore, we show that, in combination with a very simple neural network, the presented holistic representation approach is superior to other methods for link predictions on knowledge graphs.
ID: 167
Simple and practical algorithms for $\ell_p$-norm low-rank approximation
Anastasios Kyrillidis
We propose practical algorithms for entrywise $\ell_p$-norm low-rank approximation, for $p = 1$ or $p = \infty$.
The proposed framework, which is non-convex and gradient-based, is easy to implement and typically attains
better approximations, faster, than state of the art.
From a theoretical standpoint, we show that the proposed scheme can attain $(1 + \varepsilon)$-OPT approximations.
Our algorithms are not hyperparameter-free: they achieve the desiderata only assuming algorithm's hyperparameters are known {\em apriori}---or are at least approximable.
\emph{I.e.}, our theory indicates what problem quantities need to be known, in order to get a good solution within polynomial time, and does not contradict to recent inapproximabilty results, as in \citep{song2017low}.
ID: 169
Quantile-Regret Minimisation in Infinitely Many-Armed Bandits
Arghya Roy Chaudhuri, Shivaram Kalyanakrishnan
The stochastic multi-armed bandit is a well-studied abstraction of decision making in the face of uncertainty.
We consider the setting in which the number of bandit arms is much larger than the possible number of pulls,
and can even be infinite. With the aim of minimising regret with respect to an optimal arm, existing methods
for this setting either assume some structure over the set of arms (Kleinberg et al., 2008, Ray Chowdhury and Gopalan, 2017),
or some property of the reward distribution (Wang et al., 2008). Invariably, the validity of such assumptions—and therefore
the performance of the corresponding methods depends on instance-specific parameters, which might not be known
beforehand.
We propose a conceptually simple, parameter-free, and practically effective alternative.
Specifically, we introduce a notion of regret with respect to the top quantile of a probability distribution over the
expected reward of randomly drawn arms. Our main contribution is an algorithm that achieves
sublinear “quantile-regret”, both (1) when it is specified a quantile, and (2) when the quantile can
be any (unknown) positive value. The algorithm needs no side information about the arms or about
the structure of their reward distributions: it relies on random sampling to reach arms in the top
quantile. Experiments show that our algorithm outperforms several previous methods (in terms of
conventional regret) when the latter are not tuned well, and often even when they are.
ID: 171
Variational Inference for Gaussian Process Models for Survival Analysis
Minyoung Kim, Vladimir Pavlovic
Gaussian process survival analysis model (GPSAM) was recently proposed to address key deficiencies of the Cox proportional hazard model, namely the need to account for uncertainty in the hazard function modeling while, at the same time, relaxing the time-covariates factorized assumption of the Cox model. However, the existing MCMC inference algorithms for GPSAM have proven to be slow in practice. In this paper we propose novel and scalable variational inference algorithms for GPSAM that reduce the time complexity of the sampling approaches and improve scalability to large datasets. We accomplish this by employing two effective strategies in scalable GP: i) using pseudo inputs and ii) approximation via random feature expansions. In both setups, we derive the full and partial likelihood formulations, typically considered in survival analysis settings. The proposed approaches are evaluated on two clinical and a divorce-marriage benchmark datasets, where we demonstrate improvements in prediction accuracy over the existing survival analysis methods, while reducing the complexity of inference compared to the recent state-of-the-art MCMC-based algorithms.
ID: 179
A Cost-Effective Framework for Preference Elicitation and Aggregation
Zhibing Zhao, Haoming Li, Junming Wang, Jeffrey O. Kephart, Nicholas Mattei, Hui Su, Lirong Xia
We propose a cost-effective framework for preference elicitation and aggregation under Plackett-Luce model with features. Given a budget, our framework iteratively computes the most cost-effective elicitation questions in order to help the agents make better group decisions.
We illustrate the viability of the framework with an experiment on Amazon Mechanical Turk, which estimates the cost of answering different types of elicitation questions. We compare the prediction accuracy of our framework when adopting various information criteria that evaluate the expected information gain from a question. Our experiments show carefully designed information criteria are much more efficient than randomly asking questions given budget constraint.
ID: 181
Incremental Learning-to-Learn with Statistical Guarantees
Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, Massimiliano Pontil
In learning-to-learn the goal is to infer a learning algorithm that works well on a class of tasks sampled from an unknown meta- distribution. In contrast to previous work on batch learning-to-learn, we consider a scenario where tasks are presented sequentially and the algorithm needs to adapt incrementally to im- prove its performance on future tasks. Key to this setting is for the algorithm to rapidly incorporate new observations into the model as they arrive, without keeping them in memory. We focus on the case where the under- lying algorithm is ridge regression parameterized by a positive semidefinite matrix. We pro- pose to learn this matrix by applying a stochastic strategy to minimize the empirical error incurred by ridge regression on future tasks sam- pled from the meta-distribution. We study the statistical properties of the proposed algorithm and prove non-asymptotic bounds on its excess transfer risk, that is, the generalization performance on new tasks from the same meta- distribution. We compare our online learning- to-learn approach with a state of the art batch method, both theoretically and empirically.
ID: 182
Bandits with Side Observations: Bounded vs. Logarithmic Regret
Rémy Degenne, Evrard Garcelon, Vianney Perchet
We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency $\epsilon$, an extra observation is gathered by the agent for free. We prove that, no matter how small $\epsilon$ is the agent can ensure a regret uniformly bounded in time.
More precisely, we construct an algorithm with a regret smaller than $\sum_i \frac{\log(1/\epsilon)}{\Delta_i}$, up to multiplicative constant and $\log\log$ terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.
ID: 185
Sampling and Inference for Beta Neutral-to-the-Left Models of Sparse Networks
Benjamin Bloem-Reddy, Adam Foster, Emile Mathieu, Yee Whye Teh
Empirical evidence suggests that heavy-tailed degree distributions occurring in many real networks are well-approximated by power laws with exponents $\eta$ that take values both less than and greater than two. Models based on various forms of exchangeability are able to capture power laws with $\eta < 2$, and admit tractable inference algorithms; we draw on existing results to show that $\eta > 2$ cannot be generated by the forms of exchangeability used in random graph models. Preferential attachment models generate power law exponents greater than two, but have been of limited use as statistical models due to the inherent difficulty of performing inference in non-exchangeable models. Motivated by this gap, we design and implement inference algorithms for a recently proposed class of models that generates $\eta$ of all possible values. We show that although they are not exchangeable, these models exhibit probabilistic structure amenable to estimation and inference. Our inference methods make a large class of previously intractable models useful for statistical inference.
ID: 186
Clustered Fused Graphical Lasso
Yizhi Zhu, Oluwasanmi Koyejo
Estimating the dynamic connectivity structure among a system of entities has garnered much attention in recent years. While usual methods are designed to take advantage of temporal consistency to overcome noise, they conﬂict with the detectability of anomalies. We propose Clustered Fused Graphical Lasso (CFGL), a method using precomputed clustering information to improve the signal detectability as compared to typical Fused Graphical Lasso methods. We evaluate our method in both simulated and real-world datasets and conclude that, in many cases, CFGL can signiﬁcantly improve the sensitivity to signals without a signiﬁcant negative effect on the temporal consistency.
ID: 191
Unsupervised Learning of Latent Physical Properties Using Perception-Prediction Networks
David Zheng, Vinson Luo, Jiajun Wu, Joshua Tenenbaum
We propose a framework for the completely unsupervised learning of latent object properties from their interactions: the perception-prediction network (PPN). Consisting of a perception module that extracts representations of latent object properties and a prediction module that uses those extracted properties to simulate system dynamics, the PPN can be trained in an end-to-end fashion purely from samples of object dynamics. We find that the representations of latent object properties learned by PPNs not only are sufficient to accurately simulate the dynamics of systems comprised of previously unseen objects, but also can be translated directly into human-interpretable properties (e.g. mass, coefficient of restitution) in an entirely unsupervised manner. Crucially, PPNs also generalize to novel scenarios: their gradient-based training can be applied to many dynamical systems and their graph-based structure functions over systems comprised of different numbers of objects. Our results demonstrate the efficacy of graph-based neural architectures in object-centric inference and prediction tasks, and our model has the potential to discover relevant object properties in systems that are not yet well understood.
Stochastic variance-reduced gradient Langevin dynamics (SVRG-LD) was recently proposed to improve the performance of stochastic gradient Langevin dynamics (SGLD) by reducing the variance in the stochastic gradient. In this paper, we study a variant of SVRG-LD, namely SVRG-LD$^+$, which replaces the full gradient in each epoch by a subsampled one. We provide a nonasymptotic analysis of the convergence of SVRG-LD$^+$ in $2$-Wasserstein distance, and show that SVRG-LD$^+$ enjoys a lower gradient complexity than SVRG-LD, when the sample size is large or the target accuracy requirement is moderate. Our analysis directly implies a sharper convergence rate for SVRG-LD, which improves the existing convergence rate by a factor of $\kappa^{1/6}n^{1/6}$, where $\kappa$ is the condition number of the log-density function and $n$ is the sample size. Experiments on both synthetic and real-world datasets validate our theoretical results.
ID: 195
Finite-State Controllers of POMDPs using Parameter Synthesis
Sebastian Junges, Nils Jansen, Ralf Wimmer, Tim Quatmann, Leonore Winterer, Joost-Pieter Katoen, Bernd Becker
We study finite-state controllers (FSCs) for partially observable Markov decision processes (POMDPs) that are provably correct with respect to given specifications. The key insight is that computing (randomised) FSCs on POMDPs is equivalent to (and computationally as hard as) synthesis for parametric Markov chains (pMCs). This correspondence enables using black-box techniques to compute correct-by-construction FSCs for POMDPs for a wide range of properties. Our experimental evaluation on typical POMDP problems shows that we are competitive to state-of-the-art POMDP solvers.
ID: 198
Identification of Personalized Path-Specific Effects
Ilya Shpitser, Eli Sherman
Unlike classical causal inference, where the goal is to estimate average causal effects within a population, in settings such as personalized medicine, the goal is to map a unit’s characteristics to a treatment tailored to maximize the expected outcome for that unit. Obtaining high-quality mappings of this type is the goal of the dynamic regime literature. In healthcare settings, optimizing policies with respect to a particular causal pathway is often of interest as well. For example, one may wish to maximize the chemical effect of a drug given data from an observational study where chemical effect is entangled with differential adherence. In such cases, one may wish to optimize the direct effect of a drug, while keeping the indirect effect mediated by adherence to that of some reference treatment. In the context of average treatment effects, direct and indirect effects are considered in the mediation analysis literature.
In this paper, we combine mediation analysis and dynamic treatment regime ideas and con- sider how unit characteristics may be used to tailor a treatment strategy that maximizes a particular path-specific effect. In particular, we define counterfactuals associated with the path-specific effects of a policy, give a general identification algorithm for these counterfactuals, and prove completeness of the algorithm for unrestricted policies. A corollary of our results is that the identification algorithm for responses to policies given in [13] is complete for arbitrary policies.
ID: 201
Fast Counting in Machine Learning Applications
Subhadeep Karan, Matthew Eichhorn, Blake Hurlburt, Grant Iraci, Jaroslaw Zola
We propose scalable methods to execute counting queries in machine learning applications. To achieve memory and computational efficiency, we abstract counting queries and their context such that the counts can be aggregated as a stream. We demonstrate performance and scalability of the resulting approach on random queries, and through extensive experimentation using Bayesian networks learning and association rule mining. Our methods significantly outperform commonly used ADtrees and hash tables, and are practical alternatives for processing large-scale data.
ID: 204
A Dual Approach to Scalable Verification of Deep Networks
Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy Mann, Pushmeet Kohli
This paper addresses the problem of formally verifying desirable properties of neural networks, \emph{i.e.}, obtaining provable guarantees that the outputs of the neural network will always behave in a certain way for a given class of inputs. Most previous work on this topic was limited in its applicability by the size of the network, network architecture and the complexity of properties to be verified. In contrast, our framework applies to much more general class of activation functions and specifications on neural network inputs and outputs. We formulate verification as an optimization problem and solve a Lagrangian relaxation of the same to obtain a bound on the verification objective. Our approach is \emph{anytime} i.e.~it can be stopped at any time and a valid bound on the objective can be obtained. We provide theoretical guarantees for the tightness of our verification approach (i.e., how loose our bounds could be) for certain special classes of constraints and demonstrate its practical significance on a variety of verification tasks.
ID: 207
Understanding Measures of Uncertainty for Adversarial Example Detection
Lewis Smith, Yarin Gal
Measuring uncertainty is a promising technique for detecting adversarial examples, crafted in- puts on which the model predicts an incorrect class with high confidence. But many measures of uncertainty exist, including predictive entropy and mutual information, each capturing different types of uncertainty. We study these measures, and shed light on why mutual information seems to be effective at the task of adversarial example detection. We highlight failure modes for MC dropout, a widely used approach for estimating uncertainty in deep models. This leads to an improved understanding of the draw- backs of current methods, and a proposal to improve the quality of uncertainty estimates using probabilistic model ensembles. We give illustrative experiments using MNIST to demonstrate the intuition underlying the different measures of uncertainty, as well as experiments on a real- world Kaggle dogs vs cats classification dataset.
ID: 208
Causal Discovery in the Presence of Measurement Error
Tineke Blom, Sara Magliacane, Anna Klimovskaia, Joris M. Mooij
Causal discovery algorithms can infer causal relationships based on several assumptions, which include the absence of measurement error. However, this assumption is most likely violated in practical applications, resulting in erroneous, irreproducible results. In this work we show how an upper-bound for the variance of random measurement error can be obtained from the covariance matrix of measured variables. We demonstrate a practical application of our approach on real-world protein signaling data.
ID: 212
IDK Cascades: Fast Deep Learning by Learning not to Overthink
Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, Joseph E. Gonzalez
Advances in deep learning have led to substantial increases in prediction accuracy but have been accompanied by increases in the cost of rendering predictions. We conjecture that for a majority of real-world inputs, the recent advances in deep learning have created models that effectively “over-think” on simple inputs. In this paper we revisit the classic question of building model cascades that primarily leverage class asymmetry to reduce cost. We introduce the “I Don’t Know”(IDK) prediction cascades framework, a general framework to systematically compose a set of pre-trained models to accelerate inference without a loss in prediction accuracy. We propose two search based methods for constructing cascades as well as a new cost-aware objective within this framework. The proposed IDK cascade framework can be easily adopted in the existing model serving systems without additional model re-training. We evaluate the proposed techniques on a range of benchmarks to demonstrate the effectiveness of the proposed framework.
ID: 217
Learning Fast Optimizers for Contextual Stochastic Integer Programs
Vinod Nair, Dj Dvijotham, Iain Dunning, Oriol Vinyals
We present a novel RL-based approach to learning a fast and highly scalable solver for a two-stage stochastic integer program in the large-scale data setting. Branch and bound solvers such as SCIP do not scale to large datasets for this problem class. Additionally, they solve each instance independently, without any knowledge transfer across instances. We address these limitations with a learnable local search solver that jointly learns two policies, one to generate an initial solution and another to iteratively improve it with local moves. The policies use contextual features for a problem instance as input, which enables learning across instances and generalization to new ones. We also propose learning a policy to estimate a bound on the objective using dual decomposition. Benchmark results show that on test instances our approach rapidly achieves approximately 30% to 2000% better objective value, which SCIP requires more than an order of magnitude more running time to match. Our approach also achieves better solution quality on seven out of eight benchmark problems than more scalable baselines such as Tabu Search and Progressive Hedging.
ID: 221
Differential Analysis of Directed Networks
Min Ren, Dabao Zhang
We developed a novel statistical method to identify structural differences between two networks characterized by structural equation models (SEMs). We propose to reparameterize the model to separate the differential structures from common structures, and then design an algorithm with calibration and construction stages to identify these differential structures.
The calibration stage serves to obtain consistent prediction by building the l_`2 regularized regression of each endogenous variables against prescreened exogenous variables, correcting for potential endogeneity issue. The construction stage consistently selects and estimates both common and differential effects by undertaking l_`1 regularized regression of each endogenous variable against the predicts of other endogenous variables as well as its anchoring exogenous variables. Our method allows easy parallel computation at each stage. Theoretical results are obtained to establish non-asymptotic error
bounds of predictions and estimates at both stages, as well as the consistency of identified common and differential effects. Our studies on synthetic data demonstrated that our proposed method performed much better than independently
constructing the networks. A real data set is analyzed to illustrate the applicability of our method.
ID: 225
Sparse-Matrix Belief Propagation
Reid Bixler, Bert Huang
We propose sparse-matrix belief propagation, which executes loopy belief propagation in Markov random fields by replacing indexing over graph neighborhoods with sparse-matrix operations. This abstraction allows for seamless integration with optimized sparse linear algebra libraries, including those that perform matrix and tensor operations on modern hardware such as graphical processing units (GPUs). The sparse-matrix abstraction allows the implementation of belief propagation in a high-level language (e.g., Python) that is also able to leverage the power of GPU parallelization. We demonstrate sparse-matrix belief propagation by implementing it in a modern deep learning framework (PyTorch), measuring the resulting massive improvement in running time, and facilitating future integration into deep learning models.
ID: 233
Sequential Learning under Probabilistic Constraints
Amirhossein Meisami, Henry Lam, Chen Dong, Abhishek Pani
We provide the first study on online learning problems under stochastic constraints that are "soft", i.e., need to be satisfied with high probability. These constraints are imposed on all or some stages of the time horizon so that the stage decision probabilistically satisfies a safety condition that is realized after the decision is made. The distributions that govern these conditions are learned through the collected observations. Under a Bayesian framework, we study a scheme that provides statistical feasibility guarantees through the time horizon, by using posterior Monte Carlo samples to form sampled constraints that leverage the scenario generation approach in chance-constrained programming. We demonstrate how our scheme can be integrated into Thompson sampling and illustrate it with an application in online advertisement.
ID: 234
Abstraction Sampling in Graphical Models
Rina Dechter, Filjor Broka, Kalev Kask, Alexander Ihler
We present a new sampling scheme for approximating hard to compute queries over graphical models, such as computing the partition function. The scheme builds upon exact algorithms that traverse a weighted directed state-space graph representing a global function over a graphical model (e.g., probability distribution). With the aid of an abstraction function and randomization, the state space can be compacted (or trimmed) to facilitate tractable computation, yielding a Monte Carlo estimate that is unbiased. We present the general idea and analyze its properties analytically and empirically, investigating two specific ideas for picking abstractions - targeting reduction of variance or search space size.
ID: 235
Meta Reinforcement Learning with Latent Variable Gaussian Processes
Steindor Saemundsson, Katja Hofmann, Marc Peter Deisenroth
Data efficiency, i.e., learning from small data sets, is critical in many practical applications where data collection is time consuming or expensive, e.g., robotics, animal experiments or drug design. Meta learning is one way to increase the data efficiency of learning algorithms by generalizing learned concepts from a set of training tasks to unseen, but related, tasks. Often, this relationship between tasks is hard coded or relies in some other way on human expertise. In this paper, we propose to automatically learn the relationship between tasks using a latent variable model. Our approach finds a variational posterior over tasks and averages over all plausible (according to this posterior) tasks when making predictions. We apply this framework within a model-based reinforcement learning setting for learning dynamics models and controllers of many related tasks. We apply our framework in a model-based reinforcement learning setting, and show that our model effectively generalizes to novel tasks, and that it reduces the average interaction time needed to solve tasks by up to 60% compared to strong baselines.
ID: 236
Non-Parametric Path Analysis in Structural Causal Models
Junzhe Zhang, Elias Bareinboim
One of the fundamental tasks in causal inference is to decompose the observed association between a decision X and an outcome Y into its most basic structural mechanisms. In this paper, we introduce counterfactual estimands for effects along with a specific mechanism, represented as a path from X to Y in an arbitrary structural causal model. We introduce a novel non-parametric decomposition formula which expresses the covariance of X and Y as a sum over unblocked paths from X to Y contained in an arbitrary causal model. This formula allows a fine-grained path analysis without requiring a commitment to any particular parametric form, and can be seen as a generalization of Wright's decomposition method in linear systems (1923,1932) and Pearl's non-parametric mediation formula (2001).
ID: 238
Stochastic Layer-Wise Precision in Deep Neural Networks
Griffin Lacey, Graham W. Taylor, Shawki Areibi
Low precision weights, activations, and gradients have been proposed as a way to improve the computational efficiency and memory footprint of deep neural networks. Recently, low precision networks have even shown to be more robust to adversarial attacks. However, typical implementations of low precision DNNs use uniform precision across all layers of the network. In this work, we explore whether a heterogeneous allocation of precision across a network leads to improved performance, and introduce a learning scheme where a DNN stochastically explores multiple precision configurations through learning. This permits a network to learn an optimal precision configuration. We show on convolutional neural networks trained on MNIST and ILSVRC12 that even though these nets learn a uniform or near-uniform allocation strategy respectively, stochastic precision leads to a favourable regularization effect improving generalization.
ID: 239
Estimation of Optimal Path-Specific Policies
Razieh Nabi, Phyllis Kanki, Ilya Shpitser
The goal of personalized decision making is to map a unit’s characteristics to an action tailored to maximize the expected outcome for that unit. Obtaining high-quality mappings of this type is the goal of the dynamic regime literature. In healthcare settings, optimizing policies with respect to a particular causal pathway may be of interest as well. For example, we may wish to maximize the chemical effect of a drug given data from an observational study where the chemical effect of the drug on the outcome is entangled with the indirect effect mediated by differential adherence. In such cases, we may wish to optimize the direct effect of a drug, while keeping the indirect effect to that of some reference treatment.
In a companion paper, we showed how to combine mediation analysis and dynamic treatment regime ideas to yield path specific policies and counterfactual responses to these policies. In this paper, we derive a variety of methods for learning high quality policies of this type from data, in a causal model corresponding to a longitudinal setting of practical importance. We illustrate our methods via a dataset of HIV patients undergoing therapy, gathered in the Nigerian PEPFAR program.
ID: 245
High-confidence error estimates for learned value functions
Touqir Sajed, Wesley Chung, Martha White
Estimating the value function for a fixed policy is a fundamental problem in reinforcement learning. Policy evaluation algorithms---to estimate value functions---continue to be developed, to improve convergence rates, improve stability and handle variability, particularly for off-policy learning. To understand the properties of these algorithms, the experimenter needs high-confidence estimates of the accuracy of the learned value functions. For environments with small, finite state-spaces, like chains, the true value function can be easily computed, to compute accuracy. For large, or continuous state-spaces, however, this is no longer feasible. In this paper, we address the largely open problem of how to obtain these high-confidence estimates, for general state-spaces. We provide a high-confidence bound on an empirical estimate of the value error to the true value error. We use this bound to design an offline sampling algorithm, which stores the required quantities to repeatedly compute value error estimates for any learned value function. We provide experiments investigating the number of samples required by this offline algorithm in simple benchmark reinforcement learning domains, and highlight that there are still many open questions to be solved for this important problem.
ID: 247
Combinatorial Bandits for Incentivizing Agents with Dynamic Preferences
Design of incentives or recommendations to users is becoming more common as platform providers continually emerge. We propose a multi-armed bandit approach to matching incentives to users, whose preferences are unknown a priori and evolving dynamically in time, in a resource constrained environment. We propose an algorithm that combines ideas from three distinct domains: (i) a greedy matching paradigm, (ii) traditional upper confidence bound algorithm (UCB) for bandits, and (iii) mixing times from the theory of Markov chains. For this algorithm, we provide theoretical bounds on the regret and we demonstrate its performance via both synthetic and realistic (matching supply and demand in a bike-sharing platform) examples.
ID: 250
Sparse Multi-Prototype Classification
Vikas K. Garg, Lin Xiao, Ofer Dekel
We present a new class of sparse multi-prototype classifiers, designed to combine the computational advantages of sparse predictors with the non-linear power of prototype-based classification techniques. This combination makes sparse multi-prototype models especially well-suited for resource constrained computational platforms, such as those found in IoT devices. We cast our supervised learning problem as a convex-concave saddle point problem and design a provably-fast algorithm to solve it. We complement our theoretical analysis with an empirical study that demonstrates the power of our methodology.
ID: 252
Fast Stochastic Quadrature for Approximate Maximum-Likelihood Estimation
Nico Piatkowski, Katharina Morik
Recent stochastic quadrature techniques for undirected graphical models rely on near-minimax degree-$k$ polynomial approximations to the model's potential function for inferring the partition function. While providing desirable statistical guarantees, typical constructions of such approximations are themselves not amenable to efficient inference. Here, we develop a class of Monte Carlo sampling algorithms for efficiently approximating the value of the partition function, as well as the associated pseudo-marginals. More precisely, for pairwise models with $n$ vertices and $m$ edges, the complexity can be reduced from O(d^k) to O(k^4+kn+m+k!), where d>4m is the parameter dimension. We also consider the uses of stochastic quadrature for the problem of maximum likelihood (ML) parameter estimation. For a completely observed model, our analysis gives rise to a probabilistic bound on the log likelihood of the data. Maximizing this bound yields an approximate ML estimate which, in analogy to the moment-matching of exact ML estimation, can be interpreted in terms of pseudo-moment-matching. We present experimental results illustrating the behavior of this approximate ML estimator.
ID: 253
Finite-sample Bounds for Marginal MAP
Qi Lou, Rina Dechter, Alexander Ihler
Marginal MAP is a key task in Bayesian inference and decision-making, and known to be very challenging in general. In this paper, we present an algorithm that blends heuristic search and importance sampling to provide anytime finite-sample bounds for marginal MAP along with predicted MAP solutions. We convert bounding marginal MAP to a surrogate task of bounding a series of summation problems of an augmented graphical model, and then adapt dynamic importance sampling, a recent advance in bounding the partition function, to providing finite-sample bounds for the surrogate task. Those bounds are guaranteed to be tight given enough time, and the values of the predicted MAP solutions will converge to the optimum. Our algorithm runs in an anytime/anyspace manner, which gives flexible trade-offs between memory, time, and solution quality. We demonstrate the effectiveness of our approach empirically on multiple challenging benchmarks in comparison with some state-of-the-art search algorithms.
ID: 255
Acyclic Linear SEMs Obey the Nested Markov Property
Ilya Shpitser, Robin Evans, Thomas S. Richardson
The conditional independence structure induced on the observed marginal distribution by a hidden variable directed acyclic graph (DAG) may be represented by a graphical model represented by mixed graphs called maximal ancestral graphs (MAGs). This model has a number of desirable properties, in particular the set of Gaussian distributions can be parameterized by viewing the graph as a path diagram. Models represented by MAGs have been used for causal discovery [21], and identification theory for causal effects [27].
In addition to ordinary conditional independence constraints, hidden variable DAGs also induce generalized independence constraints. These constraints form the nested Markov property [19]. We first show that acyclic linear SEMs obey this property. Further we show that a natural parameterization for all Gaussian distributions obeying the nested Markov property arises from a generalization of maximal ancestral graphs that we call maximal arid graphs (MArG). We show that every nested Markov model can be associated with a MArG; viewed as a path diagram this MArG parametrizes the Gaussian nested Markov model. This leads directly to methods for ML fitting and computing BIC scores for Gaussian nested models.
ID: 263
A Unified Particle-Optimization Framework for Scalable Bayesian Sampling
Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, Liqun Chen
There has been recent interest in developing scalable Bayesian sampling methods for big-data analysis, such as stochastic gradient MCMC (SG-MCMC) and Stein variational gradient descent (SVGD). A standard SG-MCMC algorithm simulates samples from a discrete-time Markov chain to approximate a target distribution, thus samples could be highly correlated, an undesired property for SG-MCMC. In contrary, SVGD directly optimizes a set of particles to approximate a target distribution, and thus is able to obtain good approximate with relatively much fewer samples. In this paper, we propose a principle particle-optimization framework based on Wasserstein gradient flows to unify SG-MCMC and SVGD, and to allow new algorithms to be developed. Our framework interprets SG-MCMC as particle optimization, revealing strong connections between SG-MCMC and SVGD. The key component of our framework is several particle-approximate techniques to efficiently solve the original partial differential equations on the space of probability measures. Extensive experiments on both synthetic data and deep neural networks demonstrate the effectiveness and efficiency of our framework for scalable Bayesian sampling.
ID: 265
An Efficient Quantile Spatial Scan Statistic for Finding Unusual Regions in Continuous Spatial Data with Covariates
Travis Moore, Weng-Keen Wong
Domains such as citizen science biodiversity monitoring and real estate are producing spatial data with a continuous response and also an associated vector of covariates (i.e. features) associated with each spatial data point. A common data analysis task involves finding unusual regions that differ from the surrounding area. Existing techniques are intended for simpler data (e.g. point process data) and they compare regions according to the means of their distributions to measure "unusualness". Comparing means is not only vulnerable to outliers, but it is also restrictive as an analyst may want to compare other parts of the probability distributions. For instance, a real-estate analyst interested in unusual areas for high-end homes would be more interested in the 90th percentile of home sales prices than in the mean. We introduce the Quantile Spatial Scan Statistic (QSSS), which finds unusual regions in spatial data by comparing quantiles of data distributions while accounting for covariates at each data point. We also develop an exact incremental update of the underlying hypothesis test used by the QSSS, which results a massive speedup over a naive implementation.
ID: 268
Stable Gradient Descent
Yingxue Zhou, Sheng Chen, Arindam Banerjee
Given a finite training set, the goal of many machine learning tasks is to learn a model that has small population risk. Gradient descent (GD) for empirical risk minimization (ERM) and stochastic gradient descent (SGD) are two popular methods for achieving this goal. In practice, the stopping criterion for GD is not clear and it often runs excessively many iterations which may lead to overfitting, while SGD suffers from high variance. To address these issues, in this paper, we propose a Stable Gradient Descent algorithm (StGD) based on adaptive data analysis and differential privacy. StGD provides an upper bound of iterations and the theoretical analysis shows high-probability convergence to the population risk minimizer under suitable conditions. Experimental results illustrate the superiority of the proposed algorithms.
ID: 269
Learning to select computations
Frederick Callaway, Sayan Gul, Paul M. Krueger, Thomas L. Griffiths, Falk Lieder
The efficient use of limited computational resources is an essential ingredient of intelligence. Selecting computations optimally according to rational metareasoning would achieve this, but rational metareasoning is computationally intractable. Inspired by psychology and neuroscience, we propose the first learning algorithm for approximating the optimal selection of computations: Bayesian metalevel policy search (BMPS). We derive this general, sample-efficient search algorithm for a computation-selecting meta-policy based on the insight that the value of information lies between the myopic value of information and the value of perfect information. We evaluate BMPS on three increasingly difficult metareasoning problems: when to terminate computation, how to allocate computation between competing options, and planning. Across all three domains, BMPS achieved near-optimal performance and compared favorably to previously proposed metareasoning heuristics. Finally, we demonstrate the practical utility of BMPS in an emergency management scenario, even accounting for the overhead of metareasoning.
ID: 282
Per-Decision Multi-step Temporal Difference Learning with Control Variates
Kristopher De Asis, Richard S. Sutton
Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. They address a bias-variance trade off between reliance on current estimates, which could be poor, and incorporating longer sampled reward sequences into the updates. Especially in the off-policy setting, where the agent aims to learn about a policy different from the one generating its behaviour, the variance in the updates can cause learning to diverge as the number of sampled rewards used in the estimates increases. In this paper, we introduce per-decision control variates for multi-step TD algorithms, and compare them to existing methods. Our results show that including the control variates can greatly improve performance on both on and off-policy multi-step temporal difference learning tasks.
ID: 289
The Indian Buffet Hawkes Process to Model Evolving Latent Influences
Xi Tan, Vinayak Rao, Jennifer Neville
Temporal events in the real world often exhibit reinforcing dynamics, where earlier events trigger follow-up activity in the near future. Such dynamics have been represented using models of reinforcement and reciprocation, a canonical example being the Hawkes pro- cess (HP). However, previous HP models do not capture the rich dynamics of real-world activity—which can be driven by multiple la- tent triggering factors shared by past and future events, with the latent features themselves ex- hibiting temporal dependency structures. For instance, rather than view a new document just as a response to other documents in the recent past, it is important to account for the factor- structure underlying all previous documents. This structure itself is not fixed, with the influ- ence of earlier documents decaying with time. To this end, we propose a novel Bayesian non- parametric stochastic point process model, the Indian Buffet Hawkes Processes (IBHP), to learn multiple latent triggering factors under- lying streaming document/message data. The IBP facilitates the inclusion of multiple trig- gering factors in the HP, and the HP allows for modeling latent factor evolution in the IBP. We develop an efficient and scalable learning algo- rithm for the IBHP based on Sequential Monte Carlo and demonstrate the effectiveness of the model, both quantitatively and qualitatively, in experiments on synthetic and real data.
ID: 290
Battle of Bandits
Aadirupa Saha, Aditya Gopalan
We introduce Battling-bandits - an online learning framework where the learner chooses a fixed k size subset from n arms to be compared in each trial and observes a single stochastic feedback indicating the winner of the chosen set of arms. This framework generalizes the standard dueling bandit framework and applies to several practical situations such as medical treatment preference, customer preference of products etc., where it is easier and more efficient to provide feedback for multiple arms at once. We develop a novel class of stochastic subset choice model, called pairwise-subset choice model, for the purpose and propose three algorithms - Battling-Doubler, Battling-MultiSBM and Battling-Duel for the problem. While the first two algorithms are based on the corresponding dueling bandits algorithms designed for linear-link based subset choice models; the third algorithm is much general and works for any it pairwise-subset choice model that has a Condorcet winner. We provide regret guarantees for all the three algorithms and also prove a matching lower bound of $\Omega(n \log T)$ for the Battling problem, showing optimality of our proposed algorithms. The efficacy of the algorithms are also demonstrated through extensive experimental evaluations on a variety of synthetic and real world datasets.
ID: 291
Adaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields
Rémi Le Priol, Alexandre Piché, Simon Lacoste-Julien
This work investigates the training of conditional random fields (CRFs) via the stochastic dual coordinate ascent (SDCA) algorithm of Shalev- Shwartz and Zhang (2016). SDCA enjoys a linear convergence rate and a strong empirical perfor- mance for binary classification problems. How- ever, it has never been used to train CRFs. Yet it benefits from an “exact” line search with a single marginalization oracle call, unlike previous ap- proaches. In this paper, we adapt SDCA to train CRFs, and we enhance it with an adaptive non- uniform sampling strategy based on block duality gaps. We perform experiments on four standard sequence prediction tasks. SDCA demonstrates performances on par with the state of the art, and improves over it on three of the four datasets, which have in common the use of sparse features.
ID: 292
Adaptive Stratified Sampling for Precision-Recall Estimation
Ashish Sabharwal, Yexiang Xue
We propose a new algorithm for computing a constant-factor approximation of precision-recall (PR) curves for massive noisy datasets produced by generative models. Assessing validity of items in such datasets requires human annotation, which is costly and must be minimized. Our algorithm, AdaStrat, is the first data-aware method for this task. It chooses the next point to query on the PR curve adaptively, based on previous observations. It then selects specific items to annotate using stratified sampling. Under a mild monotonicity assumption, AdaStrat outputs a guaranteed approximation of the underlying precision function, while using a number of annotations that scales very slowly with N, the dataset size. For example, when the minimum precision is bounded by a constant, it issues only log log N precision queries. In general, it has a regret of no more than log log N w.r.t. an oracle that issues queries at data-dependent (unknown) optimal points. On a scaled-up NLP dataset of 3.5M items, AdaStrat achieves a remarkably close approximation of the true precision function using only 18 precision queries, 13x fewer than best previous approaches.
ID: 295
Fast Kernel Approximations for Latent Force Models and Convolved Multiple-Output Gaussian processes
Cristian Guarnizo, Mauricio Álvarez
A latent force model is a Gaussian process with a covariance function inspired by a differential operator. Such covariance function is obtained by performing convolution integrals between Green's functions associated to the differential operators, and covariance functions associated to latent functions. In the classical formulation of latent force models, the covariance functions are obtained analytically by solving a double integral, leading to expressions that involve numerical solutions of different types of error functions. In consequence, the covariance matrix calculation is considerably expensive, because it requires the evaluation of one or more of these error functions. In this paper, we use random Fourier features to approximate the solution of these double integrals obtaining simpler analytical expressions for such covariance functions. We show experimental results using ordinary differential operators and provide an extension to build general kernel functions for convolved multiple output Gaussian processes.
ID: 302
Fast Policy Learning using Imitation and Reinforcement
Imitation learning (IL) consists of a set of tools that leverage expert demonstrations to speed up the process of training policies. While these strategies provide fast convergence, the performance of IL usually varies with the quality of the expert policy. If the expert policy is suboptimal, IL can yield inferior performance compared with policies learned with policy gradient methods. In this paper, we address this problem in a mirror descent framework and propose an elegant randomized algorithm, which we name LOKI. LOKI first runs an IL algorithm for a small but random number of iterations, and then switches to a policy gradient method. We show that if the switching time is properly randomized, LOKI can learn to outperform a suboptimal expert and converge faster than running policy gradient methods from scratch. Finally, we evaluate the performance of LOKI experimentally in several simulated environments.
ID: 309
Hyperspherical Variational Auto-Encoders
Tim Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, Jakub M. Tomczak
The Variational Auto-Encoder (VAE) is one of the most used unsupervised machine learning models. But although the default choice of a Gaussian distribution for both the prior and the posterior represents a mathematically convenient distribution often leading to competitive results, we show that this parameterization fails to model hyperspherical data. To address this issue we propose using a von Mises-Fisher (vMF) distribution instead, leading to a hyperspherical latent space. Through a series of experiments, we show how such a hyperspherical VAE, or S-VAE, is more suitable for capturing data with a hyperspherical latent structure, while outperforming a normal, N-VAE, in low dimensions on other data types.
ID: 312
Dissociation-Based Oblivious Bounds for Weighted Model Counting
Li Chou, Wolfgang Gatterbauer, Vibhav Gogate
We consider the weighted model counting task which includes important tasks in graphical models such as computing the partition function and probability of evidence as special cases. We propose a partition-based bounding algorithm that exploits logical structure and gives rise to a novel set of inequalities from which lower (and upper) bounds can be derived efficiently. The bounds come with correctness guarantees and are oblivious in that they can be computed by minimally manipulating the parameters of the model. We experimentally compare our bounds with the mini-bucket scheme (which is also oblivious) and show that they are often superior to the latter on a wide variety of benchmark networks.
ID: 313
Averaging Weights Leads to Wider Optima and Better Generalization
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training, with essentially no computational overhead. We interpret this result by analyzing the geometry of SGD trajectories over the loss surfaces of deep neural networks. Moreover, we show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and ShakeShake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization in deep learning, and has almost no computational overhead.
ID: 317
Block Value Symmetries in Probabilistic Graphical Models
Gagan Madan, Ankit Anand, Mausam, Parag Singla
Several lifted inference algorithms for probabilistic graphical models first merge symmetric states into a single cluster (orbit) and then use these for downstream inference, via variations of orbital MCMC. These orbits are represented compactly using permutations over variables, and variable-value (VV) pairs, but these can miss several state symmetries in a domain.
We define the notion of permutations over block-value (BV) pairs, where a block is a set of variables. BV strictly generalizes VV symmetries, and can compute many more symmetries for increasing block sizes. To operationalize use of BV permutations in lifted inference, we describe (1) an algorithm to compute BV permutations given a block partition of the variables, (2) BV-MCMC, an extension of orbital MCMC that can sample from BV orbits, and (3) a heuristic to suggest good block partitions. Our experiments show that BV-MCMC can mix much faster compared to vanilla MCMC and orbital MCMC over VV permutations.
ID: 320
Reasoning in Latent Space with the Bayes Factor
Rahul G. Krishnan, Arjun Khandelwal, Rajesh Ranganath, David Sontag
We propose a novel way to answer probabilistic queries that span multiple datapoints using deep generative models that enforce a separation of representation learning and reasoning. We give a construction for a query-conditional \emph{latent reasoning network}, a parametric distribution over the latent space of a deep generative model. The reasoning network \emph{learns to compile} the posterior distributions of the set of query points into a single distribution via the use of novel neural architecture. With labeled data, we show how to learn the latent reasoning network using a max-margin learning algorithm. The score function we use is given by the Bayes factor of points given the query versus the points unconditionally. We show how to differentiate \emph{through} the latent reasoning model to \emph{fine-tune} the latent space of the deep generative model. We show how the algorithm may be used to focus the data variations captured by the generative model and to build new models for few-shot learning that perform comparably to several recently proposed state-of-the-art methods.
ID: 321
Densified Winner Take All (WTA) Hashing for Sparse Datasets
Beidi Chen, Anshumali Shrivastava
WTA (Winner Take All) hashing has been successfully applied in many large-scale vision applications. This hashing scheme was tailored to take advantage of the comparative reasoning (or order based information), which showed significant accuracy improvements. In this paper, we identify a subtle issue with WTA, which grows with the sparsity of the datasets. This issue limits the discriminative power of WTA. We then propose a solution to this problem based on the idea of Densification which makes use of 2-universal hash functions in a novel way. Our experiments show that Densified WTA Hashing outperforms Vanilla WTA Hashing both in image retrieval and classification tasks consistently and significantly.
Lifted inference reduces the complexity of inference in relational probabilistic models by identifying groups of constants (or atoms) which behave symmetric to each other. A number of techniques have been proposed in the literature for lifting marginal as well MAP inference. We present the first application of lifting rules for marginal-MAP (MMAP), an important inference problem in models having latent variables. Our main contribution is two fold: (1) we define a new equivalence class of variables, called Single Occurrence for MAX (SOM), and show that solution lies at extreme with respect to the SOM variables, i.e., predicate groundings differing only in the instantiation of the SOM variables take the same truth value (2) we define a sub-class SOM-R (SOM Reduce) and exploit properties of extreme assignments to show that MMAP inference can be performed by reducing the domain of SOM-R variables to a single constant. We refer to our lifting technique as the SOM-R rule for lifted MMAP. Combined with existing rules such as decomposer and binomial, this results in a powerful framework for lifted MMAP. Experiments on three benchmark domains show significant gains in both time and memory compared to ground inference as well as lifted approaches not using SOM-R.
ID: 325
PAC-Reasoning in Relational Domains
Ondrej Kuzelka, Yuyi Wang, Jesse Davis, Steven Schockaert
We consider the problem of predicting plausible missing facts in relational data, given a set of imperfect logical rules. In particular, our aim is to provide bounds on the (expected) number of incorrect inferences that are made in this way. Since for classical inference it is in general impossible to bound this number in a non-trivial way, we consider two inference relations that weaken, but remain close in spirit to classical inference.
ID: 332
Pure Exploration of Multi-Armed Bandits with Heavy-Tailed Payoffs
Xiaotian Yu, Han Shao, Michael R. Lyu, Irwin King
Inspired by heavy-tailed data distributions in real scenarios, we investigate the problem on pure exploration of Multi-Armed Bandits (MAB) with heavy-tailed payoffs by breaking the classic sub-Gaussian assumption in MAB, and assuming that stochastic payoffs from bandits are bounded by the p-th moment, where p ∈ (1, +∞). The main contributions in this paper are three-fold. First, we technically analyze tail probabilities of empirical average and truncated empirical average (TEA) for estimating expected payoffs in sequential decisions with heavy-tailed noises. Second, we propose two effective bandit algorithms based on different prior information (i.e., fixed confidence or fixed budget) for pure exploration of MAB, which generates payoffs with finite p-th moment. Third, we derive theoretical guarantees for the proposed two bandit algorithms, and demonstrate the effectiveness of two algorithms in pure exploration of MAB with heavy- tailed payoffs.
ID: 334
Counterfactual Normalization: Using Knowledge of Causal Mechanisms to Identify and Protect Against Spurious Associations
Adarsh Subbaswamy, Suchi Saria
Statistical prediction models are sensitive to learning spurious correlations as a result of selection bias or domain-specific factors in the training data which leads to poor generalization when these biases change. In this paper, we propose an adjustment-based method for removing these biases from a prediction problem without samples from the target distribution. Specifically, we use knowledge of causal mechanisms between variables in a prediction problem to identify misleading correlations. We also develop a node-splitting operation which allows us to graphically identify latent counterfactual variables that account for causal mechanisms, and show how we can estimate these variables. Models using the latent variables instead of the observed variables are protected from learning spurious correlations without sacrificing accuracy. We empirically evaluate the proposed method on simulated and real-world data to demonstrate its effectiveness.
ID: 342
Decentralized Planning for Non-dedicated Agent Teams with Submodular Rewards in Uncertain Environments
Pritee Agrawal, Pradeep Varakantham, William Yeoh
Planning under uncertainty is seen in many domains such as disaster rescue, sensor networks, security patrolling, etc. where individual agents work in a decentralized, yet co-operative manner and are tied together through a global reward function. Existing research has made significant progress by providing a rich framework to represent co-operative and decentralized planning under uncertainty. This class of problems also deal with non-dedication of team members where agents may leave due to high priority tasks(e.g., emergency,accidents etc.) or due to damange to the agent. However, there is very limited literature to handle problems of non-dedication in agent teams in decentralized settings. In this paper, we provide a general model to represent such problems and solution approaches for handling cooperative and decentralized planning under uncertainty for non-dedicated agent teams. We specifically provide two greedy approaches (an offline one and an offline-online one) that are able to deal with agents leaving the team in an effective and efficient way by exploiting the submodularity property. We provide a detailed evaluation of our approaches on existing benchmark problems and demonstrate that our approaches are able to obtain more than 90% of optimal solution quality on benchmark problems from the literature.
ID: 343
A Forest Mixture Bound for Block-Free Parallel Inference
Neal Lawton, Aram Galstyan, Greg Ver Steeg
Coordinate ascent variational inference is an important algorithm for inference in probabilistic models, but it is slow because it updates only a single variable at a time. Block coordinate methods perform inference faster by updating blocks of variables in parallel. However, the speed and stability of these algorithms depends on how the variables are partitioned into blocks. In this paper, we give a stable parallel algorithm for inference in deep exponential families that doesn't require the variables to be partitioned into blocks. We achieve this by lower bounding the ELBO by a new objective we call the forest mixture bound (FM bound) that separates the inference problem for variables within a hidden layer. We apply this to the simple case when all random variables are Gaussian and show empirically that the algorithm converges faster for models that are inherently more forest-like.
ID: 346
Causal Identification under Markov Equivalence
Amin Jaber, Jiji Zhang, Elias Bareinboim
Assessing the magnitude of cause-and-effect relations is one central challenge found throughout the sciences. The problem of identification of causal effects is concerned with determining whether a causal effect can be computed from a combination of observational data and substantive knowledge about the domain under investigation, which is usually assumed to be articulated in the form of a causal graph. A common challenge found in practice, however, is that substantive knowledge about the domain may not be ``strong'' enough so that a unique causal graph can be specified. Another line of investigation attempts to learn a qualitative description of the domain from purely observational data. Reality here is demanding as well, and not rarely, the underlying causal model cannot be pinned down, but only a collection of models that share the same set of observed features can be identified (usually called a Markov equivalence class). In this paper, we marry both approaches and study the problem of causal identification from an equivalence class, represented by a partial ancestral graph (PAG). We start by deriving a set of graphical properties of PAGs that are carried over to its induced subgraphs. We then develop an algorithm to compute the effect of an arbitrary set of interventional variables on an arbitrary outcome set. We show that the new algorithm is strictly more powerful than the current state-of-the-art found in the literature.
ID: 351
The Variational Homoencoder: Learning to learn high capacity generative models from few examples
Luke Hewitt, Maxwell Nye, Andrea Gane, Tommi Jaakkola, Joshua B. Tenenbaum
Hierarchical Bayesian methods can unify many related tasks (e.g. k-shot classification, conditional and unconditional generation) as inference within a single generative model. However, when this generative model is expressed as a powerful neural network such as a PixelCNN, we show that existing learning techniques typically fail to effectively use latent variables. To address this, we develop a modification of the Variational Autoencoder in which encoded observations are decoded to new elements from the same class. This technique, which we call a Variational Homoencoder (VHE), produces a hierarchical latent variable model which better utilises latent variables. We use the VHE framework to learn a hierarchical PixelCNN on the Omniglot dataset, which outperforms all existing models on test set likelihood and achieves both strong one-shot generation and near human-level classification. We additionally validate the VHE on natural images from the YouTube Faces database. Finally, we develop extensions of the model that apply to richer dataset structures such as factorial and hierarchical categories.
ID: 354
Probabilistic Collaborative Representation Learning for Personalized Item Recommendation
Aghiles Salah, Hady W. Lauw
We present Probabilistic Collaborative Representation Learning (PCRL), a new generative model of user preferences and item contexts. The latter builds on the assumption that relationships among items within contexts (e.g., browsing session, shopping cart, etc.) may underlie various aspects that guide the choices people make. Intuitively, PCRL seeks representations of items reflecting various regularities between them that might be useful at explaining user preferences. Formally, it relies on Bayesian Poisson Factorization to model user-item interactions, and uses a multilayered latent variable architecture to learn representations of items from their contexts. PCRL seamlessly integrates both tasks within a joint framework. However, inference and learning under the proposed model are challenging due to several sources of intractability. Relying on the recent advances in approximate inference/learning, we derive an efficient variational algorithm to estimate model parameters from observations. We further conduct experiments on several real-world datasets to showcase the benefits our proposed model.
ID: 356
Reforming Generative Autoencoders Using Parametric Hypothesis Testing
Aaron Palmer, Dipak Dey, Jinbo Bi
Generative models, while not new, have taken
the deep learning field by storm. However,
the widely used training methods have not
exploited the substantial statistical literature
concerning parametric distributional testing.
Having sound theoretical foundations, these
goodness-of-fit tests enable parts of the black
box to be stripped away. In this paper we use
the Shapiro-Wilk and propose a new multivari-
ate generalization of Shapiro-Wilk to respec-
tively test for univariate and multivariate nor-
mality of the code layer of a generative autoen-
coder. By replacing the discriminator in tra-
ditional deep models with the hypothesis tests,
we gain several advantages: no hyper-parameter
tuning for this method is needed; objectively
evaluate whether the encoder is actually embed-
ding data onto a normal manifold; accurately
define when convergence happens; explicitly
balance between reconstruction and encoding
training. Not only does our method produce
comparable results, but it does so in a frac-
tion of the time. We highlight the fact that
the hypothesis tests used in our model asymp-
totically lead to the same solution of the L 2 -
Wasserstein distance metrics used by several
generative models today.
ID: 359
Towards Flatter Loss Surface via Nonmonotonic Learning Rate Scheduling
Sihyeon Seong, Yegang Lee, Youngwook Kee, Dongyoon Han and Junmo Kim
Whereas optimizing deep neural networks using stochastic gradient descent has been showed great performances in practice, the rule for setting step size (i.e. learning rate) of gradient descent is not well studied. Although it appears that some intriguing learning rate rules such as ADAM (Kingma and Ba, 2014) have since been developed, they concentrated on improving convergence, not on improving generalization capabilities. Recently, the improved generalization property of the flat minima was revisited, and this research guides us towards promising solutions to many current optimization problems. In this paper, we analyze the flatness of loss surfaces in the lens of robustness to input perturbations and advocate that gradient descent should be guided to reach flatter region of loss surfaces to achieve generalization. Finally, we suggest a learning rate rule for escaping sharp regions of loss surfaces, and we demonstrate the capacity of our approach by performing numerous experiments.
ID: 361
A Lagrangian Perspective on Latent Variable Generative Models
Shengjia Zhao, Jiaming Song, Stefano Ermon
A variety of learning objectives have been recently proposed for latent variable generative models. We show that many of them, including InfoGAN, ALI/BiGAN, ALICE, CycleGAN, VAE, $\beta$-VAE, adversarial autoencoders, AVB, AS-VAE and InfoVAE, etc, are Lagrangian duals of the same primal optimization problem for fixed values of the Lagrange multipliers. Based on this observation, we provide an exhaustive characterization of all the objectives that belong to this class. Next, we propose a dual optimization method where we optimize model parameters as well as the Lagrangian multipliers. This method achieves Pareto near-optimal solutions in terms of optimizing primal objectives and modeling correct distributions.
ID: 362
Bayesian optimization and attribute adjustment
Stephan Eismann, Daniel Levy, Rui Shu, Stefano Ermon
Automatic design via Bayesian optimization holds great promise given the constant increase of available data across domains. However, it faces difficulties from high-dimensional, potentially discrete, search spaces. We propose to probabilistically embed inputs into a lower dimensional, continuous latent space, where we perform gradient-based optimization guided by a Gaussian process. Building on variational autoncoders, we use both labeled and unlabeled data to guide the encoding and increase its accuracy. In addition, we propose an adversarial extension to render the latent representation invariant with respect to specific design attributes, which allows us to transfer these attributes across structures. We apply the framework to a functional-protein dataset and are able to perform successful optimization of drag coefficients optimizing directly over high-dimensional shapes without incorporating domain knowledge or handcrafted features.
ID: 367
Join Graph Decomposition Bounds for Influence Diagrams
Junkyu Lee, Alexander Ihler, Rina Dechter
We introduce a new decomposition methods
for bounding the maximum expected utility of influence diagrams.
The main goal is to devise an approximation scheme that is free from translations that are required by existing variational approaches.
since a naive reduction from influence diagrams to Bayesian networks
produces practically infeasible problems.
In this work, we extend decomposition methods for the probabilistic inference by
using an algebraic framework called valuation algebra which effectively captures both multiplicative and additive local structure presents in IDs.
Empirical evaluation on four benchmark sets demonstrates
the effectiveness of our approach compared to
the translation based methods. In addition, proposed
decomposition method significantly improves
partition based approximation schemes with lower computational resources.
ID: 372
Causal Discovery with Linear Non-Gaussian Models under Measurement Error: Structural Identifiability Results
Kun Zhang, Mingming Gong, Joseph Ramsey, Kayhan Batmanghelich, Peter Spirtes, Clark Glymour
Causal discovery methods aim to recover the causal process that generated purely observational data. Despite its successes on a number of real problems, the presence of measurement error in the observed data can produce serious mistakes in the output of various causal discovery methods. Given the ubiquity of measurement error caused by instruments or proxies used in the measuring process, this problem is one of the main obstacles to reliable causal discovery. It is still unknown to what extent the causal structure of relevant variables can be identified in principle. This study aims to take a step towards filling that void. We assume that the underlining process or the measurement-error free variables follows a linear, non-Guassian causal model, and show that the so-called ordered group decomposition of the causal model, which contains major causal information, is identifiable. The causal structure identifiability is further improved with different types of sparsity constraints on the causal structure. Finally, we give rather mild conditions under which the whole causal structure is fully identifiable.