UAI 2017 - Accepted Papers

ID: 65

A Fast Algorithm for Matrix Eigen-decomposition

Zhiqiang Xu; Yiping Ke; Xin Gao

We propose a fast stochastic Riemannian gradient eigensolver for a real and symmetric matrix, and prove its local, eigengap-dependent and linear convergence. The fast convergence is brought by deploying the variance reduction technique which was originally developed for the Euclidean strongly convex problems. In this paper, this technique is generalized to Riemannian manifolds for solving the (geodesically) non-convex problem of finding a group of top eigenvectors of the matrix. Specifically, we first propose the general variance reduction form, SVRRG, in the stochastic Riemannian gradient optimization framework, and it turns out to necessitate the operation of vector transport in addition to using Riemannian gradients. We then specialize it to the problem in question resulting in our SVRRG-EIGS algorithm. We are among the first to propose and analyze the generalization of the stochastic variance reduced gradient (SVRG) to Riemannian manifolds. Meanwhile, as the extension of the VR-PCA to the Riemannian setting, it is significant and nontrivial in the sense of theoretically achieving a faster convergence and empirically also making a difference as demonstrated in experiments, due to our respect to the inherent geometry of the problem.

ID: 171

A Practical Method for Solving Contextual Bandit Problems Using Decision Trees

Adam N. Elmachtoub; Ryan McNellis; Marek Petrik; Sechan Oh

Many efficient algorithms with strong theoretical guarantees have been proposed for the contextual multi-armed bandit problem. However, applying these algorithms in practice can be difficult because they require domain expertise to build appropriate features and to tune their parameters. We propose a new method for the contextual bandit problem that is simple, practical, and can be applied with little or no domain expertise. Our algorithm relies on decision trees to model the context-reward relationship. Decision trees are non-parametric, interpretable, and work well without hand-crafted features. To guide the exploration-exploitation trade-off, we use a bootstrapping approach which abstracts Thompson sampling to non-Bayesian settings. We also discuss several computational heuristics and demonstrate the performance of our method on several datasets.

ID: 278

A Probabilistic Framework for Multi-Label Learning with Unseen Labels

Abhilash Gaure; Piyush Rai

We present a probabilistic framework for multi-label learning for a challenging setting when the test data may require predicting the presence/absence of labels that were not available at the training time. To address this, we develop a probabilistic model that leverages the co-occurrence statistics of the labels via a joint generative model for the label matrix (which denotes the label presence/absence for each example) and for the label co-occurrence matrix (which denotes how many times a pair of labels co-occurs with each other). The label co-occurrence statistics can be obtained from external data sources in an unsupervised manner. In addition to handling the unseen labels at test time, leveraging the co-occurrence information may also help in the standard multi-label learning setting, especially if the number of training examples is very small and/or the label matrix of training examples has a large fraction of missing entries. We present our experimental results on a number of benchmark data sets, comparing our model with various baselines, and show that our model performs favorably when handling unseen labels.

ID: 209

A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units

Niranjani Prasad; Li-Fang Cheng; Corey Chivers; Michael Draugelis; Barbara E Engelhardt

The management of invasive mechanical ventilation, and the regulation of sedation and analgesia during ventilation, constitutes a major part of the care of patients admitted to intensive care units. Both prolonged dependence on mechanical ventilation and premature extubation are associated with increased risk of complications and higher hospital costs, but clinical opinion on the best protocol for weaning patients off the ventilator varies. This work looks to develop a decision support tool that uses available information to predict time to extubation readiness and recommend a personalized regime of sedation dosage and ventilator support. To this end, we employ off-policy reinforcement learning algorithms to determine the best action at a given patient state from sub-optimal historical ICU data. We compare treatment policies from fitted Q-iteration with extremely randomized trees and with feedforward neural networks, and demonstrate that the policies learnt show promise in recommending weaning protocols with improved outcomes, in terms of minimizing the rate of reintubation and closer regulation of physiological stability.

ID: 220

A Tractable Probabilistic Model for Subset Selection

Yujia Shen; Arthur Choi; Adnan Darwiche

Subset selection tasks, such as top-\(k\) ranking, induce datasets where examples have cardinalities that are known a priori. In this paper, we propose a tractable probabilistic model for subset selection and show how it can be learned from data. Our proposed model is interpretable and subsumes a previously introduced model based on logistic regression. We show how the parameters of our model can be estimated in closed form given complete data, and propose an algorithm for learning its structure in an interpretable space. We highlight the intuitive structures that we learn via case studies. We finally show how our proposed model can be viewed as an instance of the recently proposed Probabilistic Sentential Decision Diagram.

ID: 306

Adversarial Sets for Regularising Neural Link Predictors

Pasquale Minervini; Thomas Demeester; Tim Rocktäschel; Sebastian Riedel

In adversarial training, a set of agents learn together by pursuing competing goals. However, such goals are defined on single data samples; for instance, in GANs, a generator learns to synthesise data samples, while a discriminator learns to distinguish between real and synthesised samples. However, in relational learning and other non-i.i.d domains, further goals can be defined over sets of samples. For example, a link predictor for the $\rel{is-a}$ relation needs to be consistent with its transitivity property: if is-a(x, y) and is-a(y, z) hold, is-a(x, z) needs to hold as well. We propose a strategy for generating adversarial sets of inputs that violate user-defined relational assumptions, such as transitivity or symmetry, formulated using First-order Logic clauses. We show how to map such clauses to an adversarial training objective, and how to optimise such an objective. In the proposed approach, the classifier is trained by jointly minimising a data and an inconsistency loss defined on an adversarial set of samples, generated by an adversary. Our approach combines the advantages of previous approaches for incorporating logic clauses: it is applicable to a variety of neural link prediction models, and can encode complex types of clauses. This is empirically supported with results on two link prediction tasks.

ID: 277

Algebraic Equivalence Class Selection for Linear Structural Equation Models

Thijs van Ommen; Joris M. Mooij

Despite their popularity, many questions about the algebraic constraints imposed by linear structural equation models remain open problems. For causal discovery, two of these problems are especially important: the enumeration of the constraints imposed by a model, and deciding whether two graphs define the same statistical model. We show how the half-trek criterion can be used to make progress in both these problems. We apply our theoretical results to a small-scale model selection problem, and find that taking the additional algebraic constraints into account may lead to significant improvements in model selection accuracy.

ID: 54

An Efficient Minibatch Acceptance Test for Metropolis-Hastings

Daniel Seita; Xinlei Pan; Haoyu Chen; John Canny

We present a novel Metropolis-Hastings method for large datasets that uses small expected-size minibatches of data. Previous work on reducing the cost of Metropolis- Hastings tests yield variable data consumed per sample, with only constant factor reductions versus using the full dataset for each sample. Here we present a method that can be tuned to provide arbitrarily small batch sizes, by adjusting either proposal step size or temperature. Our test uses the noise-tolerant Barker acceptance test with a novel additive correction variable. The resulting test has similar cost to a normal SGD update. Our experiments demonstrate several order-of-magnitude speedups over previous work.

ID: 180

Analysis of Thompson Sampling for Stochastic Sleeping Bandits

Aritra Chatterjee; Ganesh Ghalme; Shweta Jain; Rohit Vaish; Y. Narahari

We study a variant of the stochastic multi-armed bandit problem where the set of available arms varies adversarially with time (also known as the sleeping bandit problem). We focus on the Thompson Sampling algorithm, and consider a regret notion defined with respect to the best available arm. Our main result is an O(log T) regret bound for Thompson Sampling, which generalizes a similar bound known for this algorithm from the classical bandit setting. Our bound also matches (up to constants) the best-known lower bound for the sleeping bandit problem. We show via simulations that Thompson Sampling outperforms the UCB-style AUER algorithm for sleeping bandit problem

ID: 146

Approximate Evidential Reasoning Using Local Conditioning and Conditional Belief Functions

Van Nguyen

We propose a new message-passing belief propagation method that approximates belief updating on evidential networks with conditional belief functions. By means of local conditioning, the method is able to propagate beliefs on the original multiply-connected network structure using local computations, facilitating reasoning in a distributed and dynamic context. Further, by use of conditional belief functions in the form of partially defined plausibility and basic plausibility assignment functions, belief updating can be efficiently approximated using only partial information of the belief functions involved. Experiments show that the method produces results with high degree of accuracy whilst achieving a significant decrease in computational and space complexity (compared to exact methods).

ID: 109

Approximation Complexity of Maximum A Posteriori Inference in Sum-Product Networks

Diarmaid Conaty; Cassio P. de Campos; Denis D. Maua

We discuss the computational complexity of approximating maximum a posteriori inference in sum-product networks. We first show NP-hardness in trees of height two by a reduction from maximum independent set; this implies non-approximability within a sublinear factor. We show that this is a tight bound, as we can find an approximation within a linear factor in networks of height two. We then show that, in trees of height three, it is NP-hard to approximate the problem within a factor $2^{f(n)}$ for any sublinear function $f$ of the size of the input $n$. Again, this bound is tight, as we prove that the usual max-product algorithm finds (in any network) approximations within factor $2^{c \cdot n}$ for some constant $c < 1$. Last, we present a simple algorithm, and show that it provably produces solutions at least as good as, and potentially much better than, the max-product algorithm. We empirically analyze the proposed algorithm against max-product using synthetic and real-world data.

ID: 50

AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models

Karl Krauth; Edwin V. Bonilla; Kurt Cutajar; Maurizio Filippone

We investigate the capabilities and limitations of Gaussian process models by jointly exploring three complementary directions: (i) scalable and statistically efficient inference; (ii) flexible kernels; and (iii) objective functions for hyperparameter learning alternative to the marginal likelihood. Our approach outperforms all previously reported GP methods on the standard MNIST dataset; performs comparatively to previous kernel-based methods using the RECTANGLES-IMAGE dataset; and breaks the 1% error-rate barrier in GP models using the MNIST8M dataset, showing along the way the scalability of our method at unprecedented scale for GP models (8 million observations) in classification problems. Overall, our approach represents a significant breakthrough in kernel methods and GP models, bridging the gap between deep learning approaches and kernel machines.

ID: 69

Balanced Mini-batch Sampling for SGD Using Determinantal Point Processes

Cheng Zhang; Hedvig Kjellstrom; Stephan Mandt

We study a mini-batch diversification scheme for stochastic gradient descent (SGD) in machine learning. While SGD uniformly subsamples mini-batches from the data, we diversify our mini-batch by using the Determinantal Point Processes (DPPs). DPPs repulsively penalize data points in the same mini-batch which show a high degree of similarity, where similarity is measured in terms of a kernel. This scheme allows us to balance our original dataset, and leads to stochastic gradients with lower variance. We term this approach Balanced Mini-batch SGD (BM-SGD). We show theoretically that this approach contains both regular SGD and stratified sampling as limiting cases, and that more generally BM-SGD is a generalization of stratified sampling to cases where no discrete features exist to bin the data into groups. We show experimentally that our method results more interpretable and diverse features in unsupervised setups, and in better classification accuracies in supervised setups.

ID: 70

Bayesian Inference of Log Determinants

J Fitzsimons; K Cutajar; M Filippone; M Osborne; S Roberts

The log-determinant of a kernel matrix appears in a variety of machine learning problems, ranging from determinantal point processes and generalized Markov random fields, through to the training of Gaussian processes. Exact calculation of this term is often intractable when the size of the kernel matrix exceeds a few thousands. In the spirit of probabilistic numerics, we reinterpret the problem of computing the log-determinant as a Bayesian inference problem. In particular, we combine prior knowledge in the form of bounds from matrix theory and evidence derived from stochastic trace estimation to obtain probabilistic estimates for the log-determinant and its associated uncertainty within a given computational budget. Beyond its novelty and theoretic appeal, the performance of our proposal is competitive with state-of-the-art approaches to approximating the log-determinant, while also quantifying the uncertainty due to budget-constrained evidence.

ID: 78

Branch and Bound for Regular Bayesian Network Structure Learing

Joe Suzuki; Jun Kawahara

We consider efficient Bayesian network structure learning (BNSL) using branch and bound. Thus far, it is known that efficient pruning rules have been found for minimizing the description length (MDL) rather than for maximizing the posterior probability such as the Bayesian Dirichlet equivalent uniform (BDeu) that has been used most often. In any BNSL, given data, the simplest structure should be chosen if structures are equally fit to the data. We show that quotient scores satisfy such regularity while the BDeu does not. In this paper, by utilizing the regularity, we propose an pruning rule for maximizing the posterior probability based on the quotient scores, and prove that the bound is tighter than existing one. We examine by experiments that the proposed bound is applied significantly more often than existing one for the BDeu as well. Besides, we prove that the MDL satisfies regularity and that the proposed bound behaves like that of the MDL, and observe that the proposed bound is almost as efficient as one for the MDL. Our novel insight is that BNSL can be efficient by utilizing regularity for constructing a pruning rule.

ID: 11

Causal Consistency of Structural Equation Models

Paul K. Rubenstein*; Sebastian Weichwald*; Stephan Bongers; Joris M. Mooij; Dominik Janzing; Moritz Grosse-Wentrup; Bernhard Schoelkopf

Complex systems can be modelled at various levels of detail. Ideally, causal models of the same system should be consistent with one another in the sense that they agree in their predictions of the effects of interventions. We formalise this notion of consistency in the case of Structural Equation Models (SEMs) by introducing exact transformations between SEMs. This provides a general language to consider, for instance, the different levels of description in the following three scenarios: (a) models with large numbers of variables versus models in which the `irrelevant' or unobservable variables have been marginalised out; (b) micro-level models versus macro-level models in which the macro-variables are aggregate features of the micro-variables; (c) dynamical time series models versus models of their stationary behaviour. Our analysis stresses the importance of well specified interventions in the causal modelling process and sheds light on the interpretation of cyclic SEMs.

ID: 269

Causal Discovery from Temporally Aggregated Time Series

Mingming Gong; Kun Zhang; Bernhard Schölkopf; Clark Glymour; Dacheng Tao

Discovering causal structure of a dynamical system from observed time series is a traditional and important problem. In many practical applications, observed data are obtained by applying subsampling or temporally aggregation to the original causal processes, making it difficult to discover the underlying causal relations. Subsampling refers to the procedure that for every $k$ consecutive observations, one is kept, the rest being skipped, and recently some advances have been made in causal discovery from such data. With temporal aggregation, the local averages or sums of $k$ consecutive, non-overlapping observations in the causal process are computed as new observations, and causal discovery from such data is even harder. In this paper, we investigate how to recover causal relations at the original causal frequency from temporally aggregated data. Assuming the time series at the causal frequency follows a vector autoregressive (VAR) model, we show that the causal structure at the causal frequency is identifiable from aggregated time series if the noise terms are independent and non-Gaussian and some other technical conditions hold. We then present an estimation method based on non-Gaussian state-space modeling and evaluate its performance on both synthetic and real data.

ID: 286

Communication-Efficient Distributed Primal-Dual Algorithm for Saddle Point Problem

Yaodong Yu; Sulin Liu; Sinno Jialin Pan

Recently, primal-dual algorithms, which are proposed to solve reformulated convex-concave saddle point problem, have proven to be effective for solving a generic class of convex optimization problems, especially when the problems are ill-conditioned. However, the saddle point problem still lacks a distributed optimization framework where primal-dual algorithms can be employed. In this paper, we propose a novel communication-efficient distributed framework to solve the convex-concave saddle point problem based on primal-dual methods. We carefully design local subproblems and a central problem such that our proposed distributed optimization framework is communication-efficient. We provide convergence analysis of our proposed algorithm, and extend it to address non-smooth and non-strongly convex loss functions. We conduct extensive experiments on several real-world datasets to demonstrate competitive performance of the proposed method, especially on ill-conditioned problems.

ID: 97

Complexity Problems for Markov Equivalence of DAG Models

Adityanarayanan Radhakrishnan; Liam Solus; Caroline Uhler

Combinatorially, two DAG models, or Bayesian networks, are the same if their associated DAGs have the same underlying undirected graph (i.e. skeleton) and the same set of immoralities. Newly developed causal inference algorithms that search over the space of DAGs are observably more efficient whenever the sizes of the Markov equivalence classes (MECs) on a given skeleton are small. In order to assess which families of graphs satisfy this criterion for efficiency, it is desirable to be able to compute the size and number of MECs on a given skeleton. In this paper, we study a pair of generating functions that encode the number and size of MECs on a skeleton G. The first is a graph polynomial that counts the number of MECs on G according to their number of immoralities. In analogy to the independent set problem, we observe that computing an MEC on $G$ with degree-many immoralities is an NP-hard problem. The second generating function is an arithmetic function encoding the number of MECs on G counted according to their size. Via computer enumeration, we observe that this generating function is distinct for every connected graph on p nodes for all $p\leq 10$. Collectively, these results illustrate the complexity of enumerating MECs on a fixed skeleton from the perspective of graphical generating functions and combinatorial optimization.

ID: 64

Complexity of Solving Decision Trees with Skew-Symmetric Bilinear Utility

Hugo Gilbert; Olivier Spanjaard

This paper is devoted to the complexity analysis of solving decision trees with a Skew- Symmetric Bilinear (SSB) utility function. The SSB model is an extension of the Expected Utility (EU) model with enhanced descriptive possibilities. Unlike EU, the optimality principle does not hold for SSB, which makes its optimization trickier. We show that determining an SSB optimal plan is NP-hard if one only consider deterministic plans while it is polynomial time if one allows randomized plans. With the Weighted Expected Utility model (a special case of the SSB model), we show that determining an optimal plan is a polynomial time problem in both settings. Finally, we present the results of numerical tests that show the operationality of the proposed methods.

ID: 163

Composing inference algorithms as program transformations

Robert Zinkov; Chung-chieh Shan

Probabilistic inference procedures are usually coded painstakingly from scratch, for each target model and each inference algorithm. We reduce this coding effort by generating inference procedures from models automatically. We make this code generation modular by decomposing inference algorithms into reusable program transformations. These source-to-source transformations perform exact inference as well as generate probabilistic programs that compute expectations, densities, and MCMC samples. The resulting inference procedures run in time comparable to that of handwritten procedures.

ID: 173

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Gintare Karolina Dziugaite; Daniel M. Roy

One of the defining properties of deep learning is that models are chosen to have many more parameters than available training data. In light of this capacity for overfitting, it is remarkable that simple algorithms like SGD reliably return solutions with low test error. One roadblock to explaining these phenomena in terms of implicit regularization, structural properties of the solution, and/or easiness of the data is that many learning bounds are quantitatively vacuous in this "deep learning" regime. In order to explain generalization, we need nonvacuous bounds. We return to an idea by Langford and Caruana (2001), who used PAC-Bayes bounds to compute nonvacuous numerical bounds on generalization error for \emph{stochastic} two-layer two-hidden-unit neural networks via a sensitivity analysis. By optimizing the PAC-Bayes bound directly, we are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples. We connect our findings to recent and old work on flat minima and MDL-based explanations of generalization.

ID: 289

Continuously tempered Hamiltonian Monte Carlo

Matthew M Graham; Amos J Storkey

Hamiltonian Monte Carlo (HMC) is a powerful Markov chain Monte Carlo (MCMC) method for performing approximate inference in complex probabilistic models of continuous variables. In common with many MCMC methods however the standard HMC approach performs poorly in distributions with multiple isolated modes. We present a method for augmenting the Hamiltonian system with an extra continuous temperature control variable which allows the dynamic to bridge between sampling a complex target distribution and a simpler unimodal base distribution. This augmentation both helps improve mixing in multimodal targets and allows the normalisation constant of the target distribution to be estimated. The method is simple to implement within existing HMC code, requiring only a standard leapfrog integrator. We illustrate experimentally that the method is competitive with annealed importance sampling and simulating tempering methods at sampling from challenging multimodal distributions and estimating their normalising constants.

ID: 185

Convex-constrained Sparse Additive Modeling and Its Extensions

Junming Yin; Yaoliang Yu

Sparse additive modeling is a class of effective methods for performing high-dimensional nonparametric regression. In this work we show how shape constraints such as convexity/concavity and their extensions, can be integrated into additive models. The proposed sparse difference of convex additive models (SDCAM) can estimate most continuous functions without any a priori smoothness assumption. Motivated by a characterization of difference of convex functions, our method incorporates a natural regularization functional to avoid overfitting and to reduce model complexity. Computationally, we develop an efficient backfitting algorithm with linear per-iteration complexity. Experiments on both synthetic and real data verify that our method is competitive against state-of-the-art sparse additive models, with improved performance in most scenarios.

ID: 141

Coupling Adaptive Batch Sizes with Learning Rates

Lukas Balles; Javier Romero; Philipp Hennig

Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, stability and convergence is thus often enforced by means of a (manually tuned) decreasing learning rate schedule. We propose a practical method for dynamic batch size adaptation. It estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease. In contrast to recent related work, our algorithm couples the batch size to the learning rate, directly reflecting the known relationship between the two. On popular image classification benchmarks, our batch size adaptation yields faster optimization convergence, while simultaneously simplifying learning rate tuning. A TensorFlow implementation is available.

ID: 93

Data-Dependent Sparsity for Subspace Clustering

Bo Xin; Yizhou Wang; Wen Gao and David Wipf

Subspace clustering is the process of assigning subspace memberships to a set of unlabeled data points assumed to have been drawn from the union of an unknown number of low-dimensional subspaces, possibly interlaced with outliers or other data corruptions. By exploiting the fact that each inlier point has a sparse representation with respect to a dictionary formed by all the other points, an $\ell_1$ regularized sparse subspace clustering (SSC) method has recently shown state-of-the-art robustness and practical extensibility in a variety of applications. But there remain important lingering weaknesses. In particular, the $\ell_1$ norm solution is highly sensitive, often in a detrimental direction, to the very types of data structures that motivate interest in subspace clustering to begin with, sometimes leading to poor segmentation accuracy. However, as an alternative source of sparsity, we argue that a certain data-dependent, non-convex penalty function can compensate for dictionary structure in a way that is especially germane to subspace clustering problems. For example, we demonstrate that this proposal displays a form of invariance to feature-space transformations and affine translations that commonly disrupt existing methods, and moreover, in important settings we reveal that its performance quality is lower bounded by the $\ell_1$ solution. Finally, we provide empirical comparisons on popular benchmarks that corroborate our theoretical findings and demonstrate superior performance when compared to recent state-of-the-art models.

ID: 147

Decoupling Homophily and Reciprocity with Latent Space Network Models

Jiasen Yang; Vinayak Rao; Jennifer Neville

Networks form useful representations of data arising in various physical and social domains. In this work, we consider dynamic networks such as communication networks in which links connecting pairs of nodes appear over continuous time. We adopt a point process modeling approach, and study latent space models which embed network nodes into Euclidean space. We propose models to capture two different aspects of dynamic network data: (i) that communication occurs at a higher rate between individuals with similar features (homophily), and (ii) that individuals tend to reciprocate communications from other nodes, but in a manner that varies across individuals. Our framework marries ideas from point process models, like Poisson and Hawkes processes, with ideas from latent space models of static networks. We evaluate our models over a range of tasks on real-world datasets and show that a dual latent space, which accounts for heterogeneity in both reciprocity and homophily, significantly improves performance for both dynamic and static link prediction.

ID: 152

Differentially Private Variational Inference for Non-conjugate Models

Joonas Jälkö; Antti Honkela; Onur Dikmen

Many machine learning applications are based on data collected from people, such as their tastes and behaviour as well as biological traits and genetic data. Regardless of how important the application might be, one has to make sure individuals' identities or the privacy of the data are not compromised in the analysis. Differential privacy constitutes a powerful framework that prevents breaching of data subject privacy from the output of a computation. Differentially private versions of many important Bayesian inference methods have been proposed, but there is a lack of an efficient unified approach applicable to arbitrary models. In this contribution, we propose a differentially private variational inference method with a very wide applicability. It is built on top of doubly stochastic variational inference, a recent advance which provides a variational solution to a large class of models. We add differential privacy into doubly stochastic variational inference by clipping and perturbing the gradients. The algorithm is made more efficient through privacy amplification from subsampling. We demonstrate the method can reach an accuracy close to non-private level under reasonably strong privacy guarantees, clearly improving over previous sampling-based alternatives especially in the strong privacy regime.

ID: 81

Effective sketching methods for value function approximation

Yangchen Pan; Erfan Sadeqi Azer; Martha White

High-dimensional representations, such as radial basis function networks or tile coding, are common choices for policy evaluation in reinforcement learning. Learning with such high-dimensional representations, however, can be expensive, particularly for matrix methods, such as least-squares temporal difference learning or quasi-Newton methods that approximate matrix step-sizes. In this work, we explore the utility of sketching for these two classes of algorithms. We highlight issues with sketching the high-dimensional features directly, which can incur significant bias. As a remedy, we demonstrate how to use sketching more sparingly, with only a left-sided sketch, that can still enable significant computational gains and the use of these matrix-based learning algorithms that are less sensitive to parameters. We empirically investigate these algorithms, in four domains with a variety of representations. Our aim is to provide insights into effective use of sketching in practice.

ID: 83

Efficient Online Learning for Optimizing Value of Information: Theory and Application to Interactive Troubleshooting

Yuxin Chen; Jean-Michel Renders; Morteza Haghir Chehreghani; Andreas Krause

We consider the optimal value of information (VoI) problem, where the goal is to sequentially select a set of tests with a minimal cost, so that one can efficiently make the best decision based on the observed outcomes. Existing algorithms are either heuristics with no guarantees, or scale poorly (with exponential run time in terms of the number of available tests). Moreover, these methods assume a known distribution over the test outcomes, which is often not the case in practice. We propose an efficient sampling-based online learning framework to address the above issues. First, assuming the distribution over hypotheses is known, we propose a dynamic hypothesis enumeration strategy, which allows efficient information gathering with strong theoretical guarantees. We show that with sufficient amount of samples, one can identify a near-optimal decision with high probability. Second, when the parameters of the hypotheses distribution are unknown, we propose an algorithm which learns the parameters progressively via posterior sampling in an online fashion. We further establish a rigorous bound on the expected regret. We demonstrate the effectiveness of our approach on a real-world interactive troubleshooting application and show that one can efficiently make high-quality decisions with low cost.

ID: 280

Efficient solutions for Stochastic Shortest Path Problems with Dead Ends

Felipe Trevizan; Florent Teichteil-Königsbuch; Sylvie Thiebaux

Many planning problems require maximizing the probability of goal satisfaction as well as minimizing the expected cost to reach the goal. To model and solve such problems, there has been several attempts at extending stochastic shortest path problems (SSPs) to deal with dead-ends and optimize a dual optimization criterion. Unfortunately these extensions lack either theoretical robustness or practical efficiency. We study a new, perhaps even more natural optimization criterion capturing these problems, the Min-Cost given Max-Prob (MCMP) criterion. This criterion leads to the minimum expected cost policy among those with maximum success probability, and accurately accounts for the cost and risk of reaching dead-ends. Moreover, it lends itself to efficient solution methods that build on recent heuristic search algorithms for the dual representation of stochastic shortest paths problems. Our experiments show up to one order of magnitude speed-up over the state of the art.

ID: 135

Embedding Senses via Dictionary Bootstrapping

Byungkon Kang; Kyung-Ah Sohn

This paper addresses the problem of embedding senses of a plain word according to its context. Natural language is inherently ambiguous, due to the presence of many multi-sensed words. Such ambiguity might have undesirable influence over many text-mining tools, including word embedding. Traditional word embedding techniques have focused on identifying which words tend to co-occur with one another in order to derive close embeddings for such words. However, the effectiveness of this approach is largely susceptible to the validity and neutrality of the training corpus. To address this problem, we propose to use the dictionary as the authoritative corpus for computing the word embeddings. The basic idea is to simultaneously embed the definition sentence while disambiguating words. Since dictionaries list a set of disambiguated senses, the proposed procedure yields sense embeddings which exhibit semantic characteristics comparable to plain word embeddings.

ID: 242

Exact Inference for Relational Graphical Models with Interpreted Functions: Lifted Probabilistic Inference Modulo Theories

Rodrigo de Salvo Braz; Ciaran O'Reilly

Probabilistic Inference Modulo Theories (PIMT) is a recent framework that expands exact inference on graphical models to use richer languages that include arithmetic, equalities, and inequalities on both integers and real numbers. In this paper, we expand PIMT to a lifted version that also processes random functions and relations. This enhancement is achieved by adapting Inversion, a method from Lifted First-Order Probabilistic Inference literature, to also be modulo theories. This results in the first algorithm for exact probabilistic inference that efficiently and simultaneously exploits random relations and functions, arithmetic, equalities and inequalities.

ID: 12

FROSH: FasteR Online Sketching Hashing

Xixian Chen; Irwin King; Michael R. Lyu

Many hashing methods, especially those that are in the data-dependent category with good learning accuracy are still inefficient when dealing with three critical problems in modern data analysis. First, data usually come in a streaming fashion, but most of the existing hashing methods are batch-based models. Second, when data become huge, the extensive computational time, large space requirement, and multiple passes to load the data into memory will not be allowed. Third, data often lack sufficient label information. Although the recently proposed Online Sketching Hashing (OSH) is promising to alleviate all three issues mentioned above, its training procedure still suffers from a high time complexity. In this paper, we propose a FasteR Online Sketching Hashing (FROSH) method to make the training process faster. Compared with OSH, we leverage fast transform to sketch data more compactly. Particularly, we derive independent transformations to guarantee the sketching accuracy, and design a novel implementation to make such transformations applicable to online data sketching without increasing the space cost. We rigorously prove that our method can yield a comparable learning accuracy with a lower time complexity and an equal space cost compared with OSH. Finally, extensive experiments on synthetic and real-world datasets demonstrate the excellent performance of our method.

ID: 207

Fair Optimal Stopping Policy for Matching with Mediator

Yang Liu

In this paper we study an optimal stopping policy for a multi-agent delegated matching system with fairness constraints. We consider a setting where a mediator/decision maker matches a sequence of arriving assignments to a group of agents, aiming to maximize total rewards that can be collected from above matching process (from all agents), while making the matching fair to all agents. Without fairness constraint, the decision maker's matching problem can be formulated as a standard optimal stopping question. However it is possible that the decision maker will frequently skip possible matching for certain agents in order to maximize the overall reward of return. The above solution raises serious fairness concerns as agents would face unfair chance of matches. In this paper we study the optimal stopping policy (for matching) with fairness constraints. We discuss two types of fairness constraints: (i) each agent has a certain expected deadline before which her match needs to happen. (ii) For the second one, each agent would like to have a guaranteed share of average reward from the matching. We first show this fairness problems can be formulated into constrained optimal stopping problems. Then we establish the equivalence between the optimal matching strategies and the solutions to a set of constrained optimization problems. We present the exact characterization of optimal strategies. Example is provided to demonstrate the computation efficiency of our solution.

ID: 75

Fast Amortized Inference and Learning in Log-linear Models with Randomly Perturbed Nearest Neighbor Search

Stephen Mussmann; Daniel Levy; Stefano Ermon

Inference in log-linear models scales linearly in the size of output space in the worst-case. This is a bottleneck in NLP and computer vision where the output space is often enumerable but very large. We propose a method to perform inference in log-linear models with sublinear amortized cost. Our idea hinges on using Gumbel random variable perturbations and a pre-computed Maximum Inner Product Search data structure to access the most-likely elements in sublinear amortized time. Our method yields provable runtime and accuracy guarantees. Furthermore, we present empirical experiments on ImageNet and Word Embeddings showing significant speedups for sampling, inference, and learning in log-linear models.

ID: 250

Feature-to-Feature Regression for a Two-Step Conditional Independence Test

Qinyi Zhang; Sarah Filippi; Seth Flaxman; Dino Sejdinovic

The algorithms for causal discovery and more broadly for learning the structure of graphical models require well calibrated and consistent conditional independence (CI) tests. We revisit the CI tests which are based on two-step procedures and involve regression with subsequent (unconditional) independence test (RESIT) on regression residuals and investigate the assumptions under which these tests operate. In particular, we demonstrate that when going beyond simple functional relationships with additive noise, such tests can lead to an inflated number of false discoveries. We study the relationship of these tests with those based on dependence measures using reproducing kernel Hilbert spaces (RKHS) and propose an extension of RESIT which uses RKHS-valued regression. The resulting test inherits the simple two-step testing procedure of RESIT, while giving correct Type I control and competitive power. When used as a component of the PC algorithm, the proposed test is more robust to the case where hidden variables induce a switching behaviour in the associations present in the data.

ID: 142

Green Generative Modeling: Recycling Dirty Data using Recurrent Variational Autoencoders

Yu Wang; Bin Dai; Gang Hua; John Aston; David Wipf

This paper explores two useful modifications of the recent variational autoencoder (VAE), a popular deep generative modeling framework that dresses traditional autoencoders with probabilistic attire. The first involves a specially-tailored form of conditioning that allows us to simplify the VAE decoder structure while simultaneously introducing robustness to outliers. In a related vein, a second, complementary alteration is proposed to further build invariance to contaminated or dirty samples via a data augmentation process that amounts to recycling. In brief, to the extent that the VAE is legitimately a representative generative model, then each output from the decoder should closely resemble an authentic sample, which can then be resubmitted as a novel input ad infinitum. Moreover, this can be accomplished via special recurrent connections without the need for additional parameters to be trained. We evaluate these proposals on multiple practical outlier-removal and generative modeling tasks, demonstrating considerable improvements over existing algorithms.

ID: 267

Holographic Feature Representations of Deep Networks

Martin A. Zinkevich; Alex Davies; Dale Schuurmans

It is often asserted that deep networks learn ``"features", traditionally expressed by the activations of intermediate nodes. We explore an alternative concept by defining features as partial derivatives of model output with respect to model parameters---extending a simple yet powerful idea from generalized linear models. The resulting features are not equivalent to node activations, and we show that they can induce a holographic representation of the complete model: the network's output on given data can be exactly replicated by a simple linear model over such features extracted from any ordered cut. We demonstrate useful advantages for this feature representation over standard representations based on node activations.

ID: 255

How Good Are My Predictions? Efficiently Approximating Precision-Recall Curves for Massive Datasets

Ashish Sabharwal; Hanie Sedghi

Large scale machine learning produces massive datasets whose items are often associated with a confidence level and can thus be ranked. However, computing the precision of these resources requires human annotation, which is prohibitively expensive and is often skipped. We consider the problem of cost-effectively approximating precision-recall (PR) or ROC curves for such systems. Our novel approach, called PAL, provides theoretically guaranteed lower and upper bounds on the underlying precision function while relying on only $\bigOh(\log n)$ annotations for a resource with $n$ items. This contrasts favorably with $\Theta(\sqrt{n \log n})$ annotations needed by commonly used sampling based methods. Our key insight is to capitalize on a natural monotonicity property of the underlying confidence-based ranking. PAL provides tight bounds for PR curves using, e.g., only 17K annotations for resources with 200K items and 48K annotations for resources with 2B items. We illustrate our approach by evaluating the much utilized PPDB paraphrase database and a recently introduced Science KB.

ID: 297

Hybrid Deep Discriminative/Generative Models for Semi-Supervised Learning

Volodymyr Kuleshov; Stefano Ermon

Discriminative models are often the method of choice in machine learning. Generative mod- els, however, are emerging as a powerful alternative because of their ability to handle unlabeled data, especially in semi-supervised settings. We propose a framework to combine a broad class of discriminative and generative models, interpolating between the two extremes with a multi- conditional likelihood objective. Unlike existing methods, we couple the two components through shared latent variables, and train using recent advances in variational inference. Our framework offers great modeling flexibility, is compatible with modern deep architectures, and results in improvements over the state of the art on the SVHN dataset in the semi-supervised setting.

ID: 138

Importance Sampled Stochastic Optimization for Variational Inference

Joseph Sakaya; Arto Klami

Variational inference approximates the posterior distribution of a probabilistic model with a parameterized density by maximizing a lower bound for the model evidence. Modern solutions fit a flexible approximation with stochastic gradient descent, using Monte Carlo approximation for the gradients. This enables variational inference for arbitrary differentiable probabilistic models, and consequently makes variational inference feasible for probabilistic programming languages. In this work we develop more efficient inference algorithms for the task by considering importance sampling estimates for the gradients. We show how the gradient with respect to the approximation parameters can often be evaluated efficiently without needing to re-compute gradients of the model itself, and then proceed to derive practical algorithms that use importance sampled estimates to speed up computation. We present importance sampled stochastic gradient descent that outperforms standard stochastic gradient descent by a clear margin for a range of models, and provide a justifiable variant of stochastic average gradients for variational inference.

ID: 225

Importance Sampling for Fair Policy Selection

Shayan Doroudi; Philip Thomas; Emma Brusnkill

We consider the problem of off-policy policy selection in reinforcement learning: using historical data generated from running one policy to compare two or more policies. We show that approaches based on importance sampling can be unfair—they can select the worse of the two policies more often than not. We give two examples where the unfairness of importance sampling could be practically concerning. We then present sufficient conditions to theoretically guarantee fairness and a related notion of safety. Finally, we provide a practical importance sampling-based estimator to help mitigate one of the systematic sources of unfair- ness resulting from using importance sampling for policy selection.

ID: 259

Improving Optimization-Based Approximate Inference by Clamping Variables

Junyao Zhao; Josip Djolonga; Sebastian Tschiatschek; Andreas Krause

While central to the application of probabilistic models to discrete data, the problem of marginal inference is in general intractable and efficient approximation schemes need to exploit the problem structure. Recently, there have been efforts to develop inference techniques that do not necessarily make factorization assumptions about the distribution, but rather exploit the fact that sometimes there exist efficient algorithms for finding the MAP configuration. In this paper, we theoretically prove that for discrete multi-label models the bounds on the partition function obtained by two of these approaches, Perturb-and-MAP and the bound from the infinite Renyi divergence, can be only improved by clamping any subset of the variables. For the case of log-supermodular models we provide a more detailed analysis and develop a set of efficient strategies for choosing the order in which the variables should be clamped. Finally, we present a number of numerical experiments showcasing the improvements obtained by the proposed methods on several models.

ID: 292

Interpreting Lion Behaviour as Probabilistic Programs

Neil Dhir; Matthjis Vakar; Matthew Wijers; Andrew Markham; Frank Wood

We consider the problem of unsupervised learning of meaningful behavioural segments of high-dimensional time-series observations, collected from a pride of African lions. We demonstrate, by way of probabilistic programming, a methodology which allows for quick iteration over models and Bayesian inferences, which enables us to learn meaningful behavioural segments. We introduce a new Bayesian nonparametric (BNP) state-space model, which extends the HDP-HMM with an explicit BNP treatment of duration distributions, to deal with different levels of granularity of the latent behavioural space of the lions. Furthermore, we combine this approach with unsupervised feature learning, using variational autoencoders.

ID: 120

Interpreting and using CPDAGs with background knowledge

Emilija Perkovic; Markus Kalisch; Marloes H. Maathuis

We consider maximally oriented partially directed acyclic graphs (maximal PDAGs). These graphs occur, for example, as a result of applying background knowledge of certain edge orientations to a completed partially directed acyclic graph (CPDAG). While background knowledge is often available, current causal methods developed for CPDAGs cannot be applied to the resulting graphical output. In this paper, we solve two problems for maximal PDAGs: we develop methodology to read off possible ancestral relationships and we adapt some existing graphical methods for estimating total causal effects. Specifically, we adapt the graphical criteria for covariate adjustment, as well as the IDA and joint-IDA frameworks. We also present a simulation study that illustrates the gain in identifiability of total causal effects as the background knowledge increases.

ID: 48

Inverse Reinforcement Learning via Deep Gaussian Process

Ming Jin; Andreas Damianou; Pieter Abbeel; Costas Spanos

We propose a new approach to inverse reinforcement learning (IRL) based on the deep Gaussian process (deep GP) model, which is capable of learning complicated reward structures with few demonstrations. Our model stacks multiple latent GP layers to learn abstract representations of the state feature space, which is linked to the demonstrations through the Maximum Entropy learning framework. Incorporating the IRL engine into the nonlinear latent structure renders existing deep GP inference approaches intractable. To tackle this, we develop a non-standard variational approximation framework which extends previous inference schemes. This allows for approximate Bayesian treatment of the feature space and guards against overfitting. Carrying out representation and inverse reinforcement learning simultaneously within our model outperforms state-of-the-art approaches, as we demonstrate with experiments on standard benchmarks (``"object world",``"highway driving") and a new benchmark (``"binary world").

ID: 197

Iterative Decomposition Guided Variable Neighborhood Search for Graphical Model Energy Minimization

Abdelkader Ouali; David Allouche; Simon de Givry; Samir Loudni; Yahia Lebbah; Lakhdar Loukil

Graphical models factorize a global probability distribution/energy function as the product/sum of local functions. A major inference task, known as MAP in Markov Random Fields and MPE in Bayesian Networks, is to find a global assignment of all the variables with maximum a posteriori probability/minimum energy. A usual distinction on MAP solving methods is complete/incomplete, i.e. the ability to prove optimality or not. Most complete methods rely on tree search, while incomplete methods rely on local search. Among them, we study Variable Neighborhood Search for graphical models. By using (partial) tree search inside its local neighborhood exploration, we show it is possible to restore completeness. The resulting hybrid method has a much better anytime behavior than existing tree search methods and is still competitive for proving optimality. We further propose a parallel version of our method improving its anytime behavior on difficult instances coming from a large graphical model benchmark. Last we experiment on the challenging minimum energy problem found in Computational Protein Design, showing the practical benefit of our parallel version.

ID: 188

Learning Approximately Objective Priors

Eric Nalisnick; Padhraic Smyth

In modern probabilistic learning we often wish to perform automatic inference for Bayesian models. However, informative prior distributions can be costly and difficult to elicit, and, as a consequence, flat priors are often chosen with the hope that they are reasonably uninformative. Objective priors such as the Jeffreys and reference priors are generally preferable over flat priors but are not tractable to derive for many models of interest. In this paper we address this issue by proposing techniques for learning reference prior approximations: we select a parametric family and optimize a lower bound on the reference prior objective, which is the mutual information between data and parameters. Moreover, optimization can be made derivation-free via differentiable Monte Carlo expectations. We experimentally demonstrate the method’s effectiveness by recovering Jeffreys priors and learning the Variational Autoencoder’s reference prior.

ID: 266

Learning Treatment-Response Models from Multivariate Longitudinal Data

Hossein Soleimani; Adarsh Subbaswamy; Suchi Saria

Treatment effects can be estimated from observational data as the difference in potential outcomes. In this paper, we address the challenge of estimating the potential outcome when treatment-dose levels can vary continuously over time. Further, the outcome variable may not be measured at a regular frequency. Our proposed solution represents the treatment response curves using linear time-invariant dynamical systems---this provides a flexible means for modeling response over time to highly variable dose curves. Moreover, for multivariate data, the proposed method: uncovers shared structure in treatment response and the baseline across multiple markers; and, flexibly models challenging correlation structure both across and within signals over time. For this, we build upon the framework of multiple-output Gaussian Processes. On simulated and a challenging clinical dataset, we show significant gains in accuracy over state-of-the-art models.

ID: 291

Learning the Structure of Probabilistic Sentential Decision Diagrams

Yitao Liang; Jessa Bekker; Guy Van den Broeck

The probabilistic sentential decision diagram (PSDD) was recently introduced as a tractable representation for probability distributions that are subject to strong logical constraints. Meanwhile, the field of tractable learning saw tremendous success in inducing complex joint distributions from data without constraints, while guaranteeing support for efficient exact inference.This paper studies the efficacy of PSDDs for tractable learning without constraints, for which we develop the first PSDD structure learning algorithm, called LearnPSDD. Experimentals on standard benchmarks show competitive performance, despite the fact that PSDDs are more tractable and restrictive than their alternatives. LearnPSDD compares favorably to sum-product network learners, particularly in terms of model size, which is a proxy for tractability. On four datasets, we report results that to the best of our knowledge are state of the art in tractable learning. Moreover, our algorithm retains the ability to effectively learn PSDDs in structured probability spaces, which is beyond the reach of other representations.

ID: 237

Learning to Acquire Information

Yewen Pu; Leslie Pack Kaelbling; Armando Solar-Lezama

We consider the problem of diagnosis where a set of simple observations are used to infer a potentially complex hidden hypothesis. Finding the optimal subset of observations is intractable in general, thus we focus on the problem of active diagnosis, where the agent selects the next most-informative observation based on the results of previous observations. We show that under the assumption of uniform observation entropy, one can build an \emph{implication model} which directly predicts the outcome of the potential next observation conditioned on the results of past observations, and selects the observation with the maximum entropy. This approach enjoys reduced computation complexity by bypassing the complicated hypothesis space, and can be trained on observation data alone, learning how to query without knowledge of the hidden hypothesis.

ID: 206

Learning to Draw Samples with Amortized Stein Variational Gradient Descent

Yihao Feng; Dilin Wang; Qiang Liu

We propose a simple algorithm to train stochastic neural networks to draw samples from given target distributions for probabilistic inference. Our method is based on iteratively adjusting the neural network parameters so that the output changes along a Stein variational gradient direction (Liu & Wang, 2016) that maximally decreases the KL divergence with the target distribution. Our method works for any target distribution specified by their unnormalized density function, and can train any black-box architectures that are differentiable in terms of the parameters we want to adapt. We demonstrate our method with a number of applications, including variational autoencoder (VAE) with expressive encoders to model complex latent space structures, and hyper-parameter learning of MCMC samplers that allows Bayesian inference to adaptively improve itself when seeing more data.

ID: 35

Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels

Curtis G. Northcutt; Tailin Wu; Isaac L. Chuang

Noisy PN learning is the problem of binary classification when training examples may be mislabeled (flipped) uniformly with noise rate rho1 for positive examples and rho0 for negative examples. We propose Rank Pruning (RP) to solve noisy PN learning and the open problem of estimating the noise rates. Unlike prior solutions, RP is efficient and general, requiring O(T) for any unrestricted choice of probabilistic classifier with T fitting time. We prove RP achieves consistent noise estimation and equivalent empirical risk as learning with uncorrupted labels in ideal conditions, and derive closed-form solutions when conditions are non-ideal. RP achieves state-of-the-art noise rate estimation and F1, error, and AUC-PR on the MNIST and CIFAR datasets, regardless of noise rates. To highlight, RP with a CNN classifier can predict if a MNIST digit is a "1" or "not 1" with only 0.25% error, and 0.46% error across all digits, even when 50% of positive examples are mislabeled and 50% of observed positive labels are mislabeled negative examples.

ID: 37

Monte-Carlo Tree Search using Batch Value of Perfect Information

Shahaf S. Shperberg; Solomon Eyal Shimony; Ariel Felner

This paper focuses on the selection phase of Monte-Carlo Tree Search (MCTS). We define batch value of perfect information (BVPI) in game trees as a generalization of value of computation as proposed by Russell and Wefald, and use it for selecting nodes to sample in MCTS. We show that computing the BVPI is NP-hard, but it can be approximated in polynomial time. In addition, we pro- pose methods that intelligently find sets of fringe nodes with high BVPI, and quickly select nodes to sample from these sets. We apply our new BVPI methods to partial game trees, both in a stand-alone set of tests, and as a component of a full MCTS algorithm. Empirical results show that our BVPI methods outperform existing node- selection methods for MCTS in different scenarios.

ID: 155

Multi-dueling Bandits with Dependent Arms

Yanan Sui; Vincent Zhuang; Joel W. Burdick; Yisong Yue

The dueling bandits problem is an online learning framework for learning from pairwise preference feedback, and is particularly well-suited for modeling settings that elicit subjective or implicit human feedback. In this paper, we study the problem of {multi-dueling bandits with dependent arms}, which extends the original dueling bandits setting by simultaneously dueling multiple arms as well as modeling dependencies between arms. These extensions capture key characteristics found in many real-world applications, and allow for the opportunity to develop significantly more efficient algorithms than were possible in the original setting. We propose the SelfSparring algorithm, which reduces the multi-dueling bandits problem to a conventional bandit setting that can be solved using a stochastic bandit algorithm such as Thompson Sampling, and can naturally model dependencies using a Gaussian process prior. We present a no-regret analysis for multi-dueling setting, and demonstrate the effectiveness of our algorithm empirically on a wide range of simulation settings.

ID: 62

Near-Optimal Interdiction of Factored MDPs

Swetasudha Panda; Yevgeniy Vorobeychik

Stackelberg games have been widely used to model interactions between attackers and defenders in a broad array of security domains. One related approach involves plan interdiction, whereby a defender chooses a subset of actions to block (remove), and the attacker constructs an optimal plan in response. In previous work, this approach has been introduced in the context of Markov decision processes (MDPs). The key challenge, however, is that the state space of MDPs grows exponentially in the number of state variables. We propose a novel scalable MDP interdiction framework which makes use of factored representation of state, using Fourier representation for representing a value function over a boolean space. We demonstrate that our approach is significantly more scalable, while resulting in near-optimal interdiction decisions.

ID: 6

Near-Orthogonality Regularization in Kernel Methods

Pengtao Xie; Barnabas Poczos; Eric Xing

Kernel methods perform nonlinear learning in a high-dimensional reproducing kernel Hilbert space. Their large model-capacity leads to high representational power, but also incurs substantial risk of overfitting. To alleviate this problem, we propose a new regularization approach -- near-orthogonality regularization, which encourages the RKHS functions to be close to orthogonal. This effectively imposes a structural constraint over the function space, which reduces model complexity and improves generalization performance. Encouraging orthogonality can also reduce model size without compromising modeling power and is capable of capturing infrequent patterns. We define a family of orthogonality-promoting regularizers by encouraging the Gram matrix of the RKHS functions to be close to an identity matrix, where the closeness is measured by Bregman matrix divergence. We apply these regularizers to kernel distance metric learning and kernel sparse coding, and develop an efficient ADMM based algorithm to solve the regularized optimization problems. We analyze how near-orthogonality affects the generalization performance of kernel methods. The results suggest that the closer the functions are to orthogonality, the smaller the generalization error is. Experiments demonstrate the efficacy of near-orthogonality regularization in kernel methods.

ID: 244

Neighborhood Regularized $\ell^1$-Graph

Yingzhen Yang; Jiashi Feng; Jiahui Yu; Jianchao Yang; Thomas S. Huang

$\ell^{1}$-Graph, which learns a sparse graph over the data by sparse representation, has been demonstrated to be effective in clustering especially for high dimensional data. Although it achieves compelling performance, the sparse graph generated by $\ell^{1}$-Graph ignores the geometric information of the data by sparse representation for each datum separately. To obtain a sparse graph that is aligned to the underlying manifold structure of the data, we propose the novel Neighborhood Regularized $\ell^{1}$-Graph (NR$\ell^{1}$-Graph). NR$\ell^{1}$-Graph learns sparse graph with locally consistent neighborhood by encouraging nearby data to have similar neighbors in the constructed sparse graph. We present the optimization algorithm of NR$\ell^{1}$-Graph with theoretical guarantee on the convergence and the gap between the sub-optimal solution and the globally optimal solution in each step of the coordinate descent, which is essential for the overall optimization of NR$\ell^{1}$-Graph. Its provable accelerated version, NR$\ell^{1}$-Graph by Random Projection (NR$\ell^{1}$-Graph-RP) that employs randomized data matrix decomposition, is also presented to improve the efficiency of the optimization of NR$\ell^{1}$-Graph. Experimental results on various real data sets demonstrate the effectiveness of both NR$\ell^{1}$-Graph and NR$\ell^{1}$-Graph-RP.

ID: 160

On Loopy Belief Propagation -- Local Stability Analysis for Non-Vanishing Fields

Christian Knoll; Franz Pernkopf

We obtain all fixed points of belief propagation and provide a local stability analysis. In this work we consider pairwise interactions and binary random variables and investigates the influence of non-vanishing fields and finite-size graphs on the performance of belief propagation. Local stability is heavily influenced by these properties. We show how non-vanishing fields help to achieve convergence and to increase the accuracy of belief propagation. In the antiferromagnetic case we show close connections between the existence of multiple solutions, the capability of belief propagation (with damping) to converge, and the underlying graph structure. Further we provide insights into why finite-size graphs behave better than infinite-size graphs.

ID: 105

On the Complexity of Nash Equilibrium Reoptimization

Andrea Celli; Alberto Marchesi; Nicola Gatti

We provide, to the best of our knowledge, the first study about reoptimization complexity of game-theoretical solutions. In a reoptimization problem, we are given an instance, its optimal solution, and a local modification, and we are asked to find the exact or an approximate solution to the modified instance. Reoptimization is crucial whenever an instance needs to be solved repeatedly and, at each repetition, its parameters may slightly change. In this paper, we focus on Nash equilibrium, being the central game-theoretical solution. We study the reoptimization of Nash equilibria satisfying some properties (i.e., maximizing/minimizing the social welfare, the utility of a player or the support size) for some different local modifications of the game (i.e., modification of a payoff or addition/removal of an action), showing that such problems are NP-hard. Furthermore, we assess the approximation complexity of the aforementioned problems, showing that it matches the complexity of the original (non-reoptimization) problems. Finally, we show that, when finding a Nash equilibrium is thought as an optimization problem, reoptimization is useful for finding approximate solutions. Specifically, it allows one to find ε-Nash equilibria with smaller ε than that of the solutions returned by the best known approximation algorithms.

ID: 276

Online Constrained Model-based Reinforcement Learning

Benjamin van Niekerk; Andreas Damianou; Benjamin Rosman

Applying reinforcement learning to robotic systems poses a number of challenging problems. A key requirement is the ability to handle continuous state and action spaces while remaining within a limited time and resource budget. Additionally, for safe operation, the system must make robust decisions under hard constraints. To address these challenges, we propose a model based approach that combines Gaussian Process regression and Receding Horizon Control. Using sparse spectrum Gaussian Processes, we extend previous work by updating the dynamics model incrementally from a stream of sensory data. This results in an agent that can learn and plan in real-time under non-linear constraints. We test our approach on a cart pole swing-up environment and demonstrate the benefits of online learning on an autonomous racing task. The environment’s dynamics are learned from limited training data and can be reused in new task instances without retraining.

ID: 281

Probabilistic Program Abstractions

Steven Holtzen; Todd Millstein; Guy Van den Broeck

Abstraction is a fundamental tool for reasoning about a complex system. Program abstraction has been utilized to great effect for analyzing deterministic programs. At the heart of a program abstraction is a connection between the abstract program, which is simple to analyze, and the concrete program, which may be extremely complex. Program abstractions, however, are typically not probabilistic. We generalize non-deterministic program abstractions to probabilistic program abstractions. The same connections which drove program abstraction in the non-deterministic setting have probabilistic analogies with similar simplifying effects. We show that these abstractions are themselves families of probabilistic programs with particular properties, providing avenues for utilizing abstraction techniques from the programming languages community to improve the analysis of probabilistic programs.

ID: 32

Provable Inductive Robust PCA via Iterative Hard Thresholding

U.N. Niranjan; Arun Rajkumar; Theja Tulabandhula

The robust PCA problem, wherein, given an input data matrix that is the superposition of a low-rank matrix and a sparse matrix, we aim to separate out the low-rank and sparse components, is a well-studied problem in machine learning. One natural question that arises is that, as in the inductive setting, if features are provided as input as well, can we hope to do better? Answering this in the affirmative, the main goal of this paper is to study the robust PCA problem while incorporating feature information. In contrast to previous works in which recovery guarantees are based on the convex relaxation of the problem, we propose a simple iterative algorithm based on hard-thresholding of appropriate residuals. Under weaker assumptions than previous works, we prove the global convergence of our iterative procedure; moreover, it admits a much faster convergence rate and lesser computational complexity per iteration. In practice, through systematic synthetic and real data simulations, we confirm our theoretical findings regarding improvements obtained by using feature information.

ID: 130

Real-Time Resource Allocation for Tracking Systems

Yash Satsangi; Shimon Whiteson; Frans Oliehoek; Henri Bouma

Automated tracking is key to many computer vision applications. However, many tracking systems struggle to perform in real-time due to the high computational cost of detecting people, especially in ultra high resolution images. We propose a new algorithm called PartiMax that greatly reduces this cost by applying the person detector only to the relevant parts of the image. PartiMax exploits information in the particle filter to select k of the n candidate pixel boxes in the image. We prove that PartiMax is guaranteed to make a near-optimal selection with error bounds that are independent of the problem size. Furthermore, empirical results on a real-life dataset show that our system runs in real-time by processing only 10% of the pixel boxes in the image while still retaining 80\% of the original tracking performance achieved when processing all pixel boxes.

ID: 100

Regret Minimization Algorithms for the Follower’s Behaviour Identification in Leadership Games

Lorenzo Bisi; Giuseppe De Nittis; Francesco Trov`ò; Marcello Restelli; Nicola Gatti

We study for the first time, to the best of our knowledge, a leadership game in which one agent, acting as leader, faces another agent, acting as follower, whose behaviour is not known a priori by the leader, being one among a set of possible behavioural profiles. The main motivation is that in real-world applications the common game-theoretical assumption of perfect rationality is rarely met, and any specific assumption on bounded rationality models, if wrong, could lead to a significant loss for the leader. The question we pose is whether and how the leader can learn the behavioural profile of a follower in leadership games. This is a “natural” online identification problem: in fact, the leader aims at identifying the follower’s behavioural profile to exploit at best the potential non-rationality of the opponent, while minimizing the regret due to the initial lack of information. We propose two algorithms based on different approaches and we provide a regret analysis. Furthermore, we experimentally evaluate the pseudo-regret of the algorithms in concrete leadership games inspired by security domains, showing that our algorithms dramatically outperform the online learning algorithms available in the state of the art.

ID: 298

Robust Model Equivalence using Stochastic Bisimulation for N-Agent Interactive DIDs

Muthukumaran Chandrasekaran; Junhuan Zhang; Prashant Doshi; Yifeng Zeng

I-DIDs suffer disproportionately from the curse of dimensionality dominated by the exponential growth in the number of models over time. Toward this, previous methods for scaling I-DIDs identify notions of equivalence between models, such as behavioral equivalence (BE). But, this requires that the models be solved first. Also, model space compression across agents has not been previously investigated. We present a way to compress the space of models across agents, possibly with different frames, and do so without having to solve them first, using stochastic bisimulation. We test our approach on two non-cooperative partially observable domains with up to 20 agents.

ID: 234

SAT-Based Causal Discovery under Weaker Assumptions

Zhalama; Jiji Zhang; Frederick Eberhardt; Wolfgang Mayer

Using the flexibility of recently developed methods for causal discovery based on Boolean satisfiability (SAT) solvers, we encode a variety of assumptions that weaken the causal Faithfulness assumption. The encoding results in a number of SAT-based algorithms whose asymptotic correctness relies on weaker conditions than are standardly assumed. This implementation of a whole set of assumptions in the same platform enables us to systematically explore the effect of weakening the Faithfulness assumption on causal discovery. An important effect, suggested by simulation results, is that adopting weaker assumptions significantly alleviates the problem of handling conflicting constraints and substantially shortens solving time. As a result, SAT-based causal discovery is potentially more scalable under weaker assumptions.

ID: 108

Safe Semi-Supervised Learning of Sum-Product Networks

Martin Trapp; Tamas Madl; Robert Peharz; Franz Pernkopf; Robert Trappl

In several domains obtaining class annotations is expensive while at the same time unlabelled data are abundant. While most semi-supervised approaches enforce restrictive assumptions on the data distribution, recent work has managed to learn semi-supervised models in a non-restrictive regime. However, so far such approaches have only been proposed for linear models. In this work, we introduce semi-supervised parameter learning for Sum-Product Networks (SPNs). SPNs are deep probabilistic models admitting inference in linear time in number of network edges. Our approach has several advantages, i.e. it allows generative and discriminative semi-supervised learning, guarantees that adding unlabelled data can increase, but not degrade, the performance, is computationally efficient and does not enforce restrictive assumptions on the data distribution. We show on a variety of data sets that safe semi-supervised learning with SPNs is competitive compared to state-of-the-art and can lead to a better generative and discriminative objective value than a purely supervised approach.

ID: 16

Self-Discrepancy Conditional Independence Test

Sanghack Lee; Vasant Honavar

Testing random variables for conditional independence (CI) is an important and challenging task with a broad range of applications in machine learning and causal discovery. Of particular interest are kernel-based CI tests. Recent work has claimed that Kernel CI Permutation Test (KCIPT), which uses a single learned permutation to reduce the CI test to an easier two-sample test, has power competitive with existing kernel-based tests and is well-calibrated compared to other alternatives. We show that KCIPT suffers from a loss of calibratedness with increase in power (obtained by increasing the number of bootstrap samples). We propose a novel CI test, called Self-Discrepancy Conditional Independence Test (SDCIT). Our test statistic is a modified unbiased estimate of maximum mean discrepancy between a given sample and its permuted counterpart where permutation is obtained as in KCIPT. We approximate the null distribution by computing the test statistics on half-samples obtained using the sampling without replacement on the permuted sample. We present results of experiments that demonstrate the superiority of SDCIT over existing kernel-based CI tests with respect to both power and calibratedness.

ID: 26

Shortest Path under Uncertainty: Exploration versus Exploitation

Zhan Wei Lim; David Hsu; Wee Sun Lee

In the Canadian Traveler Problem (CTP), a traveler seeks a shortest path to a destination through a road network, but unknown to the traveler, some roads may be blocked. This paper studies the Bayesian CTP (BCTP), in which road states are correlated with known prior probabilities and the traveler can infer the states of an unseen road from past observations of other correlated roads. As generalized shortest-path problems, CTP and BCTP have important practical applications. We show that BCTP is NP-complete and give a polynomial-time approximation algorithm, Hedged Shortest Path under Determinization (HSPD), which approximates an optimal solution with a polylogarithmic factor. Preliminary experiments show promising results. HSPD outperforms a widely used greedy algorithm and a state-of-the-art UCT-based search algorithm for CTP, especially when significant exploration is required.

ID: 217

Stein Variational Adaptive Importance Sampling

Jun Han; Qiang Liu

We propose a novel adaptive importance sampling algorithm which incorporates Stein variational gradient decent algorithm(SVGD) with importance sampling(IS). Our algorithm leverages the nonparametric transforms defined in SVGD to iteratively decrease the KL divergence between our importance proposal and the target distribution. The advantages of this idea are two folds: first, it enables our algorithm to inherit more theoretical interpretability of IS than SVGD; second, we do not restrict the choice of our importance proposal from predefined distribution families as in traditional adaptive IS. Empirical experiments demonstrate that our algorithm performs better than the traditional adaptive IS. In addition, our algorithm achieves comparable results with Hamiltonian annealing importance sampling when employed to estimate the partition functions of graphical models and evaluate the log-likelihoods of deep generative models.

ID: 239

Stein Variational Policy Gradient

Yang Liu; Prajit Ramachandran; Qiang Liu; Jian Peng

Policy gradient methods have been successfully applied to many complex reinforcement learning problems. The major drawbacks of policy gradient methods include high variability, slow convergence, and exploration inefficiency. We introduce a maximum entropy policy optimization framework which explicitly encourages parameter exploration. We show that this optimization framework can be reduced to a Bayesian inference problem and then propose a novel Stein variational policy gradient method (SVPG) that utilizes existing policy gradient methods and a repulsive functional to generate a set of diverse but well-behaved policies. On continuous control problems, we find that the SVPG implementations of REINFORCE and advantage actor-critic algorithms greatly boost the performance of their original versions, in terms of both data efficiency and average return. SVPG is also robust against the initialization variation and can be easily implemented in a parallel manner.

ID: 137

Stochastic Bandit Models for Delayed Conversions

Claire Vernade; Olivier Cappé; Vianney Perchet

Online advertising and product recommendation are important domains of applications for multi-armed bandit methods. In these fields, the reward that is immediately available is most often only a proxy for the actual outcome of interest, which we refer to as a 'conversion'. For instance, in web advertising, clicks can be observed within a few seconds after an ad display but the corresponding sale --if any-- will take hours, if not days to happen. This paper proposes and investigates a new stochastic multi-armed bandit model in the framework proposed by Chapelle (2014) --based on empirical studies in the field of web advertising-- in which each action may trigger a future reward that will then happen with a stochastic delay. We assume that the probability of conversion associated with each action is unknown while the distribution of the conversion delay is known, distinguishing between the (idealized) case where the conversion events may be observed whatever their delay and the more realistic setting in which late conversions are censored. We provide performance lower bounds as well as two simple but efficient algorithms based on the UCB and KLUCB frameworks. The latter algorithm, which is preferable when conversion rates are low, is based on a Poissonization argument, of independent interest in other settings where aggregation of Bernoulli observations with different success probabilities is required.

ID: 86

Stochastic L-BFGS Revisited: Improved Convergence Rates and Practical Acceleration Strategies

Renbo Zhao; William B. Haskell; Vincent Y. F. Tan

We revisit the stochastic limited-memory BFGS (L-BFGS) algorithm. By proposing a new framework for analyzing convergence, we theoretically improve the (linear) convergence rates and computational complexities of the stochastic L-BFGS algorithms in previous works. In addition, we propose several practical acceleration strategies to speed up the empirical performance of such algorithms. We also provide theoretical analyses for most of the strategies. Experiments on large-scale logistic and ridge regression problems demonstrate that our proposed strategies yield significant improvements via-`a-vis competing state-of-the-art algorithms.

ID: 72

Stochastic Segmentation Trees for Multiple Ground Truths

Jake Snell; Richard S. Zemel

A strong machine learning system should aim to produce the full range of valid interpretations rather than a single mode in tasks involving inherent ambiguity. This is particularly true for image segmentation, in which there are many sensible ways to partition an image into regions. We formulate a tree-structured probabilistic model, the stochastic segmentation tree, that represents a distribution over segmentations of a given image. We train this model by optimizing a novel objective that quantifies the degree of match between statistics of the model and ground truth segmentations. Our method allows learning of both the parameters in the tree and the structure itself. We demonstrate on two datasets, including the challenging Berkeley Segmentation Dataset, that our model is able to successfully capture the range of ground truths and to produce novel plausible segmentations beyond those found in the data.

ID: 229

Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

Marco Eigenmann; Preetam Nandy; Marloes Maathuis

We consider structure learning of linear Gaussian structural equation models with weak edges. Since the presence of weak edges can lead to a loss of edge orientations in the true underlying CPDAG, we define a new graphical object that can contain more edge orientations. We show that this object can be recovered from observational data under a type of strong faithfulness assumption. We present a new algorithm for this purpose, called aggregated greedy equivalence search (AGES), that aggregates the solution path of the greedy equivalence search (GES) algorithm for varying values of the penalty parameter. We prove consistency of AGES and demonstrate its performance in a simulation study and on single cell data from Sachs et al. (2005).

ID: 45

Submodular Variational Inference for Network Reconstruction

Lin Chen; Forrest W. Crawford; Amin Karbasi

In real-world and online social networks, individuals receive and transmit information in real time. Cascading information transmissions (e.g. phone calls, text messages, social media posts) may be understood as a realization of a diffusion process operating on the network, and its branching path can be represented by a directed tree. The process only traverses and thus reveals a limited portion of the edges. The network reconstruction/inference problem is to infer the unrevealed connections. Most existing approaches derive a likelihood and attempt to find the network topology maximizing the likelihood, a problem that is highly intractable. In this paper, we focus on the network reconstruction problem for a broad class of real-world diffusion processes, exemplified by a network diffusion scheme called respondent-driven sampling (RDS). We prove that under realistic and general models of network diffusion, the posterior distribution of an observed RDS realization is a Bayesian log-submodular model.We then propose VINE (Variational Inference for Network rEconstruction), a novel, accurate, and computationally efficient variational inference algorithm, for the network reconstruction problem under this model. Crucially, we do not assume any particular probabilistic model for the underlying network. VINE recovers any connected graph with high accuracy as shown by our experimental results on real-life networks.

ID: 106

Supervised Restricted Boltzmann Machines

Tu Dinh Nguyen; Dinh Phung; Viet Huynh; Trung Le

Restricted Boltzmann Machines (RBMs) form an important class of generative deep learning models. They are fast, versatile and flexible to learn features or to be used as the building blocks to construct more expressive probabilistic deep networks. They have been successfully applied mostly for unsupervised learning. For supervised learning, RBMs have been typically employed in two separate steps: first to either extract feature or pretrain models whose outputs are then fed to a separate classifier or regressor. Arguably, separating the feature learning phase from the discriminative training phase is suboptimal. And it is desirable to leverage on the strength of RBMs in learning the data representation together with the supervised learning task in a single probabilistic framework. To this end, we propose the supervised Restricted Boltzmann machine (sRBM), a unified framework which combines the versatility of RBM to simultaneously learn the data representation and to perform supervised learning (i.e., a nonlinear classifier or a nonlinear regressor). Unlike the current state-of-the-art classification formulation proposed for RBM in (Larochelle et al., 2012), our model is a hybrid probabilistic graphical model consisting of a distinguished generative component for data representation and a discriminative component for prediction. While the work of (Larochelle et al., 2012) typically incurs no extra difficulty in inference compared with a standard RBM, our discriminative component, modeled as a directed graphical model, renders MCMC-based inference (e.g., CD or PCD) very slow and unpractical for use. To this end, we further develop scalable variational inference for the proposed sRBM for both classification and regression cases. Extensive experiments on real-world datasets show that our sRBM achieves better predictive performance than baseline methods. At the same time, our proposed framework yields learned representations which are more discriminative, hence interpretable, than those of its counterparts. Besides, our method is probabilistic and capable of generating meaningful data conditioning on specific classes – a topic which is of current great interest in deep learning aiming at data generation.

ID: 134

Synthesis of strategies in influence diagrams

Manuel Luque; Manuel Arias; Francisco Javier Díez

Influence diagrams (IDs) are a powerful tool for representing and solving decision problems under uncertainty. The objective of evaluating an ID is to compute the expected utility and an optimal strategy, which consists of a policy for each decision. Every policy is usually represented as a table containing a column for each decision scenario, i.e., for each configuration of the variables on which it depends. The no-forgetting assumption, which implies that the decision maker always remembers all past observations and decisions, makes the policies grow exponentially with the number of variables in the ID. For human experts it is very difficult to understand the strategy contained in huge policy tables, not only for their size, but also because the vast majority of columns correspond to suboptimal or impossible scenarios and are hence irrelevant. This makes it difficult to extract the rules of action, to debug the model, and to convince the experts that the recommendations of the ID are reasonable. In this paper, we propose a method that presents the strategy in the form of a compact tree. It has been implemented in OpenMarkov, an open-source software tool for probabilistic graphical models. This facility was essential when evaluating an influence diagram for the mediastinal staging of non-small cell lung cancer; the optimal strategy, whose biggest policy table contained more than 15,000 columns, was synthesized into a tree of only 5 leaves.

ID: 87

The Binomial Block Bootstrap Estimator for Evaluating Loss on Dependent Clusters

Matt Barnes; Artur Dubrawski

In this paper, we study the non-IID learning setting where samples exhibit dependency within latent clusters. Our goal is to estimate a learner's loss on new clusters, an extension of the out-of-bag error. Previously developed cross-validation estimators are well suited to the case where the clustering of observed data is known a priori. However, as is often the case in real world problems, we are only given a noisy approximation of this clustering, likely the result of some clustering algorithm. This subtle yet potentially significant issue afflicts domains ranging from image classification to medical diagnostics, where naive cross-validation is an optimistically biased estimator. We present a novel bootstrap technique and corresponding cross-validation method that, somewhat counterintuitively, injects additional dependency to asymptotically recover the loss in the independent setting.

ID: 56

The total belief theorem

Chunlai Zhou; Fabio Cuzzolin

In this paper, motivated by the treatment of conditional constraints in the data association problem, we state and prove the generalisation of the law of total probability to belief functions, as finite random sets. Our results apply to the case in which Dempster's conditioning is employed. We show that the solution to the resulting total belief problem is in general not unique, whereas it is unique when the a-priori belief function is Bayesian. Examples and case studies underpin the theoretical contributions. Finally, our results are compared to previous related work on the generalisation of Jeffrey's rule by Spies and Smets.

ID: 49

Towards Conditional Independence Test for Relational Data

Sanghack Lee; Vasant Honavar

Determining conditional independence (CI) between random variables from sampled data is important for many applications in statistical machine learning. Most existing CI tests assume that the samples are independently and identically distributed (i.i.d.). However, this assumption often does not hold in practice, e.g., in the case of relational data. We define Relational Conditional Independence (RCI), a generalization of CI to the relational setting. We show how, under a set of structural assumptions, we can test for RCI by reducing the task of testing for RCI on non-i.i.d. data to the problem of testing for CI on several data sets each of which consists of i.i.d. samples. We develop Kernel Relational CI test (KRCIT), a nonparametric test as a practical approach to testing for RCI by relaxing the structural assumptions used in our analysis of RCI. We describe results of experiments with synthetic relational data that show the benefits of KRCIT relative to traditional CI tests that don't account for the non-i.i.d. nature of relational data.

ID: 136

Triply Stochastic Gradients on Multiple Kernel Learning

Xiang Li; Bin Gu; Shuang Ao; Huaimin Wang; Charles X. Ling

Multiple Kernel Learning (MKL) is highly useful for learning complex data with multiple cues or representations. However, MKL is known to have poor scalability because of the expensive kernel computation. Dai et al. (2014) proposed to use a doubly Stochastic Gradient Descent algorithm (doubly SGD) to greatly improve the scalability of kernel methods. However, the algorithm is not suitable for MKL because it cannot learn the kernel weights. In this paper, we provide a novel extension to doubly SGD for MKL so that both the decision functions and the kernel weights can be learned simultaneously. To achieve this, we develop the triply Stochastic Gradient Descent (triply SGD) algorithm which involves three sources of randomness -- the data points, the random features, and the kernels, which was not considered in previous work. We prove that our algorithm enjoys similar convergence rate as that of doubly SGD. Comparing to several traditional MKL solutions, we show that our method has faster convergence speed and achieved better accuracy. Most importantly, our method makes it possible to learn MKL problems with millions of data points on a normal desktop PC.

ID: 162

Value Directed Exploration in Multi-Armed Bandits with Structured Priors

Bence Cserna; Marek Petrik; Reazul Hasan Russel; Wheeler Ruml

Multi-armed bandits are a quintessential machine learning problem requiring the balancing of exploration and exploitation. While there has been progress in developing algorithms with strong theoretical guarantees, there has been less focus on practical near-optimal finite-time performance. In this paper, we propose an algorithm for Bayesian multi-armed bandits that utilizes value-function-driven online planning techniques. Building on previous work on UCB and Gittins index, we introduce linearly-separable value functions that take both the expected return and the benefit of exploration into consideration to perform n-step lookahead. The algorithm enjoys a sub-linear performance guarantee and we present simulation results that confirm its strength in problems with structured priors. The simplicity and generality of our approach makes it a strong candidate for analyzing more complex multi-armed bandit problems.

ID: 132

Weighted Model Counting With Function Symbols

Vaishak Belle

Probabilistic relational languages lift the syntax of relational logic for the specification of large probabilistic graphical models, often yielding concise descriptions for many interacting random variables over classes, hierarchies and constraints. However, much of this work has been limited to an essentially propositional setting: the logical model is understood in terms of ground formulas over a fixed and finite domain; no infinite domains, and certainly no function symbols (other than constants). On the one hand, this is not surprising, because such features are very problematic from a decidability viewpoint, but on the other, they turn out to be very attractive from the point of view of machine learning applications when there is uncertainty about the existence and identity of objects. In this paper, we reconsider the problem of of probabilistic reasoning in a logical language with function symbols, and establish some key results that permit effective algorithms.

ID: 263

Why Rules are Complex: Real-Valued Probabilistic Logic Programs are not Fully Expressive

David Buchman; David Poole

This paper explores what can and cannot be represented by probabilistic logic programs. Propositional logic programs can represent any distribution, because they can be acyclic. For relational domains with fixed populations, the probabilistic parameters can be derived as the solutions to polynomial equations. Unfortunately, sometimes they only have complex-valued solutions. This proves that probabilistic logic programs, even with arbitrarily real-valued parameters, cannot represent all distributions. Moreover, they cannot approximate all distributions. Allowing the parameters to be complex numbers, we present a natural truly-cyclic canonical representation that with probability 1 can represent all distributions for relational domains with fixed populations, and, unlike standard representations, has no redundant parameters. Our results provide strong negative results on what probabilistic logic programs can represent using real numbers.

Golden Sponsor

Golden Sponsor

Golden Sponsor

Bronze Sponsor

Startup Sponsor

Media Sponsor