Sunday, August 4, 1996

Usama Fayyad and Eric Horvitz*,
Microsoft Research *

Usama Fayyad

Microsoft Research

One Microsoft Way

Redmond, WA 98052, USA

fayyad@microsoft.com

Gregory Piatetsky-Shapiro

GTE Laboratories, MS 44

Waltham, MA 02154, USA

gps@gte.com

Padhraic Smyth

Information and Computer Science

University of California

Irvine, CA 92717-3425, USA

smyth@.ics.uci.edu

This paper presents a first step towards a unifying framework for Knowledge Discovery in Databases. We describe links between data mining, knowledge discovery, and other related fields. We then define the KDD process and basic data mining algorithms, discuss application issues and conclude with an analysis of challenges facing practitioners in the field.

Fayyad, U.M. , Piatetsky-Shapiro, G. and Smyth, P**.
**Knowledge Discovery and Data Mining: Toward a Unifying
Framework, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining,
AAAI Press: Menlo Park, CA. August 1996.

This paper is available in (PostScript) format.

David M. Chickering

Microsoft Research

e-mail: dmax@microsoft.com

David Heckerman

Microsoft Research

e-mail: heckerma@microsoft.com

We discuss Bayesian methods for learning Bayesian networks when data sets are incomplete. In particular, we examine asymptotic approximations for the marginal likelihood of incomplete data given a Bayesian network. We consider the Laplace approximation and the less accurate but more efficient BIC/MDL approximation. We also consider approximations proposed by Draper (1993) and Cheeseman and Stutz (1995). These approximations are as efficient as BIC/MDL, but their accuracy has not been studied in any depth. We compare the accuracy of these approximations under the assumption that the Laplace approximation is the most accurate. In experiments using synthetic data generated from discrete naive-Bayes models having a hidden root node, we find that the CS measure is the most accurate.

Chickering, D.M. , Heckerman, D. Efficient Approximations for
the Marginal Likelihood of Incomplete Data Given a Bayesian Network,
in E. Horvitz and F. Jensen, eds., *Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence*,
Morgan Kaufmann Publishers: San Francisco, CA. August 1996, pp. 158-168.

This paper is available in (PostScript) format.

Padhraic Smyth

Information and Computer Science

University of California

Irvine, CA 92717-3425, USA

smyth@.ics.uci.edu

Finding the ``right" number of clusters, $k$, for a data set is a difficult, and often ill-posed, problem. In a probabilistic clustering context, likelihood-ratios, penalized likelihoods, and Bayesian techniques are among the more popular techniques. In this paper a new cross-validated likelihood criterion is investigated for determining cluster structure. A practical clustering algorithm based on {\it Monte Carlo cross-validation} (MCCV) is introduced. The algorithm permits the data analyst to judge if there is strong evidence for a particulark, or perhaps weaker evidence over a sub-range ofkvalues. Experimental results with Gaussian mixtures on real and simulated data suggest that MCCV provides genuine insight into cluster structure. v-fold cross-validation appears inferior to the penalized likelihood method (BIC), a Bayesian algorithm (AutoClass v2.0), and the new MCCV algorithm. Overall, MCCV and AutoClass appear the most reliable of the methods. MCCV provides the data-miner with a useful data-driven tool which complements the fully Bayesian approach: in particular, the method is conceptually simpler to adapt to more complex clustering models beyond Gaussian mixtures.

Smyth, P**. **Clustering using Monte Carlo
Cross-Validation, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining,
AAAI Press: Menlo Park, CA. August 1996.

This paper is available in (PostScript) format..

David M. Chickering

Microsoft Research

e-mail: dmax@microsoft.com

Approaches to learning Bayesian networks from data typically combine a scoring function with a heuristic search procedure. Given a Bayesian network structure, many of the scoring functions derived in the literature return a score for the entire equivalence class to which the structure belongs. When using such a scoring function, it is appropriate for the heuristic search algorithm to search over equivalence classes of Bayesian networks as opposed to individual structures. We present the general formulation of a search space for which the states of the search correspond to equivalence classes of structures. Using this space, any one of a number of heuristic search algorithms can easily be applied. We compare greedy search performance in the proposed search space to greedy search performance in a search space for which the states correspond to individual Bayesian network structures.

Chickering, D.M., Learning Equivalence Classes of Bayesian Network
Structures, in E. Horvitz and F. Jensen, eds., *Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence*,
Morgan Kaufmann Publishers: San Francisco, CA. August 1996, pp. 150-157.

This paper is available in (PostScript) format.

Nir Friedman

Dept. of Computer Science

Stanford University

Gates Building, 1A

Stanford, CA 94305-9010, USA

e-mail: nir@cs.stanford.edu

Phone: (415) 725-8782

FAX : (415) 725-1449

Moises Goldszmidt

SRI International

333 Ravenswood Way, EK329

Menlo Park, CA 94025, USA

e-mail: moises@erg.sri.com

Phone: (415) 859-4319

FAX : (415) 859-4812

In this paper we examine a novel addition to the known methods for learning Bayesian networks from data that improves the quality of the learned networks. Our approach explicitly represents and learns thelocal structurein theconditional probability tables(CPTs), that quantify these networks. This increases the space of possible models, enabling the representation of CPTs with a variable number of parameters that depends on the learned local structures. The resulting learning procedure is capable of inducing models that better emulate the real complexity of the interactions present in the data. We describe the theoretical foundations and practical aspects of learning local structures, as well as an empirical evaluation of the proposed method.This evaluation indicates that learning curves characterizing the procedure that exploits the local structure converge faster than these of the standard procedure. Our results also show that networks learned with local structure tend to be more complex (in terms of arcs), yet require less parameters.

Friedman, N., Goldszmidt, M., Learning Bayesian Networks with
Local Structure, in E. Horvitz and F. Jensen, eds., *Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence*,
Morgan Kaufmann Publishers: San Francisco, CA. August 1996, pp. 252-262.

This paper is available in PostScript format.

Ron Musick

Lawrence Livermore National Laboratory

P.O. Box 808, L-306

Livermore, CA 94551

rmusick@llnl.gov

Belief networks are a powerful tool for knowledge discovery that provide concise, understandable probabilistic models of data. There are methods grounded in probability theory to incrementally update the relationships described by the belief network when new information is seen, to perform complex inferences over any set of variables in the data, to incorporate domain expertise and prior knowledge into the model, and to automatically learn the model from data. This paper concentrates on part of the belief network induction problem, that of learning the quantitative structure (the conditional probabilities), given the qualitative structure. In particular, the current practice of rote learning the probabilities in belief networks can be significantly improved upon. We advance the idea of applying any learning algorithm to the task of conditional probability learning in belief networks, discuss potential benefits, and show results of applying neural networks and other algorithms to a medium sized car insurance belief network. The results demonstrate from 10 to 100% improvements in model error rates over the current approaches. Our approach explicitly represents and learns thelocal structurein theconditional probability tables(CPTs), that quantify these networks. This increases the space of possible models, enabling the representation of CPTs with a variable number of parameters that depends on the learned local structures. The resulting learning procedure is capable of inducing models that better emulate the real complexity of the interactions present in the data. We describe the theoretical foundations and practical aspects of learning local structures, as well as an empirical evaluation of the proposed method.This evaluation indicates that learning curves characterizing the procedure that exploits the local structure converge faster than these of the standard procedure. Our results also show that networks learned with local structure tend to be more complex (in terms of arcs), yet require less parameters.

Musick, R**. **Rethinking the Learning
of Belief Network Probabilities, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining,
AAAI Press: Menlo Park, CA. August 1996.

This paper is not yet available online.

Kathryn Blackmond Laskey

Department of Systems Engineering

George Mason University

Fairfax, VA 22030-4444 USA

e-mail: klaskey@gmu.edu

Phone: (703) 993-1644

FAX: (703) 993-1706

Laura Martignon

Max Planck Institute fr Psychologische Forschung

Center for Adaptive Behavior and Cognition

80802 Mnchen, Germany

e-mail: laura@mpipf-muenchen.mpg.de

Phone: 49-89-38-602-253

FAX: 49-89-38-602-252

This paper presents a Bayesian approach to learning the connectivity structure of a group of neurons from data on configuration frequencies. A major objective of the research is to provide statistical tools for detecting changes in firing patterns with changing stimuli. Our framework is not restricted to the well-understood case of pair interactions, but generalizes the Boltzmann machine model to allow for higher order interactions. The paper applies a Markov Chain Monte Carlo Model Composition (MC3) algorithm to search over connectivity structures and uses Laplace's method to approximate posterior probabilities of structures. Performance of the methods was tested on synthetic data. The models were also applied to data obtained by Vaadia on multi-unit recordings of several neurons in the visual cortex of a rhesus monkey in two different attentional states. Results confirmed the experimenters' conjecture that different attentional states were associated with different interaction structures.

Laskey, K., Martignon, L., Bayesian Learning of Loglinear Models for Neural Connectivity, in E. Horvitz and F. Jensen, eds., Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers: San Francisco, CA. August 1996, pp. 373-380.

This paper is available in PostScript format.

Paul E. Stolorz

Jet Propulsion Laboratory

California Institute of Technology

Pasadena, CA 91109

pauls@aig.jpl.nasa.gov

Philip C. Chew

University of Pennsylvania

Philadelphia PA

chew@upenn5.hep.upenn.edu

The Monte Carlo method is recognized as a useful tool in learning and probabilistic inference methods common to many datamining problems. Generalized Hidden Markov Models and Bayes nets are especially popular applications. However, the presence of multiple modes in many relevant integrands and summands often renders the method slow and cumbersome. Recent mean field alternatives designed to speed things up have been inspired by experience gleaned from physics. The current work adopts an approach very similar to this in spirit, but focusses instead upon dynamic programming notions as a basis for producing systematic Monte Carlo improvements. The idea is to approximate a given model by a dynamic programming-style decomposition, which then forms a scaffold upon which to build successively more accurate Monte Carlo approximations. Dynamic programming ideas alone fail to account for non-local structure, while standard Monte Carlo methods essentially ignore all structure. However, suitably-crafted hybrids can successfully exploit the strengths of each method, resulting in algorithms that combine speed with accuracy. The approach relies on the presence of significant "local" information in the problem at hand. This turns out to be a plausible assumption for many important applications. Example calculations are presented, and the overall strengths and weaknesses of the approach are discussed.

Stolorz, P.E. and Chew, P.C. Harnessing Graphical Structure in Markov Chain Monte Carlo Learning, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press: Menlo Park, CA. August 1996.

This paper is not yet available online.