275

772s Data.mining.concepts.and.Techniques.2nd.ed

Embed Size (px)

DESCRIPTION

Data Mining Book in Doc file

Citation preview

Ralf Guting and Markus Schneider

Joe Celkos SQL Programming StyleJoe Celko

Data Mining: Practical Machine Learning Tools and Technique

Ian Witten and Eibe Frank

Fuzzy Modeling and Genetic Algorithms for Data Mining andEarl Cox

Data Modeling Essentials, Third EditionGraeme C. Simsion and Graham C. Witt

Location-Based ServicesJochen Schiller and Agns Voisard

Database Modeling with Microsft Visio for Enterprise Archite

Terry Halpin, Ken Evans, Patrick Hallock, Bill Maclean

Designing Data-Intensive Web Applications

Stephano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambill

Mining the Web: Discovering Knowledge from Hypertext Data

Soumen Chakrabarti

Advanced SQL:II 1999Understanding Object-Relational and

Jim Melton

Database Tuning: Principles, Experiments, and Troubleshootin

Dennis Shasha and Philippe Bonnet

SQL:1999Understanding Relational Language Components

Jim Melton and Alan R. Simon

Information Visualization in Data Mining and Knowledge Dis

Edited by Usama Fayyad, Georges G. Grinstein, and Andreas

Transactional Information Systems: Theory, Algorithms, and P Control and RecoveryGerhard Weikum and Gottfried Vossen

Spatial Databases: With Application to GISPhilippe Rigaux, Michel Scholl, and Agnes Voisard

Information Modeling and Relational Databases: From Concep

Terry Halpin

Component Database SystemsEdited by Klaus R. Dittrich and Andreas Geppert

Joe Celkos SQL for Smarties: Advanced SQL Program

Joe Celko

Joe Celkos Data and Databases: Concepts in Practice

Joe Celko

Developing Time-Oriented Database Applications in S

Richard T. Snodgrass

Web Farming for the Data WarehouseRichard D. Hackathorn

Management of Heterogeneous and Autonomous Data

Edited by Ahmed Elmagarmid, Marek Rusinkiewicz,

Object-Relational DBMSs: Tracking the Next Great W

Michael Stonebraker and Paul Brown,with Dorothy

A Complete Guide to DB2 Universal DatabaseDon Chamberlin

Universal Database Management: A Guide to Object/

Cynthia Maro Saracco

Readings in Database Systems, Third Edition

Edited by Michael Stonebraker and Joseph M. Heller

Understanding SQLs Stored Procedures: A Complete

Jim Melton

Principles of Multimedia Database SystemsV. S. Subrahmanian

Principles of Database Query Processing for Advanced

Clement T. Yu and Weiyi Meng

Advanced Database Systems

Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Rich V. S. Subrahmanian, and Roberto Zicari

Principles of Transaction ProcessingPhilip A. Bernstein and Eric Newcomer

Using the New DB2: IBMs Object-Relational Databas

Don Chamberlin

Distributed AlgorithmsNancy A. Lynch

A Guide to Developing Client/Server SQL Applications

Setrag Khoshafian, Arvola Chan, Anna Wong, and Harry K.

The Benchmark Handbook for Database and Transaction Proce

Edited by Jim Gray

Camelot and Avalon: A Distributed Transaction Facility

Edited by Jeffrey L. Eppinger, Lily B. Mummert, and Alfred Z

Readings in Object-Oriented Database SystemsEdited by Stanley B. Zdonik and David Maier

University of I

A M S T E R D A M B O S T O N H E I D E L B E R G L O N D O NN E W Y O R K O X F O R D P A R I S S A N D I E G O S A N F R A N C I S C S I N G A P O R E S Y D N E Y T O K Y

Designations used by companies to distinguish their product registered trademarks. In all instances in which Morgan Kauf the product names appear in initial capital or all capital letter the appropriate companies for more complete information r registration.

No part of this publication may be reproduced, stored in a re form or by any meanselectronic, mechanical, photocopyin prior written permission of the publisher.

Permissions may be sought directly from Elseviers Science & Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853 [email protected]. You may also complete your reque (http://elsevier.com) by selecting Customer Support and th

Library of Congress Cataloging-in-Publication Data

Application submitted

ISBN 13: 978-1-55860-901-3ISBN 10: 1-55860-901-6

For information on all Morgan Kaufmann publications, visit www.mkp.com or www.books.elsevier.com

Printed in the United States of America 06 07 08 09 10 5 4 3 2 1

To Erik, Kevan, Kian,

Prefacexxi

Chapter1 Introduction 1

1.1 What Motivated Data Mining?

1.2 So, What Is Data Mining? 5

1.3 Data MiningOn What Kind of

1.3.1 Relational Databases 10

1.3.2 Data Warehouses 12

1.3.3 Transactional Databases 14

1.3.4 Advanced Data and Informatio Applications 15

1.4 Data Mining FunctionalitiesW Mined? 21

1.4.1 Concept/Class Description: Ch Discrimination 21

1.4.2 Mining Frequent Patterns, Asso

1.4.3 Classification and Prediction

1.4.4 Cluster Analysis 25

1.4.5 Outlier Analysis 26

1.4.6 Evolution Analysis 27

1.5 Are All of the Patterns Interesti

1.6 Classification of Data Mining Sy

1.7Data Mining Task Primitives 3

1.8 Integration of a Data Mining Sy a Database or Data Warehouse

1.9Major Issues in Data Mining 36

2.3.2Noisy Data 62

2.3.3Data Cleaning as a Process 65

2.4 Data Integration and Transformation

2.4.1 Data Integration 67

2.4.2 Data Transformation 70

2.5 Data Reduction 72

2.5.1Data Cube Aggregation 73

2.5.2Attribute Subset Selection 75

2.5.3 Dimensionality Reduction 77

2.5.4 Numerosity Reduction 80

2.6 Data Discretization and Concept Hier

2.6.1 Discretization and Concept Hierarchy Numerical Data 88

2.6.2 Concept Hierarchy Generation for Ca

2.7 Summary 97 Exercises 97

Bibliographic Notes 101

Chapter3 Data Warehouse and OLAP Technology: An

3.1 What Is a Data Warehouse? 105

3.1.1 Differences between Operational Dat and Data Warehouses 108

3.1.2 But, Why Have a Separate Data Ware 3.2 A Multidimensional Data Model 110

3.2.1 From Tables and Spreadsheets to Data

3.2.2 Stars, Snowflakes, and Fact Constellati Schemas for Multidimensional Databa

3.2.3 Examples for Defining Star, Snowflake,

and Fact Constellation Schemas 117

3.4 Data Warehouse Implementati

3.4.1Efficient Computation of Data

3.4.2Indexing OLAP Data 141

3.4.3 Efficient Processing of OLAP Q

3.5 From Data Warehousing to Dat

3.5.1Data Warehouse Usage 146

3.5.2 From On-Line Analytical Proce to On-Line Analytical Mining 3.6 Summary 150 Exercises 152 Bibliographic Notes 154

Chapter4 Data Cube Computation and Data G

4.1 Efficient Methods for Data Cub

4.1.1 A Road Map for the Materializ of Cubes 158

4.1.2 Multiway Array Aggregation fo

4.1.3 BUC: Computing Iceberg Cub Downward 168

4.1.4 Star-cubing: Computing Iceber a Dynamic Star-tree Structure

4.1.5 Precomputing Shell Fragments OLAP 178

4.1.6 Computing Cubes with Comp

4.2 Further Development of Data C Technology 189

4.2.1 Discovery-Driven Exploration

4.2.2 Complex Aggregation at Multi Multifeature Cubes 192 4.2.3 Constrained Gradient Analysis

Chapter 5 Mining Frequent Patterns, Associations, and 5.1 Basic Concepts and a Road Map 227

5.1.1 Market Basket Analysis: A Motivating E

5.1.2 Frequent Itemsets, Closed Itemsets, an

5.1.3 Frequent Pattern Mining: A Road Map

5.2 Efficient and Scalable Frequent Itemse

5.2.1 The Apriori Algorithm: Finding Freque Candidate Generation 234

5.2.2 Generating Association Rules from Fre

5.2.3Improving the Efficiency of Apriori 2

5.2.4 Mining Frequent Itemsets without Can

5.2.5 Mining Frequent Itemsets Using Vertic

5.2.6Mining Closed Frequent Itemsets 24

5.3 Mining Various Kinds of Association R

5.3.1Mining Multilevel Association Rules 2

5.3.2 Mining Multidimensional Association R from Relational Databases and Data 5.4 From Association Mining to Correlatio

5.4.1 Strong Rules Are Not Necessarily Inte

5.4.2 From Association Analysis to Correlati

5.5 Constraint-Based Association Mining

5.5.1 Metarule-Guided Mining of Associatio

5.5.2 Constraint Pushing: Mining Guided by

5.6 Summary 272 Exercises 274 Bibliographic Notes 280

6.4.1 Bayes Theorem 310

6.4.2 Nave Bayesian Classification

6.4.3Bayesian Belief Networks 31

6.4.4 Training Bayesian Belief Netwo

6.5 Rule-Based Classification 318

6.5.1 Using IF-THEN Rules for Class

6.5.2 Rule Extraction from a Decisio

6.5.3 Rule Induction Using a Sequen

6.6 Classification by Backpropagati

6.6.1 A Multilayer Feed-Forward Ne

6.6.2 Defining a Network Topology

6.6.3 Backpropagation 329

6.6.4 Inside the Black Box: Backprop

6.7Support Vector Machines 337

6.7.1 The Case When the Data Are

6.7.2 The Case When the Data Are

6.8 Associative Classification: Class Rule Analysis 344

6.9 Lazy Learners (or Learning fro

6.9.1 k-Nearest-Neighbor Classifier

6.9.2 Case-Based Reasoning 350

6.10 Other Classification Methods

6.10.1 Genetic Algorithms 351

6.10.2Rough Set Approach351

6.10.3Fuzzy Set Approaches352

6.11 Prediction 354

6.11.1 Linear Regression 355

6.11.2 Nonlinear Regression 357

6.11.3 Other Regression-Based Meth

6.15.2 ROC Curves 372

6.16 Summary 373 Exercises 375 Bibliographic Notes 378

Chapter7 Cluster Analysis 383

7.1 What Is Cluster Analysis? 383

7.2Types of Data in Cluster Analysis 386

7.2.1 Interval-Scaled Variables 387

7.2.2 Binary Variables 389

7.2.3 Categorical, Ordinal, and Ratio-Scaled

7.2.4Variables of Mixed Types 395

7.2.5 Vector Objects 397

7.3 A Categorization of Major Clustering

7.4 Partitioning Methods 401

7.4.1 Classical Partitioning Methods: k-Mean

7.4.2 Partitioning Methods in Large Databas

k-Medoids to CLARANS 407

7.5 Hierarchical Methods 408

7.5.1 Agglomerative and Divisive Hierarchic

7.5.2 BIRCH: Balanced Iterative Reducing an Using Hierarchies 412 7.5.3 ROCK: A Hierarchical Clustering Algo Categorical Attributes 414

7.5.4 Chameleon: A Hierarchical Clustering

Using Dynamic Modeling 416

7.6 Density-Based Methods 418

7.6.1 DBSCAN: A Density-Based Clustering Connected Regions with Sufficiently H

7.9.1 CLIQUE: A Dimension-Growt

7.9.2 PROCLUS: A Dimension-Redu Method 439

7.9.3 Frequent PatternBased Cluste

7.10 Constraint-Based Cluster Analy

7.10.1 Clustering with Obstacle Obje

7.10.2 User-Constrained Cluster Anal

7.10.3 Semi-Supervised Cluster Analy

7.11 Outlier Analysis 451

7.11.1 Statistical Distribution-Based O

7.11.2 Distance-Based Outlier Detect

7.11.3 Density-Based Local Outlier D

7.11.4 Deviation-Based Outlier Detec

7.12 Summary 460 Exercises 461 Bibliographic Notes 464

Chapter 8 Mining Stream, Time-Series, and Seq 8.1 Mining Data Streams 468

8.1.1 Methodologies for Stream Dat Stream Data Systems 469

8.1.2 Stream OLAP and Stream Dat

8.1.3 Frequent-Pattern Mining in Da

8.1.4 Classification of Dynamic Data

8.1.5 Clustering Evolving Data Strea

8.2Mining Time-Series Data 489

8.2.1 Trend Analysis 490

8.2.2 Similarity Search in Time-Serie

Chapter9 Graph Mining, Social Network Analysis, and

Data Mining 535

9.1 Graph Mining 535

9.1.1 Methods for Mining Frequent Subgrap

9.1.2 Mining Variant and Constrained Substr

9.1.3 Applications: Graph Indexing, Similarit and Clustering 551

9.2Social Network Analysis 556

9.2.1 What Is a Social Network? 556

9.2.2Characteristics of Social Networks5

9.2.3Link Mining: Tasks and Challenges56

9.2.4Mining on Social Networks565

9.3Multirelational Data Mining571

9.3.1 What Is Multirelational Data Mining?

9.3.2 ILP Approach to Multirelational Classifi

9.3.3Tuple ID Propagation 575

9.3.4 Multirelational Classification Using Tup

9.3.5 Multirelational Clustering with User G

9.4 Summary 584 Exercises 586 Bibliographic Notes 587

Chapter10 Mining Object, Spatial, Multimedia, Text, an

10.1 Multidimensional Analysis and Descrip Data Objects 591

10.1.1 Generalization of Structured Data 5

10.1.2 Aggregation and Approximation in Sp Generalization 593

10.3Multimedia Data Mining 607

10.3.1 Similarity Search in Multimedia

10.3.2 Multidimensional Analysis of M

10.3.3 Classification and Prediction A

10.3.4 Mining Associations in Multime

10.3.5 Audio and Video Data Mining

10.4 Text Mining 614

10.4.1 Text Data Analysis and Inform

10.4.2 Dimensionality Reduction for

10.4.3 Text Mining Approaches 624

10.5 Mining the World Wide Web

10.5.1 Mining the Web Page Layout S

10.5.2 Mining the Webs Link Structu

Authoritative Web Pages 63

10.5.3 Mining Multimedia Data on the

10.5.4 Automatic Classification of We

10.5.5 Web Usage Mining 640

10.6 Summary 641 Exercises 642 Bibliographic Notes 645

Chapter 11 Applications and Trends in Data Mini 11.1 Data Mining Applications 649

11.1.1 Data Mining for Financial Data

11.1.2 Data Mining for the Retail Indu

11.1.3 Data Mining for the Telecomm

11.1.4 Data Mining for Biological Dat

11.1.5 Data Mining in Other Scientific

11.1.6 Data Mining for Intrusion Dete

11.6 Summary 684 Exercises 685 Bibliographic Notes 687

AppendixAn Introduction to Microsofts OLE D

Data Mining 691

A.1 Model Creation 693

A.2 Model Training 695

A.3 Model Prediction and Browsing 697

Bibliography703

Index745

We are deluged by datascientific data, medica and marketing data. People have no time to l become the precious resource. So, we must find to automatically classify it, to automatically sum characterize trends in it, and to automatically active and exciting areas of the database research ing statistics, visualization, artificial intelligence to this field. The breadth of the field makes it dif over the last few decades.

Six years ago, Jiawei Hans and Micheline K presented Data Mining. It heralded a golden ag of their book reflects that progress; more than h are to recent work. The field has matured with has broadened to include many more datatypes geospatial, audio, images, and video. We are cer indeed research and commercial interest in dat all fortunate to have this modern compendium

The book gives quick introductions to da particular emphasis on data analysis. It then cov cepts and techniques that underlie classification These topics are presented with examples, a to lem class, and with pragmatic rules of thumb a Socratic presentation style is both very readable a lot from reading the first edition and got re-ed edition.

Jiawei Han and Micheline Kamber have be research. This is the text they use with their stu

Contributing factors include the computerizatio transactions; the widespread use of digital cam most commercial products; and advances in da text and image platforms to satellite remote se of the World Wide Web as a global informatio dous amount of data and information. This exp has generated an urgent need for new techniqu gently assist us in transforming the vast amou knowledge.

This book explores the concepts and techn flourishing frontier in data and information syst also popularly referred to as knowledge discover convenient extraction of patterns representing in large databases, data warehouses, the Web, ot data streams.

Data mining is a multidisciplinary field, dra technology, machine learning, statistics, patt neural networks, knowledge-based systems, a computing, and data visualization. We present hidden in large data sets, focusing on issues rela tiveness, and scalability. As a result, this book database systems, machine learning, statistics, vide the background necessary in these areas in hension of their respective roles in data minin introduction to data mining, presented with eff It should be useful for computing science stude professionals, as well as researchers involved in

Data mining emerged during the late 1980s, continues to flourish into the new millennium of the field, introducing interesting data minin

numerous enhancements and a reorganization of the book. In addition, several new chapters are included to mining complex types of data, including stream data, data, social network data, and multirelational data.

The chapters are described briefly as follows, with e Chapter 1 provides an introduction to the multid It discusses the evolutionary path of database techno

for data mining, and the importance of its applicatio to be mined, including relational, transactional, and complex types of data such as data streams, time-serie works, multirelational data, spatiotemporal data, mult data. The chapter presents a general classification of different kinds of knowledge to be mined. In compa new sections are introduced: Section 1.7 is on data users to interactively communicate with data mining mining process, and Section 1.8 discusses the issues r mining system with a database or data warehouse sy sent the condensed materials of Chapter 4, Data M Architectures, in the first edition. Finally, major chall

Chapter 2 introduces techniques for preprocessin corresponds to Chapter 3 of the first edition. Because construction of data warehouses, we address this topi introduction to data warehouses in the subsequent cha ious statistical methods for descriptive data summariz central tendency and dispersion of data. The descripti been enhanced. Methods for data integration and transf discussed, including the use of concept hierarchies for d The automatic generation of concept hierarchies is also

Chapters 3 and 4 provide a solid introduction to d Analytical Processing), and data generalization. The Chapters 2 and 5 of the first edition, but with substan

quent itemset mining are presented in an org Apriori algorithm and its variations to more a ciency, including the frequent-pattern growth vertical data format, and mining closed frequent niques for mining multilevel association rules, quantitative association rules. In comparison w placed greater emphasis on the generation of rules. Strategies for constraint-based mining an focus the rule search are also described.

Chapter 6 describes methods for data classifi tree induction, Bayesian classification, rule-base nique of backpropagation, support vector mac neighbor classifiers, case-based reasoning, genet set approaches. Methods of regression are introd to choose the best classifier or predictor are di sponding chapter in the first edition, the section vector machines are new, and the discussion of and prediction accuracy has been greatly expan

Cluster analysis forms the topic of Chapter 7. are presented, including partitioning method methods, grid-based methods, and model-base introduce techniques for clustering high-dime based cluster analysis. Outlier analysis is also d

Chapters 8 to 10 treat advanced topics in materials on recent progress in this frontier. Th vious single chapter on advanced topics. Cha data, time-series data, and sequence data (cov biological sequences). The basic data mining te ing, classification, clustering, and constraint-ba of data. Chapter 9 discusses methods for grap network analysis and multirelational data mi

This book has several strong features that set it apa ing. It presents a very broad yet in-depth coverage fro especially regarding several recent research topics on ing, social network analysis, and multirelational data the advanced topics are written to be as self-contained in order of interest by the reader. All of the major m sented. Because we take a database point of view to dat many important topics in data mining, such as scalable a OLAP analysis, that are often overlooked or minimally

To the Instructor

This book is designed to give a broad, yet detailed overv can be used to teach an introductory course on data mini level or at the first-year graduate level. In addition, it can course on data mining.

If you plan to use the book to teach an introducto materials in Chapters 1 to 7 are essential, among which do not plan to cover the implementation methods for d processing in depth. Alternatively, you may omit some use Chapter 11 as the final coverage of applications and

If you plan to use the book to teach an advanced cou Chapters 8 through 11. Moreover, additional materials may supplement selected themes from among the adva Individual chapters in this book can also be used f in related courses, such as database systems, machine leintelligent data analysis.

Each chapter ends with a set of exercises, suitable as a are either short questions that test basic mastery of the m that require analytical thinking, or implementation pro

make the book more enjoyable and reader-frien a textbook, we have tried to organize it so that i book or handbook, should you later decide to fields or pursue a career in data mining.

What do you need to know in order to read

You should have some knowledge of the co database systems, statistics, and machine le enough background of the basics in these fie these fields or your memory is a bit rusty, discussions in the book.

You should have some programming experi read pseudo-code and understand simple d arrays.

To the Professional

This book was designed to cover a wide range o result, it is an excellent handbook on the subjec as stand-alone as possible, you can focus on the can be used by application programmers and in learn about the key ideas of data mining on thei technical data analysis staff in banking, insuranc are interested in applying data mining solution may serve as a comprehensive survey of the da researchers who would like to advance the sta the scope of data mining applications.

The techniques and algorithms presented ar ing algorithms that perform well on small t in the book are geared for the discovery of p

plemental materials for readers of this book or anyo mining. The resources include:

Slide presentations per chapter. Lecture notes in available for each chapter.

Artwork of the book. This may help you to make room teaching.

Instructors manual. This complete set of answers available only to instructors from the publishers

Course syllabi and lecture plan. These are given f versions of introductory and advanced courses on and slides.

Supplemental reading lists with hyperlinks. Semin ing are organized per chapter.

Links to data mining data sets and software. We w mining data sets and sites containing interestin ages, such as IlliMine from the University of (http://illimine.cs.uiuc.edu).

Sample assignments, exams, course projects. A se and course projects will be made available to i website.

Table of contents of the book in PDF.

Errata on the different printings of the book. We errors in the book. Once the error is confirmed, w include acknowledgment of your contribution.

Comments or suggestions can be sent to hanj@cs. hear from you.

Itskevitch, Wen Jin, Tiko Kameda, Hiroyuki K

Kim, Krzysztof Koperski, Hans-Peter Kriegel, Joyce Man Lam, James Lau, Deyi Li, George ( Liao, Gang Liu, Junqiang Liu, Ling Liu, Alan (Y Xuebin Lu, Wo-Shun Luk, Heikki Mannila, Ru Alberto Mendelzon, Tim Merrett, Harvey Mill Richard Muntz, Raymond T. Ng, Vicent Ng, Ozsu, Jian Pei, Gregory Piatetsky-Shapiro, H hamed Rajan, Peter Scheuermann, Shashi Sh Evangelos Simoudis, Nebojsa Stefanovic, Yin J Dick Tsur, Anthony K. H. Tung, Ke Wang, Wei Winstone, Ju Wu, Betty (Bin) Xia, Cindy M. X Clement Yu, Jeffrey Yu, Philip S. Yu, Osmar R Zhong Zhang, Yvonne Zheng, Xiaofang Zhou Jean Hou, Helen Pinto, Lara Winstone, and H original figures in this book, and to Eugene each chapter.

We also wish to thank Diane Cerra, our Publishers, for her enthusiasm, patience, and s as well as Howard Severson, our Production tious efforts regarding production. We are in invaluable feedback. Finally, we thank our fa throughout this project.

Acknowledgments for the Seco

We would like to express our grateful thanks t bers of the Data Mining Group at UIUC, th Information Systems (DAIS) Laboratory in t the University of Illinois at Urbana-Champa

Wu, Tianyi Wu, Dong Xin, Xifeng Yan, Jiong Yang, X. Yu, Philip S. Yu, Maria Zemankova, ChengXiang Zou. Deng Cai and ChengXiang Zhai have contribut mining sections, Xifeng Yan to the graph mining secti tirelational data mining section. Hong Cheng, Chario David J. Hill, Chulyun Kim, Sangkyum Kim, Chao Li Tianyi Wu, Xifeng Yan, and Xiaoxin Yin have contrib individual chapters of the manuscript.

We also which to thank Diane Cerra, our Publis lishers, for her constant enthusiasm, patience, and su book. We are indebted to Alan Rose, the book Prod tireless and ever prompt communications with us to duction process. We are grateful for the invaluable fee Finally, we thank our families for their wholehearted s

This book is an introduction to a young and promising discovery from data. The material in this book i where emphasis is placed on basic data mining interesting data patterns hidden in large data cussed are particularly oriented toward the de mining tools. In this chapter, you will learn evolution of database technology, why data mi You will learn about the general architecture insight into the kinds of data on which mining that can be found, and how to tell which pa will study data mining primitives, from whic designed. Issues regarding how to integrate a data warehouse are also discussed. In addition ing systems, you will read about challenging r tools of the future.1.1 What Motivated Data Mining?

Necessity is the mother of

Data mining has attracted a great deal of atten society as a whole in recent years, due to the w and the imminent need for turning such data The information and knowledge gained can be ket analysis, fraud detection, and customer rete exploration.

Data mining can be viewed as a result o technology. The database system industry has development of the following functionalities ( creation, data management (including data

Figure 1.1 The evolution of database system technology.

faces, optimized query processing, and trans for on-line transaction processing (OLTP), w transaction, have contributed substantially to relational technology as a major tool for effici of large amounts of data.

Database technology since the mid-1980s adoption of relational technology and an activities on new and powerful database syste advanced data models such as extended-relati and deductive models. Application-oriented d poral, multimedia, active, stream, and sensor, a knowledge bases, and office information base distribution, diversification, and sharing of dat geneous database systems and Internet-based World Wide Web (WWW) have also emerged industry.

The steady and amazing progress of com three decades has led to large supplies of po collection equipment, and storage media. Thi the database and information industry, and information repositories available for transacti and data analysis.

Data can now be stored in many differen repositories. One data repository architecture (Section 1.3.2), a repository of multiple hetero unified schema at a single site in order to facili warehouse technology includes data cleaning, processing (OLAP), that is, analysis techniqu rization, consolidation, and aggregation as wel different angles. Although OLAP tools suppo sion making, additional data analysis tools are

Figure 1.2 We are data rich, but information poor.

data classification, clustering, and the characterizatio addition, huge volumes of data can be accumulated b houses. Typical examples include the World Wide W flow in and out like streams, as in applications like vid tion, and sensor networks. The effective and efficient forms becomes a challenging task.

The abundance of data, coupled with the need for been described as a data rich but information poor situ dous amount of data, collected and stored in large and far exceeded our human ability for comprehension wit As a result, data collected in large data repositories beco that are seldom visited. Consequently, important decisi the information-rich data stored in data repositories, intuition, simply because the decision maker does not able knowledge embedded in the vast amounts of da system technologies, which typically rely on users or d knowledge into knowledge bases. Unfortunately, this errors, and is extremely time-consuming and costly. analysis and may uncover important data patterns,

emphasis on mining from large amounts of da characterizing the process that finds a small set raw material (Figure 1.3). Thus, such a misno ing became a popular choice. Many other te meaning to data mining, such as knowledge m data/pattern analysis, data archaeology, and d

Many people treat data mining as a synonym edge Discovery from Data, or KDD. Alternativ

Knowledge

Figure 1.3 Data miningsearching for knowledge (interesti

Figure 1.4 Data mining as a step in the process of knowledge discove

based on some interestingness measures; Se

7. Knowledge presentation (where visualizati niques are used to present the mined knowl

Steps 1 to 4 are different forms of data pre for mining. The data mining step may interact interesting patterns are presented to the user a the knowledge base. Note that according to this entire process, albeit an essential one because it

We agree that data mining is a step in the kn industry, in media, and in the database research more popular than the longer term of knowledg book, we choose to use the term data mining. functionality: data mining is the process of disc amounts of data stored in databases, data wareh

Based on this view, the architecture of a ty following major components (Figure 1.5):

Database, data warehouse, World Wide We is one or a set of databases, data warehouses, tion repositories. Data cleaning and data in on the data.

Database or data warehouse server: The dat sible for fetching the relevant data, based on

1 A popular trend in the information industry is to pe preprocessing step, where the resulting data are stored i 2 Sometimes data transformation and consolidation ar particularly in the case of data warehousing. Data reduc representation of the original data without sacrificing it

data cleaning, integration and selection

DatabaseDataWorld WideOther

WarehouseWebRepos

Figure 1.5 Architecture of a typical data mining system.

Knowledge base: This is the domain knowledge th evaluate the interestingness of resulting patterns. S cept hierarchies, used to organize attributes or attrib abstraction. Knowledge such as user beliefs, which interestingness based on its unexpectedness, may a of domain knowledge are additional interestingnes metadata (e.g., describing data from multiple hetero

Data mining engine: This is essential to the data min a set of functional modules for tasks such as characte tion analysis, classification, prediction, cluster analys analysis.

Pattern evaluation module: This component typica sures (Section 1.5) and interacts with the data mi search toward interesting patterns. It may use int out discovered patterns. Alternatively, the pattern grated with the mining module, depending on t mining method used. For efficient data mining, it

incorporating more advanced techniques for d

Although there are many data mining syst perform true data mining. A data analysis syste data should be more appropriately categorized data analysis tool, or an experimental system form data or information retrieval, including fi deductive query answering in large databases s as a database system, an information retrieval s

Data mining involves an integration of tech database and data warehouse technology, statist computing, pattern recognition, neural net retrieval, image and signal processing, and spa a database perspective in our presentation of d sis is placed on efficient and scalable data min scalable, its running time should grow approxi of the data, given the available system resource By performing data mining, interesting knowle tion can be extracted from databases and viewe discovered knowledge can be applied to decisio management, and query processing. Therefore, important frontiers in database and informatio interdisciplinary developments in the informat1.3 Data MiningOn What Kind o

In this section, we examine a number of diffe can be performed. In principle, data mining s repository, as well as to transient data, such examination of data repositories will include transactional databases, advanced database sy

manage and access the data. The software programs in tion of database structures; for data storage; for concu access; and for ensuring the consistency and security o system crashes or attempts at unauthorized access.A relational database is a collection of tables, each of Each table consists of a set of attributes (columns or fie of tuples (records or rows). Each tuple in a relational tab by a unique key and described by a set of attribute valu as an entity-relationship (ER) data model, is often con An ER data model represents the database as a set of enConsider the following example.

Example 1.1 A relational database for AllElectronics. The AllElectro following relation tables: customer, item, employee, and described here are shown in Figure 1.6.

The relation customer consists of a set of attribute identity number (cust ID), customer name, address, credit information, category, and so on.

Similarly, each of the relations item, employee, and br describing their properties.

Tables can also be used to represent the relationsh relation tables. For our example, these include purc creating a sales transaction that is handled by an items sold in a given transaction), and works at (AllElectronics).

Relational data can be accessed by database queri language, such as SQL, or with the assistance of graph the user may employ a menu, for example, to specify query, and the constraints on these attributes. A given

Figure 1.6 Fragments of relations from a relational database

relational operations, such as join, selection, a efficient processing. A query allows retrieval of s your job is to analyze the AllElectronics data. T can ask things like Show me a list of all items tional languages also include aggregate function (maximum), and min (minimum). These allow sales of the last month, grouped by branch, or in the month of December? or Which sales pe

Suppose that AllElectronics is a successful international the world. Each branch has its own set of databases. T asked you to provide an analysis of the companys sales third quarter. This is a difficult task, particularly since over several databases, physically located at numerous

If AllElectronics had a data warehouse, this task house is a repository of information collected from a unified schema, and that usually resides at a single structed via a process of data cleaning, data integra loading, and periodic data refreshing. This process is Figure 1.7 shows the typical framework for constructi for AllElectronics.

Data source in Chicago

Clean

Data source in New YorkIntegrateData

Transform

Warehouse

Load

Refresh

Data source in Toronto

Data source in Vancouver

Figure 1.7 Typical framework of a data warehouse for AllElectronics.

Example 1.2 A data cube for AllElectronics. A data cube for is presented in Figure 1.8(a). The cube has thr

Chicago, New York, Toronto, Vancouver), time (w item (with item type values home entertainment, value stored in each cell of the cube is sales amou sales for the first quarter, Q1, for items relating to as stored in cell hVancouver, Q1, securityi. Additi sums over each dimension, corresponding to the SQL group-bys (e.g., the total sales amount per per quarter and item, or per each individual dim

I have also heard about data marts. What is th a data mart? you may ask. A data warehouse span an entire organization, and thus its scope other hand, is a department subset of a data w and thus its scope is department-wide.

By providing multidimensional data views data, data warehouse systems are well suited OLAP. OLAP operations use background kn data being studied in order to allow the pr abstraction. Such operations accommodate d OLAP operations include drill-down and rol data at differing degrees of summarization, as i we can drill down on sales data summarized by month. Similarly, we can roll up on sales d summarized by country.

Although data warehouse tools help suppor mining are required to allow more in-depth a data warehouse and OLAP technology is provid ing data warehouse and OLAP implementation Chapter 4.

entertainment

(a)item (types)

(b)Drill-downR

on time data for Q1o

ies)Chicago

New York

(cit

ess

Toronto

addr

Vancouver

(months)timeJan150

Feb100

March150

computersecurity

homephone

entertainment

item (types)

ies

r

ount

ess(cCana

addr

Q1

(quarters)Q2

Q3

time

Q4

en

Figure 1.8 A multidimensional data cube, commonly used for data rized data for AllElectronics and (b) showing summarized roll-up operations on the cube in (a). For improved read values are shown.

1.3.3 Transactional Databases

In general, a transactional database consists of a file whe action. A transaction typically includes a unique transa and a list of the items making up the transaction (suc

and so on.

Example 1.3 A transactional database for AllElectronics. Tr one record per transaction. A fragment of a is shown in Figure 1.9. From the relational d Figure 1.9 is a nested relation because the attrib Because most relational database systems do not transactional database is usually either stored i the table in Figure 1.9 or unfolded into a stand the items sold table in Figure 1.6.

As an analyst of the AllElectronics database purchased by Sandy Smith or How many Answering such queries may require a scan of t

Suppose you would like to dig deeper into th together? This kind of market basket data analy items together as a strategy for maximizing sale printers are commonly purchased together with model of printers at a discount to customers bu selling more of the expensive printers. A regular queries like the one above. However, data mini so by identifying frequent itemsets, that is, sets o The mining of such frequent patterns for trans

1.3.4 Advanced Data and Information S Advanced Applications

Relational database systems have been widely progress of database technology, various kinds tems have emerged and are undergoing develop applications.

text and multimedia database systems, heterogeneous a stream management systems, and Web-based global inf While such databases or information repositories r efficiently store, retrieve, and update large amounts of fertile grounds and raise many challenging research an mining. In this section, we describe each of the advanc

Object-Relational Databases

Object-relational databases are constructed based on This model extends the relational model by providing a plex objects and object orientation. Because most sop need to handle complex objects and structures, object ing increasingly popular in industry and applications.

Conceptually, the object-relational data model in object-oriented databases, where, in general terms, object. Following the AllElectronics example, objects c tomers, or items. Data and code relating to an objec unit. Each object has associated with it the following:

A set of variables that describe the objects. These entity-relationship and relational models.

A set of messages that the object can use to commu the rest of the database system.

A set of methods, where each method holds the cod receiving a message, the method returns a value in re for the message get photo(employee) will retrieve a employee object.

Objects that share a common set of properties can Each object is an instance of its class. Object classes can

A temporal database typically stores relational These attributes may involve several timesta A sequence database stores sequences of orde notion of time. Examples include customer sho biological sequences. A time-series database sto over repeated measurements of time (e.g., hour collected from the stock exchange, inventory phenomena (like temperature and wind).

Data mining techniques can be used to find t the trend of changes for objects in the database. sion making and strategy planning. For instanc the scheduling of bank tellers according to the vo data can be mined to uncover trends that could when is the best time to purchase AllElectronic defining multiple granularity of time. For exam to fiscal years, academic years, or calendar year quarters or months.

Spatial Databases and Spatiote

Spatial databases contain spatial-related info (map) databases, very large-scale integration (V and medical and satellite image databases. Spat mat, consisting of n-dimensional bit maps or image may be represented as raster data, where area. Maps can be represented in vector form lakes are represented as unions or overlays of ba lines, polygons, and the partitions and network

Geographic databases have numerous appli ogy planning to providing public service inform and electric cables, pipes, and sewage systems

more, spatial data cubes may be constructed to orga structures and hierarchies, on which OLAP operations can be performed.

A spatial database that stores spatial objects tha spatiotemporal database, from which interesting infor ple, we may be able to group the trends of moving obj moving vehicles, or distinguish a bioterrorist attack fr based on the geographic spread of a disease with time.

Text Databases and Multimedia Dat

Text databases are databases that contain word descr descriptions are usually not simple keywords but rath such as product specifications, error or bug reports, war notes, or other documents. Text databases may be hig Web pages on the World Wide Web). Some text databa that is, semistructured (such as e-mail messages and whereas others are relatively well structured (such as li databases with highly regular structures typically can database systems.

What can data mining on text databases uncover? uncover general and concise descriptions of the text associations, as well as the clustering behavior of text o mining methods need to be integrated with informat construction or use of hierarchies specifically for text d sauruses), as well as discipline-oriented term classificati stry, medicine, law, or economics).Multimedia databases store image, audio, and vid cations such as picture content-based retrieval, voice-systems, the World Wide Web, and speech-based user commands. Multimedia databases must support large o

queries. Objects in one component database component databases, making it difficult to as heterogeneous database.

Many enterprises acquire legacy databases mation technology development (including th operating systems). A legacy database is a gro bines different kinds of data systems, such as hierarchical databases, network databases, spr systems. The heterogeneous databases in a lega or inter-computer networks.

Information exchange across such databas precise transformation rules from one represe semantics. Consider, for example, the proble student academic performance among differen computer system and use its own curriculum a adopt a quarter system, offer three courses on d A+ to F, whereas another may adopt a semester and assign grades from 1 to 10. It is very diffic transformation rules between the two universi ficult. Data mining techniques may provide an exchange problem by performing statistical da and transforming the given data into higher, m as fair, good, or excellent for student grades), fro more easily be performed.

Data Streams

Many applications involve the generation and an data, where data flow in and out of an observa Such data streams have the following unique f dynamically changing, flowing in and out in a

changes within stream data. For example, we may like t network based on the anomaly of message flow, which data streams, dynamic construction of stream models, o patterns with that at a certain previous time. Most strea of abstraction, whereas analysts are often more interes of abstraction. Thus, multilevel, multidimensional on-be performed on stream data as well.

The World Wide Web

The World Wide Web and its associated distribute Yahoo!, Google, America Online, and AltaVista, provid mation services, where data objects are linked togeth Users seeking information of interest traverse from Such systems provide ample opportunities and challe ple, understanding user access patterns will not only providing efficient access between highly correlated o marketing decisions (e.g., by placing advertisements i or by providing better customer/user classification an user access patterns in such distributed information e mining (or Weblog mining).

Although Web pages may appear fancy and informat highly unstructured and lack a predefined schema, type computers to understand the semantic meaning of diver in an organized way for systematic information retriev that provide keyword-based searches without understan pages can only offer limited help to users. For example keyword may return hundreds of Web page pointers c of the pointers will be very weakly related to what the can often provide additional help here than Web searc tative Web page analysis based on linkages among We

Can Be Mined?

We have observed various types of databases an mining can be performed. Let us now examin mined.

Data mining functionalities are used to spe data mining tasks. In general, data mining tas descriptive and predictive. Descriptive mining of the data in the database. Predictive mining tas in order to make predictions.

In some cases, users may have no idea regar may be interesting, and hence may like to search parallel. Thus it is important to have a data mini patterns to accommodate different user expecta mining systems should be able to discover patte levels of abstraction). Data mining systems sh guide or focus the search for interesting patter for all of the data in the database, a measure of associated with each discovered pattern.

Data mining functionalities, and the kinds of below.

1.4.1 Concept/Class Description: Chara Discrimination

Data can be associated with classes or concepts classes of items for sale include computers and pr bigSpenders and budgetSpenders. It can be usefu cepts in summarized, concise, and yet precise a concept are called class/concept description(1) data characterization, by summarizing the d

attribute-oriented induction technique can be used to characterization without step-by-step user interaction Chapter 4.

The output of data characterization can be presen include pie charts, bar charts, curves, multidimension sional tables, including crosstabs. The resulting descr generalized relations or in rule form (called characteris forms and their transformations are discussed in Chap

Example 1.4 Data characterization. A data mining system should summarizing the characteristics of customers who sp AllElectronics. The result could be a general profile of 4050 years old, employed, and have excellent credit r users to drill down on any dimension, such as on oc customers according to their type of employment.

Data discrimination is a comparison of the general f with the general features of objects from one or a set and contrasting classes can be specified by the user, an retrieved through database queries. For example, the u eral features of software products whose sales increased whose sales decreased by at least 30% during the same p discrimination are similar to those used for data characHow are discrimination descriptions output? The f similar to those for characteristic descriptions, altho should include comparative measures that help disting trasting classes. Discrimination descriptions expressed discriminant rules.

Example 1.5 Data discrimination. A data mining system should be AllElectronics customers, such as those who shop for co

Frequent patterns, as the name suggests, are pat are many kinds of frequent patterns, includin tures. A frequent itemset typically refers to a set in a transactional data set, such as milk and bre such as the pattern that customers tend to purc era, and then a memory card, is a (frequent) seq to different structural forms, such as graphs, tr with itemsets or subsequences. If a substructure structured pattern. Mining frequent patterns lea ations and correlations within data.

Example 1.6 Association analysis. Suppose, as a marketing m determine which items are frequently purchase An example of such a rule, mined from the AllE

buys(X;computer) ) buys(X;software)

where X is a variable representing a customer. A that if a customer buys a computer, there is a as well. A 1% support means that 1% of all of that computer and software were purchased t single attribute or predicate (i.e., buys) that repe predicate are referred to as single-dimensional a notation, the above rule can be written simply

Suppose, instead, that we are given the AllE purchases. A data mining system may find asso

age(X, 20:::29) ^income(X, 20K

[support = 2%, confidence = 60%

The rule indicates that of the AllElectronics 29 years of age with an income of 20,000 to 2

pattern mining and structured pattern mining are cons discussed in Chapters 8 and 9, respectively.

1.4.3 Classification and Prediction

Classification is the process of finding a model (or fun guishes data classes or concepts, for the purpose of bein the class of objects whose class label is unknown. The de ysis of a set of training data (i.e., data objects whose cla

How is the derived model presented? The derived m ous forms, such as classification (IF-THEN) rules, decis or neural networks (Figure 1.10). A decision tree is a flo each node denotes a test on an attribute value, each bra test, and tree leaves represent classes or class distributi converted to classification rules. A neural network, wh cally a collection of neuron-like processing units with w units. There are many other methods for constructing cl Bayesian classification, support vector machines, and k

Whereas classification predicts categorical (discret models continuous-valued functions. That is, it is use able numerical data values rather than class labels. Al refer to both numeric prediction and class label predicti primarily to numeric prediction. Regression analysis is most often used for numeric prediction, although other also encompasses the identification of distribution tren

Classification and prediction may need to be prece attempts to identify attributes that do not contribute t process. These attributes can then be excluded.

Example 1.7 Classification and prediction. Suppose, as sales mana like to classify a large set of items in the store, based

Figure 1.10 A classification model can be represented in vario

(b) a decision tree, or a (c) neural network.

sales campaign: good response, mild response, a a model for each of these three classes based such as price, brand, place made, type, and cate maximally distinguish each class from the other data set. Suppose that the resulting classificatio tree. The decision tree, for instance, may identif distinguishes the three classes. The tree may re help further distinguish objects of each class fro Such a decision tree may help you understand th design a more effective campaign for the future Suppose instead, that rather than predicting item, you would like to predict the amount of re an upcoming sale at AllElectronics, based on pr (numeric) prediction because the model constfunction, or ordered value.

Chapter 6 discusses classification and predic

1.4.4 Cluster Analysis

What is cluster analysis? Unlike classification an data objects, clustering analyzes data objects

1

Figure 1.11 A 2-D plot of customer data with respect to customer loc clusters. Each cluster center is marked with a +.

In general, the class labels are not present in the traini not known to begin with. Clustering can be used to gen clustered or grouped based on the principle of maxim minimizing the interclass similarity. That is, clusters of o within a cluster have high similarity in comparison to on to objects in other clusters. Each cluster that is formed c from which rules can be derived. Clustering can also fac is, the organization of observations into a hierarchy of together.

Example 1.8 Cluster analysis. Cluster analysis can be performed o order to identify homogeneous subpopulations of cust sent individual target groups for marketing. Figure 1.1 with respect to customer locations in a city. Three clust

Cluster analysis forms the topic of Chapter 7.

1.4.5 Outlier Analysis

A database may contain data objects that do not com model of the data. These data objects are outliers. Mo

Outlier analysis is also discussed in Chapter

1.4.6 Evolution Analysis

Data evolution analysis describes and models behavior changes over time. Although this m tion, association and correlation analysis, classi related data, distinct features of such an ana sequence or periodicity pattern matching, and

Example 1.10 Evolution analysis. Suppose that you have th of the last several years available from the Ne like to invest in shares of high-tech industrial c exchange data may identify stock evolution re stocks of particular companies. Such regularitie market prices, contributing to your decision m

Data evolution analysis is discussed in Chap1.5 Are All of the Patterns Interest

A data mining system has the potential to gen terns, or rules.

So, you may ask, are all of the patterns inter tion of the patterns potentially generated would

This raises some serious questions for data pattern interesting? Can a data mining system ge a data mining system generate only interesting pa

To answer the first question, a pattern is in humans, (2) valid on new or test data with some

confidence(X ) Y ) = P(Y

In general, each interestingness measure is associate controlled by the user. For example, rules that do not say, 50% can be considered uninteresting. Rules below exceptions, or minority cases and are probably of less v Although objective measures help identify interesti unless combined with subjective measures that reflect ticular user. For example, patterns describing the chara frequently at AllElectronics should interest the marketi interest to analysts studying the same database for pat Furthermore, many patterns that are interesting by o common knowledge and, therefore, are actually unint ness measures are based on user beliefs in the data. Th esting if they are unexpected (contradicting a users beli on which the user can act. In the latter case, such patte Patterns that are expected can be interesting if they cowished to validate, or resemble a users hunch.

The second questionCan a data mining syste patterns?refers to the completeness of a data min alistic and inefficient for data mining systems to gene Instead, user-provided constraints and interestingness the search. For some mining tasks, such as association the completeness of the algorithm. Association rule mi of constraints and interestingness measures can ensure methods involved are examined in detail in Chapter 5.

Finally, the third questionCan a data mining sys terns?is an optimization problem in data mining. It ing systems to generate only interesting patterns. This w users and data mining systems, because neither would terns generated in order to identify the truly interesting this direction; however, such optimization remains a ch

1.6 Classification of Data Mining Sy

Data mining is an interdisciplinary field, the c ing database systems, statistics, machine learnin (Figure 1.12). Moreover, depending on the data other disciplines may be applied, such as neural knowledge representation, inductive logic progr ing. Depending on the kinds of data to be mined the data mining system may also integrate techn tion retrieval, pattern recognition, image analys Web technology, economics, business, bioinfor

Because of the diversity of disciplines contribu is expected to generate a large variety of data mi provide a clear classification of data mining syst tinguish between such systems and identify those systems can be categorized according to various

DatabaseStatistics

technology

InformationData

scienceMining

VisualizationOther discipli

Figure 1.12 Data mining as a confluence of multiple disciplin

relation analysis, classification, prediction, clusterin analysis. A comprehensive data mining system usual grated data mining functionalities.

Moreover, data mining systems can be distingui levels of abstraction of the knowledge mined, inclu high level of abstraction), primitive-level knowledge at multiple levels (considering several levels of abstra system should facilitate the discovery of knowledge

Data mining systems can also be categorized as (commonly occurring patterns) versus those that exceptions, or outliers). In general, concept descrip analysis, classification, prediction, and clustering mi liers as noise. These methods may also help detect o

Classification according to the kinds of techniques ut be categorized according to the underlying data min techniques can be described according to the degree autonomous systems, interactive exploratory syste methods of data analysis employed (e.g., databas oriented techniques, machine learning, statistics, vi neural networks, and so on). A sophisticated data multiple data mining techniques or work out an eff combines the merits of a few individual approaches

Classification according to the applications adapted: categorized according to the applications they ad systems may be tailored specifically for finance, te markets, e-mail, and so on. Different applications application-specific methods. Therefore, a generic, may not fit domain-specific mining tasks.

In general, Chapters 4 to 7 of this book are organize of knowledge mined. In Chapters 8 to 10, we discuss

the mining process, or examine the findings fr mining primitives specify the following, as illus

The set of task-relevant data to be mined: T or the set of data in which the user is interes or data warehouse dimensions of interest dimensions).

The kind of knowledge to be mined: This spe formed, such as characterization, discrimina classification, prediction, clustering, outlier

The background knowledge to be used in the the domain to be mined is useful for guidi for evaluating the patterns found. Concept ground knowledge, which allow data to be An example of a concept hierarchy for the a Figure 1.14. User beliefs regarding relationsh ground knowledge.

The interestingness measures and thresholds f to guide the mining process or, after discov Different kinds of knowledge may have diffe ple, interestingness measures for associatio Rules whose support and confidence values considered uninteresting.

The expected representation for visualizing t form in which discovered patterns are to be d charts, graphs, decision trees, and cubes.

A data mining query language can be de allowing users to flexibly interact with data min language provides a foundation on which user-

Background knowledge

Concept hierarchies

User beliefs about relationshi

Pattern interestingness measu

Simplicity

Certainty (e.g., confidence)

Utility (e.g., support)

Novelty

Visualization of discovered p Rules, tables, reports, charts, and cubes

Drill-down and roll-up

Figure 1.13 Primitives for specifying a data mining task.

This facilitates a data mining systems communication and its integration with the overall information proces Designing a comprehensive data mining language is covers a wide spectrum of tasks, from data characteriz task has different requirements. The design of an effec requires a deep understanding of the power, limitationthe various kinds of data mining tasks.

general abstraction level, denoted as all.

There are several proposals on data mining we use a data mining query language known as which was designed as a teaching tool, based o use to specify data mining queries appear thro an SQL-like syntax, so that it can easily be integ SQL. Lets look at how it can be used to specify

Example 1.11 Mining classification rules. Suppose, as a m would like to classify customers based on th interested in those customers whose salary i bought more than $1,000 worth of items, ea $100. In particular, you are interested in the cu purchased, the purchase location, and where to view the resulting classification in the for expressed in DMQL3 as follows, where each li aid in our discussion.

(1) use database AllElectronics db

(2) use hierarchy location hierarchy for T.

(3) mine classification as promising custo

(4) in relevance to C.age, C.income, I.type

(5) from customer C, item I, transaction T

(6) where I.item ID = T.item ID and C.cu and C.income 40,000 and I.price

(7) group by T.cust ID

3 Note that in this book, query language keywords are

However, in this example, the two classes are implici retrieved and considered examples of promising cus customers in the customer table are considered as n sification is performed based on this training set. L results are to be displayed as a set of rules. Several de introduced in Chapter 6.

There is no standard data mining query language as industry have been making good progress in this dir Data Mining (described in the appendix of this book) data mining language. Other standardization efforts inc Model Markup Language) and CRISP-DM (CRoss-Ind Mining).

1.8 Integration of a Data Mining System a Database or Data Warehouse Syst

Section 1.2 outlined the major components of the arch system (Figure 1.5). A good system architecture will fac make best use of the software environment, accomplish and timely manner, interoperate and exchange informa tems, be adaptable to users diverse requirements, and

A critical question in the design of a data mining ( or couple the DM system with a database (DB) system system. If a DM system works as a stand-alone system program, there are no DB or DW systems with which it scheme is called no coupling, where the main focus of th effective and efficient algorithms for mining the availabl system works in an environment that requires it to com system components, such as DB and DW systems, poss

ond, there are many tested, scalable algorith DB and DW systems. It is feasible to realize such systems. Moreover, most data have be Without any coupling of such systems, a D extract data, making it difficult to integrate cessing environment. Thus, no coupling rep

Loose coupling: Loose coupling means that a DB or DW system, fetching data from a dat performing data mining, and then storing t designated place in a database or data wareh Loose coupling is better than no coupling stored in databases or data warehouses by usi system facilities. It incurs some advantages o tures provided by such systems. However, m main memory-based. Because mining does

optimization methods provided by DB or pling to achieve high scalability and good pe

Semitight coupling: Semitight coupling mea a DB/DW system, efficient implementation itives (identified by the analysis of frequen can be provided in the DB/DW system. The ing, aggregation, histogram analysis, multi essential statistical measures, such as sum, c so on. Moreover, some frequently used inte puted and stored in the DB/DW system. Be are either precomputed or can be computed performance of a DM system.

Tight coupling: Tight coupling means tha into the DB/DW system. The data mining

desirable, but its implementation is nontrivial and mo Semitight coupling is a compromise between loose and identify commonly used data mining primitives and p of such primitives in DB or DW systems.1.9 Major Issues in Data Mining

The scope of this book addresses major issues in data mi logy, user interaction, performance, and diverse data ty below:

Mining methodology and user interaction issues: The mined, the ability to mine knowledge at multiple knowledge, ad hoc mining, and knowledge visualiza

Mining different kinds of knowledge in databas be interested in different kinds of knowledge, d spectrum of data analysis and knowledge disco acterization, discrimination, association and co prediction, clustering, outlier analysis, and evo trend and similarity analysis). These tasks may ent ways and require the development of nume

Interactive mining of knowledge at multiple lev difficult to know exactly what can be discove mining process should be interactive. For datab of data, appropriate sampling techniques can fi active data exploration. Interactive mining al for patterns, providing and refining data min results. Specifically, knowledge should be mine

vein, high-level data mining query langu to describe ad hoc data mining tasks by vant sets of data for analysis, the domai be mined, and the conditions and const patterns. Such a language should be integ query language and optimized for efficie

Presentation and visualization of data min be expressed in high-level languages, vis forms so that the knowledge can be ea humans. This is especially crucial if the This requires the system to adopt expressi such as trees, tables, rules, graphs, charts

Handling noisy or incomplete data: The da exceptional cases, or incomplete data obje objects may confuse the process, causin overfit the data. As a result, the accuracy Data cleaning methods and data analy required, as well as outlier mining met exceptional cases.

Pattern evaluationthe interestingness pro thousands of patterns. Many of the patt the given user, either because they repr elty. Several challenges remain regarding the interestingness of discovered pattern measures that estimate the value of pat based on user beliefs or expectations. user-specified constraints to guide the d space is another active area of research.

some data mining methods are factors motivatin distributed data mining algorithms. Such algo titions, which are processed in parallel. The res merged. Moreover, the high cost of some data mi for incremental data mining algorithms that inc out having to mine the entire data again from sc knowledge modification incrementally to amend ously discovered.

Issues relating to the diversity of database types:

Handling of relational and complex types of data: data warehouses are widely used, the developme mining systems for such data is important. Howe complex data objects, hypertext and multimedia or transaction data. It is unrealistic to expect data, given the diversity of data types and differe data mining systems should be constructed fo Therefore, one may expect to have different da kinds of data.

Mining information from heterogeneous database

Local- and wide-area computer networks (such sources of data, forming huge, distributed, and h covery of knowledge from different sources of unstructured data with diverse data semantics mining. Data mining may help disclose high-le heterogeneous databases that are unlikely to be tems and may improve information exchange a neous databases. Web mining, which uncovers i contents, Web structures, Web usage, and Web lenging and fast-evolving field in data mining.

database management systems with quer progress has led to the increasing demand analysis tools. This need is a result of the expl cations, including business and manageme and engineering, and environmental contro

Data mining is the task of discovering interes where the data can be stored in databases, data itories. It is a young interdisciplinary field, d tems, data warehousing, statistics, machine retrieval, and high-performance computing. networks, pattern recognition, spatial data an and many application fields, such as busines

A knowledge discovery process includes da tion, data transformation, data mining, presentation.

The architecture of a typical data mining warehouse and their appropriate servers, a tion module (both of which interact with a interface. Integration of the data mining c or data warehouse system can involve eithe coupling, or tight coupling. A well-designed semitight coupling with a database and/or d

Data patterns can be mined from many differ databases, data warehouses, and transaction esting data patterns can also be extracted fro ries, including spatial, time-series, sequence data streams, and the World Wide Web.

A data warehouse is a repository for long-ter organized so as to facilitate management dec

Data mining systems can be classified according to kinds of knowledge mined, the techniques used, or

We have studied five primitives for specifying a data mining query. These primitives are the specificatio data set to be mined), the kind of knowledge to be (typically in the form of concept hierarchies), inter edge presentation and visualization techniques to b ered patterns.

Data mining query languages can be designed to su mining. A data mining query language, such as DM for specifying each of the data mining primitives. based and may eventually form a standard on which mining can be based.

Efficient and effective data mining in large databas and great challenges to researchers and developers. mining methodology, user interaction, performance ing of a large variety of data types. Other issues inclu applications and their social impacts.

Exercises

1.1 What is data mining? In your answer, address the follo

(a) Is it another hype?

(b) Is it a simple transformation of technology develop machine learning?

(c) Explain how the evolution of database technology

(d) Describe the steps involved in data mining when v discovery.

1.6 Define each of the following data mining func tion, association and correlation analysis, classi lution analysis. Give examples of each data mini with which you are familiar.

1.7 What is the difference between discrimination zation and clustering? Between classification a tasks, how are they similar?

1.8 Based on your observation, describe another po discovered by data mining methods but has not a mining methodology that is quite different fr

1.9 List and describe the five primitives for specifyi

1.10 Describe why concept hierarchies are useful in d

1.11 Outliers are often discarded as noise. However, treasure. For example, exceptions in credit card ulent use of credit cards. Taking fraudulence det ods that can be used to detect outliers and disc

1.12 Recent applications pay special attention to sp poral data stream contains spatial information t of stream data (i.e., the data flow in and out lik

(a) Present three application examples of spati

(b) Discuss what kind of interesting knowledg with limited time and resources.

(c) Identify and discuss the major challenges i

(d) Using one application example, sketch a from such stream data efficiently.

1.13 Describe the differences between the following mining system with a database or data wareh

Bibliographic Notes

The book Knowledge Discovery in Databases, edited b [PSF91], is an early collection of research papers on kno book Advances in Knowledge Discovery and Data Mini

Shapiro, Smyth, and Uthurusamy [FPSSe96], is a colle knowledge discovery and data mining. There have bee lished in recent years, including Predictive Data Mining Data Mining Solutions: Methods and Tools for Solving R and Blaxton [WB98], Mastering Data Mining: The Ar tionship Management by Berry and Linoff [BL99], Build CRM by Berson, Smith, and Thearling [BST99], Data M Tools and Techniques with Java Implementations by Witt of Data Mining (Adaptive Computation and Machine L Smyth [HMS01], The Elements of Statistical Learning man [HTF01], Data Mining: Introductory and Advanc Data Mining: Multimedia, Soft Computing, and Bioin [MA03], and Introduction to Data Mining by Tan, Stein are also books containing collections of papers on discovery, such as Machine Learning and Data Mining: by Michalski, Brakto, and Kubat [MBK98], and Re Dzeroski and Lavrac [De01], as well as many tutorial database, data mining, and machine learning conferen

KDnuggets News, moderated by Piatetsky-Shapiro s tronic newsletter containing information relevant to da ery. The KDnuggets website, located at www.kdnuggets.c information relating to data mining.

The data mining community started its first interna discovery and data mining in 1995 [Fe95]. The confere national workshops on knowledge discovery in databas PS91a, FUe93, Fe94]. ACM-SIGKDD, a Special Interest

nals on databases, statistics, machine learning, a sources are listed below.

Popular textbooks on database systems inclu by Garcia-Molina, Ullman, and Widom [GMU Ramakrishnan and Gehrke [RG03], Database and Sudarshan [SKS02], and Fundamentals of D [EN03]. For an edited collection of seminal ar in Database Systems by Hellerstein and Stonebr house technology, systems, and applications hav such as The Data Warehouse Toolkit: The Com

Kimball and M. Ross [KR02], The Data Wareh Designing, Developing, and Deploying Data Wa [KRRT98], Mastering Data Warehouse Design: R

Imhoff, Galemmo, and Geiger [IGG03], Buildin and OLAP Solutions: Building Multidimensio

[Tho97]. A set of research papers on materialize tions were collected in Materialized Views: Techn by Gupta and Mumick [GM99]. Chaudhuri and overview of data warehouse technology.

Research results relating to data mining and the proceedings of many international data

SIGMOD International Conference on Managem Conference on Very Large Data Bases (VLDB), th posium on Principles of Database Systems (POD Engineering (ICDE), the International Confere (EDBT), the International Conference on Databa ference on Information and Knowledge Managem on Database and Expert Systems Applications (D on Database Systems for Advanced Applications also published in major database journals, such

Data Engineering (TKDE), ACM Transactions

Research in statistics is published in the proceedings ences, including Joint Statistical Meetings, International Society, and Symposium on the Interface: Computing Sci of publication include the Journal of the Royal Statistica Journal of American Statistical Association, Technometri

Textbooks and reference books on machine learnin

Artificial Intelligence Approach, Vols. 14, edited by Mi KM90, MT94], C4.5: Programs for Machine Learning Machine Learning by Langley [Lan96], and Machine Le book Computer Systems That Learn: Classification and P Neural Nets, Machine Learning, and Expert Systems by compares classification and prediction methods from edited collection of seminal articles on machine learnin ing by Shavlik and Dietterich [SD90].

Machine learning research is published in the proc learning and artificial intelligence conferences, includin

Machine Learning (ML), the ACM Conference on Compu the International Joint Conference on Artificial Intelligenc ciation of Artificial Intelligence Conference (AAAI). Oth major machine learning, artificial intelligence, patte system journals, some of which have been mentioned

Learning (ML), Artificial Intelligence Journal (AI), IEEE and Machine Intelligence (PAMI), and Cognitive Science

Pioneering work on data visualization techniques is of Quantitative Information [Tuf83], Envisioning Inform nations: Images and Quantities, Evidence and Narrative to Graphics and Graphic Information Processing by Ber Cleveland [Cle93], and Information Visualization in D covery edited by Fayyad, Grinstein, and Wierse [FGW0 posiums on visualization include ACM Human Facto Visualization, and the International Symposium on Inf

at www.dmg.org, and CRISP-DM (CRoss-Indu described at www.crisp-dm.org.

Architectures of data mining systems have be ference panels and meetings. The recent design IV99, Cor00, Ras04], the proposal of on-line a the study of optimization of data mining queri can be viewed as steps toward the tight integrati systems and data warehouse systems. For rela mining primitives as proposed by Sarawagi, Th as building blocks for the efficient implemen systems.

Todays real-world databases are highly susceptible to n to their typically huge size (often several gigaby multiple, heterogenous sources. Low-quality da

How can the data be preprocessed in order to consequently, of the mining results? How can the efficiency and ease of the mining process?

There are a number of data preprocessing tec remove noise and correct inconsistencies in the multiple sources into a coherent data store, suc tions, such as normalization, may be applied. F the accuracy and efficiency of mining algorithm reduction can reduce the data size by aggregating tering, for instance. These techniques are not mu For example, data cleaning can involve transfo by transforming all entries for a date field to a niques, when applied before mining, can substa patterns mined and/or the time required for th

In this chapter, we introduce the basic conce Section 2.2 presents descriptive data summariz data preprocessing. Descriptive data summariz teristics of the data and identify the presence successful data cleaning and data integration. organized into the following categories: data cle transformation (Section 2.4), and data reductio be used in an alternative form of data reductio as raw values for age) with higher-level concept This form of data reduction is the topic of Secti eneration of concept hierarchies from num techniques. The automatic generation of concep described.

niques are incomplete (lacking attribute values or cert taining only aggregate data), noisy (containing errors, o the expected), and inconsistent (e.g., containing discre used to categorize items). Welcome to the real world!

Incomplete, noisy, and inconsistent data are comm world databases and data warehouses. Incomplete dat sons. Attributes of interest may not always be availabl for sales transaction data. Other data may not be inc considered important at the time of entry. Relevant da misunderstanding, or because of equipment malfuncti with other recorded data may have been deleted. Furth tory or modifications to the data may have been overl for tuples with missing values for some attributes, may

There are many possible reasons for noisy data (havin data collection instruments used may be faulty. There m errors occurring at data entry. Errors in data transmiss technology limitations, such as limited buffer size for transfer and consumption. Incorrect data may also resu conventions or data codes used, or inconsistent form Duplicate tuples also require data cleaning.

Data cleaning routines work to clean the data by ing noisy data, identifying or removing outliers, and r believe the data are dirty, they are unlikely to trust th has been applied to it. Furthermore, dirty data can cau cedure, resulting in unreliable output. Although most cedures for dealing with incomplete or noisy data, the they may concentrate on avoiding overfitting the data Therefore, a useful preprocessing step is to run your d routines. Section 2.3 discusses methods for cleaning up

Getting back to your task at AllElectronics, suppos data from multiple sources in your analysis. This wo

Getting back to your data, you have decided, based mining algorithm for your analysis, suc classifiers, or clustering.1 Such methods provi lyzed have been normalized, that is, scaled to customer data, for example, contain the attrib salary attribute usually takes much larger values left unnormalized, the distance measurements t weigh distance measurements taken on age. F analysis to obtain aggregate information as to th that is not part of any precomputed data cube i that data transformation operations, such as n tional data preprocessing procedures that wou mining process. Data integration and data tran

Hmmm, you wonder, as you consider you selected for analysis is HUGE, which is sure to sl way I can reduce the size of my data set, witho

Data reduction obtains a reduced representati in volume, yet produces the same (or almost t number of strategies for data reduction. These data cube), attribute subset selection (e.g., remov tion analysis), dimensionality reduction (e.g., us length encoding or wavelets), and numerosity alternative, smaller representations such as clus tion is the topic of Section 2.5. Data can also use of concept hierarchies, where low-level con are replaced with higher-level concepts, such a hierarchy organizes the concepts into varying le

1 Neural networks and nearest-neighbor classifiers are de in Chapter 7.

transactions

Data transformation

Data reduction

A1A2

T1

T2

T3

T4

... T2000

22, 32, 100, 59, 48

attributes

A3...A126transactions

T1

T4

... T1

Figure 2.1 Forms of data preprocessing.

a form of data reduction that is very useful for the auto archies from numerical data. This is described in Secti generation of concept hierarchies for categorical data.

Figure 2.1 summarizes the data preprocessing step above categorization is not mutually exclusive. For exa data may be seen as a form of data cleaning, as well as

In summary, real-world data tend to be dirty, inc preprocessing techniques can improve the quality of the the accuracy and efficiency of the subsequent mining

For many data preprocessing tasks, users w istics regarding both central tendency and dis tendency include mean, median, mode, and mid include quartiles, interquartile range (IQR), and of great help in understanding the distributio studied extensively in the statistical literature. need to examine how they can be computed ef it is necessary to introduce the notions of dist holistic measure. Knowing what kind of measur an efficient implementation for it.

2.2.1 Measuring the Central Tendency

In this section, we look at various ways to me most common and most effective numerical m the (arithmetic) mean. Let x1;x2;:::;xN be a set some attribute, like salary. The mean of this set

N xix1 + x

x =i=1

=

N

This corresponds to the built-in aggregate func relational database systems.

A distributive measure is a measure (i.e., given data set by partitioning the data into s for each subset, and then merging the results i for the original (entire) data set. Both sum() because they can be computed in this mann min(). An algebraic measure is a measure that braic function to one or more distributive m an algebraic measure because it can be compu

Although the mean is the single most useful quantity always the best way of measuring the center of the data. is its sensitivity to extreme (e.g., outlier) values. Even a can corrupt the mean. For example, the mean salary at pushed up by that of a few highly paid managers. Simi in an exam could be pulled down quite a bit by a few ve caused by a small number of extreme values, we can which is the mean obtained after chopping off values a example, we can sort the values observed for salary and before computing the mean. We should avoid trimm 20%) at both ends as this can result in the loss of valua

For skewed (asymmetric) data, a better measure of Suppose that a given data set of N distinct values is sorte then the median is the middle value of the ordered set; median is the average of the middle two values.

A holistic measure is a measure that must be com whole. It cannot be computed by partitioning the give the values obtained for the measure in each subset. The tic measure. Holistic measures are much more expens measures such as those listed above.

We can, however, easily approximate the median value grouped in intervals according to their xi data values an of data values) of each interval is known. For example, p to their annual salary in intervals such as 1020K, 2030 contains the median frequency be the median interval. of the entire data set (e.g., the median salary) by interp

median = L+N=2 (freq)

1freqmedian

2 Data cube computation is described in detail in Chapters 3 and

1entire data set, ( f req)l is the sum of the frequthan the median interval, f reqmedian is the freq is the width of the median interval.

Another measure of central tendency is the value that occurs most frequently in the set. It correspond to several different values, which r with one, two, or three modes are respectively c In general, a data set with two or more modes each data value occurs only once, then there is

For unimodal frequency curves that are mo the following empirical relation:

mean mode = 3 (m

This implies that the mode for unimodal frequ can easily be computed if the mean and median In a unimodal frequency curve with perfect median, and mode are all at the same center va data in most real applications are not symmetr skewed, where the mode occurs at a value that is or negatively skewed, where the mode occu(Figure 2.2(c)).

The midrange can also be used to assess th average of the largest and smallest values in th compute using the SQL aggregate functions, m

2.2.2 Measuring the Dispersion of Data

The degree to which numerical data tend to spre the data. The most common measures of data d mary (based on quartiles), the interquartile ran

75th percentile. The quartiles, including the median, gi spread, and shape of a distribution. The distance betwe a simple measure of spread that gives the range covere This distance is called the interquartile range (IQR) anIQR = Q3 Q1.

Based on reasoning similar to that in our analysis of the conclude that Q1 and Q3 are holistic measures, as is IQ No single numerical measure of spread, such as I skewed distributions. The spreads of two sides of a s (Figure 2.2). Therefore, it is more informative to also p Q3, along with the median. A common rule of thumb f is to single out values falling at least 1:5 IQR above th

quartile.

Because Q1, the median, and Q3 together contain no (e.g., tails) of the data, a fuller summary of the shape by providing the lowest and highest data values as well. summary. The five-number summary of a distribution tiles Q1 and Q3, and the smallest and largest individualMinimum; Q1; Median; Q3; Maximum:

Boxplots are a popular way of visualizing a distribu five-number summary as follows:

Typically, the ends of the box are at the quartiles, so th tile range, IQR.

The median is marked by a line within the box.

Two lines (called whiskers) outside the box extend largest (Maximum) observations.

When dealing with a moderate number of observ potential outliers individually. To do this in a boxplo

60

40

20

Branch 1Branch 2Branch 3

Figure 2.3 Boxplot for the unit price data for items sold at fo time period.

the extreme low and high observations only beyond the quartiles. Otherwise, the whiskers vations occurring within 1:5 IQR of the qu individually. Boxplots can be used in the co data. Figure 2.3 shows boxplots for unit price AllElectronics during a given time period. For of items sold is $80, Q1 is $60, Q3 is $100. N this branch were plotted individually, as their 1.5 times the IQR here of 40. The efficient comp boxplots (based on approximates of the challenging issue for the mining of large data

Variance and Standard Deviatio

The variance of N observations, x1;x2;:::;xN , i

1N1

2 =(xix)2 =

N

N i=1

where x is the mean value of the observations, as deviation, , of the observations is the square r

2.2.3