Upload
asad199
View
74
Download
0
Tags:
Embed Size (px)
Citation preview
1
Introduction to Data Mining
Credit: Natasha Balac, Ph.D.
© Copyright 2006, Natasha Balac 2
Outline
Motivation: Why Data Mining?
What is Data Mining?
History of Data Mining
Data Mining Functionality and Terminology
Data Mining Applications
Are all the Patterns Interesting?
Issues in Data Mining
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
1990s—2000s:
Data mining and data warehousing, multimedia databases, and
Web databases
© Copyright 2006, Natasha Balac3
© Copyright 2006, Natasha Balac 4
Necessity is the Mother of Invention
Data explosion
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories
We are drowning in data, but starving for
knowledge!
© Copyright 2006, Natasha Balac 5
Necessity is the Mother of Invention
We are drowning in data, but starving for
knowledge!
Solution - Data Mining
Data Warehousing and online analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Data Mining Main Objectives
Identification of data as a source of useful
information
Use of discovered information for
competitive advantages when working in
business environment
© Copyright 2006, Natasha Balac 6
© Copyright 2006, Natasha Balac 7
Why DATA MINING?
Huge amounts of data
Electronic records of our decisions Choices in the supermarket
Financial records
Our comings and goings
We swipe our way through the world – every swipe is a record in a database
Data rich – but information poor
Lying hidden in all this data is information!
© Copyright 2006, Natasha Balac 8
Data vs. Information
Society produces massive amounts of data business, science, medicine, economics, sports, …
Potentially valuable resource
Raw data is useless need techniques to automatically extract information
Data: recorded facts (as in databases)
Information: patterns underlying the data
Patterns must be discovered automatically
Data – Information - Knowledge
© Copyright 2006, Natasha Balac 9
What is DATA MINING?
Extracting or “mining” knowledge from large amounts of data
Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data
Process for Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data
© Copyright 2006, Natasha Balac 10
Data mining:
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)information or patterns from data in large databases
Alternative names: Data mining: a misnomer?Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
What Is Data Mining?
© Copyright 2006, Natasha Balac 11
Data Mining is NOT
Data Warehousing
(Deductive) query processing
SQL/ Reporting
Software Agents
Expert Systems
Online Analytical Processing (OLAP)
Statistical Analysis Tool
Data visualization
© Copyright 2006, Natasha Balac 12
Data Mining
Programs that detect patterns and rules in
the data
Strong patterns can be used to make non-
trivial predictions on new data
© Copyright 2006, Natasha Balac 13
Data Mining Challenges
Problem 1: most patterns are not
interesting
Problem 2: patterns may be inexact or
completely spurious when noisy data
present
© Copyright 2006, Natasha Balac 14
Machine Learning Techniques
Technical basis for data mining: algorithms for
acquiring structural descriptions from examples
Methods originate from artificial intelligence,
statistics, and research on databases
© Copyright 2006, Natasha Balac 15
Machine Learning Techniques
Structural descriptions represent patterns
explicitly can be used to
predict outcome in new situation
understand and explain how prediction is
derived (maybe even more important)
© Copyright 2006, Natasha Balac 16
Multidisciplinary Field
Data Mining
Database Technology
Statistics
OtherDisciplines
Artificial Intelligence
MachineLearning
Visualization
© Copyright 2006, Natasha Balac 17
Multidisciplinary Field
Database technology
Artificial Intelligence
Machine Learning including Neural Networks
Statistics
Pattern recognition
Knowledge-based systems/acquisition
High-performance computing
Data visualization
© Copyright 2006, Natasha Balac 18
History of Data Mining
© Copyright 2006, Natasha Balac 19
History
Emerged late 1980s
Flourished –1990s
Roots traced back along three family lines
Classical Statistics
Artificial Intelligence
Machine Learning
© Copyright 2006, Natasha Balac 20
Statistics
Foundation of most DM technologies
Regression analysis, standard
distribution/deviation/variance, cluster
analysis, confidence intervals
Building blocks
Significant role in today’s data mining –
but alone is not powerful enough
© Copyright 2006, Natasha Balac 21
Artificial Intelligence
Heuristics vs. Statistics
Human-thought-like processing
Requires vast computer processing power
Supercomputers
© Copyright 2006, Natasha Balac 22
Machine Learning
Union of statistics and AI
Blends AI heuristics with advanced statistical analysis
Machine Learning – let computer programs
learn about data they study - make different decisions based on the quality of studied data
using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms
© Copyright 2006, Natasha Balac 23
Data Mining
Adoption of the Machine learning techniques to the real world problems
Union: Statistics, AI, Machine learning
Used to find previously hidden trends or patterns
Finding increasing acceptance in science and business areas which need to analyze large amount of data to discover trends which could not be found otherwise
© Copyright 2006, Natasha Balac 24
Terminology
Gold Mining (turn data tombs into golden nuggets of knowledge.
Finding small sets of precious nuggets from a great deal of raw material)
Knowledge mining from databases
Knowledge extraction
Data/pattern analysis
Knowledge Discovery Databases or KDD
Information harvesting
Business intelligence
Data Mining: A KDD Process
© Copyright 2006, Natasha Balac 25September 6, 2014Data Mining: Concepts and
Techniques 25
Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection and Transformation
Data Mining
Pattern Evaluation
Steps of a KDD Process
Learning the application domain:
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (remove noise, may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
© Copyright 2006, Natasha Balac 26
Data Mining and Business Intelligence
© Copyright 2006, Natasha Balac 27September 6, 2014
Data Mining: Concepts and Techniques 27
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
Architecture of a Typical Data Mining System
© Copyright 2006, Natasha Balac 28September 6, 2014Data Mining: Concepts and
Techniques 28
Data
Warehouse
Data cleaning & data integration Filtering
Databases
Database or data warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-baseGuide search , evaluate interestingness in patterns
Interestingness measures to focus search to interesting patterns
Data Mining: On What Kind of Data?Should be applicable to any kind of data repository, and transient data (data streams)
Types of databases and information repositories on which data mining can be
performed
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
© Copyright 2006, Natasha Balac 29
© Copyright 2006, Natasha Balac 30
LEARNING ALGORITHMS
Fundamental idea:
learn rules/patterns/relationships
automatically from the data
© Copyright 2006, Natasha Balac 31
Data Mining Tasks
Exploratory Data Analysis
Predictive Modeling inference on current data to make predictions
Classification and Regression
Descriptive Modeling characterise general properties of the data
Cluster analysis/segmentation
Discovering Patterns and Rules Association/Dependency rules
Sequential patterns
Temporal sequences
Deviation detection
© Copyright 2006, Natasha Balac 32
Data Mining TasksData is associated with classes (eg computers, printers) or concepts (eg customer types)
Concept/Class description: Characterization and discrimination
Generalize, summarize (target class), and contrast data
characteristics, e.g., dry vs. wet regions (contrasting class)
Frequent patterns: patterns occurring frequently in the data
Frequent itemsets: set of items frequently appearing together in transactional data
Association (correlation and causality)
Multi-dimensional or single-dimensional association
age(X, “20-29”) ^ income(X, “60-90K”) buys(X, “TV”)
age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]
buys(T, “computer”) buys(T, “software”) [1%, 50%]
Assoc. rule discarded as uninteresting if not satisfying minimum support threshold and minimum confidence
threshold
© Copyright 2006, Natasha Balac 33
Data Mining Tasks Classification
Finding models (functions) that describe and distinguish classes or
concepts for future prediction
Example: classify countries based on climate, or classify cars based on
gas mileage
Derived model based on analysis of a set of training data (objects of known
class label)
Model may be represented in various forms:
If-THEN rules, decision-tree, classification rule, neural network
Prediction
Models continuous valued function - Predict some unknown or missing
numerical values ( eg Regression analysis)
Relevance analysis – identify attributes not contributing to classification/prediction,
hence can be excluded
© Copyright 2006, Natasha Balac 34
Cluster analysis
Class label is unknown: Group data to form new
classes,
Example: cluster houses to find distribution
patterns
Clustering based on the principle: maximizing the
intra-class similarity and minimizing the interclass
similarity
Each cluster formed may be viewed as a class of objects
Data Mining Tasks
© Copyright 2006, Natasha Balac 35
Data Mining Tasks Outlier analysis
Outlier: a data object that does not comply with the general
behavior of the data
Mostly considered as noise or exception, but is quite useful
in fraud detection, rare events analysis
Trend and evolution analysis model regularities or
trends for objects whose behaviour changes with time
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
© Copyright 2006, Natasha Balac 36
Data Mining: Classification Schemes
General functionality
Descriptive data mining Vs. Predictive data mining (DDM – characterise general properties of data. PDM – perform inference on
data in order to make predictions)
Different views - different classifications
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques employed
Kinds of applications
© Copyright 2006, Natasha Balac 37
A Multi-Dimensional View of Data
Mining Classification
Databases to be mined
Relational, transactional, object-oriented, object-
relational, active, spatial, time-series, text, multi-
media,WWW, etc.
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier
analysis, etc.
Multiple/integrated functions
Mining at multiple levels of abstractions
© Copyright 2006, Natasha Balac 38
A Multi-Dimensional View of Data
Mining Classification
Techniques utilized
Decision/Regression trees, clustering, neural
networks, etc.
Applications adapted
Retail, telecom, banking, DNA mining, stock
market analysis, Web mining
© Copyright 2006, Natasha Balac 39
Data Mining Applications
Science: Chemistry, Physics, Medicine
Biochemical analysis
Remote sensors on a satellite
Telescopes – star galaxy classification
Medical Image analysis
© Copyright 2006, Natasha Balac 40
Data Mining Applications
Bioscience
Sequence-based analysis
Protein structure and function prediction
Protein family classification
Microarray gene expression
© Copyright 2006, Natasha Balac 41
Pharmaceutical companies, Insurance
and Health care, Medicine
Drug development
Identify successful medical therapies
Claims analysis, fraudulent behavior
Medical diagnostic tools
Predict office visits
Data Mining Applications
© Copyright 2006, Natasha Balac 42
Financial Industry, Banks, Businesses, E-
commerce
Stock and investment analysis
Identify loyal customers vs. risky customer
Predict customer spending
Risk management
Sales forecasting
Data Mining Applications
© Copyright 2006, Natasha Balac 43
Retail and Marketing
Customer buying patterns/demographic
characteristics
Mailing campaigns
Market basket analysis
Trend analysis
Data Mining Applications
© Copyright 2006, Natasha Balac 44
Database analysis and decision support
Market analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Data Mining Applications
© Copyright 2006, Natasha Balac 45
Sports and Entertainment
IBM Advanced Scout analyzed NBA game
statistics (shots blocked, assists, and fouls) to
gain competitive advantage for New York
Knicks and Miami Heat
Astronomy
JPL and the Palomar Observatory discovered
22 quasars with the help of data mining
Data Mining Applications
© Copyright 2006, Natasha Balac 46
DATA MINING EXAMPLES
Grocery store
NBA
Banking and Credit Card scoring
Fraud detection
Personalization & Customer Profiling
Campaign Management and Database
Marketing
© Copyright 2006, Natasha Balac 47
Data mining at work:
Case study 1
© Copyright 2006, Natasha Balac 48
Processing Loan Applications
Given: questionnaire with financial and personal information
Problem: should money be lent?
Borderline cases referred to loan officers
But: 50% of accepted borderline cases defaulted!
Solution: reject all borderline cases?
Borderline cases are most active customers!
© Copyright 2006, Natasha Balac 49
Enter Machine Learning
Given: 1000 training examples of borderline cases
20 attributes:
age, years with current employer,years at current address, years with the bank, years at current job, other credit cards
Learned rules predicted 2/3 of borderline cases correctly!
Rules could be used to explain decisions to customers
© Copyright 2006, Natasha Balac 50
Case study 2:
Screening images
Given:
radar satellite images of coastal waters
Problem: detecting oil slicks in those images
Oil slicks = dark regions with changing size and shape
Look-alike dark regions can be caused by weather conditions (e.g. high wind)
Expensive process requiring highly trained personnel
© Copyright 2006, Natasha Balac 51
Dark regions extracted from normalized image
Attributes:
size of region, shape, area, intensity, sharpness and jaggedness of boundaries, proximity of other regions, info about background
Constraints: Scarcity of training examples (oil slicks are rare!)
Unbalanced data: most dark regions aren’t oil slicks
Regions from same image form a batch
Requirement is adjustable false-alarm rate
Enter Machine Learning
© Copyright 2006, Natasha Balac 52
Data Mining Challenges
Computationally expensive to investigate all possibilities
Dealing with noise/missing information and errors in data
Choosing appropriate attributes/input representation
Finding the minimal attribute space
Finding adequate evaluation function(s)
Extracting meaningful information
Not overfitting
© Copyright 2006, Natasha Balac 53
Are All the “Discovered” Patterns
Interesting?
DM system may generate thousands/millions of patterns/rules
Interestingness measures: A pattern is interesting if it is easily understood
by humans, valid on new or test data with some degree of certainty,
potentially useful, novel, or validates some hypothesis that a user seeks to
confirm
support: %age of transactions the given rule satisfies
support( X => Y ) = P ( X U Y )
confidence: degree of certainty of detected association
confidence(X => Y ) = P ( Y | X )
© Copyright 2006, Natasha Balac 54
Are All the “Discovered” Patterns
Interesting?
Objective vs. subjective measures:
Objective: based on statistics and structures of
patterns
support and confidence
Subjective: based on user’s belief in the data
unexpectedness, novelty, actionability, confirm
hypothesis user wishes to validate or resemble hunch
© Copyright 2006, Natasha Balac 55
Can We Find All and Only Interesting
Patterns?
An Optimisation problem – completeness of DM algo
Completeness - Find all the interesting
patterns
Can a data mining system find all the interesting
patterns?
Association vs. classification vs. clustering
© Copyright 2006, Natasha Balac 56
Can We Find All and Only Interesting
Patterns?
Optimization - Search for only interesting patterns
Can a data mining system find only the interesting
patterns?
Approaches
First generate all the patterns and then filter out the
uninteresting ones
Mining query optimization (improve search efficiency by pruning
away subsets of the pattern space not satisfying interestingness constraints)
© Copyright 2006, Natasha Balac 57
Major Issues in Data Mining
Mining methodology and user interaction
Mining different kinds of knowledge in databases
Incorporation of background knowledge allow discovered
patterns to be expressed in concise terms at different levels of abstraction
Handling noise and incomplete data may confuse the process
Pattern evaluation: the interestingness problem
Expression and visualization of data mining results in expressive forms so that knowledge is easily understood, usable by humans
© Copyright 2006, Natasha Balac 58
Performance and scalability huge data volumes
Efficiency of data mining algorithms
Parallel, distributed and incremental mining
methods
Issues relating to the diversity of data types
Handling relational and complex types of data
Mining information from diverse databasesUnrealistic to expect one system to mine all kinds of data
Major Issues in Data Mining
© Copyright 2006, Natasha Balac 59
Issues related to applications and social
impacts
Application of discovered knowledge
Domain-specific data mining tools
Intelligent query answering
Expert systems
Process control and decision making
A knowledge fusion problem
Protection of data security, integrity, and privacy
Major Issues in Data Mining
© Copyright 2006, Natasha Balac 60
Summary
Data mining: discovering interesting patterns
from large amounts of data
A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation
© Copyright 2006, Natasha Balac 61
Summary
Mining can be performed in a variety of
information repositories
Data mining functionalities: characterization,
association, classification, clustering, outlier
and trend analysis, etc.
Classification of data mining systems
Major issues in data mining
© Copyright 2006, Natasha Balac 62
Kinds of Data Mining
Decision Tree Learning
Clustering
Neural Networks
Association Rules
Support Vector Machines
Genetic Algorithms
Nearest Neighbor Method
© Copyright 2006, Natasha Balac 63
Decision Tree ExampleGrandparents
A lotA little
© Copyright 2006, Natasha Balac 64
DECISION TREE FOR THE CONCEPT
“Play Tennis”
Day Outlook Temp Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong NoMitchell, 1997
© Copyright 2006, Natasha Balac 65
DECISION TREE FOR THE CONCEPT
“Play Tennis”
[Mitchell,1997]