Upload
kantyjingga
View
126
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Diambil dari © Copyright 2007, Natasha B
1
Introduction to Data Mining
Informatika
Diambil dari © Copyright 2007, Natasha B
2
Outline
Motivation: Why Data Mining?
What is Data Mining?
Data Mining Applications
Issues in Data Mining
Diambil dari © Copyright 2007, Natasha B
3
Data vs. Information
Society produces massive amounts of data business, science, medicine, economics, sports, …
Potentially valuable resource Raw data is useless
need techniques to automatically extract information Data: recorded facts Information: patterns underlying the data
Diambil dari © Copyright 2007, Natasha B
4
Multidisciplinary Field
Data Mining
Database Technology
Statistics
OtherDisciplines
Artificial Intelligence (Machine Learning – Neural Network)
MachineLearning Visualization
Diambil dari © Copyright 2007, Natasha B
5
Terminology
Gold Mining Knowledge mining from databases Knowledge extraction Data/pattern analysis Knowledge Discovery Databases or KDD Information harvesting Business intelligence
Diambil dari © Copyright 2007, Natasha B
6
KDD Process
Database
Selection Transformation
Data Preparation
Data Data MiningMining
Training Data
Evaluation, Verification
Model, Patterns
Diambil dari © Copyright 2007, Natasha B
7
Data Mining Tasks
Exploratory Data Analysis Predictive Modeling: Classification and Regression Descriptive Modeling
Cluster analysis/segmentation Discovering Patterns and Rules
Association/Dependency rules Sequential patterns Temporal sequences
Deviation detection
Diambil dari © Copyright 2007, Natasha B
8
Data Mining Tasks
Concept/Class description: Characterization and discrimination Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Association (correlation and causality) Multi-dimensional or single-dimensional association
age(X, “20-29”) ^ income(X, “60-90K”) buys(X, “TV”)
Diambil dari © Copyright 2007, Natasha B
9
Data Mining Tasks
Classification and Prediction
Finding models (functions) that describe and distinguish classes or concepts for future prediction
Example: classify countries based on climate, or classify cars based on gas mileage
Presentation: If-THEN rules, decision-tree, classification rule,
neural network Prediction: Predict some unknown or missing
numerical values
Diambil dari © Copyright 2007, Natasha B
10
Cluster analysis Class label is unknown: Group data to form
new classes, Example: cluster houses to find distribution
patterns
Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
Data Mining Tasks
Diambil dari © Copyright 2007, Natasha B
11
Data Mining Applications
Science: Chemistry, Physics, Medicine Biochemical analysis Remote sensors on a satellite Telescopes – star galaxy classification Medical Image analysis
Diambil dari © Copyright 2007, Natasha B
12
Data Mining Applications
Bioscience Sequence-based analysis Protein structure and function prediction Protein family classification Microarray gene expression
Diambil dari © Copyright 2007, Natasha B
13
Pharmaceutical companies, Insurance and Health care, Medicine Drug development Identify successful medical therapies Claims analysis, fraudulent behavior Medical diagnostic tools Predict office visits
Data Mining Applications
Diambil dari © Copyright 2007, Natasha B
14
Financial Industry, Banks, Businesses, E-commerce Stock and investment analysis Identify loyal customers vs. risky customer Predict customer spending Risk management Sales forecasting
Data Mining Applications
Diambil dari © Copyright 2007, Natasha B
15
Retail and Marketing Customer buying patterns/demographic
characteristics Mailing campaigns Market basket analysis Trend analysis
Data Mining Applications
Diambil dari © Copyright 2007, Natasha B
16
Database analysis and decision support Market analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Risk analysis and management Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Data Mining Applications
Diambil dari © Copyright 2007, Natasha B
17
Sports and Entertainment IBM Advanced Scout analyzed NBA game
statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
Astronomy JPL and the Palomar Observatory discovered
22 quasars with the help of data mining
Data Mining Applications
Diambil dari © Copyright 2007, Natasha B
18
DATA MINING EXAMPLES
Grocery store NBA Banking and Credit Card scoring
Fraud detection Personalization & Customer Profiling Campaign Management and Database
Marketing
Diambil dari © Copyright 2007, Natasha B
19
Data Mining Challenges
Computationally expensive to investigate all possibilities
Dealing with noise/missing information and errors in data
Choosing appropriate attributes/input representation
Finding the minimal attribute space Finding adequate evaluation function(s) Extracting meaningful information Not overfitting
Diambil dari © Copyright 2007, Natasha B
20
Summary
Data mining: discovering interesting patterns from large amounts of data
A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
Diambil dari © Copyright 2007, Natasha B
21
Summary
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, association, classification, clustering, outlier and trend analysis, etc.
Classification of data mining systems Major issues in data mining
Diambil dari © Copyright 2007, Natasha B
22
Kinds of Data Mining
Decision Tree Learning Clustering Neural Networks Association Rules Support Vector Machines Genetic Algorithms Nearest Neighbor Method
Diambil dari © Copyright 2007, Natasha B
23
DECISION TREE FOR THE CONCEPT
“Play Tennis”Day Outlook Temp Humidity Wind PlayTennis
D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong NoMitchell, 1997
Diambil dari © Copyright 2007, Natasha B
24
DECISION TREE FOR THE CONCEPT
“Play Tennis”
[Mitchell,1997]