Upload
tommy96
View
1.067
Download
3
Embed Size (px)
Citation preview
Data Mining – Day 1
Fabiano Dalpiaz
Department of Information and
Communication Technology
University of Trento - Italy
http://www.dit.unitn.it/~dalpiaz
Database e Business Intelligence
A.A. 2007-2008
© P. Giorgini, F. Dalpiaz 2
Acknowledgements
This presentation is partially based on the slides for the book:
Data Mining: Concepts and Techniques, 2° edJiawei Han and Micheline Kamber
© P. Giorgini, F. Dalpiaz 3
Two-days outline
Data Mining and KDD Why Data Mining Applications of Data Mining Data Preprocessing Data Mining techniques Visualization of the results Summary
© P. Giorgini, F. Dalpiaz 4
Data Mining and KDD
KDD ConferenceLogo
© P. Giorgini, F. Dalpiaz 5
Looking for knowledge
The Explosive Growth of Data
The World Wide Web
Business: e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation
Society and everyone: news, digital cameras, YouTube, forums,
blogs, Google & Co
We are drowning in data, but starving for knowledge!
Avoid data tombs
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets.
© P. Giorgini, F. Dalpiaz 6
What is Data Mining?
Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from huge amount of data
Alternative names Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Are simple search engines data mining? Are queries data mining? Are expert systems data mining?
© P. Giorgini, F. Dalpiaz 7
Knowledge Discovery (KDD) Process
Data sources
Data Cleaning
Data Warehouse
Data Mining
Pattern Evaluation
Selection
Data Integration
Task-relevant Data
© P. Giorgini, F. Dalpiaz 8
Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions
End User
Business Analyst
DataAnalyst
DBA
Decision
MakingData Presentation
Visualization Techniques
Data MiningInformation Discovery
Data ExplorationStatistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems
Quantity of data
© P. Giorgini, F. Dalpiaz 9
Data Mining: confluence of multiple disciplines
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
AlgorithmsOther
Disciplines
Visualization
© P. Giorgini, F. Dalpiaz 10
Why Data Mining?
© P. Giorgini, F. Dalpiaz 11
Why is Data Mining so complex? A matter of data dimensions Tremendous amount of data
Walmart – Customer buying patterns – a data warehouse 7.5 Terabytes large in 1995
VISA – Detecting credit card interoperability issues – 6800 payment transactions per second
High-dimensionality of data Many dimensions to be combined together Data cube example: time, location, product sales
High complexity of data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Spatial, spatiotemporal, multimedia, text and Web data
© P. Giorgini, F. Dalpiaz 12
What does Data Mining provide me with? (1)
Multidimensional concept description: Characterization and
discrimination
Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
Characterization describes things in the same class,
discrimination describes how to separate different classes
Frequent patterns, association, correlation vs. causality
Wine Spaghetti [0.3% of all basket cases, 75% of cases
when tomato sauce is bought]
Is this correlation or not?
© P. Giorgini, F. Dalpiaz 13
What does Data Mining provide me with? (2)
Classification and prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars
based on gas mileage
Predict some unknown or missing numerical values Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
© P. Giorgini, F. Dalpiaz 14
What does Data Mining provide me with? (3)
Outlier analysis Outlier: Data object that does not comply with the general
behavior of the data Fraud detection is the main application area Noise or exception?
Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD
memory Periodicity analysis Similarity-based analysis
© P. Giorgini, F. Dalpiaz 15
Applications of Data MiningMarket Analysis and Management Data sources:
credit card transactions, loyalty cards, smart cards, discount coupons, ...
Target marketing Find clusters of “model” customers who share the same
characteristics: • Geographics (lives in Rome, lives in Trentino)
• Demographics (married, between 21-35, at least one child, family income more than 40.000€/year)
• Psychographics (likes new products, consistently uses the Web)
• Behaviors (searches info in Internet, always defends her decisions)
Determine customer purchasing patterns over time
© P. Giorgini, F. Dalpiaz 16
Applications of Data MiningMarket Analysis and Management Cross-market analysis
Find associations between product sales, and predict based on such association
Compare the sales in the US and in Italy, find associations in old products and predict if new ones will have success
Customer profiling What types of customers buy what products Customers with age between 20-30 and income > 20K€ will buy
product A Customer requirement analysis
Identify the best products for different groups of customers Predict what factors will attract new customers
© P. Giorgini, F. Dalpiaz 17
Applications of Data MiningCorporate Analysis Finance Planning and Asset Evaluation
Cash flow prediction and analysis Cross-sectional and time-series analysis (financial ratio, trend
analysis)
Resource Planning summarize and compare the resources and spending
Competition monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
Other examples?
© P. Giorgini, F. Dalpiaz 18
What’s next? Data Preprocessing
Why is it needed? Data cleaning Data integration and transformation, Data reduction Discretization and Concept hiererchy
Data Mining techniques Frequent patterns, association rules Classification and prediction Cluster Analysis
Visualization of the results Summary
Are you sleeping?
© P. Giorgini, F. Dalpiaz 19
Data Preprocessing
© P. Giorgini, F. Dalpiaz 20
Why Data Preprocessing?
Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data• e.g., occupation=“ ”, birthdate=“31/12/2099”
noisy: containing errors or outliers• e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names• e.g., Age=“42” Birthday=“03/07/1997” (we are in 2007!!)• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records. In one copy of the data
customer A has to pay 200.000€, in the second copy of the data A does not have to pay anything.
© P. Giorgini, F. Dalpiaz 21
Why is data dirty? Incomplete data may come from
“Not applicable” data value when collected Different considerations between the time when the data was
collected and when it is analyzed. Human/hardware/software problems
Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission
Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)
© P. Giorgini, F. Dalpiaz 22
Why Is Data Preprocessing Important?
© P. Giorgini, F. Dalpiaz 23
Data Preprocessing1. Data cleaning – missing values
“Data cleaning is one of the three biggest problems in data warehousing”— Ralph Kimball
Fill in missing values Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“” Ignore the record (is it always feasible?) Manually filling missing attributes Automatically insert a constant Automatically insert the mean value (relative to the record class) Most probable value: make some inference!
© P. Giorgini, F. Dalpiaz 24
Data Preprocessing1. Data cleaning – binning Handle noisy data
Binning, clustering, regression (not details)
Binning
1. Sort data by price (€): 4, 8, 9, 15, 21, 21, 24, 25, 26
2. Partition into equal-frequency (equi-depth) bins: Bin 1: 4, 8, 9 Bin 2: 15, 21, 21 Bin 3: 24, 25, 26
3. Smoothing by bin means: Bin 1: 7, 7, 7 Bin 2: 19, 19, 19 Bin 3: 25, 25, 25
© P. Giorgini, F. Dalpiaz 25
Data Preprocessing1. Data cleaning – clustering
noise
© P. Giorgini, F. Dalpiaz 26
Data Preprocessing2. Integration and transformation
Data Integration combines data from multiple sources into a coherent store
Schema integration Integrate metadata from different sources A.cust-id B.cust-number
Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts For the same real world entity, attribute values from different
sources are different (e.g., cm vs. inch)
D1 D2 D3
D1,2,3
© P. Giorgini, F. Dalpiaz 27
Data Preprocessing2. Integration and transformation
Data integration can lead to redundant attributes Same object (A.house = B.residence) Derivates (A.annualIncome = B.salary+C.rentalIncome)
Redundant attributes can be discoverd via correlation analysis A mathematical method detecting the correletion between two
attributes Correlation coefficient (Pearson’s product moment coefficient):
the higher it is, the stronger the correlation between attributes Χ2 (chi-square) test No details on these methods here
© P. Giorgini, F. Dalpiaz 28
Data Preprocessing2. Integration and transformation
Aggregation: Sum the sales of different branches (in different data sources) to
compute the company sales
Generalization: concept hierarchy climbing From integer attribute age to classes of age (children, adult, old)
Normalization: scaled to fall within a small, specified range Change the range from [-∞,+ ∞] to [-1,+1] {-13, -6, -3, 10, 100} {-0.13, -0.06, -0.03, 0.1, 1}
© P. Giorgini, F. Dalpiaz 29
Data Preprocessing3. Data reduction
Data reduction Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same) analytical results
Different reduction types (dimensions, numerosity, discretization)
Dimensionality: Attribute subset selection Example with a decision tree (left branches True, right False)
Initial attribute set:{A1, A2, A3, A4, A5, A6}
A1? A6?
Class 1
A4?
Class 1Class 2 Class 2
Reduced attribute set: {A1, A4, A6}
© P. Giorgini, F. Dalpiaz 30
Data Preprocessing3. Data reduction Dimensionality: Principal Components Analysis
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data
Works for numeric data only Used when the number of dimensions is large
Numerosity: Clustering Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
2 clustersSparse data leadsto many clusters – non effective
© P. Giorgini, F. Dalpiaz 31
Data Preprocessing3. Data reduction
Numerosity: Sampling obtaining a small sample s to represent the whole data set N Problem: How to select a representative sampling set Random sampling is not enough – representative samples
should be preserved Stratified sampling: Approximate the percentage of each class
(or subpopulation of interest) in the overall database
No samples from here
Random sampling Stratified sampling
© P. Giorgini, F. Dalpiaz 32
Data Preprocessing4. Discretization - concept hierarchy Three types of attributes
Nominal — values from an unordered set (color, profession) Ordinal — values from an ordered set (military or academic rank) Continuous — numbers (integer or real numbers)
Discretization Divide the range of a continuous attribute into intervals Reduces data size and its complexity Some data mining algorithms do not support continuous types, and in
those cases discretization is mandatory
Some useful methods: Binning, clustering (already presented) Entropy-based discretization (no details here)
© P. Giorgini, F. Dalpiaz 33
Data Preprocessing 4. Discretization - concept hierarchy
Concept hierarchy generation For categorical data Specification of an ordering between attributes (schema level)
• street < city < state < country
Specification of a hierarchy of values (data level)• {Urbana, Champaign, Chicago} < Illinois
Automatic generation using the number of distinct values• For the set of attributes: {street, city, state, country}
• IF: |street| = 600.000, |city|=3.000, |state|=300, |country|=15
• THEN: street < city < state < country
© P. Giorgini, F. Dalpiaz 34
Day 1 Summary Data Mining and KDD Why Data Mining Applications of Data Mining Data Preprocessing
Data Cleaning Data Integration and Transformation Data Reduction Discretization and concept hierarchy
Tomorrow? Data Mining techniques Results visualization Summary
Questions?