Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Naeem Ahmed
Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
Email: [email protected]
Data Mining Concepts & Techniques Lecture No. 02
Data Processing, Data Mining
Outline • Data Preprocessing
• Data Cleaning, Transformation
• Data Mining
• Data Mining Tasks, Applications, Challenges
Acknowledgements: Introduction to Data Mining © Tan, Steinbach, Kumar and George Kollios, Homepage: http://www.cs.bu.edu/fac/gkollios/dm07
Data • Collection of data objects and
their attributes • An attribute is a property or
characteristic of an object – Examples: eye color of a
person, temperature, etc. – Attribute is also known as
variable, field, characteristic, or feature
• A collection of attributes describe an object – Object is also known as
record, point, case, sample, entity, or instance
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Attributes
Objects
Data: Types of Attributes • There are different types of attributes
– Nominal • Examples: ID numbers, eye color, zip codes
– Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from
1-10), grades, height in {tall, medium, short}
– Interval • Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio • Examples: temperature in Kelvin, length, time, counts
Data: Properties of Attribute values
• The type of an attribute depends on which of the following properties it possesses: – Distinctness: = ≠ – Order: < > – Addition: + - – Multiplication: * /
– Nominal attribute: distinctness – Ordinal attribute: distinctness & order – Interval attribute: distinctness, order & addition – Ratio attribute: all 4 properties
Discrete and Continuous Attributes • Discrete Attribute
– Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of
documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes
• Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically, real values can only be measured and represented
using a finite number of digits. – Continuous attributes are typically represented as floating-point
variables
Types of Dataset • Record
– Data Matrix – Document Data – Transaction Data
• Graph – World Wide Web – Molecular Structures
• Ordered – Spatial Data – Temporal Data – Sequential Data – Genetic Sequence Data
Record Data • Data that consists of a collection of records, each
of which consists of a fixed set of attributes Tid Refund Marital
Status Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Data Matrix • If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
• Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection of y load
Projection of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection of y load
Projection of x Load
Document Data • Each document becomes a `term' vector,
– each term is a component (attribute) of the vector, – the value of each component is the number of times the
corresponding term occurs in the document
Document 1
season
timeout
lost
win
game
score
ball
play
coach
teamDocument 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
Transaction Data • A special type of record data, where
– each record (transaction) involves a set of items. – For example, consider a grocery store. The set of
products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data • Examples: Generic graph and HTML Links
5
2
1 2
5
<a href="papers/papers.html#bbbb">Data Mining </a><li><a href="papers/papers.html#aaaa">Graph Partitioning </a><li><a href="papers/papers.html#aaaa">Parallel Solution of Sparse Linear System of Equations </a><li><a href="papers/papers.html#ffff">N-Body Computation and Dense Linear System Solvers
Ordered Data • Sequences of transactions
An element of the sequence
Items/Events
Ordered Data • Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCCCGCAGGGCCCGCCCCGCGCCGTCGAGAAGGGCCCGCCTGGCGGGCGGGGGGAGGCGGGGCCGCCCGAGCCCAACCGAGTCCGACCAGGTGCCCCCTCTGCTCGGCCTAGACCTGAGCTCATTAGGCGGCAGCGGACAGGCCAAGTAGAACACGCGAAGCGCTGGGCTGCCTGCTGCGACCAGGG
Data Quality • What kinds of data quality problems? • How can we detect problems with the data? • What can we do about these problems?
• Examples of data quality problems: – Noise and outliers – missing values – duplicate data
Why Data Processing? • Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers – inconsistent: containing discrepancies in codes or
names • No quality data, no quality mining results!
– Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality
data – Required for both OLAP and Data Mining!
Why Data Processing? • Why can data be incomplete?
– Attributes of interest are not available (e.g., customer information for sales transaction data)
– Data were not considered important at the time of transactions, so those were not recorded!
– Data not recorded because of misunderstanding or malfunctions
– Data may have been recorded and later deleted! – Missing/unknown values for some data
Why Data Processing? • Why can data be noisy/inconsistent?
– Faulty instruments for data collection – Human or computer errors – Errors in data transmission – Technology limitations (e.g., sensor data come at a
faster rate than they can be processed) – Inconsistencies in naming conventions or data codes
(e.g., 4/1/2015 could be 4 January 2015 or 4 Jan 2015) – Duplicate tuples, which were received twice should also
be removed
Major Tasks in Data Preprocessing • Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration – Integration of multiple databases, data cubes, or files
• Data transformation – Normalization and aggregation
• Data reduction – Obtains reduced representation in volume but produces the same or similar
analytical results
• Data discretization – Part of data reduction but with particular importance, especially for
numerical data
Major Tasks in Data Preprocessing
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Data Cleaning • How to handle missing data?
– Ignore the tuple: usually done when class label is missing (assuming the tasks in classification)—not effective when the percentage of missing values per attribute varies considerably
– Fill in the missing value manually: tedious + infeasible? – Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?! – Use the attribute mean to fill in the missing value – Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter – Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
Data Cleaning
Age Income Religion Gender
23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F
Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old E.g., put the most frequent religion here
How to handle missing data?
Data Cleaning • Noisy Data
– Noise: random error or variance in a measured variable – Incorrect attribute values may exist due to
• faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention
– Other data problems which requires data cleaning • duplicate records • incomplete data • inconsistent data
Data Cleaning • How to handle Noisy data? Smoothing Techniques
– Binning method: • first sort data and partition into (equi-depth) bins • then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc – Clustering
• detect and remove outliers – Combined computer and human inspection
• computer detects suspicious values, which are then checked by humans
– Regression • smooth by fitting the data into regression functions
Data Cleaning • Simple Discretization Methods: Binning
– Equal-width (distance) partitioning: • It divides the range into N intervals of equal size: uniform
grid • if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N. • The most straightforward • But outliers may dominate presentation • Skewed data is not handled well
– Equal-depth (frequency) partitioning: • It divides the range into N intervals, each containing
approximately same number of samples • Good data scaling – good handing of skewed data
Data Simple Discretization Methods: Binning
Example: customer ages
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 Equi-width binning:
number of values
0-22 22-31
44-48 32-38 38-44 48-55
55-62 62-80
Equi-width binning:
Example: Smoothing using Binning Methods
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: [4,15],[21,25],[26,34] - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
Data Cleaning: Regression
x
y
y = x + 1
X1
Y1
(salary)
(age)
Example of linear regression
Data Cleaning • Inconsistent Data
– Inconsistent data are handled by: • Manual correction (expensive and tedious) • Use routines designed to detect inconsistencies and manually
correct them. E.g., the routine may use the check global constraints (age>10) or functional dependencies
• Other inconsistencies (e.g., between names of the same attribute) can be corrected during the data integration process
Data Integration • Data integration:
– combines data from multiple sources into a coherent store • Schema integration
– integrate metadata from different sources • metadata: data about the data (i.e., data descriptors)
– Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-#
• Detecting and resolving data value conflicts – for the same real world entity, attribute values from different sources
are different (e.g., J.D.Smith and Jonh Smith may refer to the same person)
– possible reasons: different representations, different scales, e.g., metric vs. British units (inches vs. cm)
Data Integration • How to handle redundant data in Data Integration?
– Redundant data occur often when integration of multiple databases
• The same attribute may have different names in different databases • One attribute may be a “derived” attribute in another table, e.g., annual
revenue
– Redundant data may be able to be detected by correlation analysis
– Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range
– min-max normalization – z-score normalization – normalization by decimal scaling
• Attribute/feature construction – New attributes constructed from the given ones
Data Transformation • Why normalization?
– Speeds-up some learning techniques (ex. neural networks)
– Helps prevent attributes with large ranges outweigh ones with small ranges
• Example: – income has range 3000-200000 – age has range 10-80 – gender has domain M/F
Data Transformation • Normalization
– min-max normalization
• e.g. convert age=30 to range 0-1, when min=10,max=80. new_age=(30-10)/(80-10)=2/7
– z-score normalization
– normalization by decimal scaling
AAA
AA
A minnewminnewmaxnewminmaxminvv _)__(' +−−
−=
A
A
devstand_meanvv −
='
j
vv10
'= Where j is the smallest integer such that Max(| |)<1 'v
Essential Terms • Data: a set of facts (items) D, usually stored in a
database • Pattern: an expression E in a language L, that
describes a subset of facts • Attribute: a field in an item i in D • Interestingness: a function ID,L that maps an
expression E in L into a measure space M
Data Mining • Data Mining
– The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets
– The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner
• The Data Mining Task – Given a dataset D, language of facts L, interestingness
function ID,L and threshold c, find the expression E such that ID,L(E) > c efficiently
Data Mining What is Data Mining?
– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) - Pattern – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) - Clustering
What is not Data Mining?
– Look up phone number in phone directory – Query a Web search engine for information about “Amazon”
Data Mining • How Data Mining is used?
1) Identify the problem 2) Use data mining techniques to transform the data into
information 3) Act on the information 4) Measure the results
Data Mining • Data Mining Process
1) Understand the domain 2) Create a dataset
Select the interesting attributes Data cleaning and preprocessing
3) Choose the data mining task and the specific algorithm 4) Interpret the results, and possibly return to step 2
Data Mining • The origin of Data Mining
– Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
– Data Mining must address • Enormity of data • High dimensionality
of data • Heterogeneous,
distributed nature of data
Machine Learning/ Pattern
Recognition
Statistics/ AI
Data Mining
Database systems
Data Mining Tasks • Prediction Methods
– Use some variables to predict unknown or future values of other variables
• Description Methods
– Find human-interpretable patterns that describe the data
Data Mining Tasks • Classification [Predictive]: learning a function that maps an
item into one of a set of predefined classes • Regression [Predictive]: learning a function that maps an
item to a real value • Clustering [Descriptive]: identify a set of groups of similar
items • Dependencies and associations [Descriptive]: identify
significant dependencies between data attributes • Summarization [Descriptive]: find a compact description of
the dataset or a subset of the dataset
Data Mining Applications • Fraud detection: credit cards, phone cards • Marketing: customer targeting • Data Warehousing: Business Enterprises • Astronomy • Molecular biology
Challenges of Data Mining • Scalability • Dimensionality • Complex and Heterogeneous Data • Data Quality • Data Ownership and Distribution • Privacy Preservation • Streaming Data