46
Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: [email protected] Data Mining Concepts & Techniques Lecture No. 02 Data Processing, Data Mining

Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Naeem Ahmed

Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Email: [email protected]

Data Mining Concepts & Techniques Lecture No. 02

Data Processing, Data Mining

Page 2: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Outline •  Data Preprocessing

•  Data Cleaning, Transformation

•  Data Mining

•  Data Mining Tasks, Applications, Challenges

Acknowledgements: Introduction to Data Mining © Tan, Steinbach, Kumar and George Kollios, Homepage: http://www.cs.bu.edu/fac/gkollios/dm07

Page 3: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data •  Collection of data objects and

their attributes •  An attribute is a property or

characteristic of an object –  Examples: eye color of a

person, temperature, etc. –  Attribute is also known as

variable, field, characteristic, or feature

•  A collection of attributes describe an object –  Object is also known as

record, point, case, sample, entity, or instance

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Attributes

Objects

Page 4: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data: Types of Attributes •  There are different types of attributes

–  Nominal •  Examples: ID numbers, eye color, zip codes

–  Ordinal •  Examples: rankings (e.g., taste of potato chips on a scale from

1-10), grades, height in {tall, medium, short}

–  Interval •  Examples: calendar dates, temperatures in Celsius or

Fahrenheit.

–  Ratio •  Examples: temperature in Kelvin, length, time, counts

Page 5: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data: Properties of Attribute values

•  The type of an attribute depends on which of the following properties it possesses: –  Distinctness: = ≠ –  Order: < > –  Addition: + - –  Multiplication: * /

–  Nominal attribute: distinctness –  Ordinal attribute: distinctness & order –  Interval attribute: distinctness, order & addition –  Ratio attribute: all 4 properties

Page 6: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Discrete and Continuous Attributes •  Discrete Attribute

–  Has only a finite or countably infinite set of values –  Examples: zip codes, counts, or the set of words in a collection of

documents –  Often represented as integer variables. –  Note: binary attributes are a special case of discrete attributes

•  Continuous Attribute –  Has real numbers as attribute values –  Examples: temperature, height, or weight. –  Practically, real values can only be measured and represented

using a finite number of digits. –  Continuous attributes are typically represented as floating-point

variables

Page 7: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Types of Dataset •  Record

–  Data Matrix –  Document Data –  Transaction Data

•  Graph –  World Wide Web –  Molecular Structures

•  Ordered –  Spatial Data –  Temporal Data –  Sequential Data –  Genetic Sequence Data

Page 8: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Record Data •  Data that consists of a collection of records, each

of which consists of a fixed set of attributes Tid Refund Marital

Status Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Page 9: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Matrix •  If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute

•  Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection of y load

Projection of x Load

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection of y load

Projection of x Load

Page 10: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Document Data •  Each document becomes a `term' vector,

–  each term is a component (attribute) of the vector, –  the value of each component is the number of times the

corresponding term occurs in the document

Document 1

season

timeout

lost

win

game

score

ball

play

coach

teamDocument 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Page 11: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Transaction Data •  A special type of record data, where

–  each record (transaction) involves a set of items. –  For example, consider a grocery store. The set of

products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Page 12: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Graph Data •  Examples: Generic graph and HTML Links

5

2

1 2

5

<a href="papers/papers.html#bbbb">Data Mining </a><li><a href="papers/papers.html#aaaa">Graph Partitioning </a><li><a href="papers/papers.html#aaaa">Parallel Solution of Sparse Linear System of Equations </a><li><a href="papers/papers.html#ffff">N-Body Computation and Dense Linear System Solvers

Page 13: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Ordered Data •  Sequences of transactions

An element of the sequence

Items/Events

Page 14: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Ordered Data •  Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCCCGCAGGGCCCGCCCCGCGCCGTCGAGAAGGGCCCGCCTGGCGGGCGGGGGGAGGCGGGGCCGCCCGAGCCCAACCGAGTCCGACCAGGTGCCCCCTCTGCTCGGCCTAGACCTGAGCTCATTAGGCGGCAGCGGACAGGCCAAGTAGAACACGCGAAGCGCTGGGCTGCCTGCTGCGACCAGGG

Page 15: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Quality •  What kinds of data quality problems? •  How can we detect problems with the data? •  What can we do about these problems?

•  Examples of data quality problems: –  Noise and outliers –  missing values –  duplicate data

Page 16: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Why Data Processing? •  Data in the real world is dirty

–  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

–  noisy: containing errors or outliers –  inconsistent: containing discrepancies in codes or

names •  No quality data, no quality mining results!

–  Quality decisions must be based on quality data –  Data warehouse needs consistent integration of quality

data –  Required for both OLAP and Data Mining!

Page 17: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Why Data Processing? •  Why can data be incomplete?

–  Attributes of interest are not available (e.g., customer information for sales transaction data)

–  Data were not considered important at the time of transactions, so those were not recorded!

–  Data not recorded because of misunderstanding or malfunctions

–  Data may have been recorded and later deleted! –  Missing/unknown values for some data

Page 18: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Why Data Processing? •  Why can data be noisy/inconsistent?

–  Faulty instruments for data collection –  Human or computer errors –  Errors in data transmission –  Technology limitations (e.g., sensor data come at a

faster rate than they can be processed) –  Inconsistencies in naming conventions or data codes

(e.g., 4/1/2015 could be 4 January 2015 or 4 Jan 2015) –  Duplicate tuples, which were received twice should also

be removed

Page 19: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Major Tasks in Data Preprocessing •  Data cleaning

–  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

•  Data integration –  Integration of multiple databases, data cubes, or files

•  Data transformation –  Normalization and aggregation

•  Data reduction –  Obtains reduced representation in volume but produces the same or similar

analytical results

•  Data discretization –  Part of data reduction but with particular importance, especially for

numerical data

Page 20: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Major Tasks in Data Preprocessing

Page 21: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Cleaning

•  Data cleaning tasks

–  Fill in missing values

–  Identify outliers and smooth out noisy data

–  Correct inconsistent data

Page 22: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Cleaning •  How to handle missing data?

–  Ignore the tuple: usually done when class label is missing (assuming the tasks in classification)—not effective when the percentage of missing values per attribute varies considerably

–  Fill in the missing value manually: tedious + infeasible? –  Use a global constant to fill in the missing value: e.g., “unknown”, a

new class?! –  Use the attribute mean to fill in the missing value –  Use the attribute mean for all samples belonging to the same class

to fill in the missing value: smarter –  Use the most probable value to fill in the missing value: inference-

based such as Bayesian formula or decision tree

Page 23: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Cleaning

Age Income Religion Gender

23 24,200 Muslim M

39 ? Christian F

45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old E.g., put the most frequent religion here

How to handle missing data?

Page 24: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Cleaning •  Noisy Data

–  Noise: random error or variance in a measured variable –  Incorrect attribute values may exist due to

•  faulty data collection instruments •  data entry problems •  data transmission problems •  technology limitation •  inconsistency in naming convention

–  Other data problems which requires data cleaning •  duplicate records •  incomplete data •  inconsistent data

Page 25: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Cleaning •  How to handle Noisy data? Smoothing Techniques

–  Binning method: •  first sort data and partition into (equi-depth) bins •  then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc –  Clustering

•  detect and remove outliers –  Combined computer and human inspection

•  computer detects suspicious values, which are then checked by humans

–  Regression •  smooth by fitting the data into regression functions

Page 26: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Cleaning •  Simple Discretization Methods: Binning

–  Equal-width (distance) partitioning: •  It divides the range into N intervals of equal size: uniform

grid •  if A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B-A)/N. •  The most straightforward •  But outliers may dominate presentation •  Skewed data is not handled well

–  Equal-depth (frequency) partitioning: •  It divides the range into N intervals, each containing

approximately same number of samples •  Good data scaling – good handing of skewed data

Page 27: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Simple Discretization Methods: Binning

Example: customer ages

0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 Equi-width binning:

number of values

0-22 22-31

44-48 32-38 38-44 48-55

55-62 62-80

Equi-width binning:

Page 28: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Example: Smoothing using Binning Methods

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: [4,15],[21,25],[26,34] - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Page 29: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Cleaning: Regression

x

y

y = x + 1

X1

Y1

(salary)

(age)

Example of linear regression

Page 30: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Cleaning •  Inconsistent Data

–  Inconsistent data are handled by: •  Manual correction (expensive and tedious) •  Use routines designed to detect inconsistencies and manually

correct them. E.g., the routine may use the check global constraints (age>10) or functional dependencies

•  Other inconsistencies (e.g., between names of the same attribute) can be corrected during the data integration process

Page 31: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Integration •  Data integration:

–  combines data from multiple sources into a coherent store •  Schema integration

–  integrate metadata from different sources •  metadata: data about the data (i.e., data descriptors)

–  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-#

•  Detecting and resolving data value conflicts –  for the same real world entity, attribute values from different sources

are different (e.g., J.D.Smith and Jonh Smith may refer to the same person)

–  possible reasons: different representations, different scales, e.g., metric vs. British units (inches vs. cm)

Page 32: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Integration •  How to handle redundant data in Data Integration?

–  Redundant data occur often when integration of multiple databases

•  The same attribute may have different names in different databases •  One attribute may be a “derived” attribute in another table, e.g., annual

revenue

–  Redundant data may be able to be detected by correlation analysis

–  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Page 33: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Transformation •  Smoothing: remove noise from data •  Aggregation: summarization, data cube construction •  Generalization: concept hierarchy climbing •  Normalization: scaled to fall within a small, specified range

–  min-max normalization –  z-score normalization –  normalization by decimal scaling

•  Attribute/feature construction –  New attributes constructed from the given ones

Page 34: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Transformation •  Why normalization?

–  Speeds-up some learning techniques (ex. neural networks)

–  Helps prevent attributes with large ranges outweigh ones with small ranges

•  Example: –  income has range 3000-200000 –  age has range 10-80 –  gender has domain M/F

Page 35: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Transformation •  Normalization

–  min-max normalization

•  e.g. convert age=30 to range 0-1, when min=10,max=80. new_age=(30-10)/(80-10)=2/7

–  z-score normalization

–  normalization by decimal scaling

AAA

AA

A minnewminnewmaxnewminmaxminvv _)__(' +−−

−=

A

A

devstand_meanvv −

='

j

vv10

'= Where j is the smallest integer such that Max(| |)<1 'v

Page 36: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Essential Terms •  Data: a set of facts (items) D, usually stored in a

database •  Pattern: an expression E in a language L, that

describes a subset of facts •  Attribute: a field in an item i in D •  Interestingness: a function ID,L that maps an

expression E in L into a measure space M

Page 37: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Mining •  Data Mining

–  The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets

–  The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner

•  The Data Mining Task –  Given a dataset D, language of facts L, interestingness

function ID,L and threshold c, find the expression E such that ID,L(E) > c efficiently

Page 38: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Mining   What is Data Mining?

–  Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) - Pattern –  Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) - Clustering

  What is not Data Mining?

–  Look up phone number in phone directory –  Query a Web search engine for information about “Amazon”

Page 39: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Mining •  How Data Mining is used?

1)  Identify the problem 2)  Use data mining techniques to transform the data into

information 3)  Act on the information 4)  Measure the results

Page 40: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Mining •  Data Mining Process

1)  Understand the domain 2)  Create a dataset

  Select the interesting attributes   Data cleaning and preprocessing

3)  Choose the data mining task and the specific algorithm 4)  Interpret the results, and possibly return to step 2

Page 41: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Mining •  The origin of Data Mining

–  Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

–  Data Mining must address •  Enormity of data •  High dimensionality

of data •  Heterogeneous,

distributed nature of data

Machine Learning/ Pattern

Recognition

Statistics/ AI

Data Mining

Database systems

Page 42: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Mining Tasks •  Prediction Methods

–  Use some variables to predict unknown or future values of other variables

•  Description Methods

–  Find human-interpretable patterns that describe the data

Page 43: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Mining Tasks •  Classification [Predictive]: learning a function that maps an

item into one of a set of predefined classes •  Regression [Predictive]: learning a function that maps an

item to a real value •  Clustering [Descriptive]: identify a set of groups of similar

items •  Dependencies and associations [Descriptive]: identify

significant dependencies between data attributes •  Summarization [Descriptive]: find a compact description of

the dataset or a subset of the dataset

Page 44: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Data Mining Applications •  Fraud detection: credit cards, phone cards •  Marketing: customer targeting •  Data Warehousing: Business Enterprises •  Astronomy •  Molecular biology

Page 45: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Challenges of Data Mining •  Scalability •  Dimensionality •  Complex and Heterogeneous Data •  Data Quality •  Data Ownership and Distribution •  Privacy Preservation •  Streaming Data

Page 46: Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional