Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional

Naeem Ahmed

Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Email: [email protected]

Data Mining Concepts & Techniques Lecture No. 02

Data Processing, Data Mining

Outline •  Data Preprocessing

•  Data Cleaning, Transformation

•  Data Mining

•  Data Mining Tasks, Applications, Challenges

Acknowledgements: Introduction to Data Mining © Tan, Steinbach, Kumar and George Kollios, Homepage: http://www.cs.bu.edu/fac/gkollios/dm07

Data •  Collection of data objects and

their attributes •  An attribute is a property or

characteristic of an object –  Examples: eye color of a

person, temperature, etc. –  Attribute is also known as

variable, field, characteristic, or feature

•  A collection of attributes describe an object –  Object is also known as

record, point, case, sample, entity, or instance

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Attributes

Objects

Data: Types of Attributes •  There are different types of attributes

–  Nominal •  Examples: ID numbers, eye color, zip codes

–  Ordinal •  Examples: rankings (e.g., taste of potato chips on a scale from

1-10), grades, height in {tall, medium, short}

–  Interval •  Examples: calendar dates, temperatures in Celsius or

Fahrenheit.

–  Ratio •  Examples: temperature in Kelvin, length, time, counts

Data: Properties of Attribute values

•  The type of an attribute depends on which of the following properties it possesses: –  Distinctness: = ≠ –  Order: < > –  Addition: + - –  Multiplication: * /

–  Nominal attribute: distinctness –  Ordinal attribute: distinctness & order –  Interval attribute: distinctness, order & addition –  Ratio attribute: all 4 properties

Discrete and Continuous Attributes •  Discrete Attribute

–  Has only a finite or countably infinite set of values –  Examples: zip codes, counts, or the set of words in a collection of

documents –  Often represented as integer variables. –  Note: binary attributes are a special case of discrete attributes

•  Continuous Attribute –  Has real numbers as attribute values –  Examples: temperature, height, or weight. –  Practically, real values can only be measured and represented

using a finite number of digits. –  Continuous attributes are typically represented as floating-point

variables

Types of Dataset •  Record

–  Data Matrix –  Document Data –  Transaction Data

•  Graph –  World Wide Web –  Molecular Structures

•  Ordered –  Spatial Data –  Temporal Data –  Sequential Data –  Genetic Sequence Data

Record Data •  Data that consists of a collection of records, each

of which consists of a fixed set of attributes Tid Refund Marital

Status Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Data Matrix •  If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute

•  Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection of y load

Projection of x Load

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection of y load

Projection of x Load

Document Data •  Each document becomes a `term' vector,

–  each term is a component (attribute) of the vector, –  the value of each component is the number of times the

corresponding term occurs in the document

Document 1

season

timeout

lost

win

game

score

ball

play

coach

teamDocument 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Transaction Data •  A special type of record data, where

–  each record (transaction) involves a set of items. –  For example, consider a grocery store. The set of

products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Graph Data •  Examples: Generic graph and HTML Links

5

2

1 2

5

<a href="papers/papers.html#bbbb">Data Mining </a><li><a href="papers/papers.html#aaaa">Graph Partitioning </a><li><a href="papers/papers.html#aaaa">Parallel Solution of Sparse Linear System of Equations </a><li><a href="papers/papers.html#ffff">N-Body Computation and Dense Linear System Solvers

Ordered Data •  Sequences of transactions

An element of the sequence

Items/Events

Ordered Data •  Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCCCGCAGGGCCCGCCCCGCGCCGTCGAGAAGGGCCCGCCTGGCGGGCGGGGGGAGGCGGGGCCGCCCGAGCCCAACCGAGTCCGACCAGGTGCCCCCTCTGCTCGGCCTAGACCTGAGCTCATTAGGCGGCAGCGGACAGGCCAAGTAGAACACGCGAAGCGCTGGGCTGCCTGCTGCGACCAGGG

Data Quality •  What kinds of data quality problems? •  How can we detect problems with the data? •  What can we do about these problems?

•  Examples of data quality problems: –  Noise and outliers –  missing values –  duplicate data

Why Data Processing? •  Data in the real world is dirty

–  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

–  noisy: containing errors or outliers –  inconsistent: containing discrepancies in codes or

names •  No quality data, no quality mining results!

–  Quality decisions must be based on quality data –  Data warehouse needs consistent integration of quality

data –  Required for both OLAP and Data Mining!

Why Data Processing? •  Why can data be incomplete?

–  Attributes of interest are not available (e.g., customer information for sales transaction data)

–  Data were not considered important at the time of transactions, so those were not recorded!

–  Data not recorded because of misunderstanding or malfunctions

–  Data may have been recorded and later deleted! –  Missing/unknown values for some data

Why Data Processing? •  Why can data be noisy/inconsistent?

–  Faulty instruments for data collection –  Human or computer errors –  Errors in data transmission –  Technology limitations (e.g., sensor data come at a

faster rate than they can be processed) –  Inconsistencies in naming conventions or data codes

(e.g., 4/1/2015 could be 4 January 2015 or 4 Jan 2015) –  Duplicate tuples, which were received twice should also

be removed

Major Tasks in Data Preprocessing •  Data cleaning

–  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

•  Data integration –  Integration of multiple databases, data cubes, or files

•  Data transformation –  Normalization and aggregation

•  Data reduction –  Obtains reduced representation in volume but produces the same or similar

analytical results

•  Data discretization –  Part of data reduction but with particular importance, especially for

numerical data

Major Tasks in Data Preprocessing

Data Cleaning

•  Data cleaning tasks

–  Fill in missing values

–  Identify outliers and smooth out noisy data

–  Correct inconsistent data

Data Cleaning •  How to handle missing data?

–  Ignore the tuple: usually done when class label is missing (assuming the tasks in classification)—not effective when the percentage of missing values per attribute varies considerably

–  Fill in the missing value manually: tedious + infeasible? –  Use a global constant to fill in the missing value: e.g., “unknown”, a

new class?! –  Use the attribute mean to fill in the missing value –  Use the attribute mean for all samples belonging to the same class

to fill in the missing value: smarter –  Use the most probable value to fill in the missing value: inference-

based such as Bayesian formula or decision tree

Data Cleaning

Age Income Religion Gender

23 24,200 Muslim M

39 ? Christian F

45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old E.g., put the most frequent religion here

How to handle missing data?

Data Cleaning •  Noisy Data

–  Noise: random error or variance in a measured variable –  Incorrect attribute values may exist due to

•  faulty data collection instruments •  data entry problems •  data transmission problems •  technology limitation •  inconsistency in naming convention

–  Other data problems which requires data cleaning •  duplicate records •  incomplete data •  inconsistent data

Data Cleaning •  How to handle Noisy data? Smoothing Techniques

–  Binning method: •  first sort data and partition into (equi-depth) bins •  then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc –  Clustering

•  detect and remove outliers –  Combined computer and human inspection

•  computer detects suspicious values, which are then checked by humans

–  Regression •  smooth by fitting the data into regression functions

Data Cleaning •  Simple Discretization Methods: Binning

–  Equal-width (distance) partitioning: •  It divides the range into N intervals of equal size: uniform

grid •  if A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B-A)/N. •  The most straightforward •  But outliers may dominate presentation •  Skewed data is not handled well

–  Equal-depth (frequency) partitioning: •  It divides the range into N intervals, each containing

approximately same number of samples •  Good data scaling – good handing of skewed data

Data Simple Discretization Methods: Binning

Example: customer ages

0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 Equi-width binning:

number of values

0-22 22-31

44-48 32-38 38-44 48-55

55-62 62-80

Equi-width binning:

Example: Smoothing using Binning Methods

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: [4,15],[21,25],[26,34] - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Data Cleaning: Regression

x

y

y = x + 1

X1

Y1

(salary)

(age)

Example of linear regression

Data Cleaning •  Inconsistent Data

–  Inconsistent data are handled by: •  Manual correction (expensive and tedious) •  Use routines designed to detect inconsistencies and manually

correct them. E.g., the routine may use the check global constraints (age>10) or functional dependencies

•  Other inconsistencies (e.g., between names of the same attribute) can be corrected during the data integration process

Data Integration •  Data integration:

–  combines data from multiple sources into a coherent store •  Schema integration

–  integrate metadata from different sources •  metadata: data about the data (i.e., data descriptors)

–  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-#

•  Detecting and resolving data value conflicts –  for the same real world entity, attribute values from different sources

are different (e.g., J.D.Smith and Jonh Smith may refer to the same person)

–  possible reasons: different representations, different scales, e.g., metric vs. British units (inches vs. cm)

Data Integration •  How to handle redundant data in Data Integration?

–  Redundant data occur often when integration of multiple databases

•  The same attribute may have different names in different databases •  One attribute may be a “derived” attribute in another table, e.g., annual

revenue

–  Redundant data may be able to be detected by correlation analysis

–  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data Transformation •  Smoothing: remove noise from data •  Aggregation: summarization, data cube construction •  Generalization: concept hierarchy climbing •  Normalization: scaled to fall within a small, specified range

–  min-max normalization –  z-score normalization –  normalization by decimal scaling

•  Attribute/feature construction –  New attributes constructed from the given ones

Data Transformation •  Why normalization?

–  Speeds-up some learning techniques (ex. neural networks)

–  Helps prevent attributes with large ranges outweigh ones with small ranges

•  Example: –  income has range 3000-200000 –  age has range 10-80 –  gender has domain M/F

Data Transformation •  Normalization

–  min-max normalization

•  e.g. convert age=30 to range 0-1, when min=10,max=80. new_age=(30-10)/(80-10)=2/7

–  z-score normalization

–  normalization by decimal scaling

AAA

AA

A minnewminnewmaxnewminmaxminvv _)__(' +−−

−=

A

A

devstand_meanvv −

='

j

vv10

'= Where j is the smallest integer such that Max(| |)<1 'v

Essential Terms •  Data: a set of facts (items) D, usually stored in a

database •  Pattern: an expression E in a language L, that

describes a subset of facts •  Attribute: a field in an item i in D •  Interestingness: a function ID,L that maps an

expression E in L into a measure space M

Data Mining •  Data Mining

–  The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets

–  The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner

•  The Data Mining Task –  Given a dataset D, language of facts L, interestingness

function ID,L and threshold c, find the expression E such that ID,L(E) > c efficiently

Data Mining   What is Data Mining?

–  Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) - Pattern –  Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) - Clustering

  What is not Data Mining?

–  Look up phone number in phone directory –  Query a Web search engine for information about “Amazon”

Data Mining •  How Data Mining is used?

1)  Identify the problem 2)  Use data mining techniques to transform the data into

information 3)  Act on the information 4)  Measure the results

Data Mining •  Data Mining Process

1)  Understand the domain 2)  Create a dataset

  Select the interesting attributes   Data cleaning and preprocessing

3)  Choose the data mining task and the specific algorithm 4)  Interpret the results, and possibly return to step 2

Data Mining •  The origin of Data Mining

–  Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

–  Data Mining must address •  Enormity of data •  High dimensionality

of data •  Heterogeneous,

distributed nature of data

Machine Learning/ Pattern

Recognition

Statistics/ AI

Data Mining

Database systems

Data Mining Tasks •  Prediction Methods

–  Use some variables to predict unknown or future values of other variables

•  Description Methods

–  Find human-interpretable patterns that describe the data

Data Mining Tasks •  Classification [Predictive]: learning a function that maps an

item into one of a set of predefined classes •  Regression [Predictive]: learning a function that maps an

item to a real value •  Clustering [Descriptive]: identify a set of groups of similar

items •  Dependencies and associations [Descriptive]: identify

significant dependencies between data attributes •  Summarization [Descriptive]: find a compact description of

the dataset or a subset of the dataset

Data Mining Applications •  Fraud detection: credit cards, phone cards •  Marketing: customer targeting •  Data Warehousing: Business Enterprises •  Astronomy •  Molecular biology

Challenges of Data Mining •  Scalability •  Dimensionality •  Complex and Heterogeneous Data •  Data Quality •  Data Ownership and Distribution •  Privacy Preservation •  Streaming Data

Documents

Data Mining Concepts & Techniques · Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional