40
SLIDES BY: SHREE JASWAL Introduction to Data Mining

Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

S L I D E S B Y : S H R E E J A S W A L

Introduction to Data Mining

Page 2: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Books

Chp1 Slides by: Shree Jaswal

2

Which Chapter from which Text Book ?

Chapter 1: Introduction from Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition

Page 3: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Course objective and Course Outcome

Chp1 Slides by: Shree Jaswal

3

Course objective 1: To introduce the concept of data Mining as an important tool for enterprise data management and as a cutting edge technology for building competitive advantage.

Course Outcome TEITC604.1: Demonstrate an understanding of the importance of data mining and the principles of business intelligence

Page 4: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Topics to be covered

Chp1 Slides by: Shree Jaswal

4

What is Data Mining;

Kind of patterns to be mined;

Technologies used;

Major issues in Data Mining

Page 5: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Evolution of database system technology

Chp1 Slides by: Shree Jaswal

5

Page 6: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Data Warehouse Definition

Chp1 Slides by: Shree Jaswal

6

A data warehouse is a

subject-oriented

integrated

time-varying

non-volatile

collection of data that is used primarily in

organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

Page 7: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

What is a data warehouse?

Chp1 Slides by: Shree Jaswal

7

Page 8: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Example of a datawarehouse framework used by an electronics store

Chp1 Slides by: Shree Jaswal

8

Page 9: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Chp1 Slides by: Shree Jaswal 9

Page 10: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Why Data Mining?

Chp1 Slides by: Shree Jaswal

10

The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web,

computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated analysis

of massive data sets

Page 11: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

11

Why Data Mining?—Potential Applications

Data analysis and decision support

Market analysis and management

Target marketing, customer relationship management (CRM),

market basket analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting, quality

control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other Applications

Text mining (news group, email, documents) and Web mining

Stream data mining

Bioinformatics and bio-data analysis

Chp1 Slides by: Shree Jaswal

Page 12: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

12

Ex. 1: Market Analysis and Management

Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies

Target marketing

Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time

Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association

Customer profiling—What types of customers buy what products (clustering or classification)

Customer requirement analysis

Identify the best products for different groups of customers

Predict what factors will attract new customers

Provision of summary information

Multidimensional summary reports

Statistical summary information (data central tendency and variation)

Chp1 Slides by: Shree Jaswal

Page 13: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

13

Ex. 2: Corporate Analysis & Risk Management

Finance planning and asset evaluation

cash flow analysis and prediction

contingent claim analysis to evaluate assets

cross-sectional and time series analysis (financial-ratio, trend

analysis, etc.)

Resource planning

summarize and compare the resources and spending

Competition

monitor competitors and market directions

group customers into classes and a class-based pricing procedure

set pricing strategy in a highly competitive market

Chp1 Slides by: Shree Jaswal

Page 14: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

14

Ex. 3: Fraud Detection & Mining Unusual Patterns

Approaches: Clustering & model construction for frauds, outlier analysis

Applications: Health care, retail, credit card service, telecomm.

Auto insurance: ring of collisions

Money laundering: suspicious monetary transactions

Medical insurance

Professional patients, ring of doctors, and ring of references

Unnecessary or correlated screening tests

Telecommunications: phone-call fraud

Phone call model: destination of the call, duration, time of day or week.

Analyze patterns that deviate from an expected norm

Retail industry

Analysts estimate that 38% of retail shrink is due to dishonest employees

Anti-terrorism

Chp1 Slides by: Shree Jaswal

Page 15: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

What Is Data Mining?

Chp1 Slides by: Shree Jaswal

15

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount of

data

Data mining: a misnomer?

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”?

Simple search and query processing

(Deductive) expert systems

Page 16: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Examples: What is (not) Data Mining?

What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Amazon”

What is Data Mining?

– Certain names are more prevalent in certain Mumbai locations (D’Souza, D’Mello, D’Silva… in Vasai area)

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

Page 17: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

17

KDD Process: Several Key Steps

Learning the application domain

relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation

Find useful features, dimensionality/variable reduction, invariant

representation

Choosing functions of data mining

summarization, classification, regression, association, clustering

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge Chp1 Slides by: Shree Jaswal

Page 18: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

18

Knowledge Discovery (KDD) Process

This is a view from typical database systems and data warehousing communities

Data mining plays an essential role in the knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection and transformation

Data Mining

Pattern Evaluation

Chp1 Slides by: Shree Jaswal

Page 19: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Example: A Web Mining Framework

Chp1 Slides by: Shree Jaswal

19

Web mining usually involves

Data cleaning

Data integration from multiple sources

Warehousing the data

Data cube construction

Data selection for data mining

Data mining

Presentation of the mining results

Patterns and knowledge to be used or stored into knowledge-

base

Page 20: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

20

Data Mining in Business Intelligence

Increasing potential

to support

business decisions End User

Business

Analyst

Data

Analyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data Mining Information Discovery

Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data Sources

Paper, Files, Web documents, Scientific experiments, Database Systems Chp1 Slides by: Shree Jaswal

Page 21: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

21

KDD Process: A Typical View from ML and Statistics

Input Data Data Mining

Data Pre-Processing

Post-Processing

This is a view from typical machine learning and statistics communities

Data integration

Normalization

Feature selection

Dimension reduction

Pattern discovery Association & correlation Classification Clustering Outlier analysis … … … …

Pattern evaluation

Pattern selection

Pattern interpretation

Pattern visualization

Chp1 Slides by: Shree Jaswal

Page 22: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

22

Decisions in Data Mining(1)

Data to be mined

Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks

Knowledge to be mined (or: Data mining functions)

Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

Descriptive vs. predictive data mining : Descriptive mining tasks characterize properties of data in a target data set & predictive mining tasks perform induction on current data in order to make predictions

Multiple/integrated functions and mining at multiple levels

Chp1 Slides by: Shree Jaswal

Page 23: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Decisions in Data Mining(2)

Chp1 Slides by: Shree Jaswal

23

Techniques utilized

Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.

Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Page 24: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

What Kinds of Patterns Can Be Mined?

Chp1 Slides by: Shree Jaswal

24

1. Generalization

2. Association and Correlation Analysis

3. Classification

4. Cluster Analysis

5. Outlier Analysis

Page 25: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

25

Data Mining Function: (1) Generalization

Multidimensional concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., summarize the characteristics of customers who spend more than Rs. 50,000 a year at an electronics store

Data characterization is a summarization of the general characteristics or features of a target class of data

Data cube technology for computing

Scalable methods for computing (i.e., materializing) multidimensional aggregates

OLAP (online analytical processing)

Examples of Output forms : pie charts, MDD cubes, bar

charts, curves etc.

Chp1 Slides by: Shree Jaswal

Page 26: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Data Mining Function: (1) Generalization contd.

Chp1 Slides by: Shree Jaswal

26

Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes Eg. Compare 2 groups of customers- those who shop for computer

products regularly(more than twice a month) and those who rarely shop for such products(less than 3 times a year)

Data cube technology for computing Drill down on any dimension

Discriminant rules: Discrimination descriptions expressed in the form of rules

Output forms : same as that of data characterization along with discrimination descriptions

Page 27: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

27

Data Mining Function: (2) Association and Correlation Analysis

Frequent patterns (or frequent itemsets)

What items are frequently purchased together in your mart? Eg. Milk & bread

Association, correlation vs. causality

A typical association rule

Computer software [1%, 50%] (support, confidence)

Confidence means that if one buys a computer there is a 50% chance that she will buy software too. A 1% support means that 1% of all transactions under analysis show that computer & software are purchased together

Association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold

Chp1 Slides by: Shree Jaswal

Page 28: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

28

Data Mining Function: (3) Classification

Classification and label prediction

Construct models (functions) based on some training examples

Describe and distinguish classes or concepts for future prediction

E.g., classify countries based on (climate), or classify cars based

on (gas mileage)

Predict some unknown class labels

Typical methods

Decision trees, naïve Bayesian classification, support vector

machines, neural networks, rule-based classification, pattern-based

classification, logistic regression, …

Typical applications:

Credit card fraud detection, direct marketing, classifying stars,

diseases, web-pages, …

Chp1 Slides by: Shree Jaswal

Page 29: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Various forms of a classification model

Chp1 Slides by: Shree Jaswal

29

Page 30: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

30

Data Mining Function: (4) Cluster Analysis

Unsupervised learning (i.e., Class label is unknown)

Group data to form new categories (i.e., clusters), e.g.,

cluster houses to find distribution patterns

Data objects are clustered or grouped based on the principle

of maximizing intraclass similarity and

minimizing interclass similarity

Many methods and applications

Chp1 Slides by: Shree Jaswal

Page 31: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

Chp1 Slides by: Shree Jaswal 31

A 2D plot of customer data with respect to customer locations in a city showing 3 data clusters

Page 32: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

32

Data Mining Function: (5) Outlier Analysis

Outlier analysis (anomaly mining)

Outlier: A data object that does not comply with the general behavior

of the data

Noise or exception? ― One person’s garbage could be another

person’s treasure

Methods: by product of clustering or regression analysis, …

Useful in fraud detection, rare events analysis

Chp1 Slides by: Shree Jaswal

Page 33: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

33

Are All the “Discovered” Patterns Interesting?

Data mining may generate thousands of patterns: Not all of them are

interesting

Suggested approach: Human-centered, query-based, focused mining

Interestingness measures

A pattern is interesting if it is easily understood by humans, valid on new

or test data with some degree of certainty, potentially useful, novel, or

validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures

Objective: based on statistics and structures of patterns, e.g., support,

confidence, etc.

Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,

actionability, etc. Eg. A large Earthquake often follows a cluster of small

quakes

Chp1 Slides by: Shree Jaswal

Page 34: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

34

Find All and Only Interesting Patterns?

Find all the interesting patterns: Completeness

Can a data mining system find all the interesting patterns? Do we

need to find all of the interesting patterns?

Heuristic vs. exhaustive search

Association vs. classification vs. clustering

Search for only interesting patterns: An optimization problem

Can a data mining system find only the interesting patterns?

Approaches

First generate all the patterns and then filter out the uninteresting

ones

Generate only the interesting patterns—mining query optimization

Chp1 Slides by: Shree Jaswal

Page 35: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

35

Data Mining: Confluence of Multiple Disciplines

Data Mining

Machine Learning

Statistics

Applications

Algorithm

Pattern Recognition

High-Performance Computing

Visualization

Database Technology

Chp1 Slides by: Shree Jaswal

Page 36: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

36

Why Confluence of Multiple Disciplines?

Tremendous amount of data

Algorithms must be highly scalable to handle such as tera-bytes of data

High-dimensionality of data

Micro-array may have tens of thousands of dimensions

High complexity of data

Data streams and sensor data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases

Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications

Chp1 Slides by: Shree Jaswal

Page 37: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

37

Applications of Data Mining

Web page analysis: from web page classification, clustering to

PageRank & HITS algorithms

Collaborative analysis & recommender systems

Basket data analysis to targeted marketing

Biological and medical data analysis: classification, cluster analysis

(microarray data analysis), biological sequence analysis, biological

network analysis

Data mining and software engineering (e.g., IEEE Computer, Aug.

2009 issue)

From major dedicated data mining systems/tools (e.g., SAS, MS SQL-

Server Analysis Manager, Oracle Data Mining Tools) to invisible data

mining

Chp1 Slides by: Shree Jaswal

Page 38: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

38

Major Issues in Data Mining (1)

Mining Methodology

Mining various and new kinds of knowledge

Mining knowledge in multi-dimensional space

Data mining: An interdisciplinary effort

Boosting the power of discovery in a networked environment

Handling noise, uncertainty, and incompleteness of data

Pattern evaluation and pattern- or constraint-guided mining

User Interaction

Interactive mining

Incorporation of background knowledge

Presentation and visualization of data mining results

Chp1 Slides by: Shree Jaswal

Page 39: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

39

Major Issues in Data Mining (2)

Efficiency and Scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed, stream, and incremental mining methods

Diversity of data types

Handling complex types of data

Mining dynamic, networked, and global data repositories

Data mining and society

Social impacts of data mining

Privacy-preserving data mining

Invisible data mining

Chp1 Slides by: Shree Jaswal

Page 40: Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 11

40

Summary

Data mining: Discovering interesting patterns and knowledge from

massive amount of data

A natural evolution of database technology, in great demand, with

wide applications

A KDD process includes data cleaning, data integration, data selection,

transformation, data mining, pattern evaluation, and knowledge

presentation

Mining can be performed in a variety of data

Data mining functionalities: characterization, discrimination,

association, classification, clustering, outlier and trend analysis, etc.

Data mining technologies and applications

Major issues in data mining

Chp1 Slides by: Shree Jaswal