Introduction to Data Miningssjaswal.com/wp-content/uploads/2017/01/DMBI_chp1.pdf“Necessity is the...

Preview:

Citation preview

S L I D E S B Y : S H R E E J A S W A L

Introduction to Data Mining

Books

Chp1 Slides by: Shree Jaswal

2

Which Chapter from which Text Book ?

Chapter 1: Introduction from Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition

Course objective and Course Outcome

Chp1 Slides by: Shree Jaswal

3

Course objective 1: To introduce the concept of data Mining as an important tool for enterprise data management and as a cutting edge technology for building competitive advantage.

Course Outcome TEITC604.1: Demonstrate an understanding of the importance of data mining and the principles of business intelligence

Topics to be covered

Chp1 Slides by: Shree Jaswal

4

What is Data Mining;

Kind of patterns to be mined;

Technologies used;

Major issues in Data Mining

Evolution of database system technology

Chp1 Slides by: Shree Jaswal

5

Data Warehouse Definition

Chp1 Slides by: Shree Jaswal

6

A data warehouse is a

subject-oriented

integrated

time-varying

non-volatile

collection of data that is used primarily in

organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

What is a data warehouse?

Chp1 Slides by: Shree Jaswal

7

Example of a datawarehouse framework used by an electronics store

Chp1 Slides by: Shree Jaswal

8

Chp1 Slides by: Shree Jaswal 9

Why Data Mining?

Chp1 Slides by: Shree Jaswal

10

The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web,

computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated analysis

of massive data sets

11

Why Data Mining?—Potential Applications

Data analysis and decision support

Market analysis and management

Target marketing, customer relationship management (CRM),

market basket analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting, quality

control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other Applications

Text mining (news group, email, documents) and Web mining

Stream data mining

Bioinformatics and bio-data analysis

Chp1 Slides by: Shree Jaswal

12

Ex. 1: Market Analysis and Management

Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies

Target marketing

Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time

Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association

Customer profiling—What types of customers buy what products (clustering or classification)

Customer requirement analysis

Identify the best products for different groups of customers

Predict what factors will attract new customers

Provision of summary information

Multidimensional summary reports

Statistical summary information (data central tendency and variation)

Chp1 Slides by: Shree Jaswal

13

Ex. 2: Corporate Analysis & Risk Management

Finance planning and asset evaluation

cash flow analysis and prediction

contingent claim analysis to evaluate assets

cross-sectional and time series analysis (financial-ratio, trend

analysis, etc.)

Resource planning

summarize and compare the resources and spending

Competition

monitor competitors and market directions

group customers into classes and a class-based pricing procedure

set pricing strategy in a highly competitive market

Chp1 Slides by: Shree Jaswal

14

Ex. 3: Fraud Detection & Mining Unusual Patterns

Approaches: Clustering & model construction for frauds, outlier analysis

Applications: Health care, retail, credit card service, telecomm.

Auto insurance: ring of collisions

Money laundering: suspicious monetary transactions

Medical insurance

Professional patients, ring of doctors, and ring of references

Unnecessary or correlated screening tests

Telecommunications: phone-call fraud

Phone call model: destination of the call, duration, time of day or week.

Analyze patterns that deviate from an expected norm

Retail industry

Analysts estimate that 38% of retail shrink is due to dishonest employees

Anti-terrorism

Chp1 Slides by: Shree Jaswal

What Is Data Mining?

Chp1 Slides by: Shree Jaswal

15

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount of

data

Data mining: a misnomer?

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”?

Simple search and query processing

(Deductive) expert systems

Examples: What is (not) Data Mining?

What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Amazon”

What is Data Mining?

– Certain names are more prevalent in certain Mumbai locations (D’Souza, D’Mello, D’Silva… in Vasai area)

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

17

KDD Process: Several Key Steps

Learning the application domain

relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation

Find useful features, dimensionality/variable reduction, invariant

representation

Choosing functions of data mining

summarization, classification, regression, association, clustering

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge Chp1 Slides by: Shree Jaswal

18

Knowledge Discovery (KDD) Process

This is a view from typical database systems and data warehousing communities

Data mining plays an essential role in the knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection and transformation

Data Mining

Pattern Evaluation

Chp1 Slides by: Shree Jaswal

Example: A Web Mining Framework

Chp1 Slides by: Shree Jaswal

19

Web mining usually involves

Data cleaning

Data integration from multiple sources

Warehousing the data

Data cube construction

Data selection for data mining

Data mining

Presentation of the mining results

Patterns and knowledge to be used or stored into knowledge-

base

20

Data Mining in Business Intelligence

Increasing potential

to support

business decisions End User

Business

Analyst

Data

Analyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data Mining Information Discovery

Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data Sources

Paper, Files, Web documents, Scientific experiments, Database Systems Chp1 Slides by: Shree Jaswal

21

KDD Process: A Typical View from ML and Statistics

Input Data Data Mining

Data Pre-Processing

Post-Processing

This is a view from typical machine learning and statistics communities

Data integration

Normalization

Feature selection

Dimension reduction

Pattern discovery Association & correlation Classification Clustering Outlier analysis … … … …

Pattern evaluation

Pattern selection

Pattern interpretation

Pattern visualization

Chp1 Slides by: Shree Jaswal

22

Decisions in Data Mining(1)

Data to be mined

Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks

Knowledge to be mined (or: Data mining functions)

Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

Descriptive vs. predictive data mining : Descriptive mining tasks characterize properties of data in a target data set & predictive mining tasks perform induction on current data in order to make predictions

Multiple/integrated functions and mining at multiple levels

Chp1 Slides by: Shree Jaswal

Decisions in Data Mining(2)

Chp1 Slides by: Shree Jaswal

23

Techniques utilized

Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.

Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

What Kinds of Patterns Can Be Mined?

Chp1 Slides by: Shree Jaswal

24

1. Generalization

2. Association and Correlation Analysis

3. Classification

4. Cluster Analysis

5. Outlier Analysis

25

Data Mining Function: (1) Generalization

Multidimensional concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., summarize the characteristics of customers who spend more than Rs. 50,000 a year at an electronics store

Data characterization is a summarization of the general characteristics or features of a target class of data

Data cube technology for computing

Scalable methods for computing (i.e., materializing) multidimensional aggregates

OLAP (online analytical processing)

Examples of Output forms : pie charts, MDD cubes, bar

charts, curves etc.

Chp1 Slides by: Shree Jaswal

Data Mining Function: (1) Generalization contd.

Chp1 Slides by: Shree Jaswal

26

Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes Eg. Compare 2 groups of customers- those who shop for computer

products regularly(more than twice a month) and those who rarely shop for such products(less than 3 times a year)

Data cube technology for computing Drill down on any dimension

Discriminant rules: Discrimination descriptions expressed in the form of rules

Output forms : same as that of data characterization along with discrimination descriptions

27

Data Mining Function: (2) Association and Correlation Analysis

Frequent patterns (or frequent itemsets)

What items are frequently purchased together in your mart? Eg. Milk & bread

Association, correlation vs. causality

A typical association rule

Computer software [1%, 50%] (support, confidence)

Confidence means that if one buys a computer there is a 50% chance that she will buy software too. A 1% support means that 1% of all transactions under analysis show that computer & software are purchased together

Association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold

Chp1 Slides by: Shree Jaswal

28

Data Mining Function: (3) Classification

Classification and label prediction

Construct models (functions) based on some training examples

Describe and distinguish classes or concepts for future prediction

E.g., classify countries based on (climate), or classify cars based

on (gas mileage)

Predict some unknown class labels

Typical methods

Decision trees, naïve Bayesian classification, support vector

machines, neural networks, rule-based classification, pattern-based

classification, logistic regression, …

Typical applications:

Credit card fraud detection, direct marketing, classifying stars,

diseases, web-pages, …

Chp1 Slides by: Shree Jaswal

Various forms of a classification model

Chp1 Slides by: Shree Jaswal

29

30

Data Mining Function: (4) Cluster Analysis

Unsupervised learning (i.e., Class label is unknown)

Group data to form new categories (i.e., clusters), e.g.,

cluster houses to find distribution patterns

Data objects are clustered or grouped based on the principle

of maximizing intraclass similarity and

minimizing interclass similarity

Many methods and applications

Chp1 Slides by: Shree Jaswal

Chp1 Slides by: Shree Jaswal 31

A 2D plot of customer data with respect to customer locations in a city showing 3 data clusters

32

Data Mining Function: (5) Outlier Analysis

Outlier analysis (anomaly mining)

Outlier: A data object that does not comply with the general behavior

of the data

Noise or exception? ― One person’s garbage could be another

person’s treasure

Methods: by product of clustering or regression analysis, …

Useful in fraud detection, rare events analysis

Chp1 Slides by: Shree Jaswal

33

Are All the “Discovered” Patterns Interesting?

Data mining may generate thousands of patterns: Not all of them are

interesting

Suggested approach: Human-centered, query-based, focused mining

Interestingness measures

A pattern is interesting if it is easily understood by humans, valid on new

or test data with some degree of certainty, potentially useful, novel, or

validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures

Objective: based on statistics and structures of patterns, e.g., support,

confidence, etc.

Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,

actionability, etc. Eg. A large Earthquake often follows a cluster of small

quakes

Chp1 Slides by: Shree Jaswal

34

Find All and Only Interesting Patterns?

Find all the interesting patterns: Completeness

Can a data mining system find all the interesting patterns? Do we

need to find all of the interesting patterns?

Heuristic vs. exhaustive search

Association vs. classification vs. clustering

Search for only interesting patterns: An optimization problem

Can a data mining system find only the interesting patterns?

Approaches

First generate all the patterns and then filter out the uninteresting

ones

Generate only the interesting patterns—mining query optimization

Chp1 Slides by: Shree Jaswal

35

Data Mining: Confluence of Multiple Disciplines

Data Mining

Machine Learning

Statistics

Applications

Algorithm

Pattern Recognition

High-Performance Computing

Visualization

Database Technology

Chp1 Slides by: Shree Jaswal

36

Why Confluence of Multiple Disciplines?

Tremendous amount of data

Algorithms must be highly scalable to handle such as tera-bytes of data

High-dimensionality of data

Micro-array may have tens of thousands of dimensions

High complexity of data

Data streams and sensor data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases

Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications

Chp1 Slides by: Shree Jaswal

37

Applications of Data Mining

Web page analysis: from web page classification, clustering to

PageRank & HITS algorithms

Collaborative analysis & recommender systems

Basket data analysis to targeted marketing

Biological and medical data analysis: classification, cluster analysis

(microarray data analysis), biological sequence analysis, biological

network analysis

Data mining and software engineering (e.g., IEEE Computer, Aug.

2009 issue)

From major dedicated data mining systems/tools (e.g., SAS, MS SQL-

Server Analysis Manager, Oracle Data Mining Tools) to invisible data

mining

Chp1 Slides by: Shree Jaswal

38

Major Issues in Data Mining (1)

Mining Methodology

Mining various and new kinds of knowledge

Mining knowledge in multi-dimensional space

Data mining: An interdisciplinary effort

Boosting the power of discovery in a networked environment

Handling noise, uncertainty, and incompleteness of data

Pattern evaluation and pattern- or constraint-guided mining

User Interaction

Interactive mining

Incorporation of background knowledge

Presentation and visualization of data mining results

Chp1 Slides by: Shree Jaswal

39

Major Issues in Data Mining (2)

Efficiency and Scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed, stream, and incremental mining methods

Diversity of data types

Handling complex types of data

Mining dynamic, networked, and global data repositories

Data mining and society

Social impacts of data mining

Privacy-preserving data mining

Invisible data mining

Chp1 Slides by: Shree Jaswal

40

Summary

Data mining: Discovering interesting patterns and knowledge from

massive amount of data

A natural evolution of database technology, in great demand, with

wide applications

A KDD process includes data cleaning, data integration, data selection,

transformation, data mining, pattern evaluation, and knowledge

presentation

Mining can be performed in a variety of data

Data mining functionalities: characterization, discrimination,

association, classification, clustering, outlier and trend analysis, etc.

Data mining technologies and applications

Major issues in data mining

Chp1 Slides by: Shree Jaswal

Recommended