22
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: [email protected] Web: www.ifm.ac.tz/staff/bajuna/ courses /

CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: [email protected]@yahoo.com Web:

Embed Size (px)

Citation preview

Page 1: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

CIT 858: Data Mining and Data Warehousing

Course Instructor: Bajuna SaleheEmail: [email protected]: www.ifm.ac.tz/staff/bajuna/courses/

Page 2: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Introduction to Data Mining and Data Warehousing

Page 3: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Data Mining and Data Warehousing Agenda

What is Data Mining?What is Data Warehousing?The source of invention of Data Mining and

Data Warehousing.Drowning in Data Starving for Knowledge.Evolution of Database Technology to the

current state. (Home Work)

Page 4: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

What Is Data Mining? Data mining (knowledge discovery from

data) Extraction of interesting (non-trivial,

implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

Data mining: a misnomer? Should have been named “knowledge mining

from data” which is too long or “knowledge mining” not reflecting the

emphasis on mining from huge data

Page 5: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

What Is Data Mining?

Many people treat data mining as a synonym for another popularly used term Knowledge Discovery from Data/Databases (KDD).

KDD as the process is depicted below:

Page 6: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

The KDD Process

Cleaning & Integration

Evaluation & Presentation

Data Warehouse

Databases

Selection & Transformation

Data Mining

Knowledge

Page 7: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

KDD Process

1) Data cleaningTo move noise and inconsistent data

2) Data integrationWhere multiple data sources may be

combined

3) Data selectionWhere data relevant to the analysis task are

retrieved from the database.

Page 8: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

KDD Process

4) Data transformationWhere data are transformed or consolidated

into forms appropriate for mining by performing summary or aggregation operations, for instance.

5) Data miningAn essential process where intelligent

methods are applied in order to extract data pattern.

Page 9: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

KDD Process

6) Pattern evaluation.To identify the truly interesting pattern

representing knowledge.

7) Knowledge presentationWhere visualization and knowledge

representation techniques are used to present the mined knowledge to the users.

8) Use of discovered knowledge

Page 10: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Data Mining: On What Kinds Of Data?

Relational database

Data warehouse

Transactional database

Advanced database and information repositorySpatial and temporal dataStream dataMultimedia databaseText databases & WWW

Page 11: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Data Mining Functionalities

Association (correlation and causality)Cheese & Bread

Classification and Prediction Construct models that describe and

distinguish classes or concepts for future prediction

Predict some unknown or missing numerical values

Page 12: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Data Mining Functionalities (cont…)

Cluster analysis Class label is unknown: Group data to form new

classes, e.g., cluster houses to find distribution patterns

Outlier analysis Outlier: a data object that does not comply with the

general behavior of the data Noise or exception? No! useful in fraud detection and

rare event analysis

Page 13: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Necessity Is The Mother Of Invention

Data explosion problem Automated data collection tools and mature database

technology lead to huge amounts of data accumulated

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining

Data warehousing and on-line analytical processingMining interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases

Page 14: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Evolution Of Database Technology

1960s:Data collection, database creation, IMS and network DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering,

etc.)

Page 15: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Evolution Of Database Technology

1990s: Data mining, data warehousing, multimedia

databases, and Web databases

2000sStream data management and miningData mining with a variety of applicationsWeb technology and global information

systems

Page 16: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Potential Applications

Data analysis and decision supportMarket analysis and managementRisk analysis and managementFraud detection and detection of unusual patterns

Other applicationsText mining (email, documents) and Web miningStream data miningDNA and bio-data analysis

Page 17: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Fraud Detection & Mining Unusual Patterns

Applications: Health care, retail, credit card service, telecommunications

Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance

Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests

Telecommunications: phone-call fraud Phone call model: destination of the call, duration, time of day or week.

Analyze patterns that deviate from an expected norm Retail industry

Analysts estimate that 38% of retail shrink is due to dishonest employees Anti-terrorism

Approaches: Clustering, model construction, outlier analysis, etc.

Page 18: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Other Applications

Sports IBM Advanced Scout analyzed NBA game statistics

(shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web

access logs for market-related pages to discover customer preference and behavior to help analyzing effectiveness of Web marketing, improving Web site organization, etc.

Page 19: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

What is Data Warehouse? Defined in many different ways, but not

rigorouslyA decision support database that is maintained

separately from the organization’s operational database

Support information processing by providing a solid platform of consolidated, historical data for analysis

“A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile

collection of data in support of management’s decision-making process”

—Bill Inmon

Page 20: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

The source of Invention of DW and Data Mining Data explosion problem

Automated data collection tools and mature database technology lead to huge amounts of data accumulated

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

Page 21: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Drowning In Data, Starving For Knowledge

DATA KNOWLEDGE

Page 22: CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: bajunar@yahoo.combajunar@yahoo.com Web:

Importance of Data Mining

By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from databases and viewed or browsed from different angles.

The discovered knowledge can be applied to decision making process.