Dr. M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2010

Dr. M. Sulaiman Khan

([email protected])

Dept. of Computer Science

University of Liverpool

2010

COMP207: Data Mining

Introduction to Data Mining

COMP207:Data Mining

What is Data Mining? Definitions KDD: Knowledge Discovery in Databases KDD Process Differences with Statistics Views on the Process Basic Functions

Why would you do this? Motivations Applications

Summary

Today's Topics


COMP207:Data Mining

Some Definitions: “The nontrivial extraction of implicit, previously unknown, and

potentially useful information from data” (Piatetsky-Shapiro) "...the automated or convenient extraction of patterns

representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, ... or data streams." (Han, pg xxi)

“...the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” (Witten, pg 5)

“...finding hidden information in a database.” (Dunham, pg 3) “...the process of employing one or more computer learning

techniques to automatically analyse and extract knowledge from data contained within a database.” (Roiger, pg 4)

What is Data Mining?


COMP207:Data Mining

Keywords from each definition: “The nontrivial extraction of implicit, previously unknown, and

potentially useful information from data” (Piatetsky-Shapiro) "...the automated or convenient extraction of patterns

representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, ... or data streams." (Han, pg xxi)

“...the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” (Witten, pg 5)

“...finding hidden information in a database.” (Dunham, pg 3) “...the process of employing one or more computer learning

techniques to automatically analyze and extract knowledge from data contained within a database.” (Roiger, pg 4)

What is Data Mining?


COMP207:Data Mining

Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD.

Dunham: KDD is the process of finding useful information and patterns in data.Data Mining is the use of algorithms to extract information and patterns derived by the KDD process.

For this course, we will discuss the entire process (KDD) but focus mostly on the algorithms used for discovery.

KDD: Knowledge Discovery in Databases


COMP207:Data Mining

KDD (Knowledge Discovery in Databases) is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96)

Or KDD : non-trivial extraction of implicit, previously unknown and potentially useful information

Data mining is just a part of the KDD process

Data mining applies algorithms to large data to produce models or patterns interesting to the user.

KDD: Knowledge Discovery in Databases


COMP207:Data Mining

The Data Mining (KDD) Process


COMP207:Data Mining

Operational Data- Day-to-day data used to run business

Clean, collect and summarise- Most Data is not suitable for data mining- Errors or Noise, missing data, invalid formats

Data warehouse- Mega store of clean (analysis) data

Data Preparation- Validating the data for mining (e.g. remove noise, formatting, running validation routines etc.)

Training Data – Data used as test case for mining Data Mining – the process of applying mining algorithms

on data to produce interesting patterns

KDD Process Components


COMP207:Data Mining

Differences with Statistics


COMP207:Data Mining

Data Mining• Algorithms scale to large

data• Data is used secondary for

Data mining• DM–tools use background

knowledge for End-User• Strategy :

– Exploration– Cyclic

Statistics• Many algorithms with

quadratic run time.• Data is used for the Statistic

(primary)• Statistical background is

often required• Strategy:

– Conformational– Verifying– Few loops

Piatetsky-Shapiro View


COMP207:Data Mining

Initial Data

Target Data

Preprocessed Data

Transformed Data

Data Model

Knowledge

Selection

Preprocessing

Transformation

Data Mining

Interpretation

(As tweaked by Dunham)

CRISP-DM View


COMP207:Data Mining

All Data Mining functions can be thought of as attempting to find a model to fit the data.

Each function needs Criteria to create one model over another.

Each function needs a technique to Compare the data.

Two types of model: Predictive models predict unknown values based on

known data Descriptive models identify patterns in data

Each type has several sub-categories, each of which has many algorithms. We won't have time to look at ALL of them in detail.

Data Mining Functions


COMP207:Data Mining

Data Mining Functions


COMP207:Data Mining

Data Mining

Predictive

Descriptive

Classification: Maps data into predefined classesRegression: Maps data into a functionPrediction: Predict future data statesTime Series Analysis: Analyze data over time

(Supervised Learning)

Clustering: Find groups of similar itemsAssociation Rules: Find relationships between itemsCharacterisation: Derive representative information Sequence Discovery: Find sequential patterns

(Unsupervised Learning)

The aim of classification is to create a model that can predict the 'type' or some category for a data instance that doesn't have one.

Two phases:1. Given labelled data instances, learn model for

how to predict the class label for them. (Training)2. Given an unlabelled, unseen instance, use the

model to predict the class label. (Prediction)

Some algorithms predict only a binary split (yes/no), some can predict 1 of N classes, some give probabilities for each of N classes.

Classification


COMP207:Data Mining

The aim of clustering is similar to classification, but without predefined classes.

Clustering attempts to find clusters of data instances which are more similar to each other than to instances outside of the cluster.

Unsupervised Learning: learning by observation, rather than by example.

Some algorithms must be told how many clusters to find, others try to find an 'appropriate' number of clusters.

Clustering


COMP207:Data Mining

The aim of association rule mining is to find patterns that occur in the data set frequently enough to be interesting. Hence the association or correlation of data attributes within instances, rather than between instances.

These correlations are then expressed as rules – if X and Y appear in an instance, then Z also appears.

Most algorithms are extensions of a single base algorithm known as 'A Priori', however a few others also exist.

Association Rule Mining


COMP207:Data Mining

That all sounds ... complicated. Why should I learn about Data Mining?

What's wrong with just a relational database? Why would I want to go through these extra [complicated] steps?

Isn't it expensive? It sounds like it takes a lot of skill, programming, computational time and storage space. Where's the benefit?

Data Mining isn't just a cute academic exercise, it has very profitable real world uses. Practically all large companies and many governments perform data mining as part of their planning and analysis.

Why?


COMP207:Data Mining

We are Data rich but knowledge poor Computing affordable - Storage, CPU, networking Data is too large to analyse (Very Large Databases

(VLBD)- Dimensionality (size)- distributed (location spread)- heterogeneous (different types of data)

Traditional techniques infeasible- Statistics, databases

Competitive pressure in business enterprises- Customer profiling (Need to know who is a good customer)- Business to Business (B2B – Being “old” is not profitable)

Why Data Mining? Some general reasons


COMP207:Data Mining

Relational database—A commodity of every enterprise Huge data warehouses are under construction POS (Point of Sales): Transactional DBs in terabytes Object, relational, distributed, heterogeneous and

legacy databases Spatial databases (GIS), remote sensing database

(EOS), and scientific/engineering databases (Genetic data etc)

Time-series data (e.g., stock trading) and temporal data Text (documents, emails) and multimedia databases WWW:A huge, hyper-linked, dynamic, global information

system (XML, Web content and Web usage data) Crime data – terrorist data more recent applications

Data is Everywhere!


COMP207:Data Mining

The rate of data creation is accelerating each year. In 2003, UC Berkeley estimated that the previous year generated 5 exabytes of data, of which 92% was stored on electronically accessible media.Mega < Giga < Tera < Peta < Exa ... All the data in all the books in the US Library of Congress is ~136 Terabytes. So 37,000 New Libraries of Congress in 2002.

VLBI Telescopes produce 16 Gigabytes of data every second.

Each engine of each plane of each company produces ~1 Gigabyte of data every trans-atlantic length journey.

Google searches 18 billion+ accessible web pages.

The Data Explosion


COMP207:Data Mining

As the amount of data increases, the proportion of information decreases.

As more and more data is generated automatically, we need to find automatic solutions to turn those stored raw results into information.

Companies need to turn stored data into profit ... otherwise why are they storing it?

Let's look at some real world examples.

Data Explosion Implications


COMP207:Data Mining

The data generated by airplane engines can be used to determine when it needs to be serviced. By discovering the patterns that are indicative of problems, companies can service working engines less often (increasing profit) and discover faults before they materialise (increasing safety).

Loan companies can “give you results in minutes” by classifying you into a good credit risk or a bad risk, based on your personal information and a large supply of previous, similar customers.

Cell phone companies can classify customers into those likely to leave, and hence need enticement, and those that are likely to stay regardless.

Classification


COMP207:Data Mining

Discover previously unknown groups of customers/items.By finding clusters of customers, companies can then

determine how best to handle that particular cluster.

For example, this could be used for targeted advertising, special offers, transferring information gathered by association rule mining to other members of the cluster, and so forth.

The concept of 'Similarity' is often used for determining other items that you might be interested in, eg 'More Like This' links.

Clustering


COMP207:Data Mining

By finding association rules from shopping baskets, supermarkets can use this information for many things, including:

Product placement in the store What to put on sale What to create as 'joint special offers' What to offer the customer in terms of coupons What to advertise together

It shouldn't be surprising that your Tesco coupons are for things that you sometimes buy, rather than things you always or never buy.

Wal-Mart in the US records every transaction at every store -- petabytes of information to sift through. (TeraData)

Association Rule Mining


COMP207:Data Mining

Note well that data mining applications have no wisdom. They cannot apply the knowledge that they discover appropriately.

For example, a data mining application may tell you that there is a correlation between buying music magazines and beer, but it doesn't tell you how to use that knowledge. Should you put the two close together to reinforce the tendency, or should you put them far apart as people will buy them anyway and thus stay in the store longer?

Data mining can help managers plan strategies for a company, it does not give them the strategies.

Data/Information/Knowledge/Wisdom


COMP207:Data Mining

What is data mining?KDD - knowledge discovery in databases: nontrivial extraction of implicit, previously unknown and potentially useful information

Why do we need data mining?- Very large data - data explosion,- Dimensionality of data- Heterogeneity of data- Technology rich- Traditional techniques infeasible

Summary


COMP207:Data Mining

Documents

Dr. M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2010