Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 ·...

Preview:

Citation preview

Information Visualization: Data Mining - 1

Matt Cooper Big Data2

• “Principles of Data Mining”

• David Hand

• Heikki Mannila

• Padhraic Smyth

• Mostly about data mining algorithms

• “Data Preparation for Data Mining”

• Dorian Pyle

• Concentrates on data preparation

3

Books

• Part 1: What is the problem?

• Motivation: what is the goal of data mining?

• What is data mining?

• How is it used

• How does data mining relate to:

• InfoViz

• Knowledge discovery

• VDM – Visual Data Mining

Part 1

4

• Q. What is Visualization?

• A. Using some medium/media to convey a representation of some data so that the user can form a cognitive understanding of the data

• It is *not* making pictures!

What is InfoViz

• Often displayed like this

• Transform=data filtering

• Mapping?

• Representation?

Data New dataTransform Mapping Represen-

tationPerceptionDisplay

Visualization

• Representation: false ‘picture’ of physical qualities

• Molecules

• Fluid flows

• Body bits

• Primarily 3D -> volume displays

• Sometimes with time -> ‘animation’

• Very occasionally higher dimensionality

For Scientific Visualization:

7

• Data has no ‘real’ representation

• Data isn’t 3D - it’s often quite abstract

• No ‘spatial’ relationships at all

• Data items comprise many different fields

• Imagine characterizing a person

• Sciviz – 3D or maybe 4D

• InfoViz – A zillion dimensions

• What representation?

For InfoViz

8

• Having an (enormous) amount of data

• Wonder what it can tell us

• Isolate (unexpected) relationships

• (Hopefully) find some which are

• Interesting

• Novel

• Informative

• Helpful

• “Secondary data analysis”

Data Mining

9

• We generate enormous amounts of data.

• Every time we:

• Bank

• Shop

• Vote

• Drive

• Fly

• Phone…

• This data is collected.

Data gathering

• All this data is collectable!

• Easy to collect and believed to have value

• We never throw anything away!

• Easy to keep and believed to have value.

• Technologies to gather new information are growing rapidly.

Data gathering (2)

• 2011 UK census

• ~63 Million people

• ~35 questions each

• more than three pages

• ~2+ Billion data items

e.g. census data

• ‘Statistics’ versus ‘data mining’

• Statistics

• Want to know the answer to a question

• Gather suitable data (ask the question)

• Analyse the answers

• Gain (probabilistic?) insight into the answer

What is ‘Data Mining’

• Given a database of shoe-buyers…

• Database: What size shoes do people in the income bracket 20000Kr-25000Kr buy?

• Data mining: What common factors (if any) affect the size of shoes people buy?

Database Query & Data mining

14

• “Everyone spoke of an information overload but what there was in fact was a non-information overload”

• Richard Saul Wurman, “What-If, Could-be”, Philadelphia, 1976.

• (Wrote the book “Information Anxiety”)

Motivation

15

• Extraction of interesting (non-trivial), previously unknown (and potentially useful) information or patterns from data in ((very) large) databases.

• Inmon (slightly paraphrased)

What is data mining?

• Knowledge discovery in databases (KDD)

• Knowledge extraction

• Data/pattern analysis

• Data archeology

• Information harvesting

• Business intelligence

Alternative names

17

• (Deductive) query processing.

• Expert systems

• Statistical analysis

What is not data mining?

18

• Relational databases

• Transactional databases

• Advanced DB and information repositories:

• Object-oriented and object-relational databases

• Time-series data and temporal data

• Text databases and multimedia databases

• Heterogeneous and legacy databases

• WWW

• Security data (images? video?...)

• Data warehouses

Data Mining: What Data?

19

• Each of (large) number(n) of datums is a ‘tuple’

• Sometimes called a ‘feature vector’

• Tuple: a (large?) number (p) of items

• Each item may be:

• Numeric

• Textual

• other tuple (e.g. fingerprints, images, etc.)

• May be discrete or continuous

• Result is n points in a p-dimensional space

What are the characteristics of the data?

20

ID AGE SEX Education Income

248 54 M School 100 000

249 ?? F Degree 127 831

250 9 M Incomplete 0

251 85 F PhD 56 348

252 32 ?? Degree 48 326

253 45 M ?? ??

Example data set

• Holes

• Missing data values

• Errors and ‘estimates’

• Income of *exactly* 100000?

• Sample inconsistencies:

• E.g. medical records with different numbers of readings for the same person

Problems with data

Objectives of DM

• Identifying patterns in data:

• For representation

• Because they are ‘interesting’

• Unexpected!

23

1. Exploratory Data Analysis

2. Descriptive Modelling

3. Predictive Modelling

! Classification and Regression

4. Discovering Patterns and Rules

5. Retrieval by content

Data Mining tasks

• Model:

• A global summary of an entire data set.

• Makes statements about any point in the full measurement space.

• Pattern:

• Makes statements about relationships between variables only in localized regions of the measurement space.

Aside: Models and Patterns

• Pure data mining

• “Explore the data with no clear idea of what we are looking for”

• Typically very visual approach

• Very tied to ‘Visual Data Mining’

• Problems with:

• Large number of data points

• Large numbers of dimensions in data

1. Exploratory Data Analysis

• Attempt to describe all of the data

• Perhaps use:

• Model of overall probability distribution in the p-dimensional space

• Partitioning into groups e.g.:

• Cluster analysis for natural grouping

• Segmentation for user-desired groups

2. Descriptive Modelling Descriptive modelling(2)

• Form a model of the data set which allows prediction of a variable based on the known values of the others

• Classification

• Prediction of a discrete variable

• Regression analysis

• Prediction of a continuous variable

• (Prediction does not mean future here)

3. Predictive modelling

29

Predictive modelling (2)

• Q: “Why is PM not the same as DM?”

• Strong similarities, some similar methods

• A: The goals are subtly different:

• DM is about grouping in the variable space and identifying the groups.

• PM is about with predicting one variable.

Descriptive and Predictive Modelling

• Concerned with the identification of local patterns in sub-sets of the space.

• Examples:

• Frequently occurring sets of transactions

• Finding patterns of action indicating fraud

4. Discovering Rules and Patterns

• Using a pattern of interest to locate similar patterns

• Examples: Automatically…

• Finding images with similar content

• Finding text documents with similar content

5. Retrieval by content

33

• All of the preceding classes of task share a common feature:

• The notion of “is like” or “similarity”

• Or difference (dissimilarity)

• Defined through a ‘scoring function’

• In numerical or categorical data this is often easy

• In general it is not…

Score functions

34

• Is an orange like an apple?

• Yes:

• Both are fruit.

• Both grow on trees.

• No:

• One is citrus, one isn’t.

• One is orange, one is is green/red

Scoring functions (2)

• Is this picture

• Like this one?

Scoring functions (3)

• Specification of the scoring function(s) is crucial to the effectiveness of the system.

• One of the biggest contributions the user has to make!

Scoring functions (4)

• Segmentation of sales data is extensively used to classify customers by purchasing patterns and demographic data (age, income etc.)

• Use to target marketing

• Example of descriptive modelling

Example applications (1)

• The Advanced Scout system

• Analyses Basketball game logs

• Identifies features of players behaviour

• Circumstances when they play well/badly

• Which opposing players are they good or bad against.

• An example of discovering rules and patterns

Example applications (2)• Dr. John Snow’s

Cholera diagram

• Example of Exploratory Data Analysis

• Also Visual Data Mining

• Done without knowing what caused Cholera!

Example applications (3)

• SKICAT

• Classifies stars and galaxies automatically from digital image data

• Uses a 40-dimensional feature vector

• Works as well as human experts

• Predictive modelling

Example applications (4)• Image searching on the web

• Both Altavista and Google had such functions ~2000

• Both removed them

• Google now has one again (2014)

• Face recognition for security (spotting terrorists)

• Been trialled at several airports in various countries

• Some limited success to date

• Both examples of retrieval by content.

Example Applications (5)

Altavista Image Search (2000) Google image search (2015)

Google Image Search (2015) Google Image Search (2015)

2nd

Google Image Search (2015) Google Image Search (2015)

5th

Google Image Search (2015)

15th

Google image search (2015)

• Searching text documents for lies on CV’s

• Example of a by content method

Example applications (6)

• Detecting inappropriate medical treatment

• Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (saved Australia $1m/yr).

• Example of Descriptive/Predictive modelling

Fraud Detection and Management

• Data mining: discovering interesting models and patterns in data

• ‘Simplifications’ enabling understanding!

• A natural evolution of database technology, in great demand, with wide applications

• Mining can be performed in a variety of information repositories

Summary (1)

• Information expert’s input still vital

• Defining methods

• Defining scoring functions

Summary (2)

• End of Part 1

55

Recommended