10
Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 “Principles of Data Mining” David Hand Heikki Mannila Padhraic Smyth Mostly about data mining algorithms “Data Preparation for Data Mining” Dorian Pyle Concentrates on data preparation 3 Books Part 1: What is the problem? Motivation: what is the goal of data mining? What is data mining? How is it used How does data mining relate to: InfoViz Knowledge discovery VDM – Visual Data Mining Part 1 4 Q. What is Visualization? A. Using some medium/media to convey a representation of some data so that the user can form a cognitive understanding of the data It is *not* making pictures! What is InfoViz Often displayed like this Transform=data filtering Mapping? Representation? Data New data Transform Mapping Represen- tation Perception Display Visualization

Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

Information Visualization: Data Mining - 1

Matt Cooper Big Data2

• “Principles of Data Mining”

• David Hand

• Heikki Mannila

• Padhraic Smyth

• Mostly about data mining algorithms

• “Data Preparation for Data Mining”

• Dorian Pyle

• Concentrates on data preparation

3

Books

• Part 1: What is the problem?

• Motivation: what is the goal of data mining?

• What is data mining?

• How is it used

• How does data mining relate to:

• InfoViz

• Knowledge discovery

• VDM – Visual Data Mining

Part 1

4

• Q. What is Visualization?

• A. Using some medium/media to convey a representation of some data so that the user can form a cognitive understanding of the data

• It is *not* making pictures!

What is InfoViz

• Often displayed like this

• Transform=data filtering

• Mapping?

• Representation?

Data New dataTransform Mapping Represen-

tationPerceptionDisplay

Visualization

Page 2: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

• Representation: false ‘picture’ of physical qualities

• Molecules

• Fluid flows

• Body bits

• Primarily 3D -> volume displays

• Sometimes with time -> ‘animation’

• Very occasionally higher dimensionality

For Scientific Visualization:

7

• Data has no ‘real’ representation

• Data isn’t 3D - it’s often quite abstract

• No ‘spatial’ relationships at all

• Data items comprise many different fields

• Imagine characterizing a person

• Sciviz – 3D or maybe 4D

• InfoViz – A zillion dimensions

• What representation?

For InfoViz

8

• Having an (enormous) amount of data

• Wonder what it can tell us

• Isolate (unexpected) relationships

• (Hopefully) find some which are

• Interesting

• Novel

• Informative

• Helpful

• “Secondary data analysis”

Data Mining

9

• We generate enormous amounts of data.

• Every time we:

• Bank

• Shop

• Vote

• Drive

• Fly

• Phone…

• This data is collected.

Data gathering

• All this data is collectable!

• Easy to collect and believed to have value

• We never throw anything away!

• Easy to keep and believed to have value.

• Technologies to gather new information are growing rapidly.

Data gathering (2)

• 2011 UK census

• ~63 Million people

• ~35 questions each

• more than three pages

• ~2+ Billion data items

e.g. census data

Page 3: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

• ‘Statistics’ versus ‘data mining’

• Statistics

• Want to know the answer to a question

• Gather suitable data (ask the question)

• Analyse the answers

• Gain (probabilistic?) insight into the answer

What is ‘Data Mining’

• Given a database of shoe-buyers…

• Database: What size shoes do people in the income bracket 20000Kr-25000Kr buy?

• Data mining: What common factors (if any) affect the size of shoes people buy?

Database Query & Data mining

14

• “Everyone spoke of an information overload but what there was in fact was a non-information overload”

• Richard Saul Wurman, “What-If, Could-be”, Philadelphia, 1976.

• (Wrote the book “Information Anxiety”)

Motivation

15

• Extraction of interesting (non-trivial), previously unknown (and potentially useful) information or patterns from data in ((very) large) databases.

• Inmon (slightly paraphrased)

What is data mining?

• Knowledge discovery in databases (KDD)

• Knowledge extraction

• Data/pattern analysis

• Data archeology

• Information harvesting

• Business intelligence

Alternative names

17

• (Deductive) query processing.

• Expert systems

• Statistical analysis

What is not data mining?

18

Page 4: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

• Relational databases

• Transactional databases

• Advanced DB and information repositories:

• Object-oriented and object-relational databases

• Time-series data and temporal data

• Text databases and multimedia databases

• Heterogeneous and legacy databases

• WWW

• Security data (images? video?...)

• Data warehouses

Data Mining: What Data?

19

• Each of (large) number(n) of datums is a ‘tuple’

• Sometimes called a ‘feature vector’

• Tuple: a (large?) number (p) of items

• Each item may be:

• Numeric

• Textual

• other tuple (e.g. fingerprints, images, etc.)

• May be discrete or continuous

• Result is n points in a p-dimensional space

What are the characteristics of the data?

20

ID AGE SEX Education Income

248 54 M School 100 000

249 ?? F Degree 127 831

250 9 M Incomplete 0

251 85 F PhD 56 348

252 32 ?? Degree 48 326

253 45 M ?? ??

Example data set

• Holes

• Missing data values

• Errors and ‘estimates’

• Income of *exactly* 100000?

• Sample inconsistencies:

• E.g. medical records with different numbers of readings for the same person

Problems with data

Objectives of DM

• Identifying patterns in data:

• For representation

• Because they are ‘interesting’

• Unexpected!

23

1. Exploratory Data Analysis

2. Descriptive Modelling

3. Predictive Modelling

! Classification and Regression

4. Discovering Patterns and Rules

5. Retrieval by content

Data Mining tasks

Page 5: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

• Model:

• A global summary of an entire data set.

• Makes statements about any point in the full measurement space.

• Pattern:

• Makes statements about relationships between variables only in localized regions of the measurement space.

Aside: Models and Patterns

• Pure data mining

• “Explore the data with no clear idea of what we are looking for”

• Typically very visual approach

• Very tied to ‘Visual Data Mining’

• Problems with:

• Large number of data points

• Large numbers of dimensions in data

1. Exploratory Data Analysis

• Attempt to describe all of the data

• Perhaps use:

• Model of overall probability distribution in the p-dimensional space

• Partitioning into groups e.g.:

• Cluster analysis for natural grouping

• Segmentation for user-desired groups

2. Descriptive Modelling Descriptive modelling(2)

• Form a model of the data set which allows prediction of a variable based on the known values of the others

• Classification

• Prediction of a discrete variable

• Regression analysis

• Prediction of a continuous variable

• (Prediction does not mean future here)

3. Predictive modelling

29

Predictive modelling (2)

Page 6: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

• Q: “Why is PM not the same as DM?”

• Strong similarities, some similar methods

• A: The goals are subtly different:

• DM is about grouping in the variable space and identifying the groups.

• PM is about with predicting one variable.

Descriptive and Predictive Modelling

• Concerned with the identification of local patterns in sub-sets of the space.

• Examples:

• Frequently occurring sets of transactions

• Finding patterns of action indicating fraud

4. Discovering Rules and Patterns

• Using a pattern of interest to locate similar patterns

• Examples: Automatically…

• Finding images with similar content

• Finding text documents with similar content

5. Retrieval by content

33

• All of the preceding classes of task share a common feature:

• The notion of “is like” or “similarity”

• Or difference (dissimilarity)

• Defined through a ‘scoring function’

• In numerical or categorical data this is often easy

• In general it is not…

Score functions

34

• Is an orange like an apple?

• Yes:

• Both are fruit.

• Both grow on trees.

• No:

• One is citrus, one isn’t.

• One is orange, one is is green/red

Scoring functions (2)

• Is this picture

• Like this one?

Scoring functions (3)

Page 7: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

• Specification of the scoring function(s) is crucial to the effectiveness of the system.

• One of the biggest contributions the user has to make!

Scoring functions (4)

• Segmentation of sales data is extensively used to classify customers by purchasing patterns and demographic data (age, income etc.)

• Use to target marketing

• Example of descriptive modelling

Example applications (1)

• The Advanced Scout system

• Analyses Basketball game logs

• Identifies features of players behaviour

• Circumstances when they play well/badly

• Which opposing players are they good or bad against.

• An example of discovering rules and patterns

Example applications (2)• Dr. John Snow’s

Cholera diagram

• Example of Exploratory Data Analysis

• Also Visual Data Mining

• Done without knowing what caused Cholera!

Example applications (3)

• SKICAT

• Classifies stars and galaxies automatically from digital image data

• Uses a 40-dimensional feature vector

• Works as well as human experts

• Predictive modelling

Example applications (4)• Image searching on the web

• Both Altavista and Google had such functions ~2000

• Both removed them

• Google now has one again (2014)

• Face recognition for security (spotting terrorists)

• Been trialled at several airports in various countries

• Some limited success to date

• Both examples of retrieval by content.

Example Applications (5)

Page 8: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

Altavista Image Search (2000) Google image search (2015)

Google Image Search (2015) Google Image Search (2015)

2nd

Google Image Search (2015) Google Image Search (2015)

5th

Page 9: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

Google Image Search (2015)

15th

Google image search (2015)

• Searching text documents for lies on CV’s

• Example of a by content method

Example applications (6)

• Detecting inappropriate medical treatment

• Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (saved Australia $1m/yr).

• Example of Descriptive/Predictive modelling

Fraud Detection and Management

• Data mining: discovering interesting models and patterns in data

• ‘Simplifications’ enabling understanding!

• A natural evolution of database technology, in great demand, with wide applications

• Mining can be performed in a variety of information repositories

Summary (1)

• Information expert’s input still vital

• Defining methods

• Defining scoring functions

Summary (2)

Page 10: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of

• End of Part 1

55