78
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos

Indexing and Data Mining in Multimedia Databases

  • Upload
    vidar

  • View
    46

  • Download
    1

Embed Size (px)

DESCRIPTION

Indexing and Data Mining in Multimedia Databases. Christos Faloutsos CMU www.cs.cmu.edu/~christos. Outline. Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resources. Problem. - PowerPoint PPT Presentation

Citation preview

Page 1: Indexing and Data Mining in Multimedia Databases

Indexing and Data Mining in Multimedia Databases

Christos FaloutsosCMU

www.cs.cmu.edu/~christos

Page 2: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 2

Outline

Goal: ‘Find similar / interesting things’• Problem - Applications• Indexing - similarity search• New tools for Data Mining: Fractals• Conclusions• Resources

Page 3: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 3

Problem

Given a large collection of (multimedia) records, find similar/interesting things, ie:

• Allow fast, approximate queries, and• Find rules/patterns

Page 4: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 4

Sample queries

• Similarity search– Find pairs of branches with similar sales

patterns– find medical cases similar to Smith's– Find pairs of sensor series that move in sync

Page 5: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 5

Sample queries –cont’d

• Rule discovery– Clusters (of patients; of customers; ...)– Forecasting (total sales for next year?)– Outliers (eg., fraud detection)

Page 6: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 6

Outline

Goal: ‘Find similar / interesting things’• Problem - Applications• Indexing - similarity search• New tools for Data Mining: Fractals• Conclusions• Resourses

Page 7: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 7

Indexing - Multimedia

Problem:• given a set of (multimedia) objects,• find the ones similar to a desirable query

object (quickly!)

Page 8: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 8

day

$price

1 365

day

$price

1 365

day

$price

1 365

distance function: by expert

Page 9: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 9day

1 365

day1 365

S1

Sn

F(S1)

F(Sn)

‘GEMINI’ - Pictorially

eg, avg

eg,. std

off-the-shelf S.A.Ms (spatial Access Methods)

Page 10: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 10

‘GEMINI’

• fast; ‘correct’ (=no false dismissals)• used for

– images (eg., QBIC) (2x, 10x faster)– shapes (27x faster)– video (eg., InforMedia)– time sequences ([Rafiei+Mendelzon], ++)

Page 11: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 11

Remaining issues

• how to extract features automatically?• how to merge similarity scores from

different media

Page 12: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 12

Outline

Goal: ‘Find similar / interesting things’• Problem - Applications• Indexing - similarity search

– Visualization: Fastmap– Relevance feedback: FALCON

• Data Mining / Fractals• Conclusions

Page 13: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 13

FastMap

O1 O2 O3 O4 O5O1 0 1 1 100 100O2 1 0 1 100 100O3 1 1 0 100 100O4 100 100 100 0 1O5 100 100 100 1 0

~100

~1

??

Page 14: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 14

FastMap

• Multi-dimensional scaling (MDS) can do that, but in O(N**2) time

• We want a linear algorithm: FastMap [SIGMOD95]

Page 15: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 15

Applications: time sequences

• given n co-evolving time sequences• visualize them + find rules [ICDE00]

time

rate

HKD

JPY

DEM

Page 16: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 16

Applications - financial• currency exchange rates [ICDE00]

USD(t)

USD(t-5)

FRFGBPJPY

HKD

Page 17: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 17

Applications - financial• currency exchange rates [ICDE00]

USD

HKD

JPY

FRFDEM

GBP

USD(t)

USD(t-5)

Page 18: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 20

Outline

Goal: ‘Find similar / interesting things’• Problem - Applications• Indexing - similarity search

– Visualization: Fastmap– Relevance feedback: FALCON

• Data Mining / Fractals• Conclusions

Page 19: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 21

Merging similarity scores

• eg., video: text, color, motion, audio– weights change with the query!

• solution 1: user specifies weights • solution 2: user gives examples

– and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader)

– but: how about disjunctive queries?

Page 20: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 22

DEMO

server demo

Page 21: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 23

‘FALCON’Inverted VsVs

Trader wants only ‘unstable’ stocks

Page 22: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 24

‘FALCON’Inverted VsVs

average: is flat!

Page 23: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 25

“Single query point” methods

Rocchio

+

+ ++

++

x

avg

std

Page 24: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 26

“Single query point” methods

Rocchio MindReader

+

+ ++

++ +

+ ++

++ +

+ ++

++

MARS

The averaging affect in action...

x x x

Page 25: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 27

++

+

++

Main idea: FALCON Contours

feature1 (eg., avg)

feature2

eg., std

[Wu+, vldb2000]

Page 26: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 28

A: Aggregate Dissimilarity

: parameter (~ -5 ~ ‘soft OR’)

1

,i iG xgdxD

++

+

++

g1

g2

x

Page 27: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 29

• converges quickly (~5 iterations)• good precision/recall• is fast (can use off-the-shelf ‘spatial/metric

access methods’)

FALCON

Page 28: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 30

Conclusions for indexing + visualization

• GEMINI: fast indexing, exploiting off-the-shelf SAMs

• FastMap: automatic feature extraction in O(N) time

• FALCON: relevance feedback for disjunctive queries

Page 29: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 31

Outline

Goal: ‘Find similar / interesting things’• Problem - Applications• Indexing - similarity search• New tools for Data Mining: Fractals• Conclusions• Resourses

Page 30: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 32

Data mining & fractals – Road map

• Motivation – problems / case study• Definition of fractals and power laws• Solutions to posed problems• More examples

Page 31: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 33

Problem #1 - spatial d.m.

Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’

galaxies

(stores & households; healthy & ill subjects)

- patterns? (not Gaussian; not uniform)

-attraction/repulsion?

- separability??

Page 32: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 34

Problem#2: dim. reduction

• given attributes x1, ... xn

– possibly, non-linearly correlated• drop the useless ones

(Q: why? A: to avoid the ‘dimensionality curse’)

enginesize

mpg

Page 33: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 35

Answer:

• Fractals / self-similarities / power laws

Page 34: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 36

What is a fractal?

= self-similar point set, e.g., Sierpinski triangle:

...zero area;

infinite length!

Page 35: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 37

Definitions (cont’d)

• Paradox: Infinite perimeter ; Zero area!• ‘dimensionality’: between 1 and 2• actually: Log(3)/Log(2) = 1.58… (long

story)

Page 36: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 38

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

x y5 14 23 32 4

Eg:

#cylinders; miles / gallon

Page 37: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 39

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• A: nn ( <= r ) ~ r^1

Page 38: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 40

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• A: nn ( <= r ) ~ r^1

• Q: fd of a plane?• A: nn ( <= r ) ~ r^2fd== slope of (log(nn) vs

log(r) )

Page 39: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 41

Sierpinsky triangle

log( r )

log(#pairs within <=r )

1.58

== ‘correlation integral’

Page 40: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 42

Observations

self-similarity ->• <=> fractals • <=> scale-free• <=> power-laws (y=x^a, F=C*r^(-2))

log( r )

log(#pairs within <=r )

1.58

Page 41: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 43

Road map

• Motivation – problems / case studies• Definition of fractals and power laws• Solutions to posed problems• More examples • Conclusions

Page 42: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 44

Solution#1: spatial d.m.Galaxies (Sloan Digital Sky Survey w/ B.

Nichol - ‘BOPS’ plot - [sigmod2000])

•clusters?

•separable?

•attraction/repulsion?

•data ‘scrubbing’ – duplicates?

Page 43: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 45

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

Page 44: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 46

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

[w/ Seeger, Traina, Traina, SIGMOD00]

Page 45: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 47

spatial d.m.

r1r2

r1

r2

Heuristic on choosing # of clusters

Page 46: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 48

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

Page 47: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 49

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

-repulsion!!

-duplicates

Page 48: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 50

Problem #2: Dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

Page 49: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 51

Solution:

• drop the attributes that don’t increase the ‘partial f.d.’ PFD

• dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00]

Page 50: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 52

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD~1global FD=1 PFD=1

PFD=0PFD=1

Page 51: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 53

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD=1global FD=1PFD=1

PFD=0PFD=1

Notice: ‘max variance’ would fail here

Page 52: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 54

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD~1global FD=1 PFD=1

PFD=0PFD=1

Notice: SVD would fail here

Page 53: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 55

Currency dataset

USD HKD JPY …Day1 1.62Day2 1.58

Page 54: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 56

self-similar?

4

5

6

7

8

9

10

11

12

13

14

15

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Currency dataset

log(radii)

Currency slp=1.9807S(r)

-5

0

5

10

15

20

25

-7 -6 -5 -4 -3 -2 -1

Eigenfaces dataset

log(radii)

Eigenfaces slp=4.2506S(r)

fd=1.98fd=4.25

currency eigenfaces

Page 55: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 57

FDR on the ‘currency’ dataset

0.8

1

1.2

1.4

1.6

1.8

2

1 2 3 4 5 6

AmericanDollar

German MarkBritish Pound

FrenchFranc

Japanese Yen

#Attributes considered

if unif + indep.

Page 56: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 58

FDR on the ‘currency’ dataset

0.8

1

1.2

1.4

1.6

1.8

2

1 2 3 4 5 6

AmericanDollar

German MarkBritish Pound

FrenchFranc

Japanese Yen

#Attributes considered

if unif + indep.

• HKD: “useless”

•>1.98 axis are needed

Page 57: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 59

Road map

• Motivation – problems / case studies• Definition of fractals and power laws• Solutions to posed problems• More examples • Conclusions

Page 58: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 60

App. : traffic

• disk traces: self-similar (also: web traffic; comm. errors; etc)

time

#bytes

Page 59: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 61

More apps: Brain scans

• Oct-trees; brain-scans

octree levels

Log(#octants)

2.63 = fd

Page 60: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 62

More fractals:

• stock prices (LYCOS) - random walks: 1.51 year 2 years

Page 61: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 63

More fractals:

• coast-lines: 1.1-1.2 (up to 1.58)

Page 62: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 64

Page 63: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 65

Examples:MG county

• Montgomery County of MD (road end-points)

Page 64: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 66

Examples:LB county

• Long Beach county of CA (road end-points)

Page 65: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 67

More power laws: Zipf’s law

• Bible - rank vs frequency (log-log)

log(rank)

log(freq)

“a”

“the”

Page 66: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 68

More power laws

• Freq. distr. of first names; last names (Mandelbrot)

Page 67: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 69

Internet

• Internet routers: how many neighbors within h hops?

U of Alberta

Page 68: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 70

Internet topology

• Internet routers: how many neighbors within h hops? [SIGCOMM 99]

Reachability function: number of neighbors within r hops, vs r (log-log).

Mbone routers, 1995log(hops)

log(#pairs)

2.8

Page 69: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 71

More power laws: areas – Korcak’s law

Scandinavian lakes

([icde99], w/ Proietti)

Page 70: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 72

More power laws: areas – Korcak’s law

Scandinavian lakes area vs complementary cumulative count (log-log axes)

log(count( >= area))

log(area)

Page 71: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 73

Olympic medals:

y = -0.9676x + 2.3054R2 = 0.9458

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

Series1

Linear (Series1)

log rank

log(# medals)

Page 72: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 74

More power laws

• Energy of earthquakes (Gutenberg-Richter law) [simscience.org]

log(count)

magnitudeday

amplitude

Page 73: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 75

Even more power laws:

• Income distribution (Pareto’s law);• sales distributions;• duration of UNIX jobs• Distribution of UNIX file sizes• publication counts (Lotka’s law)

Page 74: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 76

Even more power laws:

• web hit frequencies ([Huberman])• hyper-link distribution [Barabasi], ++

Page 75: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 77

Overall Conclusions:

‘Find similar/interesting things’ in multimedia databases

• Indexing: feature extraction (‘GEMINI’)– automatic feature extraction: FastMap– Relevance feedback: FALCON

Page 76: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 78

Conclusions - cont’d

• New tools for Data Mining: Fractals/power laws:– appear everywhere– lead to skewed distributions (Gaussian,

Poisson, uniformity, independence)– ‘correlation integral’ for separability/cluster

detection– PFD for dimensionality reduction

Page 77: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 79

Conclusions - cont’d

– can model bursty time sequences (buffering/prefetching)

– selectivity estimation (‘how many neighbors within x km?)

– dim. curse diagnosis (it’s the fractal dim. that matters! [ICDE2000])

Page 78: Indexing and Data Mining in Multimedia Databases

U. of Alberta, 2001 C. Faloutsos 80

Resources:

• Software and papers:– http://www.cs.cmu.edu/~christos– Fractal dimension (FracDim)– Separability (sigmod 2000)– Relevance feedback for query by content

(FALCON – vldb 2000)