21
1 CMU SCS 15-826 (c) C. Faloutsos and J-Y Pan (2010) #1 15-826: Multimedia Databases and Data Mining BONUS LECTURE#2: Independent Component Analysis (ICA) Jia-Yu Pan and Christos Faloutsos NOT in Final exam CMU SCS 15-826 (c) C. Faloutsos and J-Y Pan (2010) #2 Outline • Motivation • Formulation PCA and ICA Example applications • Conclusion CMU SCS 15-826 (c) C. Faloutsos and J-Y Pan (2010) #3 Motivation: (Q1) Find patterns in data Motion capture data: broad jumps Left Knee Right Knee Energy exerted Take-off Landing Energy exerted

15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

1

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #1

15-826: Multimedia Databases and Data Mining

BONUS LECTURE#2: Independent Component Analysis (ICA)

Jia-Yu Pan and Christos Faloutsos

NOT in Final exam

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #2

Outline •  Motivation •  Formulation •  PCA and ICA •  Example applications •  Conclusion

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #3

Motivation: (Q1) Find patterns in data

•  Motion capture data: broad jumps

Left Knee

Right Knee

Energy exerted

Take-off Landing

Energy exerted

Page 2: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

2

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #4

Motivation: (Q1) Find patterns in data

•  Human would say – Pattern 1: along

diagonal – Pattern 2: along

vertical axis •  How to find these

automatically? Each point is the measurement at a time tick (total 550 points).

Take-off

Landing R:L=60:1

R:L=1:1

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #5

Motivation: (Q2) Find hidden variables

Hidden variables

“Internet bubble”

Alcoa

American Express

Boeing

Citi Group

“General trend”

Stock prices

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #6

Motivation: (Q2) Find hidden variables

“General trend” “Internet bubble”

0.94

0.03 0.63

0.64

Caterpillar Intel

Hidden variables

Page 3: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

3

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #7

Motivation: (Q2) Find hidden variables

Hidden variable 1 Hidden variable 2

B1,CAT

B1,INTC B2,CAT

B2,INTC ?

? ?

Caterpillar Intel

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #8

Motivation: Find hidden variables

•  There are two sound sources in a cocktail party…

Also called the “blind source separation” problem. “Blind” because we don’t know the sources, nor how they are mixed.

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #9

Outline •  Motivation •  Formulation •  PCA and ICA •  Example applications •  Conclusion

Page 4: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

4

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #10

Formulation: Finding patterns

Given n data points, each with m attributes.

Find patterns that describe data properties the best.

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #11

Linear representation

•  Find patterns that are vectors that describe the data set the best. •  Each point is described as a linear combination of the vectors

(patterns):

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #12

Patterns as data “vocabulary”

Good pattern ≈ sparse coding

(Q) Given data xi’s, compute hi,j’s and bi’s that are “sparse”?

Only b1 is needed to describe xi.

Page 5: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

5

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #13

Patterns in motion capture data

b2

b1

n=550 ticks Data matrix

Left Right

Basis vectors

Hidden variables

? ?

Sparse ~ non-Gaussian ~ “Independent”

“Independent”: e.g., minimize mutual information.

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #14

Outline •  Motivation •  Formulation •  PCA and ICA •  Example applications •  Conclusion

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #15

Basis vectors and hidden variables

•  Goal: Knowing X, find H and B, where X = H B

•  Problem: Under-constrained – Need additional assumptions/constraints

X: data set H: hidden variables B: basis vectors

Page 6: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

6

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #16

PCA and ICA

•  PCA vectors: major variations –  Together = good “low-rank approximation”/dimensional reduction –  Individually ≠ meaningful patterns

•  Luckily, ICA detects the major meaningful patterns.

PCA Vectors ICA Vectors

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #17

PCA •  PCA (Principal Component Analysis)

– Choose vectors which are orthonormal and –  give smallest projection error L2

•  Matrices H and B can be solved by – SVD, neural networks, or many optimization

methods

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #18

PCA

•  Extremely popular – Latent Semantic Indexing [Deerwester+90] – KL transform [Duda,Hart,Stork00] – EigenFaces [Turk,Pentlend91] – Multiple time series correlation

[Guha,Gunopulos,Koudas03] •  But, there is room for improvement.

Page 7: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

7

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #19

ICA •  ICA (Independent Component Analysis)

– Make hidden variables hi’s (columns of H) mutually independent.

•  Many implementations – Many ways to define “independence” – Many ways to find the most independent H. –  (B is found at the same time, since X=HB.)

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #20

ICA

•  Define “Independence”: – Zero mutual information – Non-Gaussianity, max. absolute Kurtosis

•  To solve for H,B: – Neural networks, optimization methods

(gradient ascent, fixed-point, …)

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #21

An non-gaussian distribution: Laplacian pdf

Sharper at 0, and more heavy tail than Gaussian pdf

Page 8: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

8

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #22

Maximizing non-gaussianity •  Assume hi ~ non-gaussian pdf (e.g. Laplacian pdf)

–  Fixed hi values, what is the most likely “B” ? •  (data point x is given and fixed)

–  Find B, s.t. likelihood P(x|B) is maximized.

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #23

Maximize likelihood •  Likelihood P(x|B) is a function of B, f(B) •  Gradient ascent

–  To find B which maximizes P(x|B)

f(B)

B B*

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #24

ICA by maximum likelihood

H B

X: static

Page 9: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

9

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #25

Convergence of ICA

ICA

Start with the matrix B as an identity matrix.

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #26

Outline •  Motivation •  Formulation •  PCA and ICA •  Example applications

– Find topics in documents – Hidden variables in stock prices – Visual vocabulary for retinal images

•  Conclusion

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #27

Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05]

Step 1: Data points (matrix)

Step 2: Compute patterns

Step 3: Interpret patterns

or

or

Data mining (Case studies)

(Q) What pattern?

(Q) Different modalities

(Q) How?

Video frames

Stock prices

Text documents

Page 10: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

10

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #28

Finding patterns in high-dimensional data

PCA finds the hyperplane. ICA finds the correct patterns.

Dimensionality reduction

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #29

Outline •  Motivation •  Formulation •  PCA and ICA •  Example applications

– Find topics in documents – Hidden variables in stock prices – Visual vocabulary for retinal images

•  Conclusion

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #30

Topic discovery on text streams •  Data: CNN headline news (Jan.-Jun. 1998) •  Documents of 10 topics in one single text

stream –  Documents are sorted by date/time –  Subsequent documents may have different

topics

Date/Time Topic 1 Topic 3 Topic 1 …

Page 11: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

11

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #31

Topic discovery on text streams

•  Known: number of topics = 10 •  Unknown: (1) topic of each document (2)

topic description

Date/Time Topic 1 Topic 3 Topic 1 …

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #32

Topic discovery in documents

X[nxm]

xi = [1, 5, …, 0]

Step 1

Windowing

Step 2

X[nxm] = H[nxm] B[mxm]

(1)  Find hyperplane (m=10)

(2)  Find patterns

aaron

zoo

m=3887 (dictionary size)

(n=1659) New stories

(30 words)

Step 3

b’i = [0, 0.7, …, 0.6] aaron

zoo animal

B’[10x3887] (Q) What does b’i mean?

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #33

Step 3: Interpret the patterns

b’i = [0, 0.7, …, 0.6]

aaron

zoo animal

Topics found

A hidden topic!

Top words : “animal”, “zoo”, …

ID Sorted word list

A Mckinne Sergeant sexual Major Armi

B bomb Rudolph Clinic Atlanta Birmingham

C Winfrei Beef Texa Oprah Cattl

D Viagra Drug Impot Pill Doctor

E Zamora Graham Kill Former Jone

F Medal Olymp Gold Women Game

G Pope Cube Castro Cuban Visit

H Asia Economi Japan Econom Asian

I Super Bowl Game Team Re

J Peopl Tornado Florida Re bomb

General idea: related to the data attributes

m=3887 (dictionary size)

Page 12: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

12

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #34

Step 3: Evaluate the patterns ID True Topic 1 Sgt. Gene Mckinney is on trial for alleged sexual misconduct 2 A bomb explodes in a Birmingham, AL abortion clinic 3 The Cattle Industry in Texas sues Oprah Winfrey for defaming beef 4 New impotency drug Viagra is approved for use 5 Diane Zamora is convicted of helping to murder her lover’s girlfriend 6 1998 Winter Olympic games 7 The Pope’s historic visit to Cube in Winter 1998 8 The economic crisis in Asia 9 Superbowl XXXII 10 Tornado in Florida

AutoSplit finds correct topics.

ID Sorted word list A mckinne sergeant sexual major armi B bomb rudolph clinic atlanta birmingham C winfrei beef texa oprah cattl D viagra drug Impot pill doctor E zamora graham kill former jone

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #35

Step 3: Evaluate the patterns ID AutoSplit A mckinne sergeant sexual major armi B bomb rudolph clinic atlanta birmingham C winfrei beef texa oprah cattl D viagra drug Impot pill doctor E zamora graham kill former jone

AutoSplit’s topics are better than PCA.

ID PCA A’ mckinne bomb women sexual sergeant B’ bomb mckinne rudolph clinic atlanta C’ winfrei viagra texa beef oprah D’ viagra winfrei drug texa beef E’ zamora viagra winfrei graham olymp

ID PCA A’ mckinne bomb women sexual sergeant B’ bomb mckinne rudolph clinic atlanta C’ winfrei viagra texa beef oprah D’ viagra winfrei drug texa beef E’ zamora viagra winfrei graham olymp

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #36

Step 3: Evaluate the patterns AutoSplit

A B C D E

AutoSplit’s topics are better than PCA.

PCA A’ B’ C’ D’ E’

PCA vectors mix the topics.

Topic 1

Topic 2

Page 13: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

13

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #37

Outline •  Motivation •  Formulation •  PCA and ICA •  Example applications

– Find topics in documents – Hidden variables in stock prices – Visual vocabulary for retinal images

•  Conclusion

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #38

Find hidden variables (DJIA stocks)

•  Weekly DJIA closing prices –  01/02/1990-08/05/2002, n=660 data points – A data point: prices of 29 companies at the time

Alcoa

American Express

Boeing

Caterpillar

Citi Group

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #39

Formulation: Find hidden variables

AA1, …, XOM1

AAn, …, XOMn

H11, H12, …, H1m

Hn1, Hn2, …, Hnm

=

B11, B12, …, B1m

Bm1, Bm2, …, Bmm ? ?

Date Hidden variable

Date

Page 14: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

14

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #40

Characterize hidden variable by the companies it influences

“General trend” “Internet bubble”

0.94

0.63 0.03

0.64

Caterpillar Intel

B1,CAT

B1,INTC B2,CAT

B2,INTC

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #41

Companies related to hidden variable 1

B1,j

Highest Lowest

Caterpillar 0.938512 AT&T 0.021885

Boeing 0.911120 WalMart 0.624570

MMM 0.906542 Intel 0.638010

Coca Cola 0.903858 Home Depot 0.647774

Du Pont 0.900317 Hewlett-Packard 0.658768

“General trend”

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #42

Companies related to hidden variable 1

B1,j

Highest Lowest

Caterpillar 0.938512 AT&T 0.021885

Boeing 0.911120 WalMart 0.624570

MMM 0.906542 Intel 0.638010

Coca Cola 0.903858 Home Depot 0.647774

Du Pont 0.900317 Hewlett-Packard 0.658768

All companies are affected by the “general trend” variable (with weights 0.6~0.9), except AT&T.

Page 15: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

15

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #43

General trend (and outlier)

“General trend”

AT&T

United Technologies

Walmart

Exxon Mobil

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #44

Companies related to hidden variable 2

B2,j

Highest Lowest Intel 0.641102 Philip Morris -0.194843

Hewlett-Packard 0.621159 International Paper -0.089569 GE 0.509164 Caterpillar 0.031678

American Express 0.504871 Procter and Gamble 0.109576 Disney 0.490529 Du Pont 0.133337

2000-2001 “Internet bubble”

Tech company

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #45

Companies related to hidden variable 2

B2,j

Highest Lowest Intel 0.641102 Philip Morris -0.194843

Hewlett-Packard 0.621159 International Paper -0.089569 GE 0.509164 Caterpillar 0.031678

American Express 0.504871 Procter and Gamble 0.109576 Disney 0.490529 Du Pont 0.133337

Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related. Other companies are un-related (weights < 0.15).

Tech company

Page 16: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

16

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #46

Outline •  Motivation •  Formulation •  PCA and ICA •  Example applications

– Find topics in documents – Hidden variables in stock prices – Visual vocabulary for retinal images

•  Conclusion

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #47

Mining cat retinal images [ICDM 05]

Normal 1 day after detachment

7 days after detachment

Retina

28 days after detachment

3d28dr 1h3dr 1d6dO2

Treatment

Detachment Development Distribution of 2 proteins

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #48

“Vocabulary” for biomedical images?

•  How to describe biomedical images? •  Analogy: the topics for text

•  Football reports: “touchdown”, “punt”, etc. •  DB papers: “query”, “optimization”, etc.

•  How to derive “visual vocabulary terms”?

Normal 7 days after detachment

“spongy”

Page 17: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

17

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #49

Visual Vocabulary (ViVo) by AutoSplit

Step 1: Tile image Step 2:

Extract tile features

Step 3: ViVo

generation

Visual���vocabulary

V1

V2

Feature 1

Feat

ure

2

8x12 tiles

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #50

Finding ViVos

Each point is a tile. Projected to the 1st and 2nd PCA vectors. (Feature vector: 512 color structure features.)

Red lines indicate ViVos.

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #51

Bio-mining with “ViVo”

•  Visual Vocabulary for retinal images –  using AutoSplit

•  Evaluation – Qualitative: biological meanings of ViVos – Data mining: highlight “interesting” regions

Page 18: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

18

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #52

Biological interpretation of ViVos ID ViVo Description Condition

V1 GFAP in inner retina (Müller cells) Healthy

V10 Healthy outer segments of rod photoreceptors Healthy

V8 Redistribution of rod opsin into cell bodies of rod photoreceptors Detached

V11 Co-occurring processes: Müller cell hypertrophy and rod opsin redistribution

Detached

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #53

Biological interpretation of ViVos ID ViVo Description Condition

2 GFAP in hypertrophy Müller cells

Morphological changes in inner retina

3 GFAP in hypertrophy Müller cells

Morphological changes in inner retina

4 GFAP in hypertrophy Müller cells

Morphological changes in inner retina

5

Healthy outer segments of rod photoreceptors (rod opsin)

Healthy

ID ViVo Description Condition

6 Rod photoreceptor cell body

Background labeling

7 GFAP in hypertrophy Müller cells

Morphological changes in inner retina

9 Outer segment degeneration (rod opsin)

Detached

12 GFAP in hypertrophy Müller cells

Morphological changes in inner retina

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #54

Bio-mining with “ViVo”

•  Visual Vocabulary for retinal images –  using AutoSplit

•  Evaluation – Qualitative: biological meanings of ViVos – Data mining: highlight “interesting” regions

Page 19: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

19

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #55

Finding “distinguishing ViVos” •  Given: Images of two classes

– Find the class-distinguishing ViVo (“DiVo”) – Highlight distinguishing regions

Normal Detached 3 days

DiVo: “spongy”

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #56

Summary: a system viewpoint

Our system

Input Output

(1) Left: “n”; Right: “3d”

Accurate classification

DiVo analysis

V8: “spongy”

(2) Regions shown (V8): “cells of rod photoreceptors” (3) Description: “Detachment occurs!”

“Rod opsin distributes from outer segment into cell bodies.”

ViVo interpretation

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #57

Outline •  Motivation •  Formulation •  PCA and ICA •  Example applications

– Find topics in documents – Hidden variables in stock prices – Visual vocabulary for retinal images

•  Conclusion

Page 20: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

20

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #58

Conclusion •  ICA: more flexible than PCA in finding

patterns. •  Many applications

– Find topics and “vocabulary” for images – Find hidden variables in time series (e.g., stock

prices) – Blind source separation

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #59

Vocabulary for embryo gene expressions

with André Balan, Christos Faloutsos, Eric P. Xing

Vocabulary

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #60

References •  Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma Juci

Machado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006.

•  Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), 2005.

•  Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and Christos Faloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp.125-130.

•  Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004.

Page 21: 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c) C. Faloutsos and J-Y Pan (2010) #34 Step 3: Evaluate the patterns ID True Topic

21

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #61

Acknowledgement

•  Prof. Tai Sing Lee •  Prof. Hiroyuki

Kitagawa •  Prof. HyungJeong

Yang •  Masafumi Hamamoto •  Prof. Nancy Pollard •  Prof. Jessica Hodgins

•  Prof. Michael Lewicki •  Prof. Eric Xing •  CMU Informedia

project •  UCSB DB Lab •  CMU bio-imaging

center •  CMU graphics lab

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #62

Citation

•  AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto

PAKDD 2004, Sydney, Australia

CMU SCS

15-826 (c) C. Faloutsos and J-Y Pan (2010) #63

References

•  Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis, John Wiley & Sons, 2001