Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
15-826 (c) C. Faloutsos and J-Y Pan (2007) #1
CMU SCS
15-826: Multimedia Databases
and Data Mining
Independent Component Analysis (ICA)
Jia-Yu Pan and Christos Faloutsos
15-826 (c) C. Faloutsos and J-Y Pan (2007) #2
CMU SCS
Outline
• Motivation
• Formulation
• PCA and ICA
• Example applications
• Conclusion
15-826 (c) C. Faloutsos and J-Y Pan (2007) #3
CMU SCS
Motivation:
(Q1) Find patterns in data• Motion capture data: broad jumps
Left Knee
Right Knee
Energy exerted
Take-off
Landing
Energy exerted
2
15-826 (c) C. Faloutsos and J-Y Pan (2007) #4
CMU SCS
Motivation:
(Q1) Find patterns in data
• Human would say
– Pattern 1: along
diagonal
– Pattern 2: along
vertical axis
• How to find these
automatically?Each point is the measurement
at a time tick (total 550 points).
Take-off
LandingR:L=60:1
R:L=1:1
15-826 (c) C. Faloutsos and J-Y Pan (2007) #5
CMU SCS
Motivation:
(Q2) Find hidden variables
Hidden variables
“Internet bubble”
Alcoa
American
Express
Boeing
Citi Group
…
“General trend”
Stock prices
15-826 (c) C. Faloutsos and J-Y Pan (2007) #6
CMU SCS
Motivation:
(Q2) Find hidden variables
“General trend” “Internet bubble”
0.940.030.63
0.64
Caterpillar Intel
Hidden variables
3
15-826 (c) C. Faloutsos and J-Y Pan (2007) #7
CMU SCS
Motivation:
(Q2) Find hidden variables
Hidden variable 1 Hidden variable 2
B1,CAT
B1,INTC B2,CAT
B2,INTC?
? ?
Caterpillar Intel
15-826 (c) C. Faloutsos and J-Y Pan (2007) #8
CMU SCS
Motivation:
Find hidden variables
• There are two sound sources in a cocktail party…
Also called the “blind source separation” problem.
“Blind” because we don’t know the sources,
nor how they are mixed.
15-826 (c) C. Faloutsos and J-Y Pan (2007) #9
CMU SCS
Outline
• Motivation
• Formulation
• PCA and ICA
• Example applications
• Conclusion
4
15-826 (c) C. Faloutsos and J-Y Pan (2007) #10
CMU SCS
Formulation: Finding patterns
Given n data points,
each with m attributes.
Find patterns that describe
data properties the best.
15-826 (c) C. Faloutsos and J-Y Pan (2007) #11
CMU SCS
Linear representation
• Find patterns that are vectors that describe the data set the best.
• Each point is described as a linear combination of the vectors
(patterns):
22,1, bbx 1i
vvvii hh +=
15-826 (c) C. Faloutsos and J-Y Pan (2007) #12
CMU SCS
Patterns as data “vocabulary”
Good pattern ≈ sparse coding
(Q) Given data xi’s,compute hi,j’s and bi’s that are “sparse”?
Only b1 is needed
to describe xi.
5
15-826 (c) C. Faloutsos and J-Y Pan (2007) #13
CMU SCS
Patterns in motion capture data
b2
b1
n=550 ticks
2
2,1,
2,21,2
2,11,1
nxX
nn xx
xx
xx
MM
MM
Data
matrix
Left Right
222
2
2,1,
2,21,2
2,11,1
xnx
1
BH
b
b
=
−−
−−
= r
r
MM
MM
nn hh
hh
hh
Basis
vectors
Hidden
variables
? ?
Sparse ~ non-Gaussian ~ “Independent”
“Independent”: e.g., minimize mutual information.
15-826 (c) C. Faloutsos and J-Y Pan (2007) #14
CMU SCS
Outline
• Motivation
• Formulation
• PCA and ICA
• Example applications
• Conclusion
15-826 (c) C. Faloutsos and J-Y Pan (2007) #15
CMU SCS
Basis vectors and hidden variables
• Goal: Knowing X, find H and B, where
X = H B
• Problem: Under-constrained
– Need additional assumptions/constraints
X: data set
H: hidden variables
B: basis vectors
6
15-826 (c) C. Faloutsos and J-Y Pan (2007) #16
CMU SCS
PCA and ICA
• PCA vectors: major variations
– Together = good “low-rank approximation”/dimensional reduction
– Individually ≠ meaningful patterns
• Luckily, ICA detects the major meaningful patterns.
PCA Vectors ICA Vectors
15-826 (c) C. Faloutsos and J-Y Pan (2007) #17
CMU SCS
PCA
• PCA (Principal Component Analysis)
– Choose vectors which are orthonormal and
– give smallest representation L2 error for
dimensional reduction
• Matrices H and B can be solved by
– SVD, neural networks, or many optimization
methods
15-826 (c) C. Faloutsos and J-Y Pan (2007) #18
CMU SCS
PCA
• Extremely popular
– Latent Semantic Indexing [Deerwester+90]
– KL transform [Duda,Hart,Stork00]
– EigenFace [Turk,Pentlend91]
– Multiple time series correlation
[Guha,Gunopulos,Koudas03]
• But, there is room for improvement.
7
15-826 (c) C. Faloutsos and J-Y Pan (2007) #19
CMU SCS
ICA• ICA (Independent Component Analysis)
– Make hidden variables hi’s (columns of H)
mutually independent.
• Many implementations
– Many ways to define “independence”
– Many ways to find the most independent H.
– (B is found at the same time, since X=HB.)
15-826 (c) C. Faloutsos and J-Y Pan (2007) #20
CMU SCS
ICA
• Define “Independence”:
– Zero mutual information
– Non- Gaussianity, max. absolute Kurtosis
• To solve for H,B:
– Neural networks, optimization methods
(gradient ascent, fixed- point, …)
)()(),( jiji hphphhp =
15-826 (c) C. Faloutsos and J-Y Pan (2007) #21
CMU SCS
An non-gaussian distribution:
Laplacian pdf
)exp(2
)( xxP λλ
−=
Sharper at 0,
and more heavy tail
than Gaussian pdf
8
15-826 (c) C. Faloutsos and J-Y Pan (2007) #22
CMU SCS
Maximizing non-gaussianity• Assume hi ~ non- gaussian pdf (e.g. Laplacian pdf)
– Fixed hi values, what is the most likely “B” ?
• (data point x is given and fixed)
– Find B, s.t. likelihood P(x|B) is maximized.
)det(
)()|(
)()()(
)det(
)()(
B
BxBx
Bxhh
B
hBx
Bhx
1
1
−
−
=⇒
==
=⇒
=
vv
vvv
vv
vv
Laplacian
LaplacianLaplacian
PP
PPP
P|P
15-826 (c) C. Faloutsos and J-Y Pan (2007) #23
CMU SCS
Maximize likelihood
• Likelihood P(x|B) is a function of B, f(B)
• Gradient ascent
– To find B which maximizes P(x|B)
f(B)
BB*
15-826 (c) C. Faloutsos and J-Y Pan (2007) #24
CMU SCS
ICA by maximum likelihood
H B
X: static
1−= XBH
BBB
nBHZBB
HsignZ
TTT
∆+=
−−∝∆
−=
ε
)(
9
15-826 (c) C. Faloutsos and J-Y Pan (2007) #25
CMU SCS
Convergence of ICA
ICA
Start with the matrix B as
an identity matrix.
15-826 (c) C. Faloutsos and J-Y Pan (2007) #26
CMU SCS
Outline
• Motivation
• Formulation
• PCA and ICA
• Example applications
– Find topics in documents
– Hidden variables in stock prices
– Visual vocabulary for retinal images
• Conclusion
15-826 (c) C. Faloutsos and J-Y Pan (2007) #27
CMU SCS
Pattern discovery with ICA: AutoSplit[PAKDD 04][WIRI 05]
Step 1:
Data points (matrix)
Step 2:Compute patterns
Step 3:
Interpret patterns
…
or
or
…
…
Data mining
(Case studies)
(Q) What pattern?
(Q) Differentmodalities
(Q) How?
Video frames
Stock prices
Text documents
10
15-826 (c) C. Faloutsos and J-Y Pan (2007) #28
CMU SCS
Finding patterns in high-
dimensional data
PCA finds the hyperplane. ICA finds the correct patterns.
Dimensionality
reduction
15-826 (c) C. Faloutsos and J-Y Pan (2007) #29
CMU SCS
Outline
• Motivation
• Formulation
• PCA and ICA
• Example applications
– Find topics in documents
– Hidden variables in stock prices
– Visual vocabulary for retinal images
• Conclusion
15-826 (c) C. Faloutsos and J-Y Pan (2007) #30
CMU SCS
Topic discovery on text streams
• Data: CNN headline news (Jan.-Jun. 1998)
• Documents of 10 topics in one single text
stream
– Documents are sorted by date/time
– Subsequent documents may have different
topics
Date/Time
Topic 1 Topic 3 Topic 1…
11
15-826 (c) C. Faloutsos and J-Y Pan (2007) #31
CMU SCS
Topic discovery on text streams
• Known: number of topics = 10
• Unknown: (1) topic of each document (2)
topic description
Date/TimeTopic 1 Topic 3 Topic 1…
15-826 (c) C. Faloutsos and J-Y Pan (2007) #32
CMU SCS
Topic discovery in documents
X[nxm]
xi = [1, 5, …, 0]
Step 1
Windowing
Step 2
X[nxm] = H[nxm] B[mxm]
(1) Find hyperplane
(m=10)
(2) Find patterns
aaron
zoo
m=3887 (dictionary size)
(n=1659)New stories
…
(30 words)
Step 3
b’i = [0, 0.7, …, 0.6]
aaron
zoo
animal
B’[10x3887]
−−
−−
−−
'
'
'
10
2
1
b
b
b
M
(Q) What does b’i mean?
15-826 (c) C. Faloutsos and J-Y Pan (2007) #33
CMU SCS
Step 3: Interpret the patterns
b’i = [0, 0.7, …, 0.6]
aaron
zooanim
al
Topicsfound
A hidden topic!
Top words : “animal”, “zoo”, …
Sorted word listID
bombReFloridaTornadoPeoplJ
ReTeamGameBowlSuperI
AsianEconomJapanEconomiAsiaH
VisitCubanCastroCubePopeG
GameWomenGoldOlympMedalF
JoneFormerKillGrahamZamoraE
DoctorPillImpotDrugViagraD
CattlOprahTexaBeefWinfreiC
BirminghamAtlantaClinicRudolphbombB
ArmiMajorsexualSergeantMckinneA
General idea: related to the data attributes
−−
−−
−−
'
'
'
10
2
1
b
b
b
M
m=3887 (dictionary size)
12
15-826 (c) C. Faloutsos and J-Y Pan (2007) #34
CMU SCS
Step 3: Evaluate the patterns
Tornado in Florida10
Superbowl XXXII9
The economic crisis in Asia8
The Pope’s historic visit to Cube in Winter 19987
1998 Winter Olympic games6
Diane Zamora is convicted of helping to murder her lover’s girlfriend5
New impotency drug Viagra is approved for use4
The Cattle Industry in Texas sues Oprah Winfrey for defaming beef3
A bomb explodes in a Birmingham, AL abortion clinic2
Sgt. Gene Mckinney is on trial for alleged sexual misconduct1
True TopicID
AutoSplit finds correct topics.
Sorted word listID
joneformerkillgrahamzamoraE
doctorpillImpotdrugviagraD
cattloprahtexabeefwinfreiC
birminghamatlantaclinicrudolphbombB
armimajorsexualsergeantmckinneA
15-826 (c) C. Faloutsos and J-Y Pan (2007) #35
CMU SCS
Step 3: Evaluate the patterns
AutoSplitID
joneformerkillgrahamzamoraE
doctorpillImpotdrugviagraD
cattloprahtexabeefwinfreiC
birminghamatlantaclinicrudolphbombB
armimajorsexualsergeantmckinneA
AutoSplit’s topics are better than PCA.
PCAID
olympgrahamwinfreiviagrazamoraE’
beeftexadrugwinfreiviagraD’
oprahbeeftexaviagrawinfreiC’
atlantaclinicrudolphmckinnebombB’
sergeantsexualwomenbombmckinneA’
PCAID
olympgrahamwinfreiviagrazamoraE’
beeftexadrugwinfreiviagraD’
oprahbeeftexaviagrawinfreiC’
atlantaclinicrudolphmckinnebombB’
sergeantsexualwomenbombmckinneA’
15-826 (c) C. Faloutsos and J-Y Pan (2007) #36
CMU SCS
Step 3: Evaluate the patternsAutoSplit
E
D
C
B
A
AutoSplit’s topics are better than PCA.
PCA
E’
D’
C’
B’
A’
PCA vectors mix the topics.
Topic 1
Topic 2
13
15-826 (c) C. Faloutsos and J-Y Pan (2007) #37
CMU SCS
Outline
• Motivation
• Formulation
• PCA and ICA
• Example applications
– Find topics in documents
– Hidden variables in stock prices
– Visual vocabulary for retinal images
• Conclusion
15-826 (c) C. Faloutsos and J-Y Pan (2007) #38
CMU SCS
Find hidden variables (DJIA stocks)
• Weekly DJIA closing prices
– 01/02/1990- 08/05/2002, n=660 data points
– A data point: prices of 29 companies at the time
Alcoa
American
Express
Boeing
Caterpillar
Citi Group
15-826 (c) C. Faloutsos and J-Y Pan (2007) #39
CMU SCS
Formulation: Find hidden variables
AA1, …, XOM1
…
…
AAn, …, XOMn
H11, H12, …, H1m
…
…
Hn1, Hn2, …, Hnm
=
B11, B12, …, B1m
…
Bm1, Bm2, …, Bmm? ?
Date Hidden variable
Date
14
15-826 (c) C. Faloutsos and J-Y Pan (2007) #40
CMU SCS
Characterize hidden variable by
the companies it influences
“General trend” “Internet bubble”
0.940.63 0.03
0.64
Caterpillar Intel
B1,CAT
B1,INTC B2,CAT
B2,INTC
15-826 (c) C. Faloutsos and J-Y Pan (2007) #41
CMU SCS
Companies related to hidden variable 1
0.658768Hewlett-Packard0.900317Du Pont
0.647774Home Depot0.903858Coca Cola
0.638010Intel0.906542MMM
0.624570WalMart0.911120Boeing
0.021885AT&T0.938512Caterpillar
LowestHighest
B1,j
“General trend”
15-826 (c) C. Faloutsos and J-Y Pan (2007) #42
CMU SCS
Companies related to hidden variable 1
0.658768Hewlett-Packard0.900317Du Pont
0.647774Home Depot0.903858Coca Cola
0.638010Intel0.906542MMM
0.624570WalMart0.911120Boeing
0.021885AT&T0.938512Caterpillar
LowestHighest
B1,j
All companies are affected by the “general trend”variable (with weights 0.6~0.9), except AT&T.
15
15-826 (c) C. Faloutsos and J-Y Pan (2007) #43
CMU SCS
General trend (and outlier)
“General trend”
AT&T
United Technologies
Walmart
Exxon Mobil
15-826 (c) C. Faloutsos and J-Y Pan (2007) #44
CMU SCS
Companies related to hidden variable 2
0.133337Du Pont0.490529Disney
0.109576Procter and Gamble0.504871American Express
0.031678Caterpillar0.509164GE
-0.089569International Paper0.621159Hewlett-Packard
-0.194843Philip Morris0.641102Intel
LowestHighest
B2,j
2000-2001 “Internet bubble”
Tech company
15-826 (c) C. Faloutsos and J-Y Pan (2007) #45
CMU SCS
Companies related to hidden variable 2
0.133337Du Pont0.490529Disney
0.109576Procter and Gamble0.504871American Express
0.031678Caterpillar0.509164GE
-0.089569International Paper0.621159Hewlett-Packard
-0.194843Philip Morris0.641102Intel
LowestHighest
B2,j
Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related.
Other companies are un-related (weights < 0.15).
Tech company
16
15-826 (c) C. Faloutsos and J-Y Pan (2007) #46
CMU SCS
Outline
• Motivation
• Formulation
• PCA and ICA
• Example applications
– Find topics in documents
– Hidden variables in stock prices
– Visual vocabulary for retinal images
• Conclusion
15-826 (c) C. Faloutsos and J-Y Pan (2007) #47
CMU SCS
Mining cat retinal images [ICDM 05]
Normal 1 day after
detachment
7 days after
detachment
Retina
28 days after
detachment
3d28dr1h3dr 1d6dO2
Treatment
Detachment Development Distribution of 2 proteins
15-826 (c) C. Faloutsos and J-Y Pan (2007) #48
CMU SCS
“Vocabulary” for biomedical images?
• How to describe biomedical images?
• Analogy: the topics for text
• Football reports: “touchdown”, “punt”, etc.
• DB papers: “query”, “optimization”, etc.
• How to derive “visual vocabulary terms”?
Normal7 days after
detachment
“spongy”
17
15-826 (c) C. Faloutsos and J-Y Pan (2007) #49
CMU SCS
Visual Vocabulary (ViVo) by
AutoSplit
Step 1:
Tile image
Step 2:
Extract tile
features
Step 3:
ViVo
generation
Visual
vocabulary
V1
V2
Feature 1
Fea
ture
2
8x12 tiles
15-826 (c) C. Faloutsos and J-Y Pan (2007) #50
CMU SCS
Finding ViVos
Each point is a tile. Projected to the 1st and 2nd PCA vectors.
(Feature vector: 512 color structurefeatures.)
Red lines indicate ViVos.
15-826 (c) C. Faloutsos and J-Y Pan (2007) #51
CMU SCS
Bio-mining with “ViVo”
• Visual Vocabulary for retinal images
– using AutoSplit
• Evaluation
– Qualitative: biological meanings of ViVos
– Data mining: highlight “interesting” regions
18
15-826 (c) C. Faloutsos and J-Y Pan (2007) #52
CMU SCS
Biological interpretation of ViVos
V11
V8
V10
V1
ID
Co-occurring processes: Müller cell
hypertrophy and rod opsin
redistribution
Redistribution of rod opsin into cell
bodies of rod photoreceptors
Healthy outer segments of rod
photoreceptors
GFAP in inner retina (Müller cells)
Description ConditionViVo
Detached
Detached
Healthy
Healthy
15-826 (c) C. Faloutsos and J-Y Pan (2007) #53
CMU SCS
Biological interpretation of ViVos
5
4
3
2
ID
Healthy outer
segments of
rod
photoreceptors
(rod opsin)
GFAP in
hypertrophy
Müller cells
GFAP in
hypertrophy
Müller cells
GFAP in
hypertrophy
Müller cells
Description ConditionViVo
Healthy
Morphologic
al changes in
inner retina
Morphologic
al changes in
inner retina
Morphologic
al changes in
inner retina
12
9
7
6
ID
GFAP in
hypertrophy
Müller cells
Outer segment
degeneration
(rod opsin)
GFAP in
hypertrophy
Müller cells
Rod
photoreceptor
cell body
Description ConditionViVo
Morphologica
l changes in
inner retina
Detached
Morphologica
l changes in
inner retina
Background
labeling
15-826 (c) C. Faloutsos and J-Y Pan (2007) #54
CMU SCS
Bio-mining with “ViVo”
• Visual Vocabulary for retinal images
– using AutoSplit
• Evaluation
– Qualitative: biological meanings of ViVos
– Data mining: highlight “interesting” regions
19
15-826 (c) C. Faloutsos and J-Y Pan (2007) #55
CMU SCS
Finding “distinguishing ViVos”
• Given: Images of two classes
– Find the class- distinguishing ViVo (“DiVo”)
– Highlight distinguishing regions
Normal Detached 3 days
DiVo: “spongy”
15-826 (c) C. Faloutsos and J-Y Pan (2007) #56
CMU SCS
Summary: a system viewpoint
Our
system
Input Output
(1) Left: “n”; Right: “3d”
Accurate classification
DiVo analysis
V8: “spongy”
(2) Regions shown (V8): “cells of rod photoreceptors”
(3) Description:“Detachment occurs!”
“Rod opsin distributes from
outer segment into cell bodies.”
ViVo interpretation
15-826 (c) C. Faloutsos and J-Y Pan (2007) #57
CMU SCS
Outline
• Motivation
• Formulation
• PCA and ICA
• Example applications
– Find topics in documents
– Hidden variables in stock prices
– Visual vocabulary for retinal images
• Conclusion
20
15-826 (c) C. Faloutsos and J-Y Pan (2007) #58
CMU SCS
Conclusion
• ICA: more flexible than PCA in finding
patterns.
• Many applications
– Find topics and “vocabulary” for images
– Find hidden variables in time series (e.g., stock
prices)
– Blind source separation
15-826 (c) C. Faloutsos and J-Y Pan (2007) #59
CMU SCS
Vocabulary for embryo gene expressions
with André Balan, Christos Faloutsos, Eric P. Xing
Vocabulary
15-826 (c) C. Faloutsos and J-Y Pan (2007) #60
CMU SCS
References
• Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma JuciMachado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006.
• Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), 2005.
• Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and ChristosFaloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp.125-130.
• Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and MasafumiHamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004.
21
15-826 (c) C. Faloutsos and J-Y Pan (2007) #61
CMU SCS
Acknowledgement
• Prof. Tai Sing Lee
• Prof. Hiroyuki
Kitagawa
• Prof. HyungJeong
Yang
• Masafumi Hamamoto
• Prof. Nancy Pollard
• Prof. Jessica Hodgins
• Prof. Michael Lewicki
• Prof. Eric Xing
• CMU Informedia
project
• UCSB DB Lab
• CMU bio- imaging
center
• CMU graphics lab
15-826 (c) C. Faloutsos and J-Y Pan (2007) #62
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2007) #63
CMU SCS
Independence
S1
S2
r1
r2
Independent Dependent
22
15-826 (c) C. Faloutsos and J-Y Pan (2007) #64
CMU SCS
Non-Gaussianity and Independence
More GaussianLess Gaussian
P(s1) P(r1)
15-826 (c) C. Faloutsos and J-Y Pan (2007) #65
CMU SCS
Non-Gaussianity and
Independence
S1
S2
r1
r2
Independent Dependent
15-826 (c) C. Faloutsos and J-Y Pan (2007) #66
CMU SCS
Non-Gaussianity and Independence
More GaussianLess Gaussian
P(s1) P(r1)
23
15-826 (c) C. Faloutsos and J-Y Pan (2007) #67
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2007) #68
CMU SCS
Citation
• AutoSplit: Fast and Scalable Discovery of
Hidden Variables in Stream and Multimedia
Databases, Jia-Yu Pan, Hiroyuki Kitagawa,
Christos Faloutsos and Masafumi
Hamamoto
PAKDD 2004, Sydney, Australia
15-826 (c) C. Faloutsos and J-Y Pan (2007) #69
CMU SCS
References
• Aapo Hyvärinen, Juha Karhunen, Erkki
Oja: Independent Component Analysis, John
Wiley & Sons, 2001