Computational BioMedical Informatics

Preview:

DESCRIPTION

Computational BioMedical Informatics. SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept. Course Information. Instructor: Dr. Jinbo Bi Office: ITEB 233 Phone: 860-486-1458 Email: jinbo@engr.uconn.edu - PowerPoint PPT Presentation

Citation preview

1

Computational BioMedical Informatics

SCE 5095: Special Topics Course

Instructor: Jinbo BiComputer Science and Engineering Dept.

2

Course Information

Instructor: Dr. Jinbo Bi – Office: ITEB 233– Phone: 860-486-1458– Email: jinbo@engr.uconn.edu

– Web: http://www.engr.uconn.edu/~jinbo/– Time: Mon / Wed. 2:00pm – 3:15pm – Location: CAST 204– Office hours: Mon. 3:30-4:30pm

HuskyCT– http://learn.uconn.edu– Login with your NetID and password– Illustration

3

Review of previous classes

Introduced unsupervised learning – particularly, cluster analysis techniques

Discussed one important application of cluster analysis – cardiac ultrasound view recognition

Review papers on medical, health or public health topics

This class, we start to discuss dimension reduction

4

Topics today What is feature reduction? Why feature reduction? More general motivations of component analysis Feature reduction algorithms

– Principal component analysis– Canonical correlation analysis– Independent component analysis

5

What is feature reduction?

Feature reduction refers to the mapping of the original high-dimensional data onto a lower-dimensional space.

– Criterion for feature reduction can be different based on different problem settings.

Unsupervised setting: minimize the information loss Supervised setting: maximize the class discrimination

Given a set of data points of p variables Compute the linear transformation (projection)

nxxx ,,, 21

)(: pdxGyxG dTpdp

6

dY pdTG

pX

dTdp XGYXG :

Linear transformation

Original data reduced data

What is feature reduction?

7

High-dimensional data

Gene expression Face images Handwritten digits

8

Feature reduction versus feature selection

Feature reduction– All original features are used– The transformed features are linear

combinations of the original features.

Feature selection– Only a subset of the original features are

used.

9

Why feature reduction?

Most machine learning and data mining techniques may not be effective for high-dimensional data – Curse of Dimensionality– Query accuracy and efficiency degrade

rapidly as the dimension increases.

The intrinsic dimension may be small. – For example, the number of genes

responsible for a certain type of disease may be small.

10

Understanding: reason about or obtain insights from data Combinations of observed variables may be more effective

bases for insights, even if physical meaning is obscure Visualization: projection of high-dimensional data onto 2D or

3D. Too much noise in the data Data compression: Need to “reduce” them to a smaller set

of factors for an efficient storage and retrieval Better representation of data without losing much

information Noise removal: Can build more effective data analyses on

the reduced-dimensional space: classification, clustering, pattern recognition, (positive effect on query accuracy)

Why feature reduction?

11

Application of feature reduction

Face recognition Handwritten digit recognition Text mining Image retrieval Microarray data analysis Protein classification

12

• We study phenomena that can not be directly observed – ego, personality, intelligence in psychology– Underlying factors that govern the observed data

• We want to identify and operate with underlying latent factors rather than the observed data – E.g. topics in news articles – Transcription factors in genomics

• We want to discover and exploit hidden relationships– “beautiful car” and “gorgeous automobile” are closely related– So are “driver” and “automobile”– But does your search engine know this?– Reduces noise and error in results

More general motivations

13

• Discover a new set of factors/dimensions/axes against which to represent, describe or evaluate the data– For more effective reasoning, insights, or better visualization– Reduce noise in the data– Typically a smaller set of factors: dimension reduction – Better representation of data without losing much information– Can build more effective data analyses on the reduced-dimensional space:

classification, clustering, pattern recognition• Factors are combinations of observed variables

– May be more effective bases for insights, even if physical meaning is obscure

– Observed data are described in terms of these factors rather than in terms of original variables/dimensions

More general motivations

14

Feature reduction algorithms

Unsupervised– Principal component analysis (PCA)– Canonical correlation analysis (CCA)– Independent component analysis (ICA)

Supervised – Linear discriminant analysis (LDA)– Sparse support vector machine (SSVM)

Semi-supervised – Research topic

15

Feature reduction algorithms

Linear – Principal Component Analysis (PCA)– Linear Discriminant Analysis (LDA)– Canonical Correlation Analysis (CCA)– Independent component analysis (ICA)

Nonlinear– Nonlinear feature reduction using kernels

Kernel PCA, kernel CCA, …– Manifold learning

LLE, Isomap

16

Basic Concept Areas of variance in data are where items can be best discriminated

and key underlying phenomena observed– Areas of greatest “signal” in the data

If two items or dimensions are highly correlated or dependent– They are likely to represent highly related phenomena– If they tell us about the same underlying variance in the data,

combining them to form a single measure is reasonable Parsimony Reduction in Error

So we want to combine related variables, and focus on uncorrelated or independent ones, especially those along which the observations have high variance

We want a smaller set of variables that explain most of the variance in the original data, in more compact and insightful form

17

Basic Concept

What if the dependences and correlations are not so strong or direct?

And suppose you have 3 variables, or 4, or 5, or 10000?

Look for the phenomena underlying the observed covariance/co-dependence in a set of variables

– Once again, phenomena that are uncorrelated or independent, and especially those along which the data show high variance

These phenomena are called “factors” or “principal components” or “independent components,” depending on the methods used

– Factor analysis: based on variance/covariance/correlation– Independent Component Analysis: based on independence

18

What is Principal Component Analysis?

Principal component analysis (PCA) – Reduce the dimensionality of a data set by finding a new set of

variables, smaller than the original set of variables– Retains most of the sample's information.– Useful for the compression and classification of data.

By information we mean the variation present in the sample, given by the correlations between the original variables.

19

Principal Component Analysis

Most common form of factor analysis The new variables/dimensions

– are linear combinations of the original ones– are uncorrelated with one another

Orthogonal in original dimension space– capture as much of the original variance in the

data as possible– are called Principal Components– are ordered by the fraction of the total

information each retains

20

Some Simple Demos

http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html

21

What are the new axes?

Original Variable A

Orig

inal

Var

iabl

e B

PC 1PC 2

• Orthogonal directions of greatest variance in data• Projections along PC1 discriminate the data most

along any one axis

22

Principal Components

First principal component is the direction of greatest variability (covariance) in the data

Second is the next orthogonal (uncorrelated) direction of greatest variability– So first remove all the variability along the first

component, and then find the next direction of greatest variability

And so on …

23

Principal Components Analysis (PCA) Principle

– Linear projection method to reduce the number of parameters – Transfer a set of correlated variables into a new set of uncorrelated

variables– Map the data into a space of lower dimensionality– Form of unsupervised learning

Properties– It can be viewed as a rotation of the existing axes to new positions in the

space defined by original variables– New axes are orthogonal and represent the directions with maximum

variability

24

Computing the Components

Data points are vectors in a multidimensional space Projection of vector x onto an axis (dimension) u is u.x Direction of greatest variability is that in which the average

square of the projection is greatest– I.e. u such that E((u.x)2) over all x is maximized– (we subtract the mean along each dimension, and center the

original axis system at the centroid of all data points, for simplicity)– This direction of u is the direction of the first Principal Component

25

Computing the Components

E((uTx)2) = E ((uTx) (uTx)) = E (uTx.x Tu)

The matrix C = x.xT contains the correlations (similarities) of the original axes based on how the data values project onto them

So we are looking for w that maximizes uTCu, subject to u being unit-length

It is maximized when w is the principal eigenvector of the matrix C, in which case

– uTCu = uTlu = l if u is unit-length, where l is the principal eigenvalue of the covariance matrix C

– The eigenvalue denotes the amount of variability captured along that dimension

26

Why the Eigenvectors?

Maximise uTxxTu s.t uTu = 1 Construct Langrangian uTxxTu – λuTu Vector of partial derivatives set to zero

xxTu – λu = (xxT – λI) u = 0As u ≠ 0 then u must be an eigenvector of xxT with

eigenvalue λ

27

Singular Value Decomposition

The first root is called the prinicipal eigenvalue which has an associated orthonormal (uTu = 1) eigenvector u

Subsequent roots are ordered such that λ1> λ2 >… > λM with rank(D) non-zero values.

Eigenvectors form an orthonormal basis i.e. uiTuj = δij

The eigenvalue decomposition: C = xxT = UΣUT

where U = [u1, u2, …, uM] and Σ = diag[λ 1, λ 2, …, λ M] Similarly the eigenvalue decomposition of xTx = VΣVT

The SVD is closely related to the above x=U Σ1/2 VT

The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues.

28

Computing the Components

Similarly for the next axis, etc. So, the new axes are the eigenvectors of the

covariance matrix of the original variables, which captures the similarities of the original variables based on how data samples project to them

• Geometrically: centering followed by rotation– Linear transformation

29

PCs, Variance and Least-Squares

The first PC retains the greatest amount of variation in the sample

The kth PC retains the kth greatest fraction of the variation in the sample

The kth largest eigenvalue of the correlation matrix C is the variance in the sample along the kth PC

The least-squares view: PCs are a series of linear least squares fits to a sample, each orthogonal to all previous ones

30

How Many PCs?

For n original dimensions, correlation matrix is nxn, and has up to n eigenvectors. So n PCs.

Where does dimensionality reduction come from?

31

Can ignore the components of lesser significance.

You do lose some information, but if the eigenvalues are small, you don’t lose much– p dimensions in original data – calculate p eigenvectors and eigenvalues– choose only the first d eigenvectors, based on their

eigenvalues– final data set has only d dimensions

0

5

10

15

20

25

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

Varia

nce

(%)

Dimensionality Reduction

32

Eigenvectors of a Correlation Matrix

33

Geometric picture of principal components

2z

1z

• the 1st PC is a minimum distance fit to a line in X space• the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC

PCs are a series of linear least squares fits to a sample,each orthogonal to all the previous.

1z

34

Optimality property of PCA

npTndT

ndTnp

XGGXXG

XGX

)(

Dimension reductionReconstruction

ndT XGY

pdTG

npX

Original data

dpG npX

35

Optimality property of PCA

2

FXX

The matrix G consisting of the first d eigenvectors of the covariance matrix S solves the following min problem:

Main theoretical result:

dF

TG

IGXGGXdp T2

G subject to )(min

reconstruction error

PCA projection minimizes the reconstruction error among all linear projections of size d.

36

Applications of PCA

Eigenfaces for recognition. Turk and Pentland. 1991.

Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001.

Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003.

37

PCA applications -Eigenfaces

the principal eigenface looks like a bland and rogynous average human face

http://en.wikipedia.org/wiki/Image:Eigenfaces.png

38

Eigenfaces – Face Recognition

When properly weighted, eigenfaces can be summed together to create an approximate gray-scale rendering of a human face.

Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces

Hence eigenfaces provide a means of applying data compression to faces for identification purposes.

Similarly, Expert Object Recognition in Video

39

PCA for image compression

d=1 d=2 d=4 d=8

d=16 d=32 d=64 d=100Original Image

40

Topics today Feature reduction algorithms

– Principal component analysis– Canonical correlation analysis– Independent component analysis

41

Canonical correlation analysis (CCA)

CCA was developed first by H. Hotelling.– H. Hotelling. Relations between two sets of variates.

Biometrika, 28:321-377, 1936.

CCA measures the linear relationship between two multidimensional variables.

CCA finds two bases, one for each variable, that are optimal with respect to correlations.

Applications in economics, medical studies, bioinformatics and other areas.

42

Canonical correlation analysis (CCA)

Two multidimensional variables– Two different measurements on the same set of objects

Web images and associated text Protein (or gene) sequences and related literature (text) Protein sequence and corresponding gene expression In classification: feature vector and class label

– Two measurements on the same object are likely to be correlated. May not be obvious on the original measurements. Find the maximum correlation on transformed space.

43

Canonical Correlation Analysis (CCA)

TXXW

Correlation

TYYW

measurement transformationTransformed data

44

Problem definition

Find two sets of basis vectors, one for x and the other for y, such that the correlations between the projections of the variables onto these basis vectors are maximized.

: and yx ww

Given

Compute two basis vectors

,ywy y

45

Problem definition

Compute the two basis vectors so that the correlations of the projections onto these vectors are maximized.

46

Algebraic derivation of CCA

The optimization problem is equivalent to

Tyy

Tyx

Txx

Txy

YYCYXC

XXCXYC

,

,where

47

Geometric interpretation of CCA

The Geometry of CCA

),(2

1,1

,

222

22

jjjjjj

yTT

yjxTT

xj

yT

jxT

j

bacorrbaba

wYYwbwXXwa

wYbwXa

Maximization of the correlation is equivalent to the minimization of the distance.

48

Algebraic derivation of CCA

maxs.t.

The optimization problem is equivalent to

49

Algebraic derivation of CCA

Txyyx CC lll yx

50

Applications in bioinformatics

CCA can be extended to multiple views of the data– Multiple (larger than 2) data sources

Two different ways to combine different data sources– Multiple CCA

Consider all pairwise correlations– Integrated CCA

Divide into two disjoint sources

51

Applications in bioinformatics

Source: Extraction of Correlated Gene Clusters from Multiple Genomic Data by Generalized Kernel Canonical Correlation Analysis. ISMB’03 http://cg.ensmp.fr/~vert/publi/ismb03/ismb03.pdf

52

Topics today Feature reduction algorithms

– Principal component analysis– Canonical correlation analysis– Independent component analysis

53

Independent component analysis (ICA)

PCA(orthogonal coordinate)

ICA(non-orthogonal coordinate)

54

Cocktail-party problem

Multiple sound sources in room (independent) Multiple sensors receiving signals which are

mixture of original signals Estimate original source signals from mixture of

received signals Can be viewed as Blind-Source Separation as

mixing parameters are not known

55

http://www.cis.hut.fi/projects/ica/cocktail/cocktail_en.cgi

DEMO: Blind Source Seperation

56

Cocktail party or Blind Source Separation (BSS) problem– Ill posed problem, unless assumptions are made!

Most common assumption is that source signals are statistically independent. This means knowing value of one of them gives no information about the other.

Methods based on this assumption are called Independent Component Analysis methods

– statistical techniques for decomposing a complex data set into independent parts.

It can be shown that under some reasonable conditions, if the ICA assumption holds, then the source signals can be recovered up to permutation and scaling.

BSS and ICA

57

Source Separation Using ICA

W11

W21

W12

W22

+

+

Microphone 1

Microphone 2

Separation 1

Separation 2

58

Original signals (hidden sources) s1(t), s2(t), s3(t), s4(t), t=1:T

59

The ICA model

s1 s2

s3 s4

x1 x2 x3 x4

a11

a12a13

a14

xi(t) = ai1*s1(t) + ai2*s2(t) + ai3*s3(t) + ai4*s4(t)

Here, i=1:4.In vector-matrix notation, and dropping index t, this is x = A * s

60

This is recorded by the microphones: a linear mixture of the sources

xi(t) = ai1*s1(t) + ai2*s2(t) + ai3*s3(t) + ai4*s4(t)

61

Recovered signals

62

BSS

If we knew the mixing parameters aij then we would just need to solve a linear system of equations.

We know neither aij nor si. ICA was initially developed to deal with problems closely related

to the cocktail party problem Later it became evident that ICA has many other applications

– e.g. from electrical recordings of brain activity from different locations of the scalp (EEG signals) recover underlying components of brain activity

Problem: Determine the source signals s, given only the

mixtures x.

63

ICA Solution and Applicability

ICA is a statistical method, the goal of which is to decompose given multivariate data into a linear sum of statistically independent components.

For example, given two-dimensional vector , x = [ x1 x2 ] T , ICA

aims at finding the following decomposition

where a1, a2 are basis vectors and s1, s2 are basis coefficients.

Constraint: Basis coefficients s1 and s2 are statistically indepen-dent.

saasa

axx

222

121

21

11

2

1

2211 ss aax

64

Approaches to ICA

Unsupervised Learning – Factorial coding (Minimum entropy coding, Redundancy

reduction)– Maximum likelihood learning– Nonlinear information maximization (entropy maximization)– Negentropy maximization– Bayesian learning

Statistical Signal Processing– Higher-order moments or cumulants – Joint approximate diagonalization– Maximum likelihood estimation

65

Application domains of ICA

Blind source separation Image denoising Medical signal processing – fMRI, ECG, EEG Modelling of the hippocampus and visual cortex Feature extraction, face recognition Compression, redundancy reduction Watermarking Clustering Time series analysis (stock market, microarray data) Topic extraction Econometrics: Finding hidden factors in financial data

66

Image denoising

Wiener filtering

ICA filtering

Noisy imageOriginal

image

Recommended