17
BINF6201/8201 Principle components analysis (PCA) --Visualization of amino acids using their physico-chemical properties 08-31-2010

BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Embed Size (px)

Citation preview

Page 1: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

BINF6201/8201

Principle components analysis (PCA)--Visualization of amino acids using their physico-chemical

properties

08-31-2010

Page 2: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Physico-chemical properties and amino acids Physico-chemical properties of the amino acids in a peptide chain determine its folding process, and therefore its 3D structure and functions.

A full understanding of these properties is necessary for understanding the folding process, and therefore design algorithms to predict protein 3D structures from the amino acid sequences.

Commonly known physico-chemical properties of amino acids that affect protein folding:

1. Volume/size:

• Sum of van der Waals volume of the atoms;

• Partial volume: the increase in water volume when solved in water

• Balkiness: the ratio of side chain volume and its length—cross section area

Page 3: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Physico-chemical properties and amino acids2. Polarity index : Electrostatic force of the amino acid that acts on its surroundings at

a distance of 10 Å, which is a combination of force due to electrical charge and dipole movement of polarized amino acid.

3. Isoelectric point (pI): the pH value of the solution in which the net charge on the amino acid is zero.

-

OHOHH

H+ H+

pH = pIpH < pI pH > pI

++ - +10 Å

Amino acid Testing charge

+

Page 4: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Physico-chemical properties and amino acids4. Hydrophobicity: a measure of the solubility of an AA in water.

Hydrophobic: difficult to solve in water

Hydrophilic: easy to solve in water

Hydrophobicity scales:

Kyle and Doolittle scale: based on free energy cost when moving an AA from the inside of a protein to its surface.

Engleman, Steitz and Goldman scale: based on free energy cost when moving an AA from a lipid bilayer membrane to water.

Hydrophobic AAs have positive hydrophobicity values, and Hydrophilic AAs have negative hydrophobicity values.

5. Water accessible area : related to the portion of the side chain that is buried in a folded protein

Page 5: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Physico-chemical properties and amino acids

How can we visulize and quantitatively analyze the similarity and difference of AAs based on these and even more physico-chemical prosperities ?

Page 6: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Analysis of amino acids based on their physico-chemical properties

A simple visualization analysis of amino acids using molecular volume and isoelectric point (pI)

Such a results is not satisfactory, and more sophysticated methods are needed.

Page 7: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Principle component analysis (PCA) Simply speaking, PCA rescale and transform the data based on the relationship of the properties of the data, such that the data point can be separated based on a few new computated properties (principle components)

y2

y1

11,311,211,1

2,32,22,1

1,31,2 1,1

x x x

......................

x x x

x x x

p1 p2 p3

d1

d2

…d11

11,311,211,1

2,32,22,1

1,31,2 1,1

z z z

......................

z z z

z z z

p1 p2 p3

d1

d2

…d11

Rescale Transform

11,311,211,1

2,32,22,1

1,31,2 1,1

y yy

......................

y yy

y yy

c1 c2 c3

d1

d2

…d11

11,211,1

2,22,1

1,2 1,1

yy

................

yy

yy

c1 c2

d1

d2

…d11

Dimension reduction

Page 8: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Principle component analysis (PCA) A general dataset can be represented as an N (rows) x P (columns) matrix X:

PN,jN,N,2N,1

Pi,ji,i,2i,1

P2,j2,2,22,1

P1,j1,1,21,1

x ... x ... x x

......................................

x ... x ... x x

......................................

x ... x ... x x

x ... x ... x x

p1 p2 … pj … pP

d1

d2

di

dNwhere N is the number of objects/data points (eg. 20 amino acids), and P is the number of properties that each object has (eg. 8 properties of AAs).

Page 9: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

N

iijj x

Nm

1

.1

Principle component analysis (PCA) Rescale the data by normalization: 1. Compute the mean of each column j:

2. Compute the standard deviation of each column j

N

ijijj mx

Ns

1

2.)(1

3. Normalization:

.j

jijij s

sxz

Pj

Pj

ssss

mmmm

... ...

... ...

21

21

Page 10: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Principle component analysis (PCA)

Rescale the data (matrix X) by normalization, generating matrix Z:

Pj

Pj

ssss

mmmm

... ...

... ...

21

21

.j

jijij s

mxz

Matrix X Matrix Z

1 ... 1 ... 1 1

0 ... 0 ... 0 0

j

j

s

m

Page 11: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Principle component analysis (PCA) Transform the matrix Z to a new matrix Y by multiplying Z by a special matrix V:

PN,jN,N,2N,1

Pi,ji,i,2i,1

P2,j2,2,22,1

P1,j1,1,21,1

y... y... yy

......................................

y... y... yy

......................................

y... y... yy

y... y... yy

PP,jP,P,2P,1

Pi,ji,i,2i,1

P2,j2,2,22,1

P1,j1,1,21,1

v... v... vv

......................................

v... v... vv

......................................

v... v... vv

v... v... vv

PN,jN,N,2N,1

Pi,ji,i,2i,1

P2,j2,2,22,1

P1,j1,1,21,1

z ... z ... z z

......................................

z ... z ... z z

......................................

z ... z ... z z

z ... z ... z zZN x P VP x P YN x P

V is computed based on the relationships of the properties of the data, and has the following properties:

.01

P

iikijvv1. Each two columns j and k in V are orthologous:

2. Each column in V is a unit vector:

P

iijv

1

2 1 ,

P

kkjikij vzy

1

Page 12: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Principle component analysis (PCA)

PN,jN,N,2N,1

Pi,ji,i,2i,1

P2,j2, 2,22,1

P1,j1,1,21,1

y... y... yy

......................................

y... y... yy

......................................

y... y... yy

y... y... yy

Each column of Y is called a principle component of the data.

If we rank the principle components according to the their variance, a few columns are predominate relative to the other ones, and they can be used to visulize the data in a reduced dimensional space.

N

iiyN

11

1

N

iiyN

12

1

N

iijyN

1

1

N

iiPyN

1

1… …

Page 13: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Principle component analysis (PCA) The variety of data can be largely visualized in the reduced space.

Plot of 20 AA’s on the first two components of the PCA

Two vectors in V that give largest variance in two columns in Y

Plot on molecular volume and isoelectric point (pI)

Page 14: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Principle component analysis (PCA) To compute matrix V, we first compute a P x P matrix of correlation coefficients between each of columns of X (or Z), C:

N

iikij

N

i kj

kikjij

N

ikikjij

kjjk

zzNss

mxmx

N

mxmxsNs

c

11

1

1))((1

))((1

N

iikijjk zz

Nc

1

1

Page 15: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Principle component analysis (PCA) The correlation coefficient matrix is symmetric, cij=cji, and cii=1.

The correlation coefficient matrix of the 20 AA based on the 8 properties:

Page 16: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Principle component analysis (PCA)

1 ... c ... c c

......................................

c ... 1 ... c c

......................................

c ... c ... 1 c

c ... c ... c 1

kP,P,2P,1

Pj,j,2j,1

P2,k2,2,1

P1,k1,1,2

It can be shown that the columns of V are eigenvectors of C, i.e.,

nP,

ni,

n2,

n1,

v

...

v

...

v

v

nP,

ni,

n2,

n1,

v

...

v

...

v

v

n

n

n

n

jnn

P

jjnkjnnn vvcvCv

1

or

nP,

ni,

n2,

n1,

v

...

v

...

v

v

n

n is the corresponding eigenvalue.

Page 17: BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010

Principle component analysis

.

1

))((11

1

2

1

1 11 1 1

11 11

2

n

P

jjnn

P

jjnjnn

jn

P

j

P

kknjk

N

i

P

j

P

kknjnikij

P

kknik

N

i

P

jjnij

N

iin

vvv

vvcvvzzN

vzvzN

yN

It can be shown that the variance the n-th component of the matrix Y is equal to the corresponding eigenvalue of the n-th eigenvector:

The total variance of the data is P, so fraction of variance of the first a few (e.g. 2) components is a measure of the representativeness of these principle components of the dataset.

In our PCA analysis of the 20 AAs dataset,

%7.798

81.257.321

P