Protein Evolution: SARS coronavirus as an example

Preview:

DESCRIPTION

CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS August 2004. Protein Evolution: SARS coronavirus as an example. - PowerPoint PPT Presentation

Citation preview

CZ5225 Methods in Computational BiologyCZ5225 Methods in Computational Biology

Lecture 2-3: Protein Families Lecture 2-3: Protein Families and Family Prediction Methodsand Family Prediction Methods

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: csccyz@nus.edu.sgcsccyz@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg

Room 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUSAugust 2004August 2004

22

Protein Evolution: Protein Evolution: SARS coronavirus as an exampleSARS coronavirus as an example

33

SARS CoronavirusSARS CoronavirusA novel coronavirusIdentified as the cause ofsevere respiratorysyndrome (SARS )

44

SARS InfectionSARS Infection

How SARS coronavirus enters a cell and reproduce

55

Protein EvolutionProtein Evolution

Generation of different species

66

Protein Families• Sequence alignment-based families.

– Based on Principle of Sequence-structure-function-relationship.– Derived by multiple sequence alignment– Database: PFAM (Nucleic Acids Res. 30:276-280)

• Structure-based families.– Derived by visual inspection and comparison of structures– Database: SCOP (J. Mol. Biol. 247, 536-540)

• Functional Families.– Databases:

• G-protein coupled receptors: GPCRDB (Nucleic Acids Res. 29: 346-349), ORDB (Nucleic Acids Res. 30:354-360)

• Nuclear receptors: NucleaRDB (Nucleic Acids Res. 29: 346-349)• Enzymes: BRENDA (Nucleic Acids Res. 30, 47-49)• Transporters: TC-DB (Microbiol Mol Biol Rev. 64:354-411)• Ligand-gated ion channels: LGICdb (Nucleic Acids Res. 29: 294-295)• Therapeutic targets: TTD (Nucleic Acids Res. 30, 412-415)• Drug side-effect targets: DART (Drug Safety 26: 685-690)

77

Protein Families

Sequence families =\= Structural families =\= Functional families

Sequence similar, structure different

Sequence different, structure similar

Sequence similar, function different (distantly related proteins)

Sequence different, function similar

Homework: find examples

88

Protein Family Prediction Methods

Sequence alignment-based families:

• Multiple sequence alignment (HMM): HMMER; JMB 235, 1501-153; JMB 301, 173-190

Structure-based families:

• Visual inspection and comparison of structures

Functional Families.

• Statistical learning methods: – Neural network: ProtFun (Bioinformatics, 19:635-642)

– Support vector machines: SVMProt (Nucleic Acids Res., 31: 3692-3697)

99

Sequence Comparison as a Sequence Comparison as a Mathematical Problem: Mathematical Problem:

Example:

Sequence a:  ATTCTTGC

Sequence b: ATCCTATTCTAGC  

         Best Alignment:             ATTCTTGC                                  ATCCTATTCTAGC                                        /|\                  gap        Bad Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                              /|\             /|\                                      gap          gap

Construction of many alignments => which is the best?  

1010

How to rate an alignment?How to rate an alignment?• Match: +8 (w(x, y) = 8, if x = y)

• Mismatch: -5 (w(x, y) = -5, if x ≠ y)

• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

C - - - T T A A C TC G G A T C A - - T

+8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12

Alignment score

1111

Alignment GraphAlignment GraphSequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

1212

An optimal alignmentAn optimal alignment-- the alignment of maximum score-- the alignment of maximum score

• Let A=a1a2…am and B=b1b2…bn .

• Si,j: the score of an optimal alignment between

a1a2…ai and b1b2…bj

• With proper initializations, Si,j can be computedas follows.

),(

),(

),(

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bws

aws

s

1313

Computing Computing SSi,ji,j

i

j

w(ai,-)

w(-,bj)

w(ai,bj)

Sm,n

1414

InitializationsInitializations

0 -3 -6 -9 -12 -15 -18 -21 -24

-3

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

1515

SS3,53,5 = = ??

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 ?

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

1616

SS3,53,5 = = ??

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

optimal score

1717

C T T A A C – TC T T A A C – TC G G A T C A TC G G A T C A T

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

8 – 5 –5 +8 -5 +8 -3 +8 = 14

1818

Global Alignment vs. Local AlignmentGlobal Alignment vs. Local Alignment

• global alignment:

• local alignment:

1919

An optimal local alignmentAn optimal local alignment

• Si,j: the score of an optimal local alignment ending at ai and bj

• With proper initializations, Si,j can be computedas follows.

),(

),(),(

0

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bwsaws

s

2020

local alignmentlocal alignment

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 ?

0

0

0

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

2121

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 13 10

0 0 0 0 8 5 2 11 8

0 8 5 2 5 3 13 10 7

0 5 3 0 2 13 10 8 18

C G G A T C A T

C

T

T

A

A

C

T

The best

score

A – C - TA T C A T8-3+8-3+8 = 18

local alignmentlocal alignment

2222

Multiple sequence alignment (MSA)Multiple sequence alignment (MSA)

• The multiple sequence alignment problem is to simultaneously align more than two sequences.

Seq1: GCTC

Seq2: AC

Seq3: GATC

GC-TC

A---C

G-ATC

2323

How to score an MSA?How to score an MSA?

• Sum-of-Pairs (SP-score)

GC-TC

A---C

G-ATC

GC-TC

A---C

GC-TC

G-ATC

A---C

G-ATC

Score =

Score

Score

Score

+

+

2424

Functional Classification by SVMFunctional Classification by SVM

• A protein is classified as either belong (+) or not belong (-) to a functional family

• By screening against all families, the function of this protein can be

identified (example: SVMProt)

• What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes.

• Advantage of SVM: Diversity of class members (no racial discrimination). Use of sequence-derived physico-chemical features as basis for classification. Suitable for functional family classifications.

2525

SVM ReferencesSVM References

• C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line).

• R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy).

• S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy).

• Online lecture notes

2626

Introduction to Machine LearningIntroduction to Machine Learning

Goal:

To “improve” (gaining knowledge, enhancing computing capability)

Tasks:

•Forming concepts by data generalization.•Compiling knowledge into compact form •Finding useful explanations for valid concepts.•Clustering data into classes.

Reference:

Machine Learning in Molecular Biology Sequence Analysis .

Internet links:

http://www.ai.univie.ac.at/oefai/ml/ml-resources.html

2727

Introduction to Machine LearningIntroduction to Machine Learning

Category:

• Inductive learning.

• Forming concepts from data without a lot of knowledge from domain (learning from examples).

• Analytic learning.

• Use of existing knowledge to derive new useful concepts (explanation based learning).

• Connectionist learning.

• Use of artificial neural networks in searching for or representing of concepts.

• Genetic algorithms.

• To search for the most effective concept by means of Darwin’s “survival of the fittest” approach.

2828

Machine Learning MethodsMachine Learning Methods Inductive learning:

Concept learning and example-based learning

Concept learning:

2929

Machine Learning MethodsMachine Learning Methods Analytic

learning:

3030

Machine Learning MethodsMachine Learning Methods Neural network:

3131

Machine Learning MethodsMachine Learning Methods Genetic algorithms:

Strength

Pattern

Classification

3232

3333

SVMSVM

3434

SVMSVM

3535

SVMSVM

3636

SVMSVM

3737

SVMSVM

3838

SVMSVM

3939

SVMSVM

4040

SVMSVM

4141

SVMSVM

4242

SVMSVM

4343

SVMSVM

4444

SVM for Classification of ProteinsSVM for Classification of ProteinsHow to represent a protein?

• Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties:– amino acid composition– Hydrophobicity– normalized Van der Waals volume– polarity,– Polarizability– Charge– surface tension– secondary structure– solvent accessibility

• Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties.

Nucleic Acids Res., 31: 3692-3697

4545

SVM for Classification of ProteinsSVM for Classification of Proteins

Descriptors for amino acid composition of protein:

C=(53.33, 46.67)

T=(51.72)

D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0)

Nucleic Acids Res., 31: 3692-3697

4646

CZ5225 Methods in Computational Biology Assignment 1Assignment 1

• Project 1: Protein family classification by SVM– Construction of training and testing datasets– Generating feature vectors– SVM classification and analysis.– Write a report and include a softcopy of your datasets

• Project 2: Develop a program of pair-wise sequence alignment using a simple scoring scheme. – Write a code in any programming language– Test it on a few examples (such as estrogen receptor and Progesterone

receptor)– Can you extend your program to multiple alignment?– Write a report and include a softcopy of your program

Recommended