Introduction to Pattern Recognition

Introduction to Pattern Recognition

Prediction in Bioinformatics• What do we want to predict?

– Features from sequence

– Data mining

• How can we predict?– Homology / Alignment– Pattern Recognition / Statistical Methods / Machine Learning

• What is prediction?– Generalization / Overfitting

– Preventing overfitting: Homology reduction

• How do we measure prediction?– Performance measures– Threshold selection

Henrik NielsenCenter for Biological Sequence Analysis

Technical University of Denmark

Sequence → structure → function

Prediction from DNA sequence

• Protein-coding genes– transcription factor binding sites– transcription start/stop– translation start/stop– splicing: donor/acceptor sites

• Non-coding RNA – tRNAs– rRNAs– miRNAs

• General features– Structure (curvature/bending)– Binding (histones etc.)

• Folding / structure• Post-Translational Modifications

– Attachment: phosphorylation glycosylation lipid attachment

– Cleavage: signal peptides, propeptides, transit peptides

– Sorting: secretion, import into various organelles, insertion into membranes

• Interactions• Function

– Enzyme activity– Transport– Receptors– Structural components– etc...

Prediction from amino acid sequence

Protein sorting in eukaryotes

• Proteins belong in different organelles of the cell – and some even have their function outside the cell

• Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"

Data: UniProt annotation of protein sorting

Annotations relevant for protein sorting are found in:– the CC (comments) lines– cross-references (DR lines) to GO (Gene Ontology)– the FT (feature table) lines

ID INS_HUMAN Reviewed; 110 AA.AC P01308;...DE Insulin precursor [Contains: Insulin B chain; Insulin A chain].GN Name=INS;...CC -!- SUBCELLULAR LOCATION: Secreted....DR GO; GO:0005576; C:extracellular region; IC:UniProtKB....FT SIGNAL 1 24

3 types of non-experimental qualifiers in the CC and FT lines:– Potential: Predicted by sequence analysis methods– Probable: Inconclusive experimental evidence– By similarity: Predicted by alignment to proteins with known

location

Problems in database parsing

Extreme example: A4_HUMAN, Alzheimer disease amyloid protein

CC -!- SUBCELLULAR LOCATION: Membrane; Single-pass type I membraneCC protein. Note=Cell surface protein that rapidly becomesCC internalized via clathrin-coated pits. During maturation, theCC immature APP (N-glycosylated in the endoplasmic reticulum) movesCC to the Golgi complex where complete maturation occurs (O-CC glycosylated and sulfated). After alpha-secretase cleavage,CC soluble APP is released into the extracellular space and the C-CC terminal is internalized to endosomes and lysosomes. Some APPCC accumulates in secretory transport vesicles leaving the late GolgiCC compartment and returns to the cell surface. Gamma-CTF(59) peptideCC is located to both the cytoplasm and nuclei of neurons. It can beCC translocated to the nucleus through association with Fe65. Beta-CC APP42 associates with FRPL1 at the cell surface and the complex isCC then rapidly internalized. APP sorts to the basolateral surface inCC epithelial cells. During neuronal differentiation, the Thr-743CC phosphorylated form is located mainly in growth cones, moderatelyCC in neurites and sparingly in the cell body. Casein kinaseCC phosphorylation can occur either at the cell surface or within aCC post-Golgi compartment....DR GO; GO:0009986; C:cell surface; IDA:UniProtKB.DR GO; GO:0005576; C:extracellular region; TAS:ProtInc.DR GO; GO:0005887; C:integral to plasma membrane; TAS:ProtInc.

Prediction methods

• Homology / Alignment

• Simple pattern recognition – Example:

PROSITE entry PS00014, ER_TARGET:Endoplasmic reticulum targeting sequence. Pattern: [KRHQSA]-[DENQ]-E-L>

• Statistical methods– Weight matrices: calculate amino acid probabilities– Other examples: Regression, variance analysis, clustering

• Machine learning– Like statistical methods, but parameters are estimated by

iterative training rather than direct calculation– Examples: Neural Networks (NN), Hidden Markov Models

(HMM), Support Vector Machines (SVM)

Prediction of subcellular localisation from sequence

• Homology: threshold 30%-70% identity• Sorting signals (‘‘zip codes’’)

– N-terminal: secretory (ER) signal peptides, mitochondrial & chloroplast transit peptides.

– C-terminal: peroxisomal targeting signal 1, ER-retention signal.

– internal: Nuclear localisation signals, nuclear export signals.

• Global properties– amino acid composition, aa pair composition – composition in limited regions– predicted structure– physico-chemical parameters

• Combined approaches

Signal-based prediction

• Signal peptides– von Heijne 1983, 1986 [WM]– SignalP (Nielsen et al. 1997, 1998; Bendtsen et al. 2004) [NN,

HMM]

• Mitochondrial & chloroplast transit peptides– Mitoprot (Claros & Vincens 1996) [linear discriminant using

physico-chemical parameters]– ChloroP, TargetP* (Emanuelsson et al. 1999, 2000) [NN]– iPSORT* (Bannai et al. 2002) [decision tree using physico-

chemical parameters]– Protein Prowler* (Hawkins & Bodén 2006) [NN]

*= includes also signal peptides

• Nuclear localisation signals– PredictNLS (Cokol et al. 2000) [regex]– NucPred (Heddad et al. 2004) [regex, GA]

Composition-based prediction

• Nakashima and Nishikawa 1994 [2 categories; odds-ratio statistics]• ProtLock (Cedano et al. 1997) [5 categories; Mahalanobis distance]• Chou and Elrod 1998 [12 categories; covariant discriminant]• NNPSL (Reinhardt and Hubbard 1998) [4 categories; NN]• SubLoc (Hua and Sun 2001) [4 categories; SVM]• PLOC (Park and Kanehisa 2003) [12 categories; SVM]• LOCtree (Nair & Rost 2005) [6 categories; SVM incl. regions,

structure and profiles]• BaCelLo (Pierleoni et al. 2006) [5 categories; SVM incl. regions and

profiles]

Pro: • does not require knowledge of signals• works even if N-terminus is wrong

Con: • cannot identify isoform differences

A simple statistical method: Linear regression

Observations (training data): a set of x values (input) and y values (output).

Model: y = ax + b (2 parameters, which are estimated from the training data)

Prediction: Use the model to calculate a y value for a new x value

Note: the model does not fit the observations exactly. Can we do better than this?

Overfitting

y = ax + b2 parameter modelGood description, poor fit

y = ax6+bx5+cx4+dx3+ex2+fx+g7 parameter modelPoor description, good fit

Note: It is not interesting that a model can fit its observations (training data) exactly.

To function as a prediction method, a model must be able to generalize, i.e. produce sensible output on new data.

A classification problem

How complex a model should we choose? This depends on:

• The real complexity of the problem

• The size of the training data set

• The amount of noise in the data set

How to estimate parameters for prediction?

Model selection

Linear Regression Quadratic Regression Join-the-dots

The test set method

The test set method

The test set method

The test set method

The test set method

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Which kind of Cross Validation?

Note: Leave-one-out is also known as jack-knife

Problem: sequences are related

• If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performanceALAKAAAAM

ALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Solution: Homology reduction

• Calculate all pairwise similarities in the data set

• Define a threshold for being ”neighbours” (too closely related)

• Calculate numbers of neighbours for each example, and remove the example with most neighbours

• Repeat until there are no examples with neighbours left

Alternative: Homology partitioning• keep all examples, but cluster them

so that no neighbours end up in the same fold

• Should be combined with weightingThe Hobohm algorithm

Defining a threshold for homology reduction

The Sander/Schneider curve:

For protein structure prediction, 70% identical classification of secondary structure means prediction by alignment is possible

This corresponds to 25% identical amino acids in a local alignment > 80 positions

First approach: two sequences are too closely related, if the prediction problem can be solved by alignment

Defining a threshold for homology reduction

The Pedersen / Nielsen / Wernersson curve:

Use the extreme value distribution to define the BLAST score at which the similarity is stronger than random

Second approach: two sequences are too closely related, if their homology is statistically significant

Documents

Introduction to Pattern Recognition