33
Function, evolution, Function, evolution, motifs and hierarchy motifs and hierarchy Ashwin Sivakumar

Function, evolution, motifs and hierarchy Ashwin Sivakumar

Embed Size (px)

Citation preview

Page 1: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Function, evolution, Function, evolution, motifs and hierarchymotifs and hierarchy

Ashwin Sivakumar

Page 2: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Today’s outline…

• A fairly short presentation

• Quick wrap-up of what we learnt yesterday

• Some basic methodological and biological background for today’s practice session.

• After a short break, we do the “real” stuff-PRACTICALS

•Like earlier, I practice along with you and we work together to script a nice ‘story’ building upon what we scripted yesterday.

• If all goes as planned, we will finish early with a good take home message.

Page 3: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Cement, bricks,accessories like windows-doors and roofs represents functions of the house/apartment at different levels Sequence motifs/signatures/patterns

represents different levels of functions at the four levels of structural hierarchy.

Page 4: Function, evolution, motifs and hierarchy Ashwin Sivakumar

What is function?

Function like structure is hierarchical.

-Molecular functions (metabolic reactions, fit into structural associates)

-Post translation modifications (eg: glycosylation sites)

-Phenotype (physiological sub-systems and influence of environmental factors)(phenotype property/disease).

-Physiological function (set of proteins) (metabolic pathway, signal transduction)

Change in function can broadly be changes in biochemistry, structure, gene network, phenotype.

Page 5: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Inference of function-Hypothesis and assumptions

Since large value selection intensity S at an amino acid residue means functional importance (low evolutionary rate), and vice versa (Kimura, 1983), site-specific change in evolutionary rate (or selection intensity S) can be naturally interpreted as ‘change of functional importance.

Page 6: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Function of a domain is not function of a sequence

Repercussions on public databases

The annotations in publicly available databases can be erroneous because:

a) The annotations are based on the ‘submitter’s discretion. At times, the annotation is that of the domain or in other cases it’s the of the sequence.

b) Thus homology based function assignment through public databases might propagate errors.

c) Sequence similarity does not mean functional similarity.

Page 7: Function, evolution, motifs and hierarchy Ashwin Sivakumar

The assumptions

Even limited sequence identity (~20%) might be enough to place unknown proteins into enzyme super-families for which the catalytic strategy is known.

Functional importance directly proportional to Evolutionary conservation, F ~=E. Thus ΔF=ΔE.

There are two types of Changes that can occur in evolutionary conservation (E):

TYPE I: Change in Evolutionary constraints (Evolutionary rate), S0

TYPE II: Change in Amino Acid properties. Eg: +ve Vs –ve charge, A0

Page 8: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Cont…

Change in function can broadly be changes in biochemistry, structure, gene network, phenotype.

Functional diverge at a residue can be..I) would involve site-specific rate difference

(A residue is conserved in One sub-family, variant in the other) (DIVERGE)

II) TYPE II would involve site-specific Amino acid type difference (positive Vs negative charges) (SequenceSpace/NMF/Evol.trace etc.)

Page 9: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Enzyme super-families

Most super-families adopt common catalytic strategies

S: substrate, P: product

Page 10: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Specificity determining residues (SDPred) The most suitable method depends on the test set. SDP: for prediction of residues in protein sequences that

determine functional differences between proteins, having same general biochemical function.

Basis: Amino acid residues that determine differences in protein functional specificity and account for correct recognition of interaction partners, are usually thought to correspond to those positions of a protein multiple alignment, where the distribution of amino acids is closely associated with grouping of proteins by specificity.

SDPpred can analyze alignments of length up to 2000 positions, containing at most 1000 proteins. There can be up to 1000 specificity groups.

The predicted SDPs are mapped on to the multiple alignment of the family.

Page 11: Function, evolution, motifs and hierarchy Ashwin Sivakumar

PHYLOGENY

Page 12: Function, evolution, motifs and hierarchy Ashwin Sivakumar

When would phylogeny work?

Provided your sequences share reasonable homology & similarity:

a) Place the query sequence in respective family (Eg: based on ADDA).

b) Get a reliable and consistent multiple sequence alignment, usually using progressive alignment which is best suited for tree building and making phylogentic inference.

c) Adjust your alignment manually. When it comes to phylogeny, there is no strict definition of a ”good” alignment.

d) Choose an appropriate Phylogenetic method.

Page 13: Function, evolution, motifs and hierarchy Ashwin Sivakumar

There are a number of phylogenetic packages…

Clustal W/X (quick and dirty tree) MEGA (Integrated package with an intuitive interface)

Phylip (Arguably the most popular phylogenetic tool) Alibee (automated improvement over clustal alignment

and subsequent tree building) PAUP # Beast and MrBaeyes (Bayesian inference of phylogeny) # Bete

Page 14: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Some basics…

Most phylogenetic methods assume that each position in a sequence can change independently from the other positions.

Gaps in alignments represent mutations in sequences such as: insertion, deletion, genetic rearrangments.

Gaps are treated in various ways by the phylogenetic methods. Most of them ignore gaps.

Page 15: Function, evolution, motifs and hierarchy Ashwin Sivakumar

METHODS.(Max. Likelihood)

Maximum likelihood In this method, the bases (nucleotides or amino acids) of all sequences at each site are considered separately (as independent), and the log-likelihood of having these bases are computed for a given topology by using a particular probability model.

This log-likelihood is added for all sites, and the sum of the log-likelihood is maximized to estimate the branch length of the tree.

This procedure is repeated for all possible topologies, and the topology that shows the highest likelihood is chosen as the final tree.

Notes : ML estimates the branch lengths of the final tree ; ML methods are usually consistent ; ML is extented to allow differences between the rate of transition and

transversion. Drawbacks need long computation time to construct a tree.

Page 16: Function, evolution, motifs and hierarchy Ashwin Sivakumar

METHODS (Maximum Parsimony)

Maximum Parsimony Parsimony criterion It consists of determining the minimum number of changes

(substitutions) required to transform a sequence to its nearest neighbor.

Maximum Parsimony The maximum parsimony algorithm searches for the minimum

number of genetic events (nucleotide substitutions or amino-acid changes) to infer the most parsimonious tree from a set of sequences.

The best tree is the one which needs the fewest changes. Problems : within practical computational limits, this often leads in the

generation of tens or more "equally most parsimonious trees" which make it difficult to justify the choice of a particular tree ;

long computation time to construct a tree.

Page 17: Function, evolution, motifs and hierarchy Ashwin Sivakumar

METHODS (Distance matrix) Distance matrix methods (upgma, nj, Fitch...) Convert sequence data

into a set of discrete pairwise distance values, arranged into a matrix. Distance methods fit a tree to this matrix.

The phylogenetic topology tree is constructed by using a cluster analysis method (like upgma or nj methods).

The phylogeny makes an estimation of the distance for each pair as the sum of branch lengths in the path from one sequence to another through the tree.

easy to perform ; quick calculation ; fit for sequences having high similarity scores.

Drawbacks : the sequences are not considered as such (loss of information) all sites are generally equally treated (do not take into account differences

of substitution rates ) not applicable to distantly divergent sequences.

Page 18: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Levels of functional annotation

Gene product functionDomain annotation for sequencesEvolutionary hypothesis through

similarityFunctional motifs/signaturesSub-cellular localization

Page 19: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Sub-cellular localization (cont…)

Intracellular, extracellular, membrane related…

Proteins that sit on the inner or outer surface of the membrane are called extrinsic or peripheral, and have a large percentage of hydrophobic amino acids in the portion of the molecule that is close to the hydrophobic membrane structure.

Page 20: Function, evolution, motifs and hierarchy Ashwin Sivakumar

The eukaryotic cell

Image source: http://ridge.icu.ac.jp/gen-ed/cell-lect-gifs/04-eucaryote-plant-cell.GIF

Top left: plant cell

Top right: animal cell

Bottom left: prokaryotes

Page 21: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Sub-cellular localization of proteins

Subcellular localization is a key functional characteristic of proteins. To perform a common physiological role, proteins must be localized in the same cellular compartment.

(plasma membrane…extracellular….cytoplasmic…mitochondrial…chloroplast…

endoplasm…peroxisomal…)

Page 22: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Methods for predicting sub-cellular locations

Homology based assignments: growing sequence data, thus impractical as well as error prone.

Artificial learning combining amino acid properties/composition and sequence signals.

Applications in a biological context For example,

In a search for virulence factors of pathogenic bacteria or easily accessible entry points for pharmaceutical drugs extracellular proteins are good candidates while proteins at other subcellular locations may be, at the beginning, not considered for such purpose

Page 23: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Target P Olof Emanuelsson, Henrik Nielsen, Søren Brunak, and Gunnar von Heijne

Neural network based artificial classifier for sub-cellular location in eukaryotes.

Further away from the N-terminal the sequence starts, less reliable are the predictions.

Page 24: Function, evolution, motifs and hierarchy Ashwin Sivakumar

ProtComp/ProtCompB

Combination of a number of methods neural networks-based prediction; direct

comparison with updated base of homologous proteins of known localization; comparisons of tetramer distributions calculated for query and DB sequences; prediction of certain functional peptide sequences, such as signal peptides, signal-anchors, GPI-anchors, transit peptides of mitochondria and chloroplasts and transmembrane segments; and search for certain localization-specific motifs.

Page 25: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Sub-cellular locations:

Bacterial compartments Cytoplasmic, Membrane, Periplasmic and

Extracellular (Secreted).

Please note:There are separate server pages for a)Animals/fungib)Plantsc)Bacteria

Page 26: Function, evolution, motifs and hierarchy Ashwin Sivakumar
Page 27: Function, evolution, motifs and hierarchy Ashwin Sivakumar
Page 28: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Interpretation

Prediction accuracy: Will be based on various factors: biological environment of

compartments, known homologues, strength of signals/patterns etc.

Nucleus: 91%

Plasma Membrane: 100%

Extra-cellular: 86%

Cytoplasm: 88%

Mitochondria: 89%

Endoplasmic reticulum: 89%

Peroxisome: 91%

Lysosome: 100%

Golgi bodies: 91%

Page 29: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Interpretation (cont…)

First check would of course be the reliability prediction statistics of various compartments. (previous slide)

Terminologies:ProtlocDBLocDBNeuralNetsTetramers Integral

Page 30: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Making sense out of the numbers

Neural Nets statistics are based on preferential weights, thus this should be looked at seriously if there is no statistical pointers from the other three sources.

If both neural networks and homology predictions point to the same compartment, this is very reliable prediction.

Incase of NN, the predictions are more reliable if the second best hit gets a much lower weight compared to the one with the highest probability.

NN should be the last option when there is no clear picture from Integral and other homology statistics.

Page 31: Function, evolution, motifs and hierarchy Ashwin Sivakumar

Interpretation cont…

Thumb rules->First see the supporting evidence with LocDB.

This is the strongest evidence.Else, look at Integral support statistics.If the integral statistics are conflicting, look at

information from ProtLocDB.In absence of other evidence, you can see

weighted statistics from NN to make a hypothesis. Incase NN and others point towards the same compartment, its obviously a very strong evidence.

Page 32: Function, evolution, motifs and hierarchy Ashwin Sivakumar
Page 33: Function, evolution, motifs and hierarchy Ashwin Sivakumar

We recommend adjusting your alignment, so that a reference sequence (a query sequence) would have no gaps or deletions in the original alignment file. Some editing, in particular, removing sequences with gaps, removing unknown residues, removing redundant sequences can be done using the SRP server using the "Filter your alignment" page. We recommend removing sequences with more than 10-20% gaps. We also recommend removing sequences with similarity of 90-95% or higher to other sequences in the alignment. Sometimes in the alignment, the reference sequence may have long N- and C- termini with gappy columns. One can remove these gappy columns first by using the "remove column" button, and then adjust the rest of the alignment removing gappy and redundant sequences

http://consurf.tau.ac.il/results/1164197646/index.htm