75
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 9 Database searching (3)

Master Course Sequence Alignment Lecture 9 Database searching (3)

Embed Size (px)

DESCRIPTION

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Master Course Sequence Alignment Lecture 9 Database searching (3). Dot-plots a simple way to visualise sequence similarity. Filter: - PowerPoint PPT Presentation

Citation preview

Page 1: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

Master Course Sequence Alignment

Lecture 9

Database searching (3)

Page 2: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Dot-plotsDot-plotsa simple way to a simple way to

visualise sequence visualise sequence similaritysimilarity

Can be a bit messy, though...Filter: 6/10 residues have to match...

Page 3: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Dot-plots, what about...Dot-plots, what about...

• Insertions/deletions -- DNA and proteins

• Duplications (e.g. tandem repeats) – DNA and proteins

• Inversions -- DNA Dot plots are calculated using a diagonal window of preset length that is slid through the search matrix --typically the central cell holds the window score (e.g. sum, average)

Page 4: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Dot-plots, self-comparisonDot-plots, self-comparison

Direct repeatDirect repeat

Tandem repeatTandem repeat

Inverted repeatInverted repeat

Page 5: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 6: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 7: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

charge

Page 8: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 9: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

(cysteine bridge)

Page 10: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 11: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE (oligomers)

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levelsProtein structure hierarchical levels

Page 12: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 13: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Globin fold proteinmyoglobinPDB: 1MBN

Helices are labelled ‘A’ (blue) to ‘H’ (red). D helix can be missing in some globins: what happens with the alignment?

Page 14: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

sandwich proteinimmunoglobulinPDB: 7FAB

Page 15: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

TIM barrel / proteinTriosephosphate IsoMerasePDB: 1TIM

Page 16: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Pyruvate kinasePhosphotransferase

barrel regulatory domain

barrel catalytic substrate binding domain

nucleotide binding domain

Page 17: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

What does this mean for What does this mean for alignment?alignment?

• Alignments need to be able to skip secondary structural elements to complete domains (i.e. putting gaps opposite these motifs in the shorter sequence).

• Depending on gap penalties chosen, the algorithm might have difficulty with making such long gaps (for example when using high affine gap penalties), resulting in incorrect alignment.

Page 18: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

What does this mean for What does this mean for homology searching?homology searching?

• Database searching algorithms just need to decide if the alignment score is good enough for inferring homology

• Sometimes, alignments can be incorrect but the score can be close enough for the database searching method to correctly identify the DB sequence as a homolog (or not)

• However, for distant hits alignments become crucial

Page 19: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 20: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Sequence Analysis/Database SearchingFinding relationships between genes and gene products of different species, including those at large evolutionary distances

Page 21: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 22: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 23: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 24: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Compared to the preceding plot, RMSD is better able to pin-point relationships between more divergent sequences (RMSD stays relatively small for a longer time as compared to PAM distance) – Structure more conserved than sequence. Note that the spread around RMSD is larger

Page 25: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Structural superpositioningStructural superpositioning

RMSD: how far are equivalenced Cα atoms separated on average?

Page 26: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

C5 anaphylatoxin -- human (PDB code 1kjs) and pig (1c5a)) proteins are superposed

Two superposed protein structures with two well-superposed helices

Red: well superposed

Blue: low match quality

Page 27: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 28: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

How to assess homology search How to assess homology search methodsmethods

• We need an annotated database, so we know which sequences belong to what homologous (super)families

• Examples of databases of homologous families are PFAM, Homstrad or Astral

• The idea is to take a protein sequence from a given homologous family, then run the search method, and then assess how well the method has carried out the search

• This should be repeated for many query sequences and then the overall performance can be measured

Page 29: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

C; family: zinc finger -- CCHH-type C; class: small C; reordered by kitschorder 1.0a C; reordered by kitschorder 1.0a C; last update 7/9/98 >P1;1zaa1 structureX:1zaa: 3 :C: 33 :C:zinc-finger (ZIF268, domain 1):Mus musculus:2.10:18.20 ------RPYACPVESCDRRFSRSDELTRHI-RI-HTGQK* >P1;1zaa2 structureX:1zaa: 34 :C: 61 :C:zinc-finger (ZIF268, domain 2):Mus musculus:2.10:18.20 -------PFQCRI--CMRNFSRSDHLTTHI-RT-HTGEK* >P1;1zaa3 structureX:1zaa: 62 :C: 87 :C:zinc-finger (ZIF268, domain 3):Mus musculus:2.10:18.20 -------PFACDI--CGRKFARSDERKRHT-KI-HLR--* >P1;1ard structureN:1ard: 102 : : 130 : :zinc-finger (transcription factor ADR1):Saccharomyces cerevisiae:-1.00:-1.00 ------RSFVCEV--CTRAFARQEHLKRHY-RS-HTNEK* >P1;1znf structureN:1znf: 1 : : 25 : :zinc-finger (XFIN, 31st domain):Xenopus laevis:-1.00:-1.00 --------YKCGL--CERSFVEKSALSRHQ-RV-HKN--* >P1;2drp2 structureX:2drp: 137 :A: 165:A:zinc-finger (tramtrack, domain 2):Drosophila melanogaster:2.80:19.30 ----NVKVYPCPF--CFKEFTRKDNMTAHV-KIIHK---* >P1;3znf structureN:3znf: 1 : : 30 : :zinc-finger (enhancer binding protein):Homo sapiens:-1.00:-1.00 ------RPYHCSY--CNFSFKTKGNLTKHMKSKAHSKK-* >P1;5znf structureN:5znf: 1 : : 30 : :zinc-finger (ZFY-6T):Homo sapiens:-1.00:-1.00 ------KTYQCQY--CEYRSADSSNLKTHIKTK-HSKEK*

ExampleYou can also look at superposed structures..

Page 30: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Sequence searchingSequence searchingQUERY

DATABASE

True Positive

True Negative

True Positive

False Positive

True Negative False Negative

T

POSITIVES

NEGATIVES

Page 31: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

So what have we gotSo what have we got

TP

TN

FP

FN

Observed

Pre

dic

ted

P

P

N

N

Page 32: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Sensitivity and Specificity – medical worldSensitivity and Specificity – medical world 

 +  -

 Test

 +

9990 True

Positive(TP)

990 False

Positive(FP)

 All with Positive Test

TP+FP

 Positive Predictive Value=

TP/(TP+FP)9990/(9990+990)

=91%

 -

10 False

Negative(FN)

989,010 True

Negative(TN)

 All with Negative Test

FN+TN

 Negative Predictive Value=

TN/(FN+TN)989,010/(10+989,0

10)=99.999%

 

 All with Disease10,000

 All without Disease999,000

Everyone=TP+FP+FN+TN

 Sensitivity=TP/

(TP+FN)

9990/(9990+1

0)

 Specificity=TN/(FP+TN)

989,010/(989,010+99

0)

Pre-Test Probability=(TP+FN)/(TP+FP+FN+TN)(in this case = prevalence)

10,000/1,000,000 = 1%

Page 33: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Receiver Operator Curve Receiver Operator Curve (ROC)(ROC)

• Plot Sensitivity (TP/(TP+FN)) against 1-Specificity (1 - TN/(FP+TN)), where the latter is called error

Error = 1 - specificity

Sen

siti

vity

Sensitivity is also called Coverage

Page 34: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 35: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 36: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 37: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Database Search Algorithms:Database Search Algorithms:Sensitivity, SelectivitySensitivity, Selectivity

• Sensitivity – the ability to detect weak similarities between sequences (often due to long evolutionary separation). Increasing sensitivity reduces false negatives, i.e. those database sequences similar to the query, but rejected.

Sensitivity (or Coverage) = TP / (TP+FN)

• Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Selectivity (or Positive Prediction Value) = TP / (TP + FP)

• Specificity also describes the ability of the method to select proper hitsSpecificity = TN / (TN + FP)

SensitivitySensitivity

Selectivity, SpecificitySelectivity, Specificity

Courtesy of Gary Benson (ISSCB 2003)

Page 38: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 39: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 40: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 41: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

COG – Cluster of Orthologous COG – Cluster of Orthologous GroupsGroups

•Orthologues found using bi-directional best hit searching with PSI-BLAST

•All COG family members are supposed to have the same function

•Searching with an unknown sequence only needs to hit a single member of a COG family, annotation can then be transferred

COG2813

http://www.ncbi.nlm.nih.gov/COG/

Page 42: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Structure-based function predictionStructure-based function prediction • SCOP (http://scop.berkeley.edu/) is a protein structure

classification database where proteins are grouped into a hierarchy of families, superfamilies, folds and classes, based on their structural and functional similarities

Page 43: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Structure-based function predictionStructure-based function prediction• SCOP hierarchy – the top level: 11 classes

Page 44: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Structure-based function predictionStructure-based function predictionAll-alpha protein

Coiled-coil proteinAll-beta protein

Alpha-beta proteinmembrane protein

Page 45: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Structure-based function predictionStructure-based function prediction• SCOP hierarchy – the second level: 800 folds

Page 46: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Structure-based function predictionStructure-based function prediction• SCOP hierarchy - third level: 1294 superfamilies

Page 47: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Structure-based function predictionStructure-based function prediction• SCOP hierarchy - third level: 2327 families

Page 48: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Structure-based function predictionStructure-based function prediction• Using sequence-structure alignment method, one can predict a

protein belongs to a – SCOP family, superfamily or fold

• Proteins predicted to be in the same SCOP family are orthologous

• Proteins predicted to be in the same SCOP superfamily are homologous

• Proteins predicted to be in the same SCOP fold are structurally analogous

folds

superfamilies

families

Page 49: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 50: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Profile wander

Page 51: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

A B

B C

C D

Page 52: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Multi-domainMulti-domain Proteins (cont.)Proteins (cont.)• A common conserved protein domain such as the

tyrosine kinase domain can obscure weak but relevant matches to other domain types (e.g. only appearing after 5000 kinase hits)

• Sequences containing low-complexity regions, such as coiled coils and transmembrane regions, can cause an explosion of the search rather than convergence because of the absence of any strong sequence signals.

• Conversely, some searches may lead to premature convergence; this occurs when the PSSM is too strict only allowing matches to very similar proteins, i.e., sequences with the same domain organization as the query are detected but no homologues with different domain combinations.

Page 53: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Multi-domainMulti-domain Proteins - DOMAINATIONProteins - DOMAINATION

George R.A. and Heringa J. (2002) Protein domain identification and improved sequence similarity searching using PSI-BLAST, Proteins: Struct. Func. Gen. 48, 672-681.

Iterate PSI-BLAST searches and domain delineation

DOMAINATION uses sequence signals to identify domain boundaries

Page 54: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Multi-domainMulti-domain Proteins – DOMAINATION Proteins – DOMAINATION method method

query

Strategy: Combine C- and N-termini of local alignments to delineate domain boundaries

Count start and stops of alignments

P(b

ou

nd

ary)

Page 55: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

DOMAINATION: Identifying domain boundaries

Sum N- and C-termini ofgapped local alignments

True N- and C- termini are counted twice (within 10 residues)

Boundaries are smoothed using twowindows (15 residues long)

Combine scores using biased protocol:

if Ni x Ci = 0then Si = Ni + Cielse Si = Ni + Ci +(Ni x Ci)/(Ni + Ci)

Page 56: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

DOMAINATION: identifying domain deletions

• Deletions in the query (or insertion in the DB sequences) are identified by– two adjacent segments in the query align to the

same DB sequences (>70% overlap), which have a region of >35 residues not aligned to the query. (remove N- and C- termini)

DBQuery

Page 57: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

DOMAINATION: identifying domain permutations

• A domain shuffling event is declared – when two local alignments (>35 residues)

within a single DB sequence match two separate segments in the query (>70% overlap), but have a different sequential order.

DB

Query

b a

a b

Page 58: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

DOMAINATION: identifying continuous and discontinuous domains

•Each segment is assigned an independence score (In). If In>10% the segment is assigned as a continuous domain.•An association score is calculated between non-adjacent fragments by assessing the shared sequence hits to the segments. If score > 50% then segments are considered asdiscontinuous domains and joined.

Page 59: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 60: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Low Complexity segmentsLow Complexity segments

• A sequence of L residues of N types can have L!/N na! different sequences of that same composition, where the composition vector = (n1,.., na,.., N) and N na! = n1! * n2! * .. * nN!

• If Rc is a vector of length N, where the vector numbers correspond to the number of residues with a given frequency (e.g. there are 5 amino acid types with 0 abundance, 3 amino acid types with abundance 1, etc., in the sequence), then the total number of distinct sequences corresponding to a particular complexity state-vector is (L! / N na!) * (N! / L rc!), where L rc! = r0! * r1! * .. * rL-1! * rL!

• Based on this, the final complexity score calculated by the SEG program is PSEG = (1/NL) * (L! / N na!) * (N! / L rc!)

Page 61: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

DOMAINATION: Post-processing low complexity regions in database sequences

Remove local fragments with > 15% LC

Page 62: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 63: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 64: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 65: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Conserved hypotheticals>P00001 Conserved hypotheticalA substantial fraction of genes in sequenced genomes encodes 'conserved hypothetical' proteins, i.e. those that are found in organisms from several phylogenetic lineages but have not been functionally characterized.

Page 66: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Profile wander (or matrix migration)Profile wander (or matrix migration)

• Permissive iterative searching user higher E-values can lead to incorrect hits (false positives) that become included into the profile. More incorrect hits can then be added in subsequent iterations, and true homologues can be lost. Also, the search can explode, leading to large numbers of spurious hits.

• A further loss of information can be incurred with PSIBLAST, because PSI-BLAST PSSMs are trimmed to only use the highest scoring region in a search, ignoring less conserved regions

Page 67: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 68: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 69: Master Course  Sequence Alignment  Lecture  9 Database searching (3)

Sequence identity scoring zonesSequence identity scoring zones

• >25-30%: homology zone

• 15-25%: twilight zone

• <15%: midnight zone (Rost, 1999)

Is midnight zone properly definable?

Page 70: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 71: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 72: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 73: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 74: Master Course  Sequence Alignment  Lecture  9 Database searching (3)
Page 75: Master Course  Sequence Alignment  Lecture  9 Database searching (3)