49
Sequence Alignments and Database Searching 08/20/07

Sequence Alignments and Database Searching 08/20/07

  • Upload
    radley

  • View
    45

  • Download
    1

Embed Size (px)

DESCRIPTION

Sequence Alignments and Database Searching 08/20/07. Protein A of interest to you. ornithine decarboxylase?. Why compare protein sequences?. Significant sequence similarities allow associations based upon known functions. Homology vs. similarity. - PowerPoint PPT Presentation

Citation preview

Page 1: Sequence Alignments and Database Searching 08/20/07

Sequence Alignments and Database Searching08/20/07

Page 2: Sequence Alignments and Database Searching 08/20/07

Why compare protein sequences?

Significant sequence similarities allow associations based upon

known functions.

Protein A of interest to you.

ornithine decarboxylase?

Page 3: Sequence Alignments and Database Searching 08/20/07

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

Possible for proteins to possess high sequence identity between segments and not be homologous

In this example, cytochrome c4, has reasonably high sequence similarity with trypsins, yet does not have common ancestor, nor common fold.

Also, subtilisin has same spatial arrangement of active site residues, but is not related to trypsins

Homology vs. similarity

Page 4: Sequence Alignments and Database Searching 08/20/07

Homologous proteins always share a common three-dimensional fold, often with common active or binding site.

Proteins that share a common ancestor are homologous.

Proteins that possess >25% identity across entire length generally will be homologous.

Proteins with <20% identity are not necessarily homologous

Homology vs. similarity

Page 5: Sequence Alignments and Database Searching 08/20/07

Homologous sequences are either: 1) orthologous, or 2) paralogous

•For orthologs - sequence divergence and evolutionary relationships will agree.•For paralogs - no necessary linkage between sequence divergence and speciation.

Orthologous cyctochrome c isozymes

Hemoglobins contain both orthologs and paralogs

Orthologs - sequence differences arises from divergence in different species (i.e. cyctochrome c) Paralogs - sequence differences arise after gene duplication within a given species (i.e. GPCRs, hemoglobins)

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

Page 6: Sequence Alignments and Database Searching 08/20/07

We’ve all seen and/or used sequence alignments, but howare they accomplished?

Sequence searches and alignments using DNA/RNA are usually not asinformative as searches and alignments using protein sequences. However.DNA/RNA searches are intuitively easier to understand:

AGGCTTAGCAAA........TCAGGGCCTAATGCG|||||||| ||| ||||||||||| |||AGGCTTAGGAAACTTCCTAGTCAGGGCCTAAAGCG

The above alignment could be scored giving a “1” for each identical nucleotide,A zero for a mismatch, and a -4 for “opening a “gap” and a -1 for each extensionof the gap. So score = 25 – 11= 14

Page 7: Sequence Alignments and Database Searching 08/20/07

Protein sequence alignments are much more complicated. How would this alignment be scored?

ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH| | | | | ||| | | || |||AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH

Unlike nucleotide sequence alignments, which are either identical ornot identical at a given position, protein sequence alignments include“shades of grey” where one might acknowledge that a T is sort of equivalent to an S etc. But how equivalent? What number would youassign to an S-T mismatch? And what about gaps? Since alanine isa common amino acid, couldn’t the A-A match be by chance? SinceTrp and Cys are uncommon, should those matches be given higherscores?

Do you see that accurately aligning sequences and accuratelyfinding related sequences are the same problem?

Page 8: Sequence Alignments and Database Searching 08/20/07

Databases

Nucleotide: GenBank (NCBI), EMBL, DDBJ

Protein: SwissProt, TrEMBL, GenPept(GenBank)

Huge databases – share much information. Many entries linked to other databases (e.g. PDB). SwissProt small but well “curated”. NCBI non-redundant(nr) protein sequence database is very large but sometimes confusing.

These databases can be searched in a number of ways. Can search only human or metazoan sequences. Can eliminate entries made before a givenDate. Etc.

Page 9: Sequence Alignments and Database Searching 08/20/07

Protein Sequence Records from PRF A series of digits (often six or seven)followed by a letter, e.g.:

1901178A

RefSeq Nucleotide Sequence Records Two letters, an underscore bar, and six digits, e.g.:mRNA records (NM_*):NM_000492genomic DNA contigs (NT_*):NT_000347 complete genome or chromosome (NC_*):NT_000907 genomic region (NG_*):

NG000019

Type of Record Sample accession format

GenBank/EMBL/DDBJ Nucleotide

SequenceRecords One letter followed by five digits, e.g.:U12345Two letters followed by six digits, e.g.:

AY123456, AF123456 GenPept Sequence Records(which contain the amino acid translations from GenBank/EMBL/DDBJ records that have a coding region feature annotated on them)

Three letters and five digits, e.g.:

AAA12345 Protein Sequence Records from SWISS-PROT

and PIR All are six characters:Character/Format1 [O,P,Q]2 [0-9]3 [A-Z,0-9]4 [A-Z,0-9]5 [A-Z,0-9]6 [0-9]e.g.:P12345 and Q9JJS7

What do all those numbers mean?

NC

BI

Page 10: Sequence Alignments and Database Searching 08/20/07

RefSeq Protein Sequence Records Two letters (NP), an underscore bar, and six digits, e.g.:

NP_000483 RefSeq Model (predicted) Sequence Records from the Human Genome annotation process

Two letters (XM, XP, or XT), an underscore bar, and six digits, e.g.:

XM_000583 Protein Structure Records PDB accessions generally contain one digit

followed by three letters, e.g.:1TUPMMDB ID numbers generally contain four digits, e.g.:3973The record for the Tumor Suppressor P53 Complexed With DNA can be retrieved by either number above

Continued….

GI numbers:a series of digits that are assigned consecutively by NCBI to each sequence it processes. Version numbers:consist of the accession number followed by a dot and a version number.

Nucleotide sequence: GI: 6995995VERSION: NM_000492.2

Protein translation: GI: 6995996VERSION: NP_000483.2

>gi|897557|gb|AAA98443.1| TIAM1 protein

NC

BI

http://www.ornl.gov/sci/techresources/Human_Genome/posters/chromosome/geneguide.shtml

Page 11: Sequence Alignments and Database Searching 08/20/07

We’ve got the data, now how do we score/search?First, we need a way to assign numbers to “shadesof grey” matches.

Genetic code scoring system – This assumes that changes in proteinsequence arise from mutations. If only one point mutation is neededto change a given AA to another (at a specific position in alignment),the two amino-acids are more closely related than if two point mutationswere required.

Physicochemical scoring system – a Thr is like a Ser, a Trp is not likean Ala……

These systems are seldom used because they have problems. Why try to second guess Nature? Since there are many related sequences out there, we can look at some (trusted) alignments to SEE which sub-stitutions have occurred and the frequency with which they occur.

Page 12: Sequence Alignments and Database Searching 08/20/07

Observed substitution scoring schemes

PAM (percent acceptable mutation) matrices are derived from studyingglobal alignments of well-characterized protein families. Use 1% residuechange (short evolutionary distance) to get PAM1 matrix. Raise this to 250 power to get 250% change (greater evolutionary distance). Therefore a PAM 30 would be used to analyze more closely related proteins, PAM 400 is used for finding and analyzing distantly related proteins.

Block substitution matrices (BLOSUM) are derived from studying local alignments (blocks) of sequences from related proteins. In other words,one might use the portions of aligned sequences from related proteinsthat have >62% identity (in the portions or blocks) to derive the BLOSUM62 scoring matrix. One might use only the blocks that have >80% identityto derive the BLOSUM 80 matrix.

Page 13: Sequence Alignments and Database Searching 08/20/07

Extracted from ISMB2000 tutorial,WR Pearson, U. of VirginiaPAM250 matrix

Note that for identical matches, scores vary depending upon observed frequencies. That is, rare amino acid (i.e. Trp) that are not substituted have high scores; frequently occuring amino acids (i.e. Ala) are down-weighted because of the high probability of aligning by chance.

Amino acid substitution matrices

•Negative scores - unlikely substitutions

Page 14: Sequence Alignments and Database Searching 08/20/07

In general, Blosum62 matrix is more accurate than PAM.However, should be aware that search performance will depend on underlying matrix

Q. Which are more divergent: PAM120 or PAM250; Blosum45 or Blosum62?

PNAS 89, 10915 (1992).

Amino acid substitution matrices

Page 15: Sequence Alignments and Database Searching 08/20/07

Gap penalties – Intuitively one recognizes that there should be a penaltyfor introducing (requiring) a gap during identification/alignment of a givensequence. But if two sequences are related, the gaps may well be locatedIn loop regions which are more tolerant of mutational events and probablyhave little impact on structure. Therefore, a new gap should be penalized, but extending an existing gap should be penalized very little.

Filtering – many proteins and nucleotides contain simple repeats or regions of low sequence complexity. These must be excluded from searches and alignments. Why?

Significance of a “hit” during a search - More important than an arbitraryscore is an estimation of the likelihood of finding a hit through pure chance. Ergo the “Expectation value” or E-value. E-values can be as low as 10-70.

Page 16: Sequence Alignments and Database Searching 08/20/07

* - Similarity score distribution expected based upon random chance using given searched database.

= - distribution of normalized similarity scores (observed) for a search using a proton ATPase against the samedatabase.

Statistics

Page 17: Sequence Alignments and Database Searching 08/20/07

E-valueSo, for sufficiently large databases (so can apply statistics):

E = Kmne-S

m- query lengthn - database lengthE - expectation valueK - scale factor for search space (database) - scale factor for scoring systemS - score, dependent on substitution matrix, gap-penalties, etc.

Doubling either sequence string doubles number of sequences with a given expectation value; similarly, double the score and expectation value decreases exponentiallyExpectation value - probability that given score will occur by chance given the query AND database strings

Page 18: Sequence Alignments and Database Searching 08/20/07

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

Must account for increases in similarity score due to increase in sequence length searched.

Statistics

Page 19: Sequence Alignments and Database Searching 08/20/07

1) Break query up into “words” e.g. ASTGHKDLLV AST

WORDS STG TGH2) Generate expanded list of words that would match with (i.e. PAM250) a score of at least T – You’re acknowledging that you may not have any

exact matches with original list of words.

3) Use expanded list of words to search database for exact matches.

4) Extend alignments from where word(s) found exact match.

Basic local alignment search tool (BLAST)

Heuristic algorithm – Uses guesses. Increases speed without a greatloss of accuracy (BLASTP, FASTA (local Hueristic), S-W local rigorous,Needleman-Wunsch global, rigorous)

Page 20: Sequence Alignments and Database Searching 08/20/07

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

Global scores require alignment of entire sequence length.Cannot be used to detect relationships between

domains in mosaic proteins.

Global versus local alignments

Local alignments are necessary to detect domains within mosaic proteins, internal duplications.

Page 21: Sequence Alignments and Database Searching 08/20/07

Pictorial representation of BLAST algorithm.

Query sequence

Words (they overlap)

Expand list of words

Search database, find exact hits, extend alignments

Report sorted list of hits

Page 22: Sequence Alignments and Database Searching 08/20/07

Nucleotide BLAST looks for exact matches

Protein BLAST requires two hits

GTQITVEDLFYNI

SEI YYN

ATCGCCATGCTTAATTGGGCTT

CATGCTTAATT

neighborhood words

exact word match

one hit

two hits

NCBI

BLAST

Page 23: Sequence Alignments and Database Searching 08/20/07

FASTA

Instead of breaking up query into words (and then generating a list of similar words), find all sequences in the database that containshort sequences that are exact or nearly exact matches for sequenceswithin the query. Score these and sort. Sort of reverse methodology toBLAST

Que

ry s

eque

nce

Database sequence

Page 24: Sequence Alignments and Database Searching 08/20/07
Page 25: Sequence Alignments and Database Searching 08/20/07

Protein database

Page 26: Sequence Alignments and Database Searching 08/20/07
Page 27: Sequence Alignments and Database Searching 08/20/07

mouse over

Page 28: Sequence Alignments and Database Searching 08/20/07

link to entrez

sorted by e values

5 X 10-98

LocusLink

Page 29: Sequence Alignments and Database Searching 08/20/07

Low complexity filter

Page 30: Sequence Alignments and Database Searching 08/20/07

Identifying distant homologies (use several different query sequences)

Examine output carefully. A lack of statistical significance doesn’t necessarily mean a lack of homology!

Extracted from ISMB2000 tutorial,WR Pearson, U. of Virginia

Also remember - If A is homologous to B, and B to C, then A should be homologous to C

Page 31: Sequence Alignments and Database Searching 08/20/07

PSI-BLAST

Very sensitive, but must not include a non-member sequence!

1) Regular BLAST search2) Sequences above a certain threshold (< specified E-value) are included. Assumed to be related proteins. This group of sequences is used to define a “profile” that contains the essence of the “family”.3) Now with the important sequence positions highlighted, can look for more distantly related sequences that should still have the essence of the protein family.4) Inclusion of more distantly related sequences modifies the profile further (further defines the essence) and allows for identification of even more distantly related sequences. Etc.

Note: PSI-BLAST may find and then subsequently lose a homologous sequence during the iteration process! “Drifting” of the program, would be the gradual loss of close homologs during the iteration process.

Page 32: Sequence Alignments and Database Searching 08/20/07

Active site serineWeakly conserved serine

Position specific scoring matrix (PSSM)(learning from your “hits”)

Page 33: Sequence Alignments and Database Searching 08/20/07

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine scored differently in these two positions

Active site nucleophile

Position specific scoring matrix (PSSM)

Page 34: Sequence Alignments and Database Searching 08/20/07

>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOHMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYYVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNGRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTHVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY

e value cutoff for PSSM

PSI-BLAST: initial run

Page 35: Sequence Alignments and Database Searching 08/20/07

PSI-BLAST: initial run NCBI

Page 36: Sequence Alignments and Database Searching 08/20/07

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

PSI-BLAST: first PSSM search

Page 37: Sequence Alignments and Database Searching 08/20/07

iteration 1

iteration 2

PSI-Blast ofhuman Tiam1

PSI-BLAST: importance of original query (remember, if A is like B….)

Page 38: Sequence Alignments and Database Searching 08/20/07

iteration 2

iteration 1

iteration 3

Ras-binding domains

PSI-Blast ofmouse Tiam2 (~90% identity with human Tiam1)

PSI-BLAST: importance of original query

Page 39: Sequence Alignments and Database Searching 08/20/07

Three-dimensional Position SpecificScoring Matrix (3D-PSSM)

Extremely sensitive, but the structure of a homolog must exist!Uses a Library of structures that represent all the known folds*and a non-redundant sequence database.

Preparing the 3D-PSSM database

1) 1D-PSSM generation. For every entry in the Library of structures, perform 20 iteration of PSI-BLAST against the NR database. Use E-value cutoff of 0.0005. Keep intermediate results from 1st through20th iteration. Recombine these intermediates at the end. Generate a PSSM (1D-PSSM) from the results.

(A 1D-PSSM for a protein of length L will have dimensions L X 20 )

2) For each Library entry, assign 2ndry structure (Helix, Strand, Coil)

Page 40: Sequence Alignments and Database Searching 08/20/07

3) Perform 3D structural superposition between each entry in the Library and all other members of its fold superfamily. Use cutoff criteria. Use

the “residue equivalencies” from the superpositions to augment the 1D-PSSMs for Library members in a given superfamily. (Key here is thatstructural alignment reduces possibility of miss-alignment of sequences).

4) Use the structural info from the whole Library to assign “solvation potentials” for each residue type. e.g. Alanines with only 5%solvent exposure are seen 122 times. The total number of residuesIn the Library with 5% exposure is 3246. So the solvation potentialwould be 122/3246=0.038 for an alanine with 5% exposure. Do this for Ala at 10%, 15%, …95%, 100%. Do for all 20 AAs.

Enter query sequence

5) Use Psi-BLAST to generate 1D-PSSM for query (nr database)

6) Perform 2ndry structure prediction for query

Page 41: Sequence Alignments and Database Searching 08/20/07

7) Align the query sequence against each member of the Library using a “3 pass” approach:

I) query is aligned against Library member using the 1D-PSSM of the Library entry2) query is aligned to the 3D-PSSM of the Library entry3) Library entry is aligned to the query’s 1D-PSSM

During these procedures the 2ndry structure matching and solvation potentials are being used but are constant. The highest scoring of the 3 passes is taken as the final result.

So, how good is 3D-PSSM?

Page 42: Sequence Alignments and Database Searching 08/20/07

Three papers report the initial characterization of PLC-e(what, no PH domain???)

JBC, 276, 2758 (2001)EMBO J 20, 743 (2001)JBC 276, 2752 (2001)

Page 43: Sequence Alignments and Database Searching 08/20/07

A fourth paper quickly follows….(PLC-es share architecture of PLC-b isozymes. How’d they do that???)

Wing M. et al., JBC 2002

Page 44: Sequence Alignments and Database Searching 08/20/07

3D-PSSM

PLCsequence entered as query

Page 45: Sequence Alignments and Database Searching 08/20/07

3D-PSSM

PDB entry(for existing structure)

Sequence alignment(between query and existing structure)

3-D model(with sidechains!)

Expectation value

Fold

Page 46: Sequence Alignments and Database Searching 08/20/07

A very simple HMM for a protein with 4 amino acids

The square boxes are called “match states” – these will emit a amino acid with a set probability for each AA. Diamond boxes are for insertions between match states, and the circles are for deletions.

Not only are there emission probabilities for the set and insert states,there are probabilities for the transitions between states. There aremany possible paths through the Model!

Random transitions through the Model and emissions from the statesare guided by probabilities. All you see at the end is the generated sequence. The model that generated the sequence is “hidden”. But the resulting sequence is related to those sequences used to construct themodel. IT IS POSSIBLE TO CALCULATE THE PROBABILITYTHAT A GIVEN SEQUENCE WAS GENERATED BY THE MODEL!

Page 47: Sequence Alignments and Database Searching 08/20/07

Multiple sequence alignments (MSAs)

In this example, an MSA is used to identify regions of high sequence conservation presumably reflecting structural and functional constraints. Useful for delimiting known domains and potential new functional regions (e.g. the Ras-binding domain in yellow and the blue box of currently unknown function).

Page 48: Sequence Alignments and Database Searching 08/20/07

Fun with MSA...

MSA used to locate functional residues and domain boundries in homologs of Dbl-proteins with known structure (Dbs and Tiam1).

Red amino acids directly interact with GTPases. Blue residues directly interact with phosphoinositides.

Page 49: Sequence Alignments and Database Searching 08/20/07

What you should know

The general approaches to finding related sequences – i.e. the methodology the terminology, how they differ.

Some of the definitions (e.g. what factors affect the E-value?, what’s paralogous?)