Blosum Substitution Matrix Pab

Preview:

Citation preview

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

1

Introduction to Bioinformatics:Protein Informatics

7/23/03NHLBI Symposium: From Genome to Disease

Patricia C. BabbittUniversity of California, San Francisco

babbitt@cgl.ucsf.edu

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

2

“ –mastics, –omens & omics”(courtesy of Cambridge Healthtech Institute: 50 & counting...)

• Biome• Celluome• Chronome• Clinome• Complexome• Crystallome• Cytome• Diagnome• Enzymome• Epigenome• Fluxome• Foldome• Functome• Genome• Glycome• Infectuome

• Immunome• Interactome• Localizome• Metabolome• Methylome• Microbiome• Morphome• Operome• ORFeome• Pathogenome• Peptidome• Pharmacogenomics• Phenome• Phylogenome• Physiome

• Promoterome• Proteome• Pseudogenome• Regulome• Resistome• Ribonome• Secretome• Signalome• Somatonome• Toxicome• Transcriptome• Translatome• Unknome• Vaccinomics• Variome

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

3

• deduction of function• tracing ancestral connections• understanding enzyme mechanisms• structural analysis of receptors, molecules involved

in cell signaling• identification of molecular surfaces in protein-

protein, protein-DNA interactions• protein engineering• clustering of families, superfamilies• metabolic computing/comparative genome analysis

Applications of Protein Informatics

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

4

Tools/Approaches for Protein Informatics

• database searching/pairwise alignments• pattern searching and motif analysis• multiple alignments• phylogenetic tree construction• sequence and structure comparison• comparative genomics• “metabolic computing”• transmembrane/2° structure prediction• 3D structure prediction/modeling• visualization• composition/pI/mass analysis

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

5

• Protein sequence analysis is more specific and lessnoisy than nucleic acid analysis due to the inherentdifferences in the message content of nucleic acid andamino acid codes

• 20-letter code vs 4-letter code, degeneracy of codonmessaging

• But searches for many functional genomicsexperiments must be done at nucleotide level...

Protein vs. nucleic acid sequenceanalysis?

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

6

Outline: Performing your own Analyses inProtein Informatics

• Ins and Outs of database searching– underlying assumptions– scoring, optimization, statistical significance, caveats

• Fasta, Blast & PsiBlast• Pattern searching & motif analysis• Pre-computed analyses for protein families using

sequence and structure information, motif databases

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

7

• The first and most common operation in proteininformatics...and the only way to access the information inlarge databases

• Primary tool for inference of homologous structure andfunction

• Improved algorithms to handle large databases quickly

• Provides an estimate of statistical significance

• Generates alignments

• Definitions of similarity can be tuned using differentscoring matrices and algorithm-specific parameters

Database searching

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

8

The underlying assumption used infunctional inference...

…requires comparison of sequences

Sequence Conservation

Structure Conservation

Function Conservation

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

9

Formalizing the Problem

• Given: two sequences that you want to align• Goal: find the best alignment that can be obtained by

sliding one sequence along the other• Requirements:

– a scheme for evaluating matches/mis-matches between anytwo characters

– a score for insertions/deletions– a method for optimization of the total score– a method for evaluating the significance of the alignment

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

10

• The degree of match between two letters can berepresented in a matrix

• Changing the matrix can change the alignment– Simplest: Identity (unitary) matrix– Better: Definitions of similarity based on inferences about chemical

or biological properties –Examples: PAM, Blosum, Gonnet matrices

• The score should have the form: pab /qa qb , where pab isthe probability that residue a is substituted by residue b,and qa and qb are the background probabilities for residuea and b respectively.

• Handling gaps remains an incompletely solved problem...

Scoring Systems

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

11

• Derived from the BLOCKS database, which, in turn isderived from the PROSITE library(see http://blocks.fhcrc.org/blocks/; http://www.expasy.ch/prosite/)

• BLOCKS generated from multiply aligned sequencesegments without gaps clustered at various similaritythresholds and corrected to avoid sampling bias

• Derived from data representing highly conservedsequence segments from divergent proteins rather thandata based on very similar sequences (as with PAMmatrices)

BLOSUM (BLOcks SUbstitution) Matrices

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

12

• Many sequences from aligned families are used togenerate the matrices

• Sequences identical at >X% are eliminated to avoidbias from proteins over-represented in the database

• Specific matrices refer to these clustering cut-offs, i.e.,BLOSUM62 reflects observed substitutions betweensegments <62% identical

• These matrices have become the default scoringschemes used at most primary internet search sites

• Different matrices can make a difference to yourresults!

*adapted from Ewens & Grant, Statistical Methods in Bionformatics

Derivation of BLOSUM matrices*

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

13

• scoring matrices are tailored to degree of divergenceand may require a specific query length for optimalperformance*

*adapted from information available at the NCBI Blast web site

Query Length Substitution Matrix

<35 PAM-30

35-50 PAM-70

50-85 BLOSUM-80

>85 BLOSUM-62

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

14

Scoring and optimization

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

15

SEQUENCEHOMOLOGS• •E • • •Q •U •E • • •N • •C •E • • •AN •AL •O •G• •

• Dot matrix plots: a simple description of alignmentoperations illustrating types of relationships betweena sequence pair

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

16

• The signal-to-noise ratio can be improved usingfiltering techniques designed to minimize thecomposition- dependent background

• Example of common filters: over-lapping, fixed-length"windows" for sequence comparison

• To be counted, a comparison must achieve aminimum threshold score summed over the window,derived empirically or from a statistical or evolutionarymodel of sequence similarity

• The window size and minimum threshold score (oftentermed "stringency") at which the score is counted canbe user-defined

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

17

Seq1 = SEQUENCEHOMOLOGSeq2 = SEQUENCEANALOGWindow = 7, Stringency = 42% (3/7 matches)

SEQUENCSEQUENCEANALOG (7/7 matches)

SEQUENCSEQUENCEANALOG (0/7 matches)

...

CEHOMOLSEQUENCEANALOG (2/7 matches)

...

HOMOLOG (3/7 matches)SEQUENCEANALOG

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

18

Window = 30; Stringency = 2

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

19

Window = 30; Stringency = 11

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

20

• To measure the local similarity between 2 sequences, scorescan be used in the matrix instead of dots for a sliding windowcomparison– Summing the identities/similarities at each position– For a window of 5 residues and storing the score in the position

corresponding to the center of the window:

1P R I M E511-1-2+0+4 = +21S E Q U E N C E A N A L Y S I S P R I M E R21 . . .

1P R I M E5 16+6+5+6+4 = +271S E Q U E N C E A N A L Y S I S P R I M E R21

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

21

Statistical Significance

• A good way to determine if the alignment score hasstatistical meaning is to compare it with the scoregenerated from the alignment of two randomsequences

• A model of ‘random’ sequences is needed. Thesimplest model chooses the amino acid residues in asequence independently, with backgroundprobabilities

• For an un-gapped alignment, the score of a match toa random sequence is the sum of many similarrandom variables, the sum can be approximated by anormal distribution.

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

22

– Comparing a query sequence to a set of random sequences of uniform length results inscores that obey an extreme value distribution rather than a normal distribution, e.g.,can lead to overestimation of an alignment’s significance (see Altschul et al, 1994)

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

23

• For database searches, the ONLY criteriaavailable to judge the likelihood of a structural orevolutionary relationship between 2 sequences isan estimate of statistical significance

• Statistical significance and biological significanceare NOT necessarily the same

Caveats

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

24

Query= /phosphonatase/phosSt.gcg (255 letters) (10/20/99/pcb)Database: /mol/seq/blast/db/swissprot 78,725 sequences; 28,368,147 total letters!

Score ESequences producing significant alignments: (bits) Value

sp|O06995|PGMB_BACSU Begin: 93 End: 204 PUTATIVE BETA-PHOSPHOGLUCOMUTASE (BETA-PGM) 38 0.020sp|P31467|YIEH_ECOLI Begin: 1 End: 180 HYPOTHETICAL 24.7 KD PROTEIN IN TNAB-BGLB I... 36 0.10sp|O14165|YDX1_SCHPO Begin: 34 End: 201 HYPOTHETICAL 27.1 KD PROTEIN C4C5.01 IN CHR... 31 2.6sp|P41277|GPP1_YEAST Begin: 133 End: 200 (DL)-GLYCEROL-3-PHOSPHATASE 1 30 4.4sp|Q39565|DYHB_CHLRE Begin: 3911 End: 4032 DYNEIN BETA CHAIN, FLAGELLAR OUTER ARM 29 7.6sp|P77625|YFBT_ECOLI Begin: 143 End: 187 HYPOTHETICAL 23.7 KD PROTEIN IN LRHA-ACKA I... 29 10.0sp|Q40297|FCPA_MACPY Begin: 146 End: 176 FUCOXANTHIN-CHLOROPHYLL A-C BINDING PROTEIN... 29 13sp|P40853|GPHP_ALCEU Begin: 94 End: 188 PHOSPHOGLYCOLATE PHOSPHATASE, PLASMID (PGP) 29 13sp|Q40296|FCPB_MACPY Begin: 146 End: 176 FUCOXANTHIN-CHLOROPHYLL A-C BINDING PROTEIN... 29 13sp|P52183|ANNU_SCHAM Begin: 119 End: 168 ANNULIN (PROTEIN-GLUTAMINE GAMMA-GLUTAMYLTR... 29 13sp|P40106|GPP2_YEAST Begin: 133 End: 200 (DL)-GLYCEROL-3-PHOSPHATASE 2 28 17sp|P37934|MAY3_SCHCO Begin: 435 End: 552 MATING-TYPE PROTEIN A-ALPHA Y3 27 29sp|O06219|MURE_MYCTU Begin: 255 End: 371 UDP-N-ACETYLMURAMOYLALANYL-D-GLUTAMATE--2,6... 27 29sp|P08419|EL2_PIG Begin: 182 End: 245 ELASTASE 2 PRECURSO 27 38sp|Q11034|Y07S_MYCTU Begin: 163 End: 218 HYPOTHETICAL 69.5 KD PROTEIN CY02B10.28C 27 38sp|P00577|RPOC_ECOLI Begin: 1290 End: 1401 DNA-DIRECTED RNA POLYMERASE BETA' CHAIN (T 27 38sp|P32662|GPH_ECOLI Begin: 20 End: 49 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 27 38sp|P32662|GPH_ECOLI Begin: 116 End: 224 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 27 28sp|P32282|RIR1_BPT4 Begin: 239 End: 266 RIBONUCLEOSIDE-DIPHOSPHATE REDUCTASE ALPHA C... 27 50sp|P17346|LEC2_MEGRO Begin: 36 End: 121 LECTIN BRA-2 27 50sp|P54947|YXEH_BACSU Begin: 24 End: 51 HYPOTHETICAL 30.2 KD PROTEIN IN IDH-DEOR IN... 27 50sp|P77366|PGMB_ECOLI Begin: 95 End: 190 PUTATIVE BETA-PHOSPHOGLUCOMUTASE (BETA-PGM) 27 50sp|P30139|THIG_ECOLI Begin: 43 End: 79 THIG PROTEIN 27 50sp|P95649|CBBY_RHOSH Begin: 96 End: 189 CBBY PROTEIN 27 50sp|Q43154|GSHC_SPIOL Begin: 228 End: 327 GLUTATHIONE REDUCTASE, CHLOROPLAST PRECURSO... 26 66sp|P34132|NT6A_HUMAN Begin: 191 End: 215 NEUROTROPHIN-6 ALPHA (NT-6 ALPHA) 26 66sp|P34134|NT6G_HUMAN Begin: 115 End: 144 NEUROTROPHIN-6 GAMMA (NT-6 GAMMA) 26 66sp|P95650|GPH_RHOSH Begin: 48 End: 114 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 26 66

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

25

0

50

100

150

200

0 200 400 600 800 1000

chan

ges/

100

amin

o ac

ids

millions of years since divergence

Hemoglobin

Fibrinopeptides

Cytochrome C

• Different proteins evolve at different rates

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

26

• Different domains within a single proteinevolve at different rates

C-peptide

B-chain C-peptide A-chain

A-chain

B-chain

r = 0.13 x 10-9/site/yearr = 0.97 x 10-9/site/year

Proinsulin

Mature insulin

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

27

• "Fast" search algorithm generates global alignments,allows gaps(see http://www.ebi.ac.uk/fasta33/)

• Extensively updated since first release– added statistical analysis– multiple variants available– FASTA3 is the current implementation

FASTA

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

28

• FASTA Compares protein vs protein or DNA vs DNA

• FASTX/FASTY Compares DNA query to proteinsequence db, DNA translated in 3 forward (or reverse)frames; allows frameshifts

• TFASTX Compares protein query vs DNA sequence ordb, translated in all 6 reading frames; no accommodationfor introns

• FASTS Compares a set of short peptide fragmentsderived from mass spectrometric proteomic analysis vsprotein or DNA db

FASTA flavors(see http://fasta.bioch.virginia.edu/)

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

29

• Original "fast" search algorithm generates localalignments without gaps (Blast 1.4)

• Newer versions (Blast 2.0x) accommodates gaps

• Access at NCBI and other sites:http://www.ncbi.nlm.nih.gov/BLAST/

• Documentation– Manual: http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html– FACS: http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html– Tutorial: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

BLAST

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

30

BLAST flavors

• blastp compares an amino acid query sequence against a proteinsequence database

• blastn compares a nucleotide query sequence against a nucleotidesequence database

• blastx compares the six-frame conceptual translation products ofa nucleotide query sequence (both strands) against a proteinsequence database

• tblastn compares a protein query sequence against a nucleotidesequence database dynamically translated in all six readingframes (both strands)

• tblastx compares the six-frame translations of a nucleotide querysequence against the six-frame translations of a nucleotide sequencedatabase

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

31

• These methods are so widely used because theyreally are that good...

• BUT, there are some disadvantages:– Loss of sub-optimal alignments– Pairwise comparisons limit information content– Many biologically significant relationships may be lost in the

"noise," i.e., hits that are not statistically significant

• BLAST is not “better” than FASTA

Some Generalities about Fasta, Blast

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

32

• Generalizes BLAST algorithm to use a position-specific score matrix in place of a query sequence andassociated substitution matrix for searching thedatabases

• Position-specific score matrix generated from theoutput of a gapped Blast search, i.e., uses a profile ormotif defined in the initial Blast search in place of asingle query sequence and matrix for subsequentsearches of the database

• Results in a database search “tuned” to the specificsequence characteristics of interest

Psi-Blast: Extending our reach...

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

33

• Constructs a multiple alignment from a Gapped Blastsearch and generates a profile from any significantlocal alignments found

• The profile is compared to the protein database andPSI-BLAST estimates the statistical significance ofthe local alignments found, using "significant" hits toextend the profile for the next round

• PSI-BLAST iterates step 2 an arbitrary number oftimes or until convergence

*Adapted from the PSI-BLAST tutorial at NCBI

Steps in a Psi-Blast search*

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

34

• Access at http://www.ncbi.nlm.nih.gov/BLAST/

• Tutorial athttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-2.html

• A short explanation of PSI-BLAST statistics athttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-3.html

• See also:Park J et al “Sequence comparisons using multiplesequences detect three times as many remote homologs as pairwisemethods,” JMB 284:1201-10, 1998

PSI-BLAST information on the web

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

35

Other alternatives

• Many, many other DB searching algorithms areavailable– Smith-Waterman– Methods based on probabilistic models/profiles, e.g., Hidden

Markov models– Motif searching

• Or, you can use (or start with) pre-computedanalyses of protein families

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

36

• Identification of very distant homologs• May point to important functional units in a

protein• Can be used to "anchor" a multiple alignment• Databases of motifs can be used to develop other

informatics applications

Example: BLOCKS Æ Blosum matrices

Why do motif analysis?

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

37

Motif analysis

• Focuses on conserved patterns among two or moresequences to determine relationships

• Many variants of motif searching available– Consensus-based, e.g., Prosite

http://expasy.nhri.org.tw/prosite/– Manually annotated motifs, distant relationships, e.g.,

PRINTShttp://www.bioinf.man.ac.uk/dbbrowser/PRINTS/

– Statistical, e.g., MEME (Multiple EM for Motif Elicitation)http://meme.sdsc.edu/meme/website/

– Database searching, e.g., PHI-BLASThttp://www.ncbi.nlm.nih.gov/BLAST/

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

38

Meme & Mast

• Meme: motif discovery toolhttp://meme.sdsc.edu/meme/website/intro.html– motifs represented as position-dependent letter-probability

matrices which describe the probability of each possibleletter at each position in the pattern

– output can be converted to BLOCKS which can then beconverted to PSSMs (position-specific scoring matrices)

• Mast: database searching tool using one or moremotifs as queries– provides a match score for each sequence in the database

compared with each of the motifs in the group of motifsprovided represented as p-values

– provides probable order and spacing of occurrences of themotifs in the sequence hits

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

39

Some pre-calculated motif/family compilations

• Prosite: Protein families/domains showing biologicallyimportant patterns (1637 different patterns, rules andprofiles/matrices as of 6/03) http://us.expasy.org/prosite/

• Pfam: Multiple sequence alignments and HMMs formany protein domains (5724 families as of 5/03)http://pfam.wustl.edu/

• Prints: Conserved motifs characterizing proteinfamilies (1800 entries, encoding 10,931 individualmotifs as of 4/03) http://bioinf.man.ac.uk/dbbrowser/PRINTS/

• Compilation of specific protein family websites at theMRC http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-family.html

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

40

Laboratory Exercises & Resources fromBaygenomics

http://baygenomics.ucsf.edu/PGAConference2003/

• Using the LDL receptor as an example– DB searching– TMD prediction– Prosite, Pfam, Prints, Motif analysis– Multiple alignment generation and interpretation– Tree building/visualization– 2° structure/TMD prediction– 3D structure visualization

• Part of a 2-day hands-on workshop (& and onlineversion)– extensive help files– detailed answer keys