Upload
liuz
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences. if a related sequence has a known function can you inherit functional properties if a related sequence has a known structure, can you model the unknown structure using the known? - PowerPoint PPT Presentation
Citation preview
Bioinformatics Methods for Inheriting Structural and Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene SequencesFunctional annotations for Gene Sequences
if a related sequence has a known function can you inherit if a related sequence has a known function can you inherit functional propertiesfunctional properties
if a related sequence has a known structure, can you model the if a related sequence has a known structure, can you model the unknown structure using the known?unknown structure using the known?
structural information can often provide additional clues as to the structural information can often provide additional clues as to the functionfunction
What are the best methods to use?What are the best methods to use?
What thresholds should be used for safe inheritance of functional What thresholds should be used for safe inheritance of functional properties?properties?
a
a b
duplication
speciation
species 1 species 2
a b a b
paralogs
orthologs
Homologues are related sequences: Homologues are related sequences:
Protein Sequence and Structure DatabasesProtein Sequence and Structure Databases
GenBank sequence database in the States has over 120 GenBank sequence database in the States has over 120 million sequences - some partial. More than a million non-million sequences - some partial. More than a million non-identical sequencesidentical sequences
DNA database of Japan (DDBJ)DNA database of Japan (DDBJ)
UniProt (SWISS-PROT) database has > a million non-UniProt (SWISS-PROT) database has > a million non-
identical sequences - validated gene sequencesidentical sequences - validated gene sequences
Protein Structure Databank (PDB - States, ePDB - UK) has Protein Structure Databank (PDB - States, ePDB - UK) has >70,000 entries>70,000 entries
Web Based Public Resources containing Functional Web Based Public Resources containing Functional AnnotationsAnnotations
• Protein Family and Function databasesProtein Family and Function databases Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, CATH, HOMSTRADCATH, HOMSTRAD
• Databases of biochemical pathways and biological databasesDatabases of biochemical pathways and biological databases KEGG, WIT, GO, FunCat, EC KEGG, WIT, GO, FunCat, EC
• Databases of Protein-Ligand InteractionsDatabases of Protein-Ligand Interactions IntAct, MIPS, RELIBASE, BIND, DIP, IrefIndexIntAct, MIPS, RELIBASE, BIND, DIP, IrefIndex
• Species DatabasesSpecies Databases ENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCycENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCyc
Evolution of Protein SequencesEvolution of Protein Sequences
substitutions due to single base mutationssubstitutions due to single base mutations
insertions or deletions (indels) of residues - usually not in the insertions or deletions (indels) of residues - usually not in the secondary structures but in the connecting loopssecondary structures but in the connecting loops
insertions/deletions (indels) can make it harder to compare insertions/deletions (indels) can make it harder to compare sequences - have to line up the equivalent regions and put gaps sequences - have to line up the equivalent regions and put gaps where there are indelswhere there are indels
Evolution of Protein SequencesEvolution of Protein Sequences
Sequence ASequence A
Sequence BSequence B
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT
Human Hemoglobin: Alpha and Beta ChainsHuman Hemoglobin: Alpha and Beta Chains
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQ
KTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPN
RFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL
ALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEF
DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH
TPAVHASLDKFLASVSTVLTSKYRTPAVHASLDKFLASVSTVLTSKYR
FGKEFTPPVQAAYQKVVAGVANALAHKYHFGKEFTPPVQAAYQKVVAGVANALAHKYH
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT
Human Hemoglobin: Alpha and Beta ChainsHuman Hemoglobin: Alpha and Beta Chains
VHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWTVHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWT
KTYFPHF DLSH GSAQVKGHGKKVADALTNAVAHVKTYFPHF DLSH GSAQVKGHGKKVADALTNAVAHV
QRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL
DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH
DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH
LPAEFTPAVHASLDKFLASVSTVLTSKYRLPAEFTPAVHASLDKFLASVSTVLTSKYR
FGKEFTPPVQAAYQKVVAGVANALAHKYHFGKEFTPPVQAAYQKVVAGVANALAHKYH
Percentage Sequence IdentityPercentage Sequence Identity
= number of identical residues X 100= number of identical residues X 100
number of residues in smallest proteinnumber of residues in smallest protein
For globin exampleFor globin example
without gaps ~9%without gaps ~9%
with gaps ~41%with gaps ~41%
Searching for Homologues with Related FunctionsSearching for Homologues with Related Functions
How do you handle the evolutionary changes?How do you handle the evolutionary changes?
How similar do the sequences need to be to inherit structural and How similar do the sequences need to be to inherit structural and functional propertiesfunctional properties
How do you cope with the volume of data ie millions of How do you cope with the volume of data ie millions of sequences to search?sequences to search?
Searching Sequence DatabasesSearching Sequence Databases
Can you inherit functional information?Can you inherit functional information?
Do fast scans using approximate Do fast scans using approximate methods e.g.methods e.g. BLAST or PSIBLASTBLAST or PSIBLAST
Align proteins carefully using a dynamic Align proteins carefully using a dynamic programming methodprogramming method Needleman & WunschNeedleman & WunschSmith & WatermanSmith & Waterman
Scan against sequence profiles (or Scan against sequence profiles (or HMMs) in secondary databases e.g.HMMs) in secondary databases e.g. Pfam, InterPro, Gene3DPfam, InterPro, Gene3D
Align query sequence against family relatives Align query sequence against family relatives using:using: ClustalW, Jalview, MUSCLE, MAFFT ClustalW, Jalview, MUSCLE, MAFFT
VV
diagonal lines give equivalent residuesdiagonal lines give equivalent residues
II LL SS TT RR II VV HH VV NN SS II LL PP SS TT NN
VVIILLSSTTRRIIVVIILLPPEEFFSSTT
Sequence ASequence AS
equ
enc
e B
Se
que
nce
B
Dot Plots, Path Matrices, Score Dot Plots, Path Matrices, Score MatricesMatrices
VV II LL SS TT RR II VV HHVVNNSS II LL PP SS TT NN
VVIILLSSTTRRIIVVIILLPPEEFFSSTT
Sequence ASequence A
Seq
uen
ce B
Seq
uen
ce B
identical residues score 1identical residues score 1highest scoring path across the matrix gives best alignmenthighest scoring path across the matrix gives best alignment
V I L S L V I L P Q R S L V V I L S L V I L A L T VV I L S L V I L P Q R S L V V I L S L V I L A L T V
SSTTVVIILLSSLLVVRRNNVVIILLPPQQRRIILLSSLLVVIISSLLAALL
Sequence ASequence A
Seq
uen
ce B
Seq
uen
ce B
runs runs (tuples) of (tuples) of
33residuesresidues
66
66
55
66
33
33
33
66
SCORE = SCORE = 20 - 9 = 20 - 9 =
1111
33
gap gap penaltypenalty
= 3= 3
Alignment from Dot PlotAlignment from Dot Plot
VILSLV ILPQRSLVVILSLVI LALTVVILSLV ILPQRSLVVILSLVI LALTV
STVILSLVNVILPQR ILSLVISLAL STVILSLVNVILPQR ILSLVISLAL
score = 20score = 20
sequence identity = 20/26 = 75%sequence identity = 20/26 = 75%
Global alignmentGlobal alignment
Local alignmentLocal alignment
Needleman & WunschNeedleman & Wunsch
Smith & WatermanSmith & Waterman
Dynamic Programming MethodsDynamic Programming Methods
Sequence ASequence AS
equ
ence
BS
equ
ence
B
Significance of sequence similarity – length dependenceSignificance of sequence similarity – length dependence
Sequence Sequence identity (%)identity (%)
00
2020
4040
00 200200 400400
protein pairs having > 150 residues are homologous if the sequence protein pairs having > 150 residues are homologous if the sequence identity is > 25%identity is > 25%
short proteins/fragments of 20-40 residues - 30% sequence identity short proteins/fragments of 20-40 residues - 30% sequence identity frequently occurs by chancefrequently occurs by chance
lengthlength
Homologous pairsHomologous pairs
If proteins are homologous they are likely to have similar structures If proteins are homologous they are likely to have similar structures
and functions…..and functions…..
•Modelling a structure based on the structure of a Modelling a structure based on the structure of a homologue >= 30%homologue >= 30%
•Inheriting functional properties from a homologue >= 60%Inheriting functional properties from a homologue >= 60%
The structures of proteins in a family tend to be much The structures of proteins in a family tend to be much more highly conserved during evolution than the more highly conserved during evolution than the sequences (and, in some families, the function)sequences (and, in some families, the function)
Sequence identity between homologues required for Sequence identity between homologues required for inheriting structure or function:inheriting structure or function:
Residue Substitution MatricesResidue Substitution Matrices
a substitution matrix is a 20 x 20 matrix which scores each possible comparison of a substitution matrix is a 20 x 20 matrix which scores each possible comparison of residuesresidues
Identity MatrixIdentity Matrix
simplest scoring scheme - amino acids are either identical (score 1) or non-identical simplest scoring scheme - amino acids are either identical (score 1) or non-identical (score 0)(score 0)
score residue pairs according to similarities in their physico-chemical properties e.g. score residue pairs according to similarities in their physico-chemical properties e.g. val->leu scores well, val->arg scores lowval->leu scores well, val->arg scores low
score residue pairs according to how frequently the mutation is oberved to occur in score residue pairs according to how frequently the mutation is oberved to occur in evolution eg Dayhoff (PAM), BLOSSUM matrices evolution eg Dayhoff (PAM), BLOSSUM matrices
Physicochemical Properties MatrixPhysicochemical Properties Matrix
Evolutionary MatricesEvolutionary Matrices
Dayhoff Matrix (PAM or MDM)Dayhoff Matrix (PAM or MDM)
based on evolutionary relationships, it is derived by analysing the based on evolutionary relationships, it is derived by analysing the substitutions observed in closely related sequences (>80% identity) substitutions observed in closely related sequences (>80% identity)
the method measures evolutionary distance by determining the the method measures evolutionary distance by determining the number of point accepted mutations, where:number of point accepted mutations, where:
1PAM = a single point mutation every 100 residues1PAM = a single point mutation every 100 residues
for distant relatives in the twilight zone (<25% identity), generally use a for distant relatives in the twilight zone (<25% identity), generally use a 250 PAM matrix250 PAM matrix
for database searches generally use 120 PAMSfor database searches generally use 120 PAMS
BLOSUM Substitution MatricesBLOSUM Substitution Matrices
• matrix is derived from analysing substitution patterns in more matrix is derived from analysing substitution patterns in more distant relatives (i.e. < 85% identity)distant relatives (i.e. < 85% identity)
• for clusters of related sequences (e.g. 60% ID, 80% ID) derive for clusters of related sequences (e.g. 60% ID, 80% ID) derive multiple alignments without gaps, for short regions of related multiple alignments without gaps, for short regions of related sequencessequences
• use the alignments to calculate residue substitution frequenciesuse the alignments to calculate residue substitution frequencies
Henikoff & Henikoff (1993)
Which Matrix Should be Used?Which Matrix Should be Used?
• Matrices derived from observed substitution data (e.g. DAYHOFF, Matrices derived from observed substitution data (e.g. DAYHOFF, BLOSUM) are better than identity matrix or those based on physical BLOSUM) are better than identity matrix or those based on physical propertiesproperties
• various studies suggest that PAM250 gives the best result when various studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithmsaligning distant proteins using dynamic programming algorithms
• in database searching it may be better to use PAM120 or in database searching it may be better to use PAM120 or BLOSUM62BLOSUM62
BLASTBLASTBasic Local Alignment ToolBasic Local Alignment Tool
Altschul et al (1990)Altschul et al (1990)
• A highest scoring segment pair (HSP) is found between A highest scoring segment pair (HSP) is found between two sequencestwo sequences
the sequences may be related ifthe sequences may be related if
HSP score > cutoffHSP score > cutoff
matches significant ‘words’ or segments and then extends matches significant ‘words’ or segments and then extends these matches using local dynamic programmingthese matches using local dynamic programming
BLASTBLAST
Step 1Step 1: match significant words: match significant words
query sequence of length L
For each sequence find the For each sequence find the ‘words’ with significant ‘words’ with significant
scoresscores
BLASTBLAST
Step 2:Step 2: compare the word list to the database and identify exact matchescompare the word list to the database and identify exact matches
BLASTBLAST
Step 3:Step 3: for each word match, extend the alignment using a PAM matrix and for each word match, extend the alignment using a PAM matrix and dynamic programmingdynamic programming
• searches for 2 non-overlapping segments on same diagonalsearches for 2 non-overlapping segments on same diagonal
• must be within a certain distance of each other before extension is invokedmust be within a certain distance of each other before extension is invoked
• can also allow gaps so that the method joins segments on different diagonals can also allow gaps so that the method joins segments on different diagonals
BLASTBLAST
Assessing the Significance of Sequence MatchAssessing the Significance of Sequence Match
• length - can get artificially high scores between small sequenceslength - can get artificially high scores between small sequences
• composition - if sequences are rich in particular amino acid composition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteinsresidues can get high scores for unrelated proteins
• to assess the significance of a match it is necessary to compare to assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequencesthe score with that returned by random or unrelated sequences
• if the database is small or when considering a pair-wise if the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random comparison, the sequences can be shuffled to generate random sequencessequences
Assessing the Significance of Scores Returned from a Database ScanAssessing the Significance of Scores Returned from a Database Scan
scorescore
fre
que
ncy
fre
que
ncy
meanmean
s.ds.d
S - mS - m
Z score = score (S) - mean for unrelated (m)Z score = score (S) - mean for unrelated (m)
standard deviation (s.d)standard deviation (s.d)
Z value > 3 s.d related sequencesZ value > 3 s.d related sequences
probe score Sprobe score S
BLAST resultsBLAST results
BLAST best hit
>gi|17472322|ref|XP_061555.1| (XM_061555) similar to orphan G protein-coupled receptor GPR26
[Homo sapiens]
Length = 337
Score = 298 bits (762), Expect = 8e-80
Identities = 168/327 (51%)
Query: 1 MGPGEALLAGLLVMVLAVALLSNALVLLCCAYSAELRTRASGVLLVNLSLGHLLLAALDM 60
M A LAGLLV + V+LLSNALVLLC +SA++R +A + +NL+ G+LL ++M
Sbjct: 1 MNSWNAGLAGLLVGTIGVSLLSNALVLLCLLHSADIRRQAPALFTLNLTCGNLLCTVVNM 60
Query: 61 PFTLLGVMRGRTPSAPGACQVIGFLDTFLASNAALSVAALSADQWLAVGFPLRYAGRLRP 120
P TL GV+ R P+ C++ FLDTFLA+N+ LS+AALS D+W+AV FPL Y ++R
Sbjct: 61 PLTLAGVVAQRQPAGDRLCRLAAFLDTFLAANSMLSMAALSIDRWVAVVFPLSYRAKMRL 120
Query: 121 RYAGLLLGCAWGQSLAFSGAALGCSWLGYSSAFASCSLRLPPEPERPRFAAFTATLHAVG 180
R A L++ W +L F AAL SWLG+ +ASC+L ER RFA FT HA+
Sbjct: 121 RDAALMVAYTWLHALTFPAAALALSWLGFHQLYASCTLCSRRPDERLRFAVFTGAFHALS 180
S - score for the pairwise S - score for the pairwise
alignment.alignment.
E value - number of hits E value - number of hits
you would expect by you would expect by
chance with score S or chance with score S or
higher given the size of higher given the size of
the database and the the database and the
length of the alignmentlength of the alignment
Good Match Good Match
< 1 X 10-50< 1 X 10-50
Possible MatchPossible Match
1 X 10-50 to 1 X 10-21 X 10-50 to 1 X 10-2
Needleman & WunschNeedleman & Wunsch
HH CC NN II RR QQ CC LL CC RR PP MMAA
AA
II
CC
II
NN
RR
CC
KK
CC
RR
HH
PP
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
11
00
00
00
11
00
11
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
11
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
11
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
11
00
11
00
00
00
00
00
00
00
00
11
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
Needleman & Wunsch AlgorithmNeedleman & Wunsch Algorithm
• Accumulate the matrix by adding to each cell the highest score in Accumulate the matrix by adding to each cell the highest score in the column or row to the right and below itthe column or row to the right and below it
• find the highest scoring path in the matrix by:find the highest scoring path in the matrix by:
• starting in the top left cornerstarting in the top left corner
• moving down across the matrix from cell to cell moving down across the matrix from cell to cell
• choosing the highest scoring cell at each movechoosing the highest scoring cell at each move
• the path can not go back on itself or cross the same row or column the path can not go back on itself or cross the same row or column twicetwice
• Add to the score in the cell the highest score from a cell in the row or Add to the score in the cell the highest score from a cell in the row or column to right and belowcolumn to right and below
Accumulating the MatrixAccumulating the Matrix
i,ji,j
i-1,j-1i-1,j-1
i-n,j-1i-n,j-1
i-1,j-mi-1,j-m
Sequence ASequence A
HH CC NN II RR QQ CC LL CC RR PP MMAA
AA
II
CC
II
NN
RR
CC
KK
CC
RR
HH
PP
88
77
66
66
55
44
33
33
22
22
11
00
77
77
66
66
55
44
33
33
22
11
22
00
66
66
77
66
55
44
44
33
33
11
11
00
66
66
66
55
66
44
33
33
22
11
11
00
55
66
55
66
55
44
33
33
22
11
11
00
44
44
44
44
55
55
33
33
22
22
11
00
44
44
44
44
44
44
33
33
22
11
11
00
33
33
44
33
33
33
44
33
33
11
11
00
33
33
33
33
33
33
33
33
22
11
11
00
22
22
33
22
33
22
33
22
33
11
11
00
11
11
11
11
11
22
11
11
11
22
11
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
Seq
uenc
e B
Seq
uenc
e B
• start in the leftmost or topmost rowstart in the leftmost or topmost row
• move to the highest scoring cell in row or column to right and belowmove to the highest scoring cell in row or column to right and below
Possible Moves in Finding a Path across the Possible Moves in Finding a Path across the MatrixMatrix
i,ji,j
i-1,j-1i-1,j-1
i-n,j-1i-n,j-1
i-1,j-mi-1,j-m
Sequence ASequence A
HH CC NN II RR QQ CC LL CC RR PP MMAA
AA
II
CC
II
NN
RR
CC
KK
CC
RR
HH
PP
88
77
66
66
55
44
33
33
22
22
11
00
77
77
66
66
55
44
33
33
22
11
22
00
66
66
77
66
55
44
44
33
33
11
11
00
66
66
66
55
66
44
33
33
22
11
11
00
55
66
55
66
55
44
33
33
22
11
11
00
44
44
44
44
55
55
33
33
22
22
11
00
44
44
44
44
44
44
33
33
22
11
11
00
33
33
44
33
33
33
44
33
33
11
11
00
33
33
33
33
33
33
33
33
22
11
11
00
22
22
33
22
33
22
33
22
33
11
11
00
11
11
11
11
11
22
11
11
11
22
11
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
Seq
uenc
e B
Seq
uenc
e B
Sequence ASequence AHH CC NN II RR QQ CC LL CC RR PP MMAA
AAIICCIINNRRCCKKCCRRHHPP
887766665544333322221100
777766665544333322112200
666677665544443333111100
666666556644333322111100
556655665544333322111100
444444445555333322221100
444444444444333322111100
333344333333443333111100
333333333333333322111100
222233223322332233111100
111111111122111111221100
000000000000000000000011
000000000000000000000000
Sequ
ence
BSequ
ence
B
A H C N I - R Q C L C R - P MA H C N I - R Q C L C R - P M
A I C - I N R - C K C R H P MA I C - I N R - C K C R H P M