Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences

Bioinformatics Methods for Inheriting Structural and Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene SequencesFunctional annotations for Gene Sequences

if a related sequence has a known function can you inherit if a related sequence has a known function can you inherit functional propertiesfunctional properties

if a related sequence has a known structure, can you model the if a related sequence has a known structure, can you model the unknown structure using the known?unknown structure using the known?

structural information can often provide additional clues as to the structural information can often provide additional clues as to the functionfunction

What are the best methods to use?What are the best methods to use?

What thresholds should be used for safe inheritance of functional What thresholds should be used for safe inheritance of functional properties?properties?

a

a b

duplication

speciation

species 1 species 2

a b a b

paralogs

orthologs

Homologues are related sequences: Homologues are related sequences:

Protein Sequence and Structure DatabasesProtein Sequence and Structure Databases

GenBank sequence database in the States has over 120 GenBank sequence database in the States has over 120 million sequences - some partial. More than a million non-million sequences - some partial. More than a million non-identical sequencesidentical sequences

DNA database of Japan (DDBJ)DNA database of Japan (DDBJ)

UniProt (SWISS-PROT) database has > a million non-UniProt (SWISS-PROT) database has > a million non-

identical sequences - validated gene sequencesidentical sequences - validated gene sequences

Protein Structure Databank (PDB - States, ePDB - UK) has Protein Structure Databank (PDB - States, ePDB - UK) has >70,000 entries>70,000 entries

Web Based Public Resources containing Functional Web Based Public Resources containing Functional AnnotationsAnnotations

• Protein Family and Function databasesProtein Family and Function databases Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, CATH, HOMSTRADCATH, HOMSTRAD

• Databases of biochemical pathways and biological databasesDatabases of biochemical pathways and biological databases KEGG, WIT, GO, FunCat, EC KEGG, WIT, GO, FunCat, EC

• Databases of Protein-Ligand InteractionsDatabases of Protein-Ligand Interactions IntAct, MIPS, RELIBASE, BIND, DIP, IrefIndexIntAct, MIPS, RELIBASE, BIND, DIP, IrefIndex

• Species DatabasesSpecies Databases ENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCycENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCyc

Evolution of Protein SequencesEvolution of Protein Sequences

substitutions due to single base mutationssubstitutions due to single base mutations

insertions or deletions (indels) of residues - usually not in the insertions or deletions (indels) of residues - usually not in the secondary structures but in the connecting loopssecondary structures but in the connecting loops

insertions/deletions (indels) can make it harder to compare insertions/deletions (indels) can make it harder to compare sequences - have to line up the equivalent regions and put gaps sequences - have to line up the equivalent regions and put gaps where there are indelswhere there are indels

Evolution of Protein SequencesEvolution of Protein Sequences

Sequence ASequence A

Sequence BSequence B

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT

Human Hemoglobin: Alpha and Beta ChainsHuman Hemoglobin: Alpha and Beta Chains

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQ

KTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPN

RFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL

ALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEF

DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH

TPAVHASLDKFLASVSTVLTSKYRTPAVHASLDKFLASVSTVLTSKYR

FGKEFTPPVQAAYQKVVAGVANALAHKYHFGKEFTPPVQAAYQKVVAGVANALAHKYH

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT

Human Hemoglobin: Alpha and Beta ChainsHuman Hemoglobin: Alpha and Beta Chains

VHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWTVHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWT

KTYFPHF DLSH GSAQVKGHGKKVADALTNAVAHVKTYFPHF DLSH GSAQVKGHGKKVADALTNAVAHV

QRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL

DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH

DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH

LPAEFTPAVHASLDKFLASVSTVLTSKYRLPAEFTPAVHASLDKFLASVSTVLTSKYR

FGKEFTPPVQAAYQKVVAGVANALAHKYHFGKEFTPPVQAAYQKVVAGVANALAHKYH

Percentage Sequence IdentityPercentage Sequence Identity

= number of identical residues X 100= number of identical residues X 100

number of residues in smallest proteinnumber of residues in smallest protein

For globin exampleFor globin example

without gaps ~9%without gaps ~9%

with gaps ~41%with gaps ~41%

Searching for Homologues with Related FunctionsSearching for Homologues with Related Functions

How do you handle the evolutionary changes?How do you handle the evolutionary changes?

How similar do the sequences need to be to inherit structural and How similar do the sequences need to be to inherit structural and functional propertiesfunctional properties

How do you cope with the volume of data ie millions of How do you cope with the volume of data ie millions of sequences to search?sequences to search?

Searching Sequence DatabasesSearching Sequence Databases

Can you inherit functional information?Can you inherit functional information?

Do fast scans using approximate Do fast scans using approximate methods e.g.methods e.g. BLAST or PSIBLASTBLAST or PSIBLAST

Align proteins carefully using a dynamic Align proteins carefully using a dynamic programming methodprogramming method Needleman & WunschNeedleman & WunschSmith & WatermanSmith & Waterman

Scan against sequence profiles (or Scan against sequence profiles (or HMMs) in secondary databases e.g.HMMs) in secondary databases e.g. Pfam, InterPro, Gene3DPfam, InterPro, Gene3D

Align query sequence against family relatives Align query sequence against family relatives using:using: ClustalW, Jalview, MUSCLE, MAFFT ClustalW, Jalview, MUSCLE, MAFFT

VV

diagonal lines give equivalent residuesdiagonal lines give equivalent residues

II LL SS TT RR II VV HH VV NN SS II LL PP SS TT NN

VVIILLSSTTRRIIVVIILLPPEEFFSSTT

Sequence ASequence AS

equ

enc

e B

Se

que

nce

B

Dot Plots, Path Matrices, Score Dot Plots, Path Matrices, Score MatricesMatrices

VV II LL SS TT RR II VV HHVVNNSS II LL PP SS TT NN

VVIILLSSTTRRIIVVIILLPPEEFFSSTT


Seq

uen

ce B

Seq

uen

ce B

identical residues score 1identical residues score 1highest scoring path across the matrix gives best alignmenthighest scoring path across the matrix gives best alignment

V I L S L V I L P Q R S L V V I L S L V I L A L T VV I L S L V I L P Q R S L V V I L S L V I L A L T V

SSTTVVIILLSSLLVVRRNNVVIILLPPQQRRIILLSSLLVVIISSLLAALL


Seq

uen

ce B

Seq

uen

ce B

runs runs (tuples) of (tuples) of

33residuesresidues

66

66

55

66

33

33

33

66

SCORE = SCORE = 20 - 9 = 20 - 9 =

1111

33

gap gap penaltypenalty

= 3= 3

Alignment from Dot PlotAlignment from Dot Plot

VILSLV ILPQRSLVVILSLVI LALTVVILSLV ILPQRSLVVILSLVI LALTV

STVILSLVNVILPQR ILSLVISLAL STVILSLVNVILPQR ILSLVISLAL

score = 20score = 20

sequence identity = 20/26 = 75%sequence identity = 20/26 = 75%

Global alignmentGlobal alignment

Local alignmentLocal alignment

Needleman & WunschNeedleman & Wunsch

Smith & WatermanSmith & Waterman

Dynamic Programming MethodsDynamic Programming Methods

Sequence ASequence AS

equ

ence

BS

equ

ence

B

Significance of sequence similarity – length dependenceSignificance of sequence similarity – length dependence

Sequence Sequence identity (%)identity (%)

00

2020

4040

00 200200 400400

protein pairs having > 150 residues are homologous if the sequence protein pairs having > 150 residues are homologous if the sequence identity is > 25%identity is > 25%

short proteins/fragments of 20-40 residues - 30% sequence identity short proteins/fragments of 20-40 residues - 30% sequence identity frequently occurs by chancefrequently occurs by chance

lengthlength

Homologous pairsHomologous pairs

If proteins are homologous they are likely to have similar structures If proteins are homologous they are likely to have similar structures

and functions…..and functions…..

•Modelling a structure based on the structure of a Modelling a structure based on the structure of a homologue >= 30%homologue >= 30%

•Inheriting functional properties from a homologue >= 60%Inheriting functional properties from a homologue >= 60%

The structures of proteins in a family tend to be much The structures of proteins in a family tend to be much more highly conserved during evolution than the more highly conserved during evolution than the sequences (and, in some families, the function)sequences (and, in some families, the function)

Sequence identity between homologues required for Sequence identity between homologues required for inheriting structure or function:inheriting structure or function:

Residue Substitution MatricesResidue Substitution Matrices

a substitution matrix is a 20 x 20 matrix which scores each possible comparison of a substitution matrix is a 20 x 20 matrix which scores each possible comparison of residuesresidues

Identity MatrixIdentity Matrix

simplest scoring scheme - amino acids are either identical (score 1) or non-identical simplest scoring scheme - amino acids are either identical (score 1) or non-identical (score 0)(score 0)

score residue pairs according to similarities in their physico-chemical properties e.g. score residue pairs according to similarities in their physico-chemical properties e.g. val->leu scores well, val->arg scores lowval->leu scores well, val->arg scores low

score residue pairs according to how frequently the mutation is oberved to occur in score residue pairs according to how frequently the mutation is oberved to occur in evolution eg Dayhoff (PAM), BLOSSUM matrices evolution eg Dayhoff (PAM), BLOSSUM matrices

Physicochemical Properties MatrixPhysicochemical Properties Matrix

Evolutionary MatricesEvolutionary Matrices

Dayhoff Matrix (PAM or MDM)Dayhoff Matrix (PAM or MDM)

based on evolutionary relationships, it is derived by analysing the based on evolutionary relationships, it is derived by analysing the substitutions observed in closely related sequences (>80% identity) substitutions observed in closely related sequences (>80% identity)

the method measures evolutionary distance by determining the the method measures evolutionary distance by determining the number of point accepted mutations, where:number of point accepted mutations, where:

1PAM = a single point mutation every 100 residues1PAM = a single point mutation every 100 residues

for distant relatives in the twilight zone (<25% identity), generally use a for distant relatives in the twilight zone (<25% identity), generally use a 250 PAM matrix250 PAM matrix

for database searches generally use 120 PAMSfor database searches generally use 120 PAMS

BLOSUM Substitution MatricesBLOSUM Substitution Matrices

• matrix is derived from analysing substitution patterns in more matrix is derived from analysing substitution patterns in more distant relatives (i.e. < 85% identity)distant relatives (i.e. < 85% identity)

• for clusters of related sequences (e.g. 60% ID, 80% ID) derive for clusters of related sequences (e.g. 60% ID, 80% ID) derive multiple alignments without gaps, for short regions of related multiple alignments without gaps, for short regions of related sequencessequences

• use the alignments to calculate residue substitution frequenciesuse the alignments to calculate residue substitution frequencies

Henikoff & Henikoff (1993)

Which Matrix Should be Used?Which Matrix Should be Used?

• Matrices derived from observed substitution data (e.g. DAYHOFF, Matrices derived from observed substitution data (e.g. DAYHOFF, BLOSUM) are better than identity matrix or those based on physical BLOSUM) are better than identity matrix or those based on physical propertiesproperties

• various studies suggest that PAM250 gives the best result when various studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithmsaligning distant proteins using dynamic programming algorithms

• in database searching it may be better to use PAM120 or in database searching it may be better to use PAM120 or BLOSUM62BLOSUM62

BLASTBLASTBasic Local Alignment ToolBasic Local Alignment Tool

Altschul et al (1990)Altschul et al (1990)

• A highest scoring segment pair (HSP) is found between A highest scoring segment pair (HSP) is found between two sequencestwo sequences

the sequences may be related ifthe sequences may be related if

HSP score > cutoffHSP score > cutoff

matches significant ‘words’ or segments and then extends matches significant ‘words’ or segments and then extends these matches using local dynamic programmingthese matches using local dynamic programming

BLASTBLAST

Step 1Step 1: match significant words: match significant words

query sequence of length L

For each sequence find the For each sequence find the ‘words’ with significant ‘words’ with significant

scoresscores

BLASTBLAST

Step 2:Step 2: compare the word list to the database and identify exact matchescompare the word list to the database and identify exact matches

BLASTBLAST

Step 3:Step 3: for each word match, extend the alignment using a PAM matrix and for each word match, extend the alignment using a PAM matrix and dynamic programmingdynamic programming

• searches for 2 non-overlapping segments on same diagonalsearches for 2 non-overlapping segments on same diagonal

• must be within a certain distance of each other before extension is invokedmust be within a certain distance of each other before extension is invoked

• can also allow gaps so that the method joins segments on different diagonals can also allow gaps so that the method joins segments on different diagonals

BLASTBLAST

Assessing the Significance of Sequence MatchAssessing the Significance of Sequence Match

• length - can get artificially high scores between small sequenceslength - can get artificially high scores between small sequences

• composition - if sequences are rich in particular amino acid composition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteinsresidues can get high scores for unrelated proteins

• to assess the significance of a match it is necessary to compare to assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequencesthe score with that returned by random or unrelated sequences

• if the database is small or when considering a pair-wise if the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random comparison, the sequences can be shuffled to generate random sequencessequences

Assessing the Significance of Scores Returned from a Database ScanAssessing the Significance of Scores Returned from a Database Scan

scorescore

fre

que

ncy

fre

que

ncy

meanmean

s.ds.d

S - mS - m

Z score = score (S) - mean for unrelated (m)Z score = score (S) - mean for unrelated (m)

standard deviation (s.d)standard deviation (s.d)

Z value > 3 s.d related sequencesZ value > 3 s.d related sequences

probe score Sprobe score S

BLAST resultsBLAST results

BLAST best hit

>gi|17472322|ref|XP_061555.1| (XM_061555) similar to orphan G protein-coupled receptor GPR26

[Homo sapiens]

Length = 337

Score = 298 bits (762), Expect = 8e-80

Identities = 168/327 (51%)

Query: 1 MGPGEALLAGLLVMVLAVALLSNALVLLCCAYSAELRTRASGVLLVNLSLGHLLLAALDM 60

M A LAGLLV + V+LLSNALVLLC +SA++R +A + +NL+ G+LL ++M

Sbjct: 1 MNSWNAGLAGLLVGTIGVSLLSNALVLLCLLHSADIRRQAPALFTLNLTCGNLLCTVVNM 60

Query: 61 PFTLLGVMRGRTPSAPGACQVIGFLDTFLASNAALSVAALSADQWLAVGFPLRYAGRLRP 120

P TL GV+ R P+ C++ FLDTFLA+N+ LS+AALS D+W+AV FPL Y ++R

Sbjct: 61 PLTLAGVVAQRQPAGDRLCRLAAFLDTFLAANSMLSMAALSIDRWVAVVFPLSYRAKMRL 120

Query: 121 RYAGLLLGCAWGQSLAFSGAALGCSWLGYSSAFASCSLRLPPEPERPRFAAFTATLHAVG 180

R A L++ W +L F AAL SWLG+ +ASC+L ER RFA FT HA+

Sbjct: 121 RDAALMVAYTWLHALTFPAAALALSWLGFHQLYASCTLCSRRPDERLRFAVFTGAFHALS 180

S - score for the pairwise S - score for the pairwise

alignment.alignment.

E value - number of hits E value - number of hits

you would expect by you would expect by

chance with score S or chance with score S or

higher given the size of higher given the size of

the database and the the database and the

length of the alignmentlength of the alignment

Good Match Good Match

< 1 X 10-50< 1 X 10-50

Possible MatchPossible Match

1 X 10-50 to 1 X 10-21 X 10-50 to 1 X 10-2

Needleman & WunschNeedleman & Wunsch

HH CC NN II RR QQ CC LL CC RR PP MMAA

AA

II

CC

II

NN

RR

CC

KK

CC

RR

HH

PP

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

11

00

00

00

11

00

11

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

11

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

11

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

11

00

11

00

00

00

00

00

00

00

00

11

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

Needleman & Wunsch AlgorithmNeedleman & Wunsch Algorithm

• Accumulate the matrix by adding to each cell the highest score in Accumulate the matrix by adding to each cell the highest score in the column or row to the right and below itthe column or row to the right and below it

• find the highest scoring path in the matrix by:find the highest scoring path in the matrix by:

• starting in the top left cornerstarting in the top left corner

• moving down across the matrix from cell to cell moving down across the matrix from cell to cell

• choosing the highest scoring cell at each movechoosing the highest scoring cell at each move

• the path can not go back on itself or cross the same row or column the path can not go back on itself or cross the same row or column twicetwice

• Add to the score in the cell the highest score from a cell in the row or Add to the score in the cell the highest score from a cell in the row or column to right and belowcolumn to right and below

Accumulating the MatrixAccumulating the Matrix

i,ji,j

i-1,j-1i-1,j-1

i-n,j-1i-n,j-1

i-1,j-mi-1,j-m



AA

II

CC

II

NN

RR

CC

KK

CC

RR

HH

PP

88

77

66

66

55

44

33

33

22

22

11

00

77

77

66

66

55

44

33

33

22

11

22

00

66

66

77

66

55

44

44

33

33

11

11

00

66

66

66

55

66

44

33

33

22

11

11

00

55

66

55

66

55

44

33

33

22

11

11

00

44

44

44

44

55

55

33

33

22

22

11

00

44

44

44

44

44

44

33

33

22

11

11

00

33

33

44

33

33

33

44

33

33

11

11

00

33

33

33

33

33

33

33

33

22

11

11

00

22

22

33

22

33

22

33

22

33

11

11

00

11

11

11

11

11

22

11

11

11

22

11

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

Seq

uenc

e B

Seq

uenc

e B

• start in the leftmost or topmost rowstart in the leftmost or topmost row

• move to the highest scoring cell in row or column to right and belowmove to the highest scoring cell in row or column to right and below

Possible Moves in Finding a Path across the Possible Moves in Finding a Path across the MatrixMatrix

i,ji,j

i-1,j-1i-1,j-1

i-n,j-1i-n,j-1

i-1,j-mi-1,j-m



AA

II

CC

II

NN

RR

CC

KK

CC

RR

HH

PP

88

77

66

66

55

44

33

33

22

22

11

00

77

77

66

66

55

44

33

33

22

11

22

00

66

66

77

66

55

44

44

33

33

11

11

00

66

66

66

55

66

44

33

33

22

11

11

00

55

66

55

66

55

44

33

33

22

11

11

00

44

44

44

44

55

55

33

33

22

22

11

00

44

44

44

44

44

44

33

33

22

11

11

00

33

33

44

33

33

33

44

33

33

11

11

00

33

33

33

33

33

33

33

33

22

11

11

00

22

22

33

22

33

22

33

22

33

11

11

00

11

11

11

11

11

22

11

11

11

22

11

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

Seq

uenc

e B

Seq

uenc

e B

Sequence ASequence AHH CC NN II RR QQ CC LL CC RR PP MMAA

AAIICCIINNRRCCKKCCRRHHPP

887766665544333322221100

777766665544333322112200

666677665544443333111100

666666556644333322111100

556655665544333322111100

444444445555333322221100

444444444444333322111100

333344333333443333111100

333333333333333322111100

222233223322332233111100

111111111122111111221100

000000000000000000000011

000000000000000000000000

Sequ

ence

BSequ

ence

B

A H C N I - R Q C L C R - P MA H C N I - R Q C L C R - P M

A I C - I N R - C K C R H P MA I C - I N R - C K C R H P M

Documents

Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences