Upload
chancellor-hendrix
View
16
Download
1
Embed Size (px)
DESCRIPTION
Remarks About Homework. Write detailed answers Pay attention to details in the questions “… nor can the shy man learn…”. Multiple Sequence Alignment (MSA) and Phylogeny. One of the options to get multiple sequence Fasta file. One of the options to get multiple sequence Fasta file. - PowerPoint PPT Presentation
Citation preview
Remarks About HomeworkRemarks About Homework
Write detailed answersWrite detailed answers
Pay attention to details in the questionsPay attention to details in the questions
“… “… nor can the shy man learn…”nor can the shy man learn…”
Multiple Multiple Sequence Sequence
Alignment (MSA)Alignment (MSA)andand
Phylogeny Phylogeny
OneOne of the options to get multiple of the options to get multiple sequence Fasta filesequence Fasta file
OneOne of the options to get multiple of the options to get multiple sequence Fasta filesequence Fasta file
MSA input: multiple sequence MSA input: multiple sequence Fasta fileFasta file
>gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] >gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLT MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLT KGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLT KGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLT LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
>gi|57113961|ref|NP_001009043.1| CD4 antigen [Pan troglodytes] >gi|57113961|ref|NP_001009043.1| CD4 antigen [Pan troglodytes] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQTKILGNQGSFLT MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQTKILGNQGSFLT KGPSKLNDRVDSRRSLWDQGNFTLIIKNLKIEDSDTYICEVGDQKEEVQLLVFGLTANSDTHLLQGQSLT KGPSKLNDRVDSRRSLWDQGNFTLIIKNLKIEDSDTYICEVGDQKEEVQLLVFGLTANSDTHLLQGQSLT LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAQRMSQIKRLLSEKKTCQCPHRFQKTCSPI RCRHRRRQAQRMSQIKRLLSEKKTCQCPHRFQKTCSPI
>gi|50054438|ref|NP_001001908.1| CD4 antigen [Sus scrofa] >gi|50054438|ref|NP_001001908.1| CD4 antigen [Sus scrofa] MDPGTSLRHLFLVLQLAMLPAASGTQEKYLVLGKAGDLAELPCHSSQKKNLPFNWKNSNQTKILGGHGSF MDPGTSLRHLFLVLQLAMLPAASGTQEKYLVLGKAGDLAELPCHSSQKKNLPFNWKNSNQTKILGGHGSF WHTASVTELTSRLDSKKNMWDHGSFPLIIKNLEVTDSGIYICEVEDKRIEVQLLVFRLTASVTRVLLGQS WHTASVTELTSRLDSKKNMWDHGSFPLIIKNLEVTDSGIYICEVEDKRIEVQLLVFRLTASVTRVLLGQS LTLTLEGPSGSHPTVQWKGPGNKSKNDVKSLLLPQVGLEDSGLWTCTVSQDQKTLVFRSNIFVLAFQKVP LTLTLEGPSGSHPTVQWKGPGNKSKNDVKSLLLPQVGLEDSGLWTCTVSQDQKTLVFRSNIFVLAFQKVP STVYVKEGDQVALSFPLTFEAESLSGELMWRQTKGASSPQSWITFSLKDRKVTVQKSLQNLKLRMAEKLP STVYVKEGDQVALSFPLTFEAESLSGELMWRQTKGASSPQSWITFSLKDRKVTVQKSLQNLKLRMAEKLP LQITLLQALPQYAGSGNLTLVLPEGRLHREVNLVVMRATQSKNEVTCEVLGPTPPKVVLSLKLGNQSMKV LQITLLQALPQYAGSGNLTLVLPEGRLHREVNLVVMRATQSKNEVTCEVLGPTPPKVVLSLKLGNQSMKV SDQQKLVTVLDPEAGMWRCLLRDKDKVLLESQVEVLPTAFTRAWPELLASVIGGIIGLLFLAGFCIACVK SDQQKLVTVLDPEAGMWRCLLRDKDKVLLESQVEVLPTAFTRAWPELLASVIGGIIGLLFLAGFCIACVK CWHRRRRAERMSQIKRLLSEKKTCQCAHRQQKNYSLT CWHRRRRAERMSQIKRLLSEKKTCQCAHRQQKNYSLT
>gi|6978631|ref|NP_036837.1| Cd4 molecule [Rattus norvegicus] >gi|6978631|ref|NP_036837.1| Cd4 molecule [Rattus norvegicus] MCRGFSFRHLLPLLLLQLSKLLVVTQGKTVVLGKEGGSAELPCESTSRRSASFAWKSSDQKTILGYKNKL MCRGFSFRHLLPLLLLQLSKLLVVTQGKTVVLGKEGGSAELPCESTSRRSASFAWKSSDQKTILGYKNKL LIKGSLELYSRFDSRKNAWERGSFPLIINKLRMEDSQTYVCELENKKEEVELWVFRVTFNPGTRLLQGQS LIKGSLELYSRFDSRKNAWERGSFPLIINKLRMEDSQTYVCELENKKEEVELWVFRVTFNPGTRLLQGQS LTLILDSNPKVSDPPIECKHKSSNIVKDSKAFSTHSLRIQDSGIWNCTVTLNQKKHSFDMKLSVLGFAST LTLILDSNPKVSDPPIECKHKSSNIVKDSKAFSTHSLRIQDSGIWNCTVTLNQKKHSFDMKLSVLGFAST SITAYKSEGESAEFSFPLNLGEESLQGELRWKAEKAPSSQSWITFSLKNQKVSVQKSTSNPKFQLSETLP SITAYKSEGESAEFSFPLNLGEESLQGELRWKAEKAPSSQSWITFSLKNQKVSVQKSTSNPKFQLSETLP LTLQIPQVSLQFAGSGNLTLTLDRGILYQEVNLVVMKVTQPDSNTLTCEVMGPTSPKMRLILKQENQEAR LTLQIPQVSLQFAGSGNLTLTLDRGILYQEVNLVVMKVTQPDSNTLTCEVMGPTSPKMRLILKQENQEAR VSRQEKVIQVQAPEAGVWQCLLSEGEEVKMDSKIQVLSKGLNQTMFLAVVLGSAFSFLVFTGLCILFCVR VSRQEKVIQVQAPEAGVWQCLLSEGEEVKMDSKIQVLSKGLNQTMFLAVVLGSAFSFLVFTGLCILFCVR CRHQQRQAARMSQIKRLLSEKKTCQCSHRMQKSHNLI CRHQQRQAARMSQIKRLLSEKKTCQCSHRMQKSHNLI
Clustal XClustal X
Step1: Load the sequencesStep1: Load the sequences
Uploaded sequencesUploaded sequences
A little unclear…
Edit Fasta headersEdit Fasta headers…… MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLT MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLT KGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLT KGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLT LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQTKILGNQGSFLT MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQTKILGNQGSFLT KGPSKLNDRVDSRRSLWDQGNFTLIIKNLKIEDSDTYICEVGDQKEEVQLLVFGLTANSDTHLLQGQSLT KGPSKLNDRVDSRRSLWDQGNFTLIIKNLKIEDSDTYICEVGDQKEEVQLLVFGLTANSDTHLLQGQSLT LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI LTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSI VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL VYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPL HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK HLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAQRMSQIKRLLSEKKTCQCPHRFQKTCSPI RCRHRRRQAQRMSQIKRLLSEKKTCQCPHRFQKTCSPI
MDPGTSLRHLFLVLQLAMLPAASGTQEKYLVLGKAGDLAELPCHSSQKKNLPFNWKNSNQTKILGGHGSF MDPGTSLRHLFLVLQLAMLPAASGTQEKYLVLGKAGDLAELPCHSSQKKNLPFNWKNSNQTKILGGHGSF WHTASVTELTSRLDSKKNMWDHGSFPLIIKNLEVTDSGIYICEVEDKRIEVQLLVFRLTASVTRVLLGQS WHTASVTELTSRLDSKKNMWDHGSFPLIIKNLEVTDSGIYICEVEDKRIEVQLLVFRLTASVTRVLLGQS LTLTLEGPSGSHPTVQWKGPGNKSKNDVKSLLLPQVGLEDSGLWTCTVSQDQKTLVFRSNIFVLAFQKVP LTLTLEGPSGSHPTVQWKGPGNKSKNDVKSLLLPQVGLEDSGLWTCTVSQDQKTLVFRSNIFVLAFQKVP STVYVKEGDQVALSFPLTFEAESLSGELMWRQTKGASSPQSWITFSLKDRKVTVQKSLQNLKLRMAEKLP STVYVKEGDQVALSFPLTFEAESLSGELMWRQTKGASSPQSWITFSLKDRKVTVQKSLQNLKLRMAEKLP LQITLLQALPQYAGSGNLTLVLPEGRLHREVNLVVMRATQSKNEVTCEVLGPTPPKVVLSLKLGNQSMKV LQITLLQALPQYAGSGNLTLVLPEGRLHREVNLVVMRATQSKNEVTCEVLGPTPPKVVLSLKLGNQSMKV SDQQKLVTVLDPEAGMWRCLLRDKDKVLLESQVEVLPTAFTRAWPELLASVIGGIIGLLFLAGFCIACVK SDQQKLVTVLDPEAGMWRCLLRDKDKVLLESQVEVLPTAFTRAWPELLASVIGGIIGLLFLAGFCIACVK CWHRRRRAERMSQIKRLLSEKKTCQCAHRQQKNYSLT CWHRRRRAERMSQIKRLLSEKKTCQCAHRQQKNYSLT
MCRGFSFRHLLPLLLLQLSKLLVVTQGKTVVLGKEGGSAELPCESTSRRSASFAWKSSDQKTILGYKNKL MCRGFSFRHLLPLLLLQLSKLLVVTQGKTVVLGKEGGSAELPCESTSRRSASFAWKSSDQKTILGYKNKL LIKGSLELYSRFDSRKNAWERGSFPLIINKLRMEDSQTYVCELENKKEEVELWVFRVTFNPGTRLLQGQS LIKGSLELYSRFDSRKNAWERGSFPLIINKLRMEDSQTYVCELENKKEEVELWVFRVTFNPGTRLLQGQS LTLILDSNPKVSDPPIECKHKSSNIVKDSKAFSTHSLRIQDSGIWNCTVTLNQKKHSFDMKLSVLGFAST LTLILDSNPKVSDPPIECKHKSSNIVKDSKAFSTHSLRIQDSGIWNCTVTLNQKKHSFDMKLSVLGFAST SITAYKSEGESAEFSFPLNLGEESLQGELRWKAEKAPSSQSWITFSLKNQKVSVQKSTSNPKFQLSETLP SITAYKSEGESAEFSFPLNLGEESLQGELRWKAEKAPSSQSWITFSLKNQKVSVQKSTSNPKFQLSETLP LTLQIPQVSLQFAGSGNLTLTLDRGILYQEVNLVVMKVTQPDSNTLTCEVMGPTSPKMRLILKQENQEAR LTLQIPQVSLQFAGSGNLTLTLDRGILYQEVNLVVMKVTQPDSNTLTCEVMGPTSPKMRLILKQENQEAR VSRQEKVIQVQAPEAGVWQCLLSEGEEVKMDSKIQVLSKGLNQTMFLAVVLGSAFSFLVFTGLCILFCVR VSRQEKVIQVQAPEAGVWQCLLSEGEEVKMDSKIQVLSKGLNQTMFLAVVLGSAFSFLVFTGLCILFCVR CRHQQRQAARMSQIKRLLSEKKTCQCSHRMQKSHNLI CRHQQRQAARMSQIKRLLSEKKTCQCSHRMQKSHNLI
>Homo_sapiens_CD4
>Pan_troglodytes_CD4
>Sus_scrofa_CD4
>Rattus_norvegicus_CD4
>gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens]
>gi|57113961|ref|NP_001009043.1| CD4 antigen [Pan troglodytes]
>gi|50054438|ref|NP_001001908.1| CD4 antigen [Sus scrofa]
>gi|6978631|ref|NP_036837.1| Cd4 molecule [Rattus norvegicus]
Uploaded sequencesUploaded sequences
Much better
Step2: Perform alignmentStep2: Perform alignment
Multiple Sequence Alignment and Multiple Sequence Alignment and conservation viewconservation view
Step 3: Create treeStep 3: Create tree
The Newick tree format is used to represent trees as strings
CA D
In Newick format: ((A,C),(B,D));
B
• Each pair of parenthesis () encloses a clade in the tree • A comma “,” separates the members of the corresponding clade• A semicolon “;” is always the last character
Step 4: View tree with NJPlotStep 4: View tree with NJPlot
Note :unrooted tree
CB
A
A
B
C
=
B
C
A
=B
C
A
=
Rooted vs. unrooted trees
1
2
3A
B
C
1
CBA
2
BCA
3
ABC
≠
≠
How would each tree look in Newick format?
1
2
3A
B
C
1
CBA
2
BCA
3
ABC
≠
≠
((C,B),A) ((A,B),C)
((A,C),B)(A,B,C)
Step 4.5: defining an outgroupStep 4.5: defining an outgroup
Step 4: View tree with NJPlotStep 4: View tree with NJPlot
Note :The order
inside a split doesn’t matter
Chimp HumanGorillaHuman ChimpGorilla
=
Chimp GorillaHuman
= =
Human GorillaChimp
(Gorilla,(Human,Chimp)) = (Gorilla,(Chimp,Human))
= ((Human,Chimp),Gorilla) = ((Chimp,Human),Gorilla)
How How robustrobust is our tree is our tree??
We need some statistical way to estimate the We need some statistical way to estimate the confidence in the tree topology confidence in the tree topology (like we need the E-(like we need the E-value to estimate the confidence of a blast hit)value to estimate the confidence of a blast hit)
But we don’t know anything about the But we don’t know anything about the distribution of tree topologiesdistribution of tree topologies
The only data source we have is our data (MSA)The only data source we have is our data (MSA) So, we must rely on our own resources: So, we must rely on our own resources: “pull up “pull up
by your own bootstraps”by your own bootstraps”
How robust is our treeHow robust is our tree??
Bootstrap
Bootstrap1. Create n (100-1000) new MSAs (pseudo-datasets) by randomly sampling K positions from our original MSA with replacement
12345 K1 : ATCTG…A 2 : ATCTG…C3 : ACTTA…C 4 : ACCTA…T
11244…31 : AATTT…C2 : AATTT…C3 : AACTT…T4 : AACTT…C
97478…101 : TTTTA…T2 : CATAC…A3 : CATAC…T4 : AGTGG…A
51578… 121 : GAGTA…T2 : GAGAC…G3 : AAAAC…A4 : AAAGG…C
Bootstrap2. Reconstruct a pseudo-tree from each pseudo-dataset using the same method used for reconstructing the original tree
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
11244…31 : AATTT…C2 : AATTT…C3 : AACTT…T4 : AACTT…C
97478…101 : TTTTA…T2 : CATAC…A3 : CATAC…T4 : AGTGG…A
51578… 121 : GAGTA…T2 : GAGAC…G3 : AAAAC…A4 : AAAGG…C
Bootstrap3. For each node in our original tree, we count the number of times it appeared in the pseudo-trees Sp1
Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3
Sp4
67%100%
Step 3.5 - BootstrapStep 3.5 - Bootstrap
Bootstrap values on NJPlotBootstrap values on NJPlot
Note:ClustalX saves trees with .ph extension. Trees with bootstrap are saved with .phb extension
Reconstructing the tree of lifeReconstructing the tree of life
Darwin’s vision of the tree of life Darwin’s vision of the tree of life from the from the Origin of SpeciesOrigin of Species
Based on molecular data (SSU Based on molecular data (SSU rRNA), branching of several rRNA), branching of several kingdoms remain in disputekingdoms remain in dispute
Lateral Gene Transfer (LGT) Lateral Gene Transfer (LGT) Challenges the Conceptual Basis Challenges the Conceptual Basis
of Phylogenetic Classificationof Phylogenetic Classification
Science 3 March 2006:Vol. 311. no. 5765, pp. 1283 - 1287
Toward Automatic Reconstruction of a Highly Resolved Tree of Life
MethodologyMethodology Started with 36 genes universally present in 191 Started with 36 genes universally present in 191
species (spanning all 3 domains of life), for species (spanning all 3 domains of life), for which orthologs could be unambiguously which orthologs could be unambiguously identifiedidentified
Eliminated 5 genes that are LGT suspects Eliminated 5 genes that are LGT suspects (mostly tRNA synthetases)(mostly tRNA synthetases)
Constructed an MSA for each of the 31 Constructed an MSA for each of the 31 orthogroupsorthogroups
Concatenated all 31 MSAs to a super-MSA of Concatenated all 31 MSAs to a super-MSA of 8090 columns8090 columns
The phylogeny was reconstructed based on the The phylogeny was reconstructed based on the super-MSA using the maximum likelihood super-MSA using the maximum likelihood approachapproach
Archaea
Eukaryota
Bacteria
Tree supportTree support
81.7% of the branches show bootstrap 81.7% of the branches show bootstrap support of over 80%support of over 80%
65% of the branches show bootstrap 65% of the branches show bootstrap support of 100%support of 100%
However, several deep branchings show However, several deep branchings show low supportslow supports