MEGA TUTORİAL.pdf

Biochem 711 – 2008 1

Alignment&EvolutionwithMEGA‐1

L11: Alignments Evolution: MEGA

Table of Contents

Introduction............................................................................................. 3

Acknowledgements.................................................................................. 4

L11 Exercise A: Set up............................................................................. 4 1. Launch MEGA............................................................................................... 4 2. Retrieve Sequence ........................................................................................ 5

L11 Exercise B: BLAST and Align within MEGA ....................................... 6 1. Launch MEGA web browser........................................................................... 6 2. BLAST search within MEGA .......................................................................... 6

2.1. Paste sequence........................................................................................ 6 2.2. Select the database to be searched ............................................................ 6 2.3. Optimization algorithm: blastn................................................................... 7 2.4. Press BLAST ............................................................................................ 7 2.5. BLAST results .......................................................................................... 7 2.6. Selecting results for alignment................................................................... 9

3. Preparing the Alignment within MEGA......................................................... 11 3.1. Add first sequence to alignment ............................................................... 11 3.2. Add additional sequence to be aligned...................................................... 11 3.3. Save the current list ................................................................................ 12

4. Create the Alignment .................................................................................. 13 4.1. Algorithm: ClustalW................................................................................ 13 4.2. Perform the alignment ............................................................................ 14 4.3. Adjustments to the Aligned Sequences ..................................................... 15 4.4. Adjustments to the Alignment.................................................................. 16

L11 Exercise C: Calculate a Neighbor-Joining Tree ............................... 18 1. Open alignment file .................................................................................... 18 2. Activate Neighbor-Joining ........................................................................... 19

L11 Exercise D: Precision in Acquiring and Aligning Sequences............ 20 1. Acquiring Query Sequence .......................................................................... 21 2. BLAST within MEGA.................................................................................... 23

2.1. Set-up ................................................................................................... 23 2.2. BLAST results ........................................................................................ 24

3. Build the alignment list............................................................................... 31 3.1. Edit Sequence Names............................................................................. 32 3.2. Edit START codons ................................................................................. 33

4. Translate to Protein.................................................................................... 33 5. Set parameters and calculate protein alignment ......................................... 34 6. Alignment adjustments............................................................................... 35

Biochem 711 – 2008 2


7. Export alignment in DNA and protein forms ................................................ 35 8. Eliminate duplicate sequences.................................................................... 35 9. Eliminate inadequate sequences ................................................................. 37

9.1. Biology and Structure of the protein .......................................................... 37 9.2. Remove sequence .................................................................................. 39

10. Estimate reliability of alignment with Average AA Identity .......................... 40

L11 Exercise E: Neighbor-Joining Phylogenetic Tree, Rooting ............... 42 1. Create Neighbor-Joining Tree...................................................................... 42 2. Estimating the reliability of a tree: Bootstraping.......................................... 43 3. Tree Rooting............................................................................................... 45

3.1. Finding an outgroup................................................................................ 46 3.2. Rooting the tree ..................................................................................... 46

L11 Exercise F: End of laboratory .......................................................... 48

Biochem 711 – 2008 3


Introduction

✔ INFO http://en.wikipedia.org/wiki/ /Evolution, /Phylogenetics, /Phylogenetic_tree In biology, evolution refers to changes in the inherited traits of a population of organisms from one generation to the next. Genes that are passed on to an organism's offspring produce the inherited traits that are the basis of evolution. Phylogenetics is the study of evolutionary relatedness among various groups of organisms (e.g., species, populations). A phylogenetic tree, also called an evolutionary tree, is a tree showing the evolutionary relationships among various biological species that are believed to have a common ancestor. In a phylogenetic tree, each node with descendants represents the most recent common ancestor of the descendants, and the edge lengths in some trees correspond to time estimates. Each node is called a taxonomic unit. Taxonomy is the classification of organisms according to similarity. Although phylogenetic trees produced on the basis of sequenced genes or genomic data in different species can provide evolutionary insight, they have important limitations. They do not necessarily accurately represent the species evolutionary history.

✔ INFO http://en.wikipedia.org/wiki/Multiple_sequence_alignment

A multiple sequence alignment is a sequence alignment of three or more biological sequences. In general, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. From the resulting multiple sequence alignment, sequence homology can be inferred (homology refers to any similarity between characters that is due to their shared ancestry) and phylogenetic analysis can be conducted to assess the shared evolutionary origins amongst the sequences. In practical terms, a tree is constructed from a multiple alignment of homologous sequences. The quality of the alignment is the most influential factor for the calculated trees. In these exercises we will use the MEGA software that can retrieve sequences, create a multiple sequence alignment with the Clustal algorithm and calculate a tree with various methods.

Biochem 711 – 2008 4


Quoting from the web site http://www.megasoftware.net/ MEGA 4: Molecular Evolutionary Genetics Analysis MEGA is an integrated tool for conducting automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, and testing evolutionary hypotheses. References:

Kumar S, Dudley J, Nei M & Tamura K (2008) MEGA: A biologist-centric software for evolutionary analysis of DNA and

protein sequences. Briefings in Bioinformatics 9: 299-306.

Tamura K, Dudley J, Nei M & Kumar S (2007) MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software

version 4.0. Molecular Biology and Evolution 24: 1596-1599.

Acknowledgements

This laboratory is loosely inspired from Barry G. Hall’s book “Phylogenetic Trees Made Easy: A How-to Manual, Third Edition”

L11 Exercise A: Set up

MEGA 4 has been tested on the following Microsoft Windows® operating systems: Windows 95/98, NT, 2000, XP, and Vista. If you are working from home you can download MEGA from http://www.megasoftware.net/ Your DMC computer should be running in Windows mode. If not ask for help.

1. Launch MEGA

Biochem 711 – 2008 5


✔ TASK Launch MEGA with the menu cascade Start > All Programs > Mega 4

2. Retrieve Sequence

✔ INFO The first complete E. coli genome was announced Sept 5th 1997 in the journal Science1 as a milestone in complete genome elucidation. In the following example we will use the nuoL gene defined as "NADH:ubiquinone oxidoreductase, membrane subunit L” within the complete genome entry.

Note: database searches now often find complete genomes. The reference to a particular gene is obtained by the “begin” and “end” values specifying a sequence “region.” For example, the nuoL gene is defined for the complete genome of K12 which has the accession value of CP000948: LOCUS CP000948 1842 bp DNA linear BCT 05-JUN-2008 DEFINITION Escherichia coli str. K12 substr. DH10B, complete genome. ACCESSION CP000948 REGION: 2482992..2484833 VERSION CP000948.1 GI:169887498

✔ TASK

The nuoL sequence will be provided to you either as a text file within the “Classroom Scratch” directory or on the virology.wisc.edu/acp resources section.

Open the file, highlight and copy the sequence to the clipboard. This is the “Query sequence.”

1 Blattner FR et al. The complete genome sequence of Escherichia coli K-12. Science. 1997 Sep 5; 277(5331):1453-74.

Biochem 711 – 2008 6


L11 Exercise B: BLAST and Align within MEGA

1. Launch MEGA web browser

✔ TASK MEGA has a built-in web browser that we will use to find and retrieve sequences. Follow the menu Alignment > Do BLAST Search to launch the internal browser that goes directly to NCBI BLAST.

2. BLAST search within MEGA

2.1. Paste sequence

✔ TASK Paste the query sequence from the clipboard into the “Enter Query Sequence” window.

2.2. Select the database to be searched

✔ TASK The default database is the Human genome. Change to the non-redundant (nr) complete nucleotide database collection.

Biochem 711 – 2008 7


2.3. Optimization algorithm: blastn Note: The default search should be blastn.

2.4. Press BLAST Before pressing the BLAST button you can optionally choose to “Show results in a new window.” Verify that the parameters are: Search database nr using Blastn (Optimize for somewhat similar sequences)

✔Press BLAST. A new job page will appear and updated within seconds. If a warning sign is presented simply press OK

2.5. BLAST results

✔ READ The result page first shows a graphical overview of the finds. The length of the bar is an indication of the region of similarity with the query sequence.

Biochem 711 – 2008 8


The bar is colored by alignment score according to the color key above. For example, bars colored red indicate a similarity score above 200. The graphical output is followed by a table showing the sequence accession number, description and scores. Databases are updated constantly and your results will likely be somewhat different. Shown below are the top and bottom portions of the output as it is of this writing. The Max score is the sum of identities and similarities dictated by the comparison matrix default (BLOSUM62). The E-value is defined at http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html Note: the “Max ident” (maximum identity) can be at 100% at the bottom of the table as well, but in these cases the “Query coverage” is much lower and E-value are very high.

//

Finally the pair-wise alignments are shown sorted by descending E-value from the most significant to the least significant (shown below):

>gi|60650283|gb|BT021192.1| Bos taurus SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily b, member 1 (SMARCB1), mRNA, complete cds Length=1491 GENE ID: 537412 SMARCB1 | SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily b, member 1 [Bos taurus] (10 or fewer PubMed links) Score = 41.1 bits (21), Expect = 3.0 Identities = 29/33 (87%), Gaps = 0/33 (0%)

Biochem 711 – 2008 9


Strand=Plus/Plus Query 77 CCGATACTCGCTTCTGCCGCCGCGAGGCTGATG 109 ||||| ||||||||||||||||| | | ||||| Sbjct 20 CCGATCCTCGCTTCTGCCGCCGCAATGATGATG 52

2.6. Selecting results for alignment

✔ READ To create a phylogenetic tree we only want to include homologs: sequences that have a common ancestor. This is the general assumption in phylogenetics. We can select which sequences we want to include as judged by the pair-wise sequence alignment.

✔ TASK Click on the Max Score for the first result in the list, here 583 for “Escherichia coli str. K12 substr. DH10B, complete genome.” This is a direct link to the alignment:

>gi|169887498|gb|CP000948.1| Escherichia coli str. K12 substr. DH10B, complete genome Length=4686137 Features in this part of subject sequence: NADH:ubiquinone oxidoreductase, membrane subunit L NADH:ubiquinone oxidoreductase, membrane subunit K Score = 583 bits (303), Expect = 2e-163 Identities = 303/303 (100%), Gaps = 0/303 (0%) Strand=Plus/Plus Query 1 TCATCCGCGCATCTCACTTACTGAATCGATGTTCAGGTTCTGGCGACGACGGTGAAGTTG 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2484830 TCATCCGCGCATCTCACTTACTGAATCGATGTTCAGGTTCTGGCGACGACGGTGAAGTTG 2484889 Query 61 CAGCAGCAGCGCAAGGCCGATACTCGCTTCTGCCGCCGCGAGGCTGATGGCGAGAATGTA 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2484890 CAGCAGCAGCGCAAGGCCGATACTCGCTTCTGCCGCCGCGAGGCTGATGGCGAGAATGTA 2484949 Query 121 CATCACCTGACCGTCGGTCTGGCCCCAGTAGCTTCCGGCGACCACGAAGGCCAGCGCGGA 180 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2484950 CATCACCTGACCGTCGGTCTGGCCCCAGTAGCTTCCGGCGACCACGAAGGCCAGCGCGGA 2485009 Query 181 GGCGTTAATCATGATTTCCAGACCAATCAACATAAACAGCAGATTGCGACGGATAACCAG 240 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2485010 GGCGTTAATCATGATTTCCAGACCAATCAACATAAACAGCAGATTGCGACGGATAACCAG 2485069 Query 241 ACCGGTTAAGCCAAGAACGAATAAGATTGCCGCGAGGATCAGTCCATGTTGTAAGGGGAT 300 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2485070 ACCGGTTAAGCCAAGAACGAATAAGATTGCCGCGAGGATCAGTCCATGTTGTAAGGGGAT 2485129 Query 301 CAT 303 ||| Sbjct 2485130 CAT 2485132

This is a perfect match: 303 out of 303 and no gaps. It is in fact the query sequence that we will want to include in the tree. With a right-mouse click, open the subunit L in a new browser window.

Biochem 711 – 2008 10


The resulting window will show the information relevant to the gene as taken out of the complete genome sequence thanks to the “Range: from” option which is filled automatically when opening the link:

LOCUS CP000948 1842 bp DNA linear BCT 05-JUN-2008 DEFINITION Escherichia coli str. K12 substr. DH10B, complete genome. ACCESSION CP000948 REGION: 2482992..2484833 VERSION CP000948.1 GI:169887498 PROJECT GenomeProject:20079 KEYWORDS . SOURCE Escherichia coli str. K12 substr. DH10B ORGANISM Escherichia coli str. K12 substr. DH10B Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia. REFERENCE 1 (bases 1 to 1842) AUTHORS Durfee,T., Nelson,R., Baldwin,S., Plunkett,G. III, Burland,V., Mau,B., Petrosino,J.F., Qin,X., Muzny,D.M., Ayele,M., Gibbs,R.A., Csorgo,B., Posfai,G., Weinstock,G.M. and Blattner,F.R. TITLE The complete genome sequence of Escherichia coli DH10B: insights into the biology of a laboratory workhorse JOURNAL J. Bacteriol. 190 (7), 2597-2606 (2008) PUBMED 18245285 REFERENCE 2 (bases 1 to 1842) AUTHORS Plunkett,G. III. TITLE Direct Submission JOURNAL Submitted (20-FEB-2008) Department of Genetics and Biotechnology, University of Wisconsin, 425G Henry Mall, Madison, WI 53706, USA COMMENT DH10B and DH10B-T1R are available from Invitrogen Corporation (http://www.invitrogen.com). FEATURES Location/Qualifiers source 1..1842 /organism="Escherichia coli str. K12 substr. DH10B" /mol_type="genomic DNA" /strain="K-12" /sub_strain="DH10B" /db_xref="taxon:316385" gene complement(1..1842) /gene="nuoL" /locus_tag="ECDH10B_2440" CDS complement(1..1842) /gene="nuoL" /locus_tag="ECDH10B_2440" /codon_start=1 /transl_table=11 /product="NADH:ubiquinone oxidoreductase, membrane subunit L" /protein_id="ACB03437.1" /db_xref="GI:169889730" /db_xref="ASAP:AEC-0002144" /translation="MNMLALTIILPLIGFVLLAFSRGRWSENVSAIVGVGSVGLAALV TAFIGVDFFANGEQTYSQPLWTWMSVGDFNIGFNLVLDGLSLTMLSVVTGVGFLIHMY ASWYMRGEEGYSRFFAYTNLFIASMVVLVLADNLLLMYLGWEGVGLCSYLLIGFYYTD PKNGAAAMKAFVVTRVGDVFLAFALFILYNELGTLNFREMVELAPAHFADGNNMLMWA TLMLLGGAVGKSAQLPLQTWLADAMAGPTPVSALIHAATMVTAGVYLIARTHGLFLMT PEVLHLVGIVGAVTLLLAGFAALVQTDIKRVLAYSTMSQIGYMFLALGVQAWDAAIFH LMTHAFFKALLFLASGSVILACHHEQNIFKMGGLRKSIPLVYLCFLVGGAALSALPLV TAGFFSKDEILAGAMANGHINLMVAGLVGAFMTSLYTFRMIFIVFHGKEQIHAHAVKG VTHSLPLIVLLILSTFVGALIVPPLQGVLPQTTELAHGSMLTLEITSGVVAVVGILLA AWLWLGKRTLVTSIANSAPGRLLGTWWYNAWGFDWLYDKVFVKPFLGIAWLLKRDPLN SMMNIPAVLSRFAGKGLLLSENGYLRWYVASMSIGAVVVLALLMVLR" gene complement(1839..>1842) /gene="nuoK" /locus_tag="ECDH10B_2441" CDS complement(1839..>1842) /gene="nuoK" /locus_tag="ECDH10B_2441" /codon_start=1 /transl_table=11 /product="NADH:ubiquinone oxidoreductase, membrane subunit K" /protein_id="ACB03438.1" /db_xref="GI:169889731" /db_xref="ASAP:AEC-0002145" /translation="MIPLQHGLILAAILFVLGLTGLVIRRNLLFMLIGLEIMINASAL AFVVAGSYWGQTDGQVMYILAISLAAAEASIGLALLLQLHRRRQNLNIDSVSEMRG" ORIGIN 1 tcaacgcagt accatcaaca gtgccagcac cacgaccgca ccgatgctca tggatgccac 61 ataccagcgc agatagccgt tctcacttaa cagcagacct ttacctgcaa agcgggaaag 121 gacagccggg atgttcatca ttgagttcag cggatcgcgt ttcagcaacc aggcaatacc 181 caggaacggc ttgacgaaca ctttgtcata cagccagtca aatccccagg cgttgtacca 241 ccaggtgccc agcagacggc ccggcgcact gttggcgatg gaggtcacca gagtacgttt 301 acccagccac agccaggctg ccagcagaat gccgaccacc gcgaccacgc cagaggtaat 361 ttccagggtc aacatgctgc cgtgcgccag ttccgtcgtt tgcggaagca cgccctgcag

Biochem 711 – 2008 11


421 cggcggtaca atcagtgcgc caacgaaggt ggaaaggatc agcagcacaa tcagcggcag 481 gctgtgagtt acccctttca cggcgtgagc gtgaatttgt tcttttccgt ggaagacgat 541 gaaaatcata cggaaggtgt agagcgaggt cataaacgca ccgaccagac ctgccaccat 601 cagattgata tgaccattcg ccatcgcacc cgcgaggatc tcatccttac tgaagaagcc 661 cgcagtgacc agcggtagtg ccgacagtgc tgcgccgccc accaggaagc agagataaac 721 cagcggaata gatttacgca gaccgcccat cttgaagatg ttctgttcgt gatggcaggc 781 cagaatgacg gaaccggatg ccaggaacag cagcgcttta aagaacgcgt gggtcatcaa 841 gtggaaaatc gccgcatccc atgcctgcac gccaagcgcg aggaacatgt agccaatctg 901 gctcatggta gagtaagcga gaacacgttt gatgtcggtc tgtaccagcg cggcaaaacc 961 ggccagcagc agcgtaaccg ccccgacaat acccaccaga tgcagaactt ccggcgtcat 1021 caggaacagg ccgtgggtac gggcgatcag gtagacaccc gcggttacca tggttgcggc 1081 gtggatcagc gcggagacag gcgtcgggcc cgccatcgcg tcggcaagcc atgtctgcaa 1141 cggcaactgc gcagatttac cgaccgcacc gcccagcagc atcagcgtcg cccacatcag 1201 catgttattg ccgtcagcaa agtgcgctgg tgccagttcc accatttcgc ggaagttcag 1261 ggtgcccagt tcgttgtaaa gaatgaacag tgcgaaagcg aggaacacgt cacccacacg 1321 ggtcacgacg aacgctttca ttgccgctgc gccattcttc ggatcggtgt aatagaaccc 1381 gatcagcaga taggagcaca ggcccacgcc ttcccagccg aggtacatca gcagcaggtt 1441 gtcggcaagc accagaacca ccatgctggc gatgaacagg ttggtgtaag cgaagaagcg 1501 agagtagccc tcttcaccgc gcatatacca ggaggcgtac atgtgaataa ggaaacccac 1561 accagtgacc accgagagca tggtcagcga caggccgtcc agcaccaggt taaaaccgat 1621 gttaaagtcg cctaccgaca tccacgtcca cagcggctgg ctgtatgtct gctcgccgtt 1681 agcgaagaaa tcaacgccga taaaggcggt taccagcgcc gccaggccca cagagcctac 1741 gccgacgatc gccgagacgt tttcagacca gcgcccacgg gagaatgcca gcaggacgaa 1801 gccaatcaat ggcaaaataa tggttaaggc aagcatgttc at //

Disclaimer | Write to the Help Desk NCBI | NLM | NIH

3. Preparing the Alignment within MEGA

3.1. Add first sequence to alignment

✔ TASK Click on the button with the red +

sign ++ Add to Alignment The addition will be acknowledged by MEGA and will start the list of sequences to be aligned. Click OK The MEGA AlnExplorer will start perhaps behind the current browser window. Close or move this browser window to reveal the M4 Alignment Explorer.

3.2. Add additional sequences to be aligned

✔ TASK Add new sequences to the alignment list In a similar manner explore the alignments provided by BLAST. Check the E-value, and choose sequences that are at least 50% similar over the length of the sequence query. Choose one sequence per species. There would

Biochem 711 – 2008 12


be no gain of evolutionary information to include too many identical E. coli sequences or sequences that are 100% identical. Concentrate on chains named L. The nucleotide sequence should be in the 1800 bases range. If you make a mistake, you can remove a sequence already sent to the M4 Alignment Explorer by right-clicking on the sequence and selecting Cut from the pull-down menu.

Note: instead of the “Open in New Window” you can also simply click on the score link and click the back arrow to go back to the list. When you are done, your M4 Alignment Explorer should look similar to the following which exhibits 15 sequences: Note: The base background coloring can be toggled on/off from the Display menu.

3.3. Save the current list

✔ TASK Save the alignment list into a file with the menu cascade: Data > Export Alignment > MEGA Format Enter a file name, e.g. nuoL and the .meg filename extension is already supplied by MEGA.

Biochem 711 – 2008 13


Save the file on the Desktop. When prompted, give a title in the first question and answer YES to say that these are protein coding sequences

✔ TASK Close the MEGA web browser

Note: if you prefer you can copy the file NuoL.meg from the Classes Scratch drive or download it from http://virology.wisc.edu/acp

4. Create the Alignment

✔✔ READ Defining an alignment. Excerpt from: http://en.wikipedia.org/wiki/Sequence_alignment A sequence alignment is a way of arranging the primary sequences (here as DNA) to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Since we want to build a tree, we assume that the sequences are homologs and derive from a common ancestor. During the alignment gaps are inserted between the residues so that residues with identical or similar characters are aligned in successive columns. If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps can be interpreted as indels (insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another.

4.1. Algorithm: ClustalW

Biochem 711 – 2008 14


MEGA uses an embedded version of ClustalW (command line interface) to perform the alignment. Clustal creates the alignment in 3 main steps: (see http://en.wikipedia.org/wiki/Clustal)

- perform pair-wise alignment of all sequences. - Create a simple phylogenetic tree based on similarity distance - Use the phylogenetic tree to carry out a multiple alignment

The basis of similarity is a comparison table or distance matrix, BLOSUM62 for proteins and mostly identities for nucleic acids

4.2. Perform the alignment

✔ TASK From the Alignment menu choose Align by ClustalW

Confirm that you want to select all the sequences for the alignment: click OK

This will bring the ClustalW paramters window. For now keep the suggested values and press OK

The progress bar shows the advance of the Clustal algorithm from pairwise alignments to multiple alignment.

When the alignment is done, the main window will flash and refresh with the new alignment, shown below at the 5’ and 3’ ends

Biochem 711 – 2008 15


//

This “default alignment” is the best alignment that ClustalW can create given the default parameters. It is unlikely to be the “best” alignment, and manual adjustments would have to be made to compensate some of the flaws introduced by the algorithm, such as the splitting of codons for example. Below we will perform a few adjustments, and a better alignment will be carried out in a later exercise.

4.3. Adjustments to the Aligned Sequences

✔✔ READ The following adjustments are made ad hoc for this example and may or may not apply in the exact same way if you have selected different sequences.

4.3.1. Remove 2 sequences In this example 2 sequences (Klebsiella pneumoniae 342 and Erwinia tasmaniensis strain ET1/99) appear shorter (and indeed do not match until base 960) and will be removed with the right-click and Cut menu options. The example will remain with 13 sequences.

If you have inadvertently incorporated sequences that are smaller you may wish to remove them at this time. Note: now that the 2 short sequences are removed, * are shown on some of the top squares indicating columns where all bases are identical.

4.3.2. Reverse complement sequences The sequences that are aligned are the reverse complement of the coding sequence. The sequences will therefore be changed within MEGA.

Biochem 711 – 2008 16


Hint: check the 3’end of your sequences and if they end with CAT (the reverse complement of AUG or ATG the Methionine codon) then it is likely the case.

✔ TASK Follow these directions if your sequences are reversed. Edit > Select All Data > Reverse Complement (result)

The ATG is highlighted at the 5’ end of the converted sequences. * show columns with 100% identical base.

4.4. Adjustments to the Alignment If you have reversed your sequences they should all start with ATG and all ATGs should be aligned within the same column. If they are not you will be able to adjust this part of the alignment in the same manner as we will adjust the 3’end below.

4.4.1. Adjusting the 3’end These are coding sequence and end with a terminator codon and all three are represented – UAA (Ochre), UAG (Amber), UGA (Opal) – in their DNA form. By definition these are the termination signals, and these 3 columns should be aligned.

✔ TASK Select the terminator bases for as many adjacent sequences as possible. Then move the selected bases towards the 3’ end of the sequence by repeatedly

clicking on the button |( ) meant to move selected blocks rightward. Repeat for the remaining terminator sequence until all 3’ends are aligned.

Biochem 711 – 2008 17


Save the file again (Data > Export Alignment), perhaps with a new name, e.g. NuoL_edited.meg. Answer YES to the question about coding proteins. We now have an alignment from which we can build a tree. See next Exercise.

✔ TASK You can save the alignment in the .mas binary format with the following menu cascade: Data > Save Session Call the new file e.g. NuoL_Edited.mas The .mas format is binary and while the .meg is plain text. If you are curious you can open them with Wordpad.

For the next segment we will need the .meg file.

✔ TASK Close the Alignment Explorer.

Biochem 711 – 2008 18


Note: if you prefer you can copy the file NuoL_Edited.meg from the Classes Scratch drive or download it from http://virology.wisc.edu/acp

L11 Exercise C: Calculate a Neighbor-Joining Tree

✔ INFO http://en.wikipedia.org/wiki/Neighbor_joining

The neighbor-joining iterative algorithm requires knowledge of the distance between each pair of sequences in the tree, constructed in a step-wise fashion. Each iteration consists of the following steps:

1) Calculate a distance matrix Q for each pair of sequences. 2) Find the pair of sequences in Q with the lowest value. Create a node on the tree

that joins these two sequences (i.e. join the closest neighbors, as the algorithm name implies).

3) Calculate the distance of each of the sequences in the pair to this new node. 4) Calculate the distance of all sequences outside of this pair to the new node. 5) Start the algorithm again, considering the pair of joined neighbors as a single

sequence (taxon) and using the distances calculated in the previous step. Neighbor-joining is based on the minimum-evolution criterion for phylogenetic trees, i.e. the topology that gives the least total branch length is preferred at each step of the algorithm. This algorithm has been extensively tested and is statistically consistent under many models of evolution. Hence, given data of sufficient length, neighbor-joining will reconstruct the true tree with high probability.

1. Open alignment file

✔ TASK From the main MEGA window click on “Click me to activate a data file” and open the previously saved or retrieved NuoL_Edited.meg file. The file will open within the MEGA Sequence Data Explorer as show below.

Biochem 711 – 2008 19


Note that by default the sequence is shown only for positions that are different from the first listed sequence (E. coli str. K12). This can be toggled on and off

with the special button:

Note that this window is separate from the main MEGA window that might be behind this view.

2. Activate Neighbor-Joining

✔ TASK From the main MEGA window follow the menu cascade: Phylogeny > Construct Phylogeny > Neighbor-Joining (NJ)...

This will open the M4: Analysis Preference window where current parameters and selections are summarized: we can see that we are working on a nucleotide file and that the method currently chosen is Neighbor-Joining. Green squares at the end of lines indicate parameters that could be altered and will be reviewed later.

Biochem 711 – 2008 20


✔ TASK Click the

button This will bring the M4: Tree Explorer window

Save the current tree:

We have now created a valid tree based on the alignment. We will explore how to change tree viewing and drawing options in a further exercise.

✔ TASK Close all MEGA windows: the MEGA main window and any associated MEGA Explorer window.

L11 Exercise D: Precision in Acquiring and Aligning Sequences

✔ READ Our accomplishment so far is to have used MEGA to retrieve sequences, align them and make a tree. However, we have not worked with precision either in the selection of the files or in making the tree. This section will go over finding homologous sequences to the best of our judgment. There is no meaning in placing an unrelated sequence in the tree because the purpose of the tree is to depict the level of ancestry between the sequences. Restating our purpose: why do we make a tree? A tree is a graphical representation of the relationship between sequences believed to derive from a common ancestor and serves as a tool to illustrate a

Biochem 711 – 2008 21


concept or a hypothesis. The level of details may relate to the problem we want to illustrate. For example, we may want to study variants of a protein within a single species, or, on the other hand, find the evolutionary relationship of a given protein amongst all known species. In evolutionary biology, homology refers to any similarity between characters that is due to their shared ancestry. Homology among proteins and DNA is often concluded on the basis of sequence similarity. However, sequence similarity may arise from different ancestors: short sequences may be similar by chance, and sequences may be similar because both were selected to bind to a particular protein. When similarity is very high on a reasonable length there is little doubt that the sequences are likely homologs. However, how do we find sequences that are distantly related? DNA sequences share roughly 25% similarity with a random sequence of the same composition simply because there are only 4 choices (A,C, G, T/U.) On the other hand, with proteins there are 20 choices at each position along the sequence and homology can often be detected until a similarity threshold of 5%.

✔ READ Therefore in the next section we will acquire sequences based on protein search. However, a score of 100% between 2 (identical) protein sequences does not necessarily signify that the coding DNA would also score 100%. Because of the nature of the genetic code there exist silent substitutions for amino acids that arise from a different codon and therefore alter the DNA/RNA sequence. For this exercise we are interested in the finest tree structure as possible, and we will retrieve the actual coding DNA sequences. In a first step we will allow different DNA sequences that encode identical proteins to be part of the list and true duplicates will be removed at a later stage.

✔ READ For this exercise we will embark on finding sequences related to the bacterial KcsA potassium channel.

1. Acquiring Query Sequence Potassium channels found in bacteria are amongst the most studied of ion channels in terms of their molecular structure. They have a tetrameric structure and has been solved by X-ray crystallography and NMR. Therefore we can retrieve the sequence from the PDB web site.

Biochem 711 – 2008 22


✔ TASK Open a web browser. ( e.g. or ) Point the browser to the PDB web site: http://www.rcsb.org/pdb In the search box enter 1F6G

(1F6G is the PDB ID code for the bacterial KcsA potassium channel.)

Click Site Search or press return. On the left hand side panel under the Structure Tab: Click on FASTA Sequence The browser will then ask to save or open. Choose Save and direct the file to the desktop if asked (default with Firefox.) Accept the default name of 1F6G.fasta.txt Navigate to the desktop.

Right-Click on the file. Choose Open With Select WordPad (or Microsoft Office Word) to open the file.

Note: if you double-click on the file icon, it will open with Notepad and only one single long line will show as Notepad is not intelligent about end-of-line:

Biochem 711 – 2008 23


The structure is composed of four identical proteins sequences labeled A to D. Copy the sequence of one of the chains e.g. chain A to the clipboard. This will be our query sequence to use with BLAST.

2. BLAST within MEGA

2.1. Set-up

✔ TASK Launch MEGA (Windows: Start > All Programs) Launch MEGA BLAST web browser (MEGA: Alignment > Do BLAST Search) (The default is blastn with nucleotides and we need to switch to proteins.) Click Home Scroll down the page. Click protein blast

Paste sequence in the appropriate window. (Right-click with the mouse or press keyboard Control V together)

Verify that the database is nr. Verify that the algorithm is blastp (protein-protein BLAST) Press the BLAST button at the bottom of the page.

Biochem 711 – 2008 24


✔ OPTIONAL Before pressing the BLAST button you can open and explore the Algorithm Parameters by clicking-open the triangle.

2.2. BLAST results As previously the result page is organized with a graphical summary (max 100), a description list (up to the limit preset in Algorithms Parameters – default is 100) and finally the alignments. Within the “Description” and “Alignment” lists, entries that are derived from the Protein Data Bank and are therefore a solved structure are shown with ; entries with and “Entrez Gene” (a searchable database of genes at NCBI from RefSeq genomes) are shown with the symbol .

✔ INFO http://www.ncbi.nlm.nih.gov/RefSeq/ = Entrez GeneEntrez Gene = RefSeq = The Reference Sequence collection. RefSeq

aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences. Each RefSeq represents a single, naturally occurring molecule from one organism. Similar to a review article in the literature, a RefSeq is a synthesis of information representing the consolidation of information by a particular group at a particular time. RefSeq has been built using data from public archival databases only.

// (skip to bottom)

Biochem 711 – 2008 25


2.2.1. Explore first hit

✔ TASK Click on the E-value (here 317) to jump to the first hit, or scroll down the page. Typically the very first hit is of course the query sequence itself. The original sequence was the PDB sequence and that is what we found. Furthermore, since the channel has a tetrameric structure the 4 identical chains are listed as entry followed by to signify it is a structural entry. Right-click and open this entry into a new window

The text is shown at right. DO NOT CLICK the

++ Add to Alignment button at this point!

LOCUS 1F6G_A 160 aa linear BCT 24-SEP-2008 DEFINITION Chain A, Potassium Channel (Kcsa) Full-Length Fold. ACCESSION 1F6G_A VERSION 1F6G_A GI:13399712 DBSOURCE pdb: molecule 1F6G, chain 65, release Aug 27, 2007; deposition: Jun 21, 2000; class: Proton Transport, Membrane Protein; source: Mol_id: 1; Organism_scientific: Streptomyces Lividans; Organism_common: Bacteria; Expression_system: Escherichia Coli; Expression_system_common: Bacteria; Expression_system_vector_type: Plasmid; Expression_system_plasmid: Pqe32; Exp. method: Nmr, 8 Structures. KEYWORDS . SOURCE Streptomyces lividans ORGANISM Streptomyces lividans Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales; Streptomycineae; Streptomycetaceae; Streptomyces. REFERENCE 1 (residues 1 to 160) AUTHORS Cortes,D.M., Cuello,L.G. and Perozo,E. TITLE Molecular architecture of full-length KcsA: role of cytoplasmic domains in ion permeation and activation gating JOURNAL J. Gen. Physiol. 117 (2), 165-180 (2001) PUBMED 11158168 REFERENCE 2 (residues 1 to 160) AUTHORS Cortes,D.M. and Perozo,E. TITLE Direct Submission JOURNAL Submitted (21-JUN-2000) COMMENT SEQRES. FEATURES Location/Qualifiers source 1..160 /organism="Streptomyces lividans" /db_xref="taxon:1916" Region 1..126 /region_name="Domain 1" /note="NCBI Domains" SecStr 2..20 /sec_str_type="helix" /note="helix 1" SecStr 22..45 /sec_str_type="helix" /note="helix 2" SecStr 63..74 /sec_str_type="helix" /note="helix 3" SecStr 86..113 /sec_str_type="helix" /note="helix 4" SecStr 114..122 /sec_str_type="helix" /note="helix 5" SecStr 130..147 /sec_str_type="helix" /note="helix 6" ORIGIN 1 mppmlsglla rlvklllgrh gsalhwaaag aatvllvivl lagsylavla ergapgaqli 61 typaalwwsv etattvgygd lypvtlwgrc vavvvmvagi tsfglvtaal atwfvgreqe 121 rrghfvrhse kaaeeaytrt tralherfdr lermlddnrr //

This particular sequence was expressed through a plasmid expression vector after cloning. The DNA sequence linked to this protein sequence is not directly available. In addition since this sequence was cloned for structural expression,

Biochem 711 – 2008 26


we do not know at this point if any mutation (even minor) was introduced into the sequence to facilitate the structural study. Since structural sequences do not have a link to their DNA coding sequence, and since we do not know if these sequences are naturally occurring, as a general rule we will not use any of the structural proteins sequences.

✔ TASK Close this window (if you used the right-click method) or click the back button if you did not use the right-click method.

2.2.2. Explore second hit

✔ READ Before opening the second hit let’s first remark that there are also multiple links for this sequence but unlike the previous, first link it is not related to the trimeric nature of the sequence, but rather on the fact that the sequence is represented in multiple databases:

Query 1 MPPMLSGLLARLVKLLLGRHGSALHWAAAGAATVLLVIVLLAGSYLAVLAERGAPGAQLI 60 MPPMLSGLLARLVKLLLGRHGSALHW.AAGAATVLLVIVLLAGSYLAVLAERGAPGAQLI Sbjct 1 MPPMLSGLLARLVKLLLGRHGSALHWRAAGAATVLLVIVLLAGSYLAVLAERGAPGAQLI 60 Query 61 TYPAALWWSVETATTVGYGDLYPVTLWGRCVAVVVMVAGITSFGLVTAALATWFVGREQE 120 TYP.ALWWSVETATTVGYGDLYPVTLWGR.VAVVVMVAGITSFGLVTAALATWFVGREQE Sbjct 61 TYPRALWWSVETATTVGYGDLYPVTLWGRLVAVVVMVAGITSFGLVTAALATWFVGREQE 120 Query 121 RRGHFVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR 160 RRGHFVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR Sbjct 121 RRGHFVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR 160 This sequence differs at 3 amino acid positions (label . added above) and is 98% identical to the query sequence.

✔ TASK

Using the right-click method in a new window open the following 3 entries: ref|NP_631700.1 | G emb|CAA86025.1| emb|CAC16993.1| G

Biochem 711 – 2008 27


Scroll down to the FEATURES section. ref|NP_631700.1| voltage-gated potassium channel [Streptomyces coelicolor A3(2)]

emb|CAC16993.1| voltage-gated potassium channel [Streptomyces coelicolor A3(2)]

emb|CAA86025.1| potassium channel protein [Streptomyces lividans]

Interstingly these 3 entries are linked together but belong to 2 different species: Streptomyces coelicolor A3(2)and Streptomyces lividans Furthermore, upon close inspection it can be seen that the sequence within the 3 entries are 100% identical to each other, which means that the query sequence

Biochem 711 – 2008 28


might have had some errors, and therefore it is indeed best not to include structure sequences in our tree. Both sequences from the emb database contain information of related PDB structures, and indeed emb|CAA86025.1| is shown linked to the PDB entry 1F6G (highlighted on the figure above), our original query sequence. This was the query sequence, yet there were 3 amino acid differences detected. It is possible that ther structure was done on a slight variant which might have been obtained by cloning, and therefore these would not be natural mutations.

This will be our first entry:

✔ TASK Click the top red cross button: ++ Add to Alignment

2.2.3. Checking for errors Scroll down to the next entry that is not a structural entry (s): >ref|YP_002206049.1| G voltage-gated potassium channel [Streptomyces sviceus ATCC 29083] gb|EDY57125.1| G voltage-gated potassium channel [Streptomyces sviceus ATCC 29083] Length=162

✔ TASK Open ref|YP_002206049.1| in a new window Click CDS Scroll to bottom Observe sequence and /translation /translation="MAPCARPAPGLRGVSMLPGFLARMVELMRRRDGRSLHVKAAGGA TAVLLVVMLTGSWAVLVAEEGARGASLTSYPKALWWSVETATTVGYGDFYPVTWWGRV

✔ TASK Click on the CDS link highlighted on the RefSeq version. This action will open a new NCBI window with the DNA coding sequence. You can verify that it begins with the START codon (atg) and ends with the OPAL terminator codon (tga).

//

Biochem 711 – 2008 29


VGTVVMVVGITTYGMVTAALATWFVARQQKRAHPVGAETLHALHERFDRLEELLGAKK KG" ORIGIN 1 ctggccccgt gcgcccgccc cgcccccggc cttcgaggag tgagcatgct gcccggattc

[...]

✔ READ The reported translation starts with Methionine (M) but the reported DNA sequence starts with a C rather than A. Translating this sequence would in fact result in a Leucine being the first amino acid. Therefore there is a mistake in the DNA sequence. In the other link (gb|EDY57125.1|) the same mistake can be found. We can notice that the next entry is from the same strain (Streptomyces sviceus ATCC 29083) which has a correct ATG start and is therefore a better sequence. Ignore the sequence with the wrong start and Click the top red cross button ++ Add to Alignment for ref|ZP_03195532.1|

2.2.4. Reverse Complement? Some sequences are reported as the negative strand, and need to be reversed (reverse complemented) either before adding them to MEGA or within MEGA. The next entry as we follow along the Description list and skipping s entries is: >ref|YP_909829.1| G voltage-gated potassium channel protein [Bifidobacterium adolescentis ATCC 15703] dbj|BAF39747.1| G possible voltage-gated potassium channel protein [Bifidobacterium adolescentis ATCC 15703] Length=228

✔ TASK Open ref|YP_909829.1| in a new window Observe the CDS record: CDS 1..228 /locus_tag="BAD_0966" /coded_by="complement(NC_008618.1:1203706..1204392)" /note="COG family: Kef-type K+ transport systems_predicted

Note that it is stated: coded by complement which means that the coding sequence is the reverse complement.

Click CDS Scroll to sequence at bottom Observe sequence at 5’ and 3’ ends

Biochem 711 – 2008 30


ORIGIN 1 ctatggtctg cgcagccgcg cgatgtcgtc acgtagctgg ttgacggtgt ccgtcagttc [...] 661 cagaaacacc aacgccaatg cggtcat // It can easily be seen that the sequence “cat” at the 3’ end is the reverse complement of “atg” the START codon. Similarly, the 5’end sequence “cta” is the reverse completement of the Amber STOP codon “uag”. There are 2 ways to reverse complement the DNA sequence: within the browser or within MEGA. Within Browser: Click square next to “Reverse complemented strand” Press “Refresh” button.

Add sequence to MEGA: Press ++ Add to Alignment

Within MEGA: Add sequence to MEGA: Press ++ Add to Alignment Right-click on sequence name Select “Reverse Complement”

✔ TASK Choose one method and add this sequence in the proper orientation

2.2.5. Long sequences In some cases clicking the CDS link will bring about the complete genome. In this case, press the back button and check next to the CDS entry what are the begin and end numbers along the sequence that contain the gene. Example for: >ref|YP_832549.1| G Ion transport 2 domain-containing protein [Arthrobacter sp. FB24] gb|ABK04449.1| G Ion transport 2 domain protein [Arthrobacter sp. FB24] Length=244

✔ TASK Open ref|YP_832549.1| in a new window Observe the CDS record: CDS 1..244 /locus_tag="Arth_3070" /coded_by="NC_008541.1:3441675..3442409" /note="PFAM: Ion transport protein; Ion transport 2 domain protein,;

Note the coded by entry. 3441675..3442409 are the begin and end values for the gene within the NC_008541.1 record.

Biochem 711 – 2008 31


Click the CDS link again.

Replace the begin and end values with 3441675 and 3442409 Press Refresh

Note that a new button for showing the whole sequence is now being displayed.

Press ++ Add to Alignment

3. Build the alignment list

✔ READ Go down the Description list and choose sequences to be added:

Time &

patience required!

For each hit with an acceptable E-value select one of the links (in the previous hits there were 5 links,) privilege links from RefSeq if there is a choice. In any case, choose only those that have a CDS link.

Click on the CDS link and only then click ++ Add to Alignment. Skip all structure entries (s)

Acceptable E-value is a cut-off decision. We can decide that anything higher than 1. 10-3 is too high, but that cut-off may vary in the end is rather subjective. Also, we can eliminate sequences that come from the same strain as well as proteins labeled hypothetical, putative, possible or predicted as well as synthetic contructs. The general “algorithm” that we follow therefore is:

E-Value < threshold?

different strain?

Open in new

window?

CDS available?

Check Sequence

and translation

Reverse Complement?

Biochem 711 – 2008 32


3.1. Edit Sequence Names MEGA creates an entry name based on the database name of up to 40 characters. Long names are hard to read and make for untidy trees. Therefore we shall now rename the files to a better name within MEGA. Each software and each operating system has restrictions on what to use for special characters. For example, Unix does not handle blanks or quotes well, and the Nexus file format crashes if dashes are part of the file name. Guidelines: Choose a name that is Unique Use underscore _ rather than spaces Keep name short (some software limit name to 10 characters) Only use letters, numbers, underscore and period. Do no use and eliminate blank spaces and colon (:)

✔ TASK Double-click name within MEGA and edit names of sequence.

Before: K+No1.mas After: K+No1-renamed.mas

Click on the blank cell above the name list (marked above) to adjust the cell size for all names.

Biochem 711 – 2008 33


If you are unsure about your gathered sequence you can retrieve the file(s) the K+No1.mas or K+No1-renamed.mas “Classroom Scratch” directory or on the virology.wisc.edu/acp resources section.

3.2. Edit START codons

✔ READ In the second hit example we eliminated a sequence because its START codon was not correct, and we noticed that this was a duplicate sequence. Four sequences were introduced in the list with a START codon with the wrong first base although the translated proteins was reported starting with a Methionine (as it should.) Because there are advantages to working with the protein sequence for creating the alignment, the DNA sequences will be used in a later stage to create a protein translation. If the START codon is wrong, the translated proteins will not reflect the actual protein sequence. Therefore we need to edit the START codon that have an erroneous base: M_vanbaalenii, S_arenicola, M_sp_Mjls_4829, S_pneumoniae.

✔ TASK Adjust the 4 sequences named above one by one

Right-click on first base Select “cut” from pull down menu Type letter “A” to replace cut base

The first 3 letters of every sequence should now read: ATG

4. Translate to Protein

✔ READ MEGA works with either DNA or protein sequences. However, when MEGA is working with protein sequence, the alignment is transferred back to the DNA sequence. Previously we discussed the fact that there are 4 choices at each position for DNA while there are 20 choices for amino acids in protein sequences. This fact results in more precise alignments if performed at the protein level. In addition, if the alignment was performed at the DNA level, Clustal would add gaps to maximize the alignment score and would undoubtedly split codons, resulting in an aberrant protein translation.

Biochem 711 – 2008 34


The transferred alignment to the DNA will keep codons intact, create gaps as multiples of 3 and therefore eliminate frameshift artifacts. It has also been shown that trees created from such protein alignments are more accurate than trees created form the DNA sequences2.

✔ TASK Click on tab “Translated Proteins Sequences”

5. Set parameters and calculate protein alignment

✔ TASK With the protein being displayed follow the menu cascades: Edit > Select All then Alignment > Align by ClustalW

This will bring the protein parameters windows. There are 2 sets of gap penalties corresponding to the 2-step process of the Custal algorithm. Hall3 reports that changing the Multiple Alignment parameters of Gap Opening and Gap Extension from their defaults of 10 and 0.2 to values of 3 and 1.8 respectively improves alignments significantly.

✔ TASK Change the defaults values as explained above

Defaults Change to

Click OK to activate the alignment. 2 Hall BG. Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol. 2005; 22(3):792-802. Erratum in: Mol Biol Evol. 2005 ; 22(4):1160. 3 Hall B.G. Phylogenetic Trees Made Easy. 2008; 3rd Edition; Sinauer Associates, Inc.

Biochem 711 – 2008 35


MEGA will report the progress of the alignment status in the ClustalW Progress window showing the 2-step process.

6. Alignment adjustments

✔ READ Since the alignment is the basis for the calculation of the tree one would expect that adjusting the alignment as best as possible will calculate the best tree. However, it was shown that as long as the alignment is more than 50% accurate increasing the alignment accuracy (mostly by manual adjustment) has little effect on the resulting tree accuracy4.

7. Export alignment in DNA and protein forms Export the alignment as a .meg file from the Export Alignment menu Name file e.g. K+AlignAA_No1.meg

Now we will export the alignment in its DNA form. Click “DNA Sequences” above the file names to return to the DNA form which is now aligned according to the amino acids.

Export the alignment as a .meg file Name file e.g. K+AlignDNA_No1.meg Answer YES when asked if coding sequences

8. Eliminate duplicate sequences During the selection process we do not know if any of the sequences had silent mutations. The following process will trace duplicate sequences if there are any present.

4 Ogden TH, Rosenberg MS. Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol. 2006; 55(2):314-328.

Biochem 711 – 2008 36


✔ TASK

On the main MEGA window click line: “Click me to activate data file” Choose the newly saved DNA alignment K+AlignDNA_No1.meg The alignment will open within the Sequence Data Explorer. On the main MEGA window follow the menu cascade: Distances > Choose Model

In the opening M4: Analysis Preferences window verify that the first line indicates Data Type: Nucleotide (Coding). On the line named -> Model click on the green square at right.

The green square will change to a gray square with 3 dots.

Click on that gray square and follow the menu cascade: Nucleotide > No. of Differences as illustrated below

Click OK The -> Model line will update to read the updated mode.

On the main MEGA window follow the menu cascade: Distances > Compute Pairwise...

Biochem 711 – 2008 37


The model we have just chosen should still be active for this calculation. Click Compute button

Alter the look of the computed pair-wise table: Change the number of decimal places with the down arrow (circled) and optionally shrink the cell size with the cursor (arrow.) Identical sequences would show a difference of zero. In our example the lowest pair is revealed to be 1 base different as shown on this blow up of the calculated distance matrix: M. ulcerans and M.marinum. Therefore there are no duplicate sequences.

9. Eliminate inadequate sequences

✔ TASK Activate the DNA alignment from within the main MEGA window and press the “Translated Protein Sequences” button.

9.1. Biology and Structure of the protein

✔ READ Potassium ion channels remove the hydration shell from the ion when it enters the selectivity filter. In prokaryotic species the selectivity filter is formed by five residues TVGYG in the “P loop” from each of the 4 subunits as illustrated below for KcsA:

Biochem 711 – 2008 38


Indeed this pattern showed in all BLAST results. For example in Thermoplasma volcanium (underlined below): Query 36 LVIVLLAGSYLAVLAERGAPGAQLITYPAALWWSVETATTVGYGDLYPVTLWGRCVAVVV 95 +IV+L GSYL L +R +++ Y A+W+++ET TTVGYGD+ PV+ GR VA+++ Sbjct 26 FIIVVLIGSYLEFLTQRNVKYSEIKNYFTAIWFTMETVTTVGYGDVVPVSNLGRVVAMLI 85

The specificity filter sequence of Thermoplasma volcanium appears much earlier in the alignment at position 231 and is also misaligned.

This sequence is defined as Kef-type K+ transporter NAD-binding component [Thermoplasma volcanium GSS1] from a paper titled “Archaeal adaptation to higher temperatures revealed by genomic of Thermoplasma

Thermoplasma volcanium is the only sequence that does not seem in register with the rest of the alignment in the region of the specificity filter (starting at alignment position 392) as shown here with colors removed for clarity. It can also be noted that the TVGYG sequence is split by 2 gaps by the misaligned sequence.

Biochem 711 – 2008 39


volcanium.” (Kawashima T. et al., Proc Natl Acad Sci U S A. 2000 ;97(26):14257-14262) with an optimum growth temperature of 60ºC. The length of the volcanium sequence (348) is more than twice that of the query sequence (160). In addition, the description for the volcanium protein as both K+ transporter and NAD-binding can lead to believe that this protein contains 2 domains. A quick search on the PFAM database (http://pfam.sanger.ac.uk/ a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models) reveals that indeed the volcanium protein contains 2 domains: the Ion_trans_2 domain and the NAD binding TrkA-N domain:

Query Sequence Thermoplasma volcanium

Sequence HMM Pfam-A Description Entry type Start End From To E-value

Query Sequence KcsA Ion_trans_2 Ion channel Domain 34 116 1 81 2.6e-18

Thermoplasma volcanium

Ion_trans_2 Ion channel Domain 25 106 1 81 3.4e-21 TrkA_N TrkA-N domain Domain 123 247 1 121 3.2e-15

✔ INFO The Ion_trans_2 pattern at the Pfam database has an Isoleucine insertion within the filter sequence. #HMM *->ivlllvlifgtvyysl....epeeg.wewsfldalYFsfvTlTTiGYGDivPlstdaGRlftivyiliGiplfllllavlgrflte<-* #MATCH ++l++vl++g+ + l ++++p+ ++ al++s+ T TT+GYGD++P+ t +GR +++v+++ Gi+ f+l++a l+++++ #SEQ VLLVIVLLAGSYLAVLaergAPGAQlI--TYPAALWWSVETATTVGYGDLYPV-TLWGRCVAVVVMVAGITSFGLVTAALATWFVG 116

In addition, it is interesting to note that the filter sequence does not registered as a pattern at the Prosite database http://www.expasy.ch/prosite/

9.2. Remove sequence

✔ READ At this point we can make the decision to either remove the NAD binding domain portion of the volcanium sequence that is obviously not homologous to our query sequence, or we can decide to remove that sequence altogether. For simplicity we will remove that sequence from the alignment and delete the orphaned gaps.

Biochem 711 – 2008 40


✔ TASK

Return to DNA alignment view: press DNA Sequences button Right-Click the T_volcanium sequence Select Delete from the pull-down menu We will now remove the many orphaned gaps that remain: Edit > Select All Alignment > Delete Gap-Only Sites This will remove columns with only gaps and no sequence. Export the alignment in DNA form: Data > Export Alignment > MEGA format Enter a title and answer YES to protein coding question. Name the file K+AlignDNA_No2.meg

10. Estimate reliability of alignment with Average AA Identity

✔ READ This is the final stage before calculating the tree. It has been shown that when the percent identity of amino acids falls below 20% the resulting sequence alignment has less than 50% of the amino acids correctly aligned5. If the percent identity is between 20 and 30% the number of correctly aligned amino acids raises to 80% and for percent identites above 30% the number of correctly aligned amino acids raises to 90%. As discussed previously, the tree accuracy varies little if more than 50% of the amino acids are correctly aligned.

✔ TASK

Return to protein alignment view: press Translated Proteins button.

5 Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999; 27(13):2682-2690.

Biochem 711 – 2008 41


Save the protein alignment into a file named e.g. K+AlignAA_No2.meg

On the MEGA main window activate this file by clicking “click to activate a data file”. The file will open in the M4: Sequence Data Explorer. On the MEGA main window follow the menu cascade: Distances > Compute Overall Mean... In the opening Analysis Preferences window click on the green square on the -> Model line, follow the menu cascade: Amino Acid > p-distance:

The p-distance is 1 – amino acid identity, therefore the 20% limit discussed above would compute to a p-distance of 0.8. If we find a p-distance lower than 0.8 it is an acceptable value, and if it is above 0.8 the alignment is unreliable. Here we find 0.589 and therefore we can use this alignment to compute a tree or a family of trees. Note: if the value is shown as 1 simply press the up or down arrow above to increase the number of decimal points.

✔ INFO For non-coding DNA sequence we cannot use the protein alignment as a guide as we have done above. Instead of a threshold of 20% amino acid identity, the minimun is 66% DNA identity to reach the threshold of 50% alignment accuracy6.

6 Kumar S, Filipski A. (Review) Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 2007; 17(2): 127-135.

Biochem 711 – 2008 42


L11 Exercise E: Neighbor-Joining Phylogenetic Tree, Rooting

✔ INFO http://en.wikipedia.org/wiki/ /Phylogenetics, /Computational_phylogenetics /Phylogenetic_tree Phylogenetics is the study of evolutionary relatedness among various groups of organisms. Computational phylogenetics is the application of computational algorithms to phylogenetic analyses. The goal is to assemble a phylogenetic tree representing a hypothesis about the evolutionary ancestry of a set of genes or sequences. Computational algorithms: Distance-matrix methods such as neighbor-joining or UPGMA, which calculate genetic distance from multiple sequence alignments, are simplest to implement, but do not invoke an evolutionary model. Maximum parsimony is another simple method of estimating phylogenetic trees, but implies an implicit model of evolution (i.e. parsimony). More advanced methods use the optimality criterion of maximum likelihood, often within a Bayesian Framework, and apply an explicit model of evolution to phylogenetic tree estimation.

1. Create Neighbor-Joining Tree In a previous exercise we used the neighbor-joining distance method to calculate the tree and we will use that method again. As shown in the INFO table above there are many other methods, some of which require long set-ups, statistical modeling and long calculations even with a fast computer. We created a suitable alignment in the previous exercise and we assessed its suitability by computing the overall mean distance.

✔Preliminary: This is a continuation of the previous exercise. If you are starting from here click on the phrase “click here to activate a data file” on MEGA main window and open the previously created DNA alignment K+AlignDNA_No2.meg If you are continuing from above make sure you are looking at the DNA rather than the translated protein sequences.

✔ TASK

From the main MEGA window follow the menu cascade: Phylogeny > Construct Phylogeny > Neighbor-Joining (NJ)...

Biochem 711 – 2008 43


In the Analysis preference window: adjust the “-> Model” line to read “Maximum Composite Likelihood”

(click green square, then choose Nucleotide > Maximum Composite Likelihood”)

Click Compute to calculate the tree.

2. Estimating the reliability of a tree: Bootstraping

✔ INFO In computational phylogenetics, boostrapping refers to creating multiple pseudo-alignments from randomly chosen columns (sampling) from the original alignment until the pseudo-, random alignments have the same length as the original alignment. For each random alignment a tree is calculated with the same parameters as the tree calculated for the original alignment. The tree is then assessed for the presence (score of 1) or absence (score of 0) of each clade that was present on the original tree and the scores are recorded. The next bootstrap cycle is then initiated. The number of cycles may depend on the computing time or the desired precision. 100 to 2000 boostrap replicates are typical, and the number of cycles increases the calculation time. We can be most confident in clades with 90 to 100% bootstrap values the confidence level decreases with the calculated bootstrap values. Vocabulary: • A clade is a taxonomic group comprising a single common ancestor and all the descendants of that ancestor. • A strap is long narrow strip of pliant material such as leather. The

Biochem 711 – 2008 44


computer term bootstrap began as a 1950s metaphor derived from using a strap to pull on leather boots without outside help. • In computing, bootstrapping ("to pull oneself up by one's bootstraps") refers to techniques that allow a simple system to activate a more complicated system.

✔ TASK Follow the menu cascade from the main MEGA window:

Phylogeny > Boostrap Test of Phylogeny > Neighbor-joining...

Click on the Test of Phylogeny tab. Keep the default number of cycles to 500. The random seed varies each time and can be accepted as-is.

Click √ to close the window. Click Compute to calculate the tree. In the window where the tree is displayed follow the menu cascade: View > Topology Only This will change the display of the tree and show the branching more clearly and read the bootstrap values more clearly.

Biochem 711 – 2008 45


✔ OPTIONAL 1 Save Current Session in .mts format and / or Export Current Tree (Newick)

✔ OPTIONAL 2 Explore the various tree display options

3. Tree Rooting

✔ READ In spite of its appearance the Neighbor-Joining tree (with or without bootstraping) is unrooted. An Unrooted tree illustrates relatedness without making assumptions about common ancestry while a rooted tree is a

Biochem 711 – 2008 46


directed tree with a unique node corresponding to the most recent common ancestor (usually calculated.) The most common method for rooting trees is the use of an uncontroversial outgroup — close enough to allow inference from sequence or trait data, but far enough to be a clear outgroup.

3.1. Finding an outgroup The outgroup will allow the rooting of the tree. The evolutionary assumption is that the outgroup branched from the parent group before the other groups branched from each other.

✔ READ You might have noted that there are 2 longer sequences within the sequences selected form the BLAST results: P_physalis and P_penicillatus. Indeed these 2 sequences are Eukaryotic sequences as shown within the NCBI sequence text: DEFINITION voltage-gated potassium channel [Physalia physalis]. ACCESSION ABD59027 VERSION ABD59027.1 GI:88976032 DBSOURCE accession DQ385496.1 ORGANISM Physalia physalis Eukaryota; Metazoa; Cnidaria; Hydrozoa; Siphonophora; Cystonectae; Physaliidae; Physalia. DEFINITION Polyorchis penicillatus potassium channel homolog (jShak1) mRNA,complete cds. ACCESSION U32922 VERSION U32922.1 GI:987508 KEYWORDS . SOURCE Polyorchis penicillatus (penicillate jellyfish) ORGANISM Polyorchis penicillatus Eukaryota; Metazoa; Cnidaria; Hydrozoa; Hydroida; Anthomedusae; Polyorchidae; Polyorchis.

Both sequences are reported as potassium channel and we can assume that they are homologs. In addition these sequences fit on the alignment including within the selectivity filter region and are shown together as 100 on the bootstrap calculation.

3.2. Rooting the tree

✔ TASK On the tree window click the rooting icon as shown here and located at the top left of the tree window.

Click on the node connecting the 2 Eukaryotic sequences to root the tree at this node.

=>

Biochem 711 – 2008 47


The tree is now a rooted tree.

Biochem 711 – 2008 48


L11 Exercise F: End of laboratory

1) Save files that you wish to keep 2) quit MEGA 3) Close all windows.

- e -

Class notes

Documents

MEGA TUTORİAL.pdf