Applied Bioinformatics Week 8 Jens Allmer. Practice I

Preview:

Citation preview

Applied Bioinformatics

Week 8

Jens Allmer

Practice I

Topic

• Multiple Sequence Alignment Review– Building an MSA– Editing an MSA

• Dendrograms

• Phylogenetic Trees

Choosing Sequences

• How many?– 10 – 15 (less than 50 would be good)

• Seqs should be >30% and <90% identical

• Prefer seqs of similar length

• Prefer seqs without internal repeats or extract them

Choosing Sequences

• While choosing your sequences give them good names

• Some sequences should be well annotated

Create an MSA

• This time use 20 – 50 sequences– From different species

• Use ClustalW for alignment

• Most ClustalW servers display a dendrogram

• Confirm this by using a few of them

Gathering Sequences

• Download the sequences as a FASTA file as well

• Most programs will support this format

Output Formats

• Many different formats– FASTA widely supported– Pdf Only for printing/ storing/ sharing– Pir Similar to fasta – Msf common MSA format– Aln subset of msf

Converting Formats

• http://bioweb.pasteur.fr/seqanal/interfaces/fmtseq.html

• Names (>…) no longer than 15 characters

• Different formats maintain different data

• Converting will introduce the problem of loosing data

• Make sure to have a master copy

Editing Alignments

• http://www.jalview.org• Start the program

• Choose File – Input Alignment – from Textbox

• Copy and paste the ClustalW alignment

Dendrogram

• Jalview also allows you to view different types of Dendrograms based on different similarity measures

• Use Jalview and compare the trees that are constructed based on the different measures

End Practice I

• 15 min break

Theory I

Phylogeny

• Sources– Sequences– Clades– Organims

• Why– Understand evolution– Strain diversity– Epidemiology– Gene predicion

Dendrogram

http://en.wikipedia.org/wiki/Dendrogram

Phylogenetic Tree

Tree Terminology• All circled elements (e.g.: a) are called node(s)• The connections between them are called edge(s) or branch(es)

• The first node that forms the tree is called root (here abcdef)

• Terminal nodes that have only one connection are called leaf(ves) (e.g.: a)

Unrooted Trees (remove red root)

Branch Length

• Arbitrary

• Similarity

• Evolutionary Time

Tree types

• A dendrogram is a broad term for the diagrammatic representation of a phylogenetic tree.

• A cladogram is a tree formed using cladistic methods. This type of tree only represents a branching pattern, i.e., its branch lengths do not represent time.

• A phylogram is a phylogenetic tree that explicitly represents number of character changes through its branch lengths.

• A chronogram is a phylogenetic tree that explicitly represents evolutionary time through its branch lengths.

Sequences• DNA

– Sensitive but quite divergent at longer distances

– Use for very closely related organisms

• cDNA– Still sensitve but less divergent (e.g. introns)

– Use for closely related families

• Protein– Least sensitive but most useful for more distant relationships

– Use for distantly related species

• 16S RNA– Exists in all organisms

– Highly conserved

Overall Process

• Get Sequences• Construct MSA• Compute pairwise distances (for some methods)• Build Tree

– Topology

– Branch Lengths

• Estimate accuracy, reliability– Build several different trees for that

• Visualize the tree

Computational Tree Formation

• Distance Methods– Neighbor-Joining– Least-Squares– UPGMA

• Parsimony– Least number of evolutionary steps

• Maximum Likelihood– Highest probable tree to fit to the hypothesis is

constructed

Neighbor Joining

• Bottom-up clustering method1. Create distance map

2. Join closest nodes

3. Do (1-2) until fully joined

http://en.wikipedia.org/wiki/Neighbor_joining

Least Squares

• Standard approximation approach– Minimizes the sum of the error (squares)

• Example PGLS – Phylogenetic Generalized Least Squares– Needs additional data (traits)

http://www.dynamicgeometry.com/General_Resources/Advanced_Sketch_Gallery/Other_Explorations/Statistics_Collection/Least_Squares.html

UPGMA

• Unweighted Pair Group Method with Arithmetic Mean– Aglomerative hierarchial clustering method– Assumes constant rate of evolution

Similarity Measures

• Sequence– Number of different positions

– Weighted differences• Substitution Matrices

– Pairwise alignments• NW, SW, ..

• Additional measurements or knowlege– Traits

• Parsimony– Number of changes for tree paths

Tree Accuracy

• Bootstrapping– Resample– Recompute– Do many times– Compare results

http://www.sciencedirect.com/science/article/pii/S0191814107000156

http://goergen.deviantart.com/art/Magic-Forrest-Wallpaper-139108299

End Theory I

• Mindmap

• Break

Practice II

Where to get Trees

• Most servers that allow for MSA will also provide at least the guide tree which was used to construct the alignment

• If that’s all you are interested in you don’t need to go any further

Edit your MSA

• Remove blocks consisting of mostly gaps (using JalView)

• Remove N- and C-termini if not conserved well

Easy Tree

• www.ebi.ac.uk/clustalw/• Paste your alignment• Select a tree type• Other options need to be set (see

right)• Press run• Make a screen shot• You can paste it where needed

Phylip (More elaborate tree)

• http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html

• Choose protdist from the page• Paste the MSA• Bootstrapping e.g.:

Phylip

• Run the query

• Click further analysis

Click Run

Select full screen view

There is your tree

Ugly Tree

• Let’s face it the tree is quite ugly• http://iubio.bio.indiana.edu/treeapp/treeprint-form.html• Select the consense.outtree from the previous website and paste it

into the box

• Select submit to create the tree

• Play around with the formats and settings

Tree Topologies

Other Resources

• http://en.wikipedia.org/wiki/List_of_phylogenetics_software

• http://itol.embl.de/

Recommended