Assignemnt on Phylogency

Bioinformatics- Assignment 1 Report

Phylogeny Construction for Influenza viruses based on hemagglutinin sequence

Reihaneh Rabbany k.

Contents Assignment Description ................................................................................................................................ 2

First Step, Collecting Protein Sequences ...................................................................................................... 3

Influenza Virus .......................................................................................................................................... 3

Collecting Sequences ................................................................................................................................ 4

FASA Format .......................................................................................................................................... 4

Step 2, Computing Pairwise Distances and Multiple Alignment................................................................... 5

Scoring scheme ......................................................................................................................................... 5

ClustalW .................................................................................................................................................... 5

Step3, Phylogeny Construction ..................................................................................................................... 7

Distance-based phylogeny ........................................................................................................................ 7

Character-based phylogeny ...................................................................................................................... 8

Step4, Evaluation .......................................................................................................................................... 9

Consistency ............................................................................................................................................... 9

Bootstrapping ......................................................................................................................................... 10

References .................................................................................................................................................. 12

Figures

Figure 1- annotated phylogeny tree by distance method ............................................................................ 7

Figure 2- annotated phylogeny tree obtained by parsimony ....................................................................... 8

Figure 3 - Zoomed branch of distance tree (left) and parsimony tree (right) .............................................. 9

Figure 4- Comparing resulted cladogram from distance method (right) with the reported cladogram for

Influenza A virus by Yoshiyuki Suzuki, et. al. (left). ..................................................................................... 10

Figure 5- Bootstrapvalues_The upper is corresponding to parsimony method and the bottom one is

corresponded to distance method ............................................................................................................. 11



Reihaneh Rabbany k.

Assignment Description

The following is an H1N1 influenza virus hemagglutinin protein sequence, in FASTA format.

>gi|89903075|gb|ABD79112| /Human/4(HA)/H1N1//1946/// hemagglutinin [Influenza A virus A/Cam/46(H1N1))] MKAKLLILLCALSATDADTICIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDSHNGKLCRLKGIAPLQLG

KCNIAGWILGNPECESLLSKRSWSYIAETPNSENGACYPGDFADYEELREQLSSVSSFERFEIFPKGRSW

PEHNIDIGVTAACSHAGKSSFYKNLLWLTEKDGSYPNLNKSYVNKKEKEVLILWGVHHPPNIENQKTLYR KENAYVSVVSSNYNRRFTPEIAERPKVRGQAGRINYYWTLLEPGDTIIFEANGNLIAPWYAFALNRGIGS

GIITSNASMDECDTKCQTPQGAINSSLPFQNIHPFTIGECPKYVRSTKLRMVTGLRNIPSIQSRGLFGAI

AGFIEGGWDGMIDGWYGYHHQNEQGSGYAADQKSTQNAINGITNKVNSVIEKMNTQFTAVGKEFNKLEKR MENLNKKVDDGFLDIWTYNAELLVLLENERTLDFHDSNVKNLYEKVKNQLRNNAKEIGNGCFEFYHKCNN

ECMESVKNGTYDYPKFSEESKLNREKIDGVKLESMGVYQILAIYSTVASSLVLLVSLGAISFWMCSNGSL

QCRICI

Detailed tasks:

1. Search for at least 100 other hemagglutinin protein sequences for influenza viruses, such that

they are distributed well in all 16 subtypes (H1–H16).

2. Using an appropriate scoring scheme to compute the pairwise distances between every pair of

sequences in the above; using the same scoring scheme, construct a multiple sequence

alignment for these sequences.

3. Use a distance-based and a character-based phylogeny construction method, together with an

out-group, to build two phylogenies for these sequences.

4. Evaluate the constructed phylogenies.

Note that the detailed descriptions of steps of operations you perform and the consequences of these

operations must be reported (for example, the number of sequences you collected from each database,

each tool you have called and their availability).



Reihaneh Rabbany k.

First Step, Collecting Protein Sequences At the first step, I should search for at least 100 other hemagglutinin protein sequences for influenza

viruses, such that they are distributed well in all 16 subtypes (H1–H16). For doing this, first I should get

familiar with Influenza virus.

Influenza Virus

“The influenza virus is an RNA virus comprises five genera: Influenzavirus A, Influenzavirus B,

Influenzavirus C, Isavirus, and Thogotovirus. The type A viruses are the most virulent human pathogens

and cause the most severe disease. The Influenza A genome encodes 11 proteins: hemagglutinin (HA),

neuraminidase (NA), nucleoprotein (NP), M1, M2, NS1, NS2(NEP), PA, PB1, PB1-F2 and PB2” [1].

“HA and NA are large glycoproteins on the outside of the viral particles; these proteins are targets for

antiviral drugs which are antigens to which antibodies can be raised. Influenza A viruses are classified

into subtypes based on antibody responses to HA and NA, forming the basis of the H and N distinctions

in, for example, H5N1” [1]. “There are 16 different HA antigens (H1 to H16) and nine different NA

antigens (N1 to N9) for influenza A”. [2].

Naming

Each subtype virus has mutated into a variety of strains1 [2]. Generally, influenza A variants are

identified according to the isolate that they are like and thus are presumed to share lineage (example

Fujian flu virus like); according to their typical host (example Bird flu, Human Flu, Swine Flu, Horse Flu,

Dog Flu); according to their subtype, an H number (for hemagglutinin) and an N number (for

neuraminidase) (example H3N2); and according to their deadliness (example LP) [2,3].

1 A strain is a genetic variant or subtype of a microorganism (e.g. virus). For example, a "flu strain" is a certain

biological form of the influenza or "flu" virus.



Reihaneh Rabbany k.

Collecting Sequences

I’ve used Influenza Virus Resources in NCNBI2 to retrieve the HA protein sequences for influenza. It

contains more than 11000 viruses. I simply requested the db to retrieves all complete HA sequences of

Influenza A and chose 7 of each subtype and download them in FASA format (The selected sequences

are mostly from USA and between 2000 and 2008 unless there is not enough number of such sequences

in these years).

Although large number of influenza sequences in NCBI, it contains only 2 H14 and 5 H15 subtypes.

Therefore, I used Uniport3 to find more sequences in these subtypes and I found 4 H14 and 7 H15

sequences there. Further, I searched BioHealth4 and I found 8 H15 and 2 H14 there.

All these results are intenerated and recorded in the name of “data/sequences.fasta”.

FASA Format

FASA is a text-based format for representing peptide sequences, in which amino acids are represented

using single-letter codes.

Description line begins with “>” symbol. The word following the ">" symbol is the identifier of the

sequence, and the rest of the line is the description (both are optional). There should be no space

between the ">" and the first letter of the identifier. In this case these descriptions contain the viruses’

location, host, year and subtype.

Amino acid codes

The amino acid codes supported are:

Amino Acid

A B C D E F G H I K L M N O P Q R S T U V W Y Z X * -

Meaning

Alanine

Aspartic acid or Asparagine

Cysteine

Aspartic acid

Glutamic acid

Phenylalanine

Glycine

Histidine

Isoleucine

Lysine

Leucine

Methionine

Asparagine

Pyrrolysine

Proline

Glutamine

Arginine

Serine

Threonine

Selenocysteine

Valine

Tryptophan

Tyrosine

Glutamic acid or Glutamine

Any

translation stop

gap of indeterminate length

2 http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/select.cgi?go=1

3 http://www.uniprot.org/

4 http://www.biohealthbase.org/GSearch/home.do?decorator=Influenza



Reihaneh Rabbany k.

Step 2, Computing Pairwise Distances and Multiple Alignment Here I should use an appropriate scoring scheme to compute the pairwise distances between every pair

of sequences in the above; and then using the same scoring scheme, construct a multiple sequence

alignment for these sequences.

Scoring scheme

Scoring scheme contains biological information which determines how one should compute the

alignment. It includes substitution matrix (to assign scores to amino-acid matches or mismatches) and

gap penalties (for matching an amino acid in one sequence to a gap in the other) [7, 8].

The two common substitution matrixes are PAM series and BLOSUM series; when comparing closely

related proteins, one should use lower PAM or higher BLOSUM, for distantly related proteins higher

PAM or lower BLOSUM matrices [7].

ClustalW

For performing multiple sequence alignment, I’ve used ClustalW from PHYLIP package via Mobyle 5

webservice (a portal for bioinformatics analyses). I’ve also checked ClustalX with is a windows interface

for the ClustalW multiple sequence alignment program but as there are no different in functionality, I

keep on using the webservice.

“ClustalW is a progressive method that generates a multiple sequence alignment by first aligning the

most similar sequences and then adding successively less related sequences or groups to the alignment

until the entire query set has been incorporated into the solution. The initial tree describing the

sequence relatedness is based on pairwise comparisons.” [8]

For its scoring scheme I’ve selected the following settings for both Pairwise Alignments parameters and

Protein parameters of multiple sequence alignment 6:

Gap opening penalty: 10

Gap extension penalty: 0.2

Gap separation penalty range: 8

Delay divergent sequences: 30% identity for delay

Protein weight matrix: PAM series

Protein weight matrix for pairwise alignment: PAM350

5 http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=clustalw-multialign

6 Command for rerunning it is:

clustalw -align -infile=sequences.fasta -type=protein -matrix=blosum -nopgap -nohgap -hgapresidues="RNDQEGKPS" -pwmatrix=blosum -

outfile=BlosumAligned



Reihaneh Rabbany k.

All the result of this section is reported under directory: “analysis\*\clustalw-multialign”

It includes a “sequences.aln” file that contains the multiple sequence alignment. There is also

“sequences.dnd” which contains the resulted tree and also “clustalw-multialign.out” that shows the

progress of this algorithm which includes the pairwise scores between each pair of sequences.



Reihaneh Rabbany k.

Step3, Phylogeny ConstructionIn this step I should use a distance

together with an out-group (which

leading to it), to build two phylogenies for these sequences.

Distance-based phylogeny

In distance-based tree reconstruction, we reconstruct an evolutionary tree from a distance matrix. As

most of distance measures don’t guarantee to produce a

clustering methods for building the tree;

Arithmetic Mean) which produce a

last) or NJ (Neighbor Joining) [9].

For constructing this phylogeny tree I used “Protdist” (Prote

toolbox and via Mobyle webservice

Multiple sequence alignment) which I further fed

obtain the corresponding phylogenetic tree. The resulted

in “analysis\*\Distance\protdist\protdist.outfile

“analysis\*\Distance\neighbor\neighbor.outtree

and “drawtree” toolbox in PHYLIP. The results are “

“analysis\*\Distance\tree.pdf”.

Figure 1

7 http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=protdist

ln -s PAMali.phylipi infile && protdist <protdist.params && mv outfile protdist.outfile8 http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=neighbor

ln -s protdist.outfile infile && neighbor <neighbor.params && mv outfile neighbor.outfile && mv outtree

Assignment 1 Report


Step3, Phylogeny Construction se a distance-based and a character-based phylogeny construction method,

(which specifies which species is to have the root of the tree be on the line

to build two phylogenies for these sequences.

based tree reconstruction, we reconstruct an evolutionary tree from a distance matrix. As

most of distance measures don’t guarantee to produce an additive matrix, we usually use a hierarchical

clustering methods for building the tree; such as UPGMA (Unweighted pair Group Method with

Arithmetic Mean) which produce an ultra-metric rooted tree (root corresponds to the cluster created

For constructing this phylogeny tree I used “Protdist” (Protein Sequence Distance Method) from PHYLIP

toolbox and via Mobyle webservice7. This program computed a distance matrix (based on the given

which I further fed into NJ algorithm (again via Mobyle webservice

onding phylogenetic tree. The resulted protdis’s resulted distance matrix is reported

protdist.outfile” and the NJ’s resulted tree is reported under

neighbor.outtree”. I’ve plotted its cladogram and tree using “drawgram”

and “drawtree” toolbox in PHYLIP. The results are “analysis\*\Distance\cladogram.pdf

1- annotated phylogeny tree by distance method

bin/MobylePortal/portal.py?form=protdist With commant: s PAMali.phylipi infile && protdist <protdist.params && mv outfile protdist.outfile

bin/MobylePortal/portal.py?form=neighbor With commant: s protdist.outfile infile && neighbor <neighbor.params && mv outfile neighbor.outfile && mv outtree neighbor.outtree

based phylogeny construction method,

specifies which species is to have the root of the tree be on the line

based tree reconstruction, we reconstruct an evolutionary tree from a distance matrix. As

additive matrix, we usually use a hierarchical

such as UPGMA (Unweighted pair Group Method with

metric rooted tree (root corresponds to the cluster created

in Sequence Distance Method) from PHYLIP

(based on the given

algorithm (again via Mobyle webservice8) to

distance matrix is reported

resulted tree is reported under

m and tree using “drawgram”

ladogram.pdf” and

neighbor.outtree



Reihaneh Rabbany k.

Character-based phylogeny

Instead of computing distances from alignment matrix

to construct the tree, we could use the alignment matrix directly to build the evolutionary tree by

character-based methods (these methods try to explain the best c

that they describe their successors species

or protein sequence of that species

the one that needs minimum number of changes

For constructing this phylogeny tree I used “ProtPars” (Protein Sequence Parsimony Method) from

PHYLIP toolbox and via Mobyle webservice

“analysis\*\Parisomy\protpars\” folder. I’ve a

“drawtree” toolbox in PHYLIP. The results are “

“analysis\*\Parisomy \tree.pdf”.

Figure 2-

9 http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=protpars

ln -s BlosumAligned.phylipi infile && protpars <protpars.params &&

Assignment 1 Report


phylogeny

f computing distances from alignment matrix (� ��) and using these distance matrix (


based methods (these methods try to explain the best character strings for internal nodes so

that they describe their successors species – here character string of a species is the amino

species); such as maximum parsimony method that define the best tree as

at needs minimum number of changes [9].


PHYLIP toolbox and via Mobyle webservice9. The results are reported under

” folder. I’ve also plotted its cladogram and tree using “drawgram” and

“drawtree” toolbox in PHYLIP. The results are “analysis\*\Parisomy\cladogram.pdf

”.

annotated phylogeny tree obtained by parsimony

bin/MobylePortal/portal.py?form=protpars With commant: s BlosumAligned.phylipi infile && protpars <protpars.params && mv outfile protpars.outfile && mv outtree protpars.outtree

and using these distance matrix (� � �)


haracter strings for internal nodes so

is the amino-acids string

); such as maximum parsimony method that define the best tree as


The results are reported under

lso plotted its cladogram and tree using “drawgram” and

ladogram.pdf” and

mv outfile protpars.outfile && mv outtree protpars.outtree



Reihaneh Rabbany k.

Step4, Evaluation For evaluating the constructed phylogenies

biological information and by bootstrapping.

Consistency

For evaluating the resulted trees I compare how consistent they are with

data I have both what are given in protein sequences’ descriptions

this, first, I renamed the sequences

using “\RenameSequences\Renaming

and readable (Figure 1 and 2). I used these renamed sequences to build the phylogenetic trees by both

distance based and parsimony methods

\Parisomy\cladogram.pdf” and “analysis

trees that the viruses in a same subtype are grouped in the same clad

these algorithms with biological information.

corresponding virus year. For illustrating them I zoomed in branch H7 of both trees:

Figure 3 - Zoomed branch of distance tree

Based on Figure 1, 2, and 3, using this scoring scheme,

distances for more closely sequences

Further I compared the resulted cladogram

there is a high agreement between my resul

influenza A viruses subtypes (see Figure 4

Assignment 1 Report


the constructed phylogenies, I took two approaches by consistency of them with

biological information and by bootstrapping.

For evaluating the resulted trees I compare how consistent they are with the taxonomy

are given in protein sequences’ descriptions and using others’ results

I renamed the sequences so that their subtype plus their year location becomes their identifier

Renaming\Rename.java”. In this way the resulted tree becomes meaningful

I used these renamed sequences to build the phylogenetic trees by both

distance based and parsimony methods and the resulted cladogram trees are “analysis

” and “analysis\ Consistence\Distance\cladogram.pdf”. We could see in these

same subtype are grouped in the same clade which shows the consistency of

with biological information. Moreover most of the branches are consistence with the

For illustrating them I zoomed in branch H7 of both trees:

Zoomed branch of distance tree (left) and parsimony tree (right)

using this scoring scheme, the parsimony method exhibit the evolutionary

sequences better that the distance method.

cladogram trees with trees reported by Yoshiyuki Suzuki,

between my results and results presented on that paper about divergence of

Figure 4).

, I took two approaches by consistency of them with

taxonomy or biological

and using others’ results. For doing

their subtype plus their year location becomes their identifier

In this way the resulted tree becomes meaningful

I used these renamed sequences to build the phylogenetic trees by both

trees are “analysis\Consistence

e could see in these

which shows the consistency of

consistence with the

method exhibit the evolutionary

Yoshiyuki Suzuki, et. al. [11] and

paper about divergence of



Reihaneh Rabbany k.

Figure 4- Comparing resulted cladogram from distance method

Yoshiyuki Suzuki, et. al.

Bootstrapping

Apart from consistency, I evaluate the resulted tree by bootstrapping.

parameters in “ProtPars” and “ProtDist

bootstrapped trees and then I fed these trees into

Mobyle webservice10

). This program generated a c

that shows the agreement on that branch between

TreeWithBootstrapValues.txt” and

technical point is that, for the distance

tried the PHYLIP package directly;

bootstrapped MSA. Using these bootstrapped samples

obtain consensus tree with bootstrap

the webservice. By the way, the results are

10

http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=consenseln -s protpars.outtree intree &&consense <consense.params && mv outfile consense.outfile && mv outtree consense.outtree

Assignment 1 Report


Comparing resulted cladogram from distance method (right) with the reported cladogram for Influenza A virus by

uzuki, et. al. (left) One could see that clades are mostly identical.

Apart from consistency, I evaluate the resulted tree by bootstrapping. For doing this,

ProtDist”+”NJ” to perform bootstrapping and they

ed these trees into “Consensus” tree program in PHYLIP toolbo

This program generated a consensus tree with bootstrap values

n that branch between all bootstrapped trees (see “\Boostrap

and “\Boostrap\Distance\TreeWithBootstrapValues.txt

distance method the “Pratdist” webservice is extremely slow;

; I used “seqboot” in PHYLIP package to generate

these bootstrapped samples I produce 100 trees by “neighbor

tree with bootstrap values by “Consensus”. This one is still slow but much faster than

the results are not surprisingly identical.

bin/MobylePortal/portal.py?form=consense With commant: s protpars.outtree intree &&consense <consense.params && mv outfile consense.outfile && mv outtree consense.outtree

reported cladogram for Influenza A virus by

doing this, I simply set

they produced 100

e program in PHYLIP toolbox (via

values on its branches

Boostrap\Parisomy\

TreeWithBootstrapValues.txt” [10]. One

extremely slow; therefore, I

ate to generate 100

neighbor” and finally

. This one is still slow but much faster than

s protpars.outtree intree &&consense <consense.params && mv outfile consense.outfile && mv outtree consense.outtree



Reihaneh Rabbany k.

Comparing the resulted bootstrap values in

that the parsimony method is by far better than the distance method in this specific task

dataset and settings; as the most of branches in its tree has 100%

values, while the distance method produces

example compare these two branches with similar

from parsimony method (note that bootstrap values are between 0 and 1

lower one is from distance method.

Figure 5- Bootstrapvalues_The upper is corresponding to

method

Assignment 1 Report


Comparing the resulted bootstrap values in consensus trees of distance and parsimony

method is by far better than the distance method in this specific task

as the most of branches in its tree has 100% bootstrap values

, while the distance method produces relatively poor bootstrap values in the branches.

example compare these two branches with similar species and different bootstraps. The upper one is

note that bootstrap values are between 0 and 1, i.e. 1 mean 100%

lower one is from distance method.

The upper is corresponding to parsimony method and the bottom one is corresponded to distance

parsimony method revealed

method is by far better than the distance method in this specific task and with these

or high bootstrap

poor bootstrap values in the branches. For

aps. The upper one is

1 mean 100%) and the

method and the bottom one is corresponded to distance



Reihaneh Rabbany k.

References [1] http://en.wikipedia.org/wiki/Influenza

[2] http://en.wikipedia.org/wiki/Influenzavirus_A

[3] http://en.wikipedia.org/wiki/Influenza_Genome_Sequencing_Project

[4] http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html

[5] http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/select.cgi?go=1

[6] http://www.uniprot.org/

[7] http://en.wikipedia.org/wiki/Substitution_matrix

[8] http://en.wikipedia.org/wiki/Sequence_alignment

[9] N. C. Jones and P. A. Pevzner. "An Introduction to Bioinformatics Algorithms". MIT Press. 2004

[10] http://bioweb2.pasteur.fr/docs/phylip/doc/consense.html

[11] Yoshiyuki Suzuki, et. al., Origin and Evolution of Influenza Virus Hemagglutinin Genes, Molecular

Biology and Evolution, Oxford University Press, April 1, 2002

Documents

Assignemnt on Phylogency