37
Genome exploration in A-T G-C space an introduction to ‘DNA walking’ Jonathan Blakes Submitted for the degree MSc Biotechnology and Computation ulie Newdoll ulie Newdoll awn of the Double Helix awn of the Double Helix il/Mixed, 2002 il/Mixed, 2002

20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Embed Size (px)

DESCRIPTION

Presentation of my MSc project to my new research group in Nottingham.

Citation preview

Page 1: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Genomeexploration

in A-T G-C spacean introduction to

‘DNA walking’

Jonathan Blakes

Submitted for the degreeMSc Biotechnology and

ComputationJulie NewdollJulie NewdollDawn of the Double Helix Dawn of the Double Helix Oil/Mixed, 2002Oil/Mixed, 2002

Page 2: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Exponential growth of DNA sequencesGenBank release notes December 2007

“from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months.”

83,874,179,730 base pairs in 80,388,382 entries

Page 3: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

EnsEMBLEMBL-EBI

Sanger InstituteWellcome Trust

UCSCUniversity of California

at Santa Cruz

Genome Browsers

Page 4: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Can we understand this Can we understand this informationinformation

a priori,a priori,summarise summarise andand

preserve fine structurepreserve fine structure?

Page 5: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Page 6: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

GraphDNA – DNA Walker

Page 7: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Page 8: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Page 9: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Mapping

Page 10: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Mapping4 rotations for each of the 3 previous mappings

2 reflections of each of those

24 possible combinations of cardinal vectors

These are the 3 most parsimonious mappings of those 24

Page 11: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

A-T G-C

Page 12: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

A-G C-T

Page 13: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

A-C G-T

Page 14: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

A-T G-C

Page 15: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

A-T G-C is consistently smallest

A-T G-C walks• contain more information in less space• are simply easier to print

Page 16: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Page 17: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Page 18: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Page 19: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Page 20: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Genome Exploration

Page 21: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Human chromosome 1250,000,000 bases

Page 22: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

S. cerevisiae chromosome 1

Page 23: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

EnsEMBL annotation

Page 24: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Page 25: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Duplications

small (~4 base) sequencesoccur several times ineach larger sequence

for each small sequence calculatelineline between occurrences

if the angle and length of 2 or more lines are consistentthen draw lineslines to reveal possible duplications

Can detect duplications by eye

or algorithmically:

Page 26: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Page 27: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Comparison withpublished data

This is a 7 fold contiguous duplication

in the male Y chromosome.

Members of the TSPY (Testis-

specific Y-encoded proteins) family

identified by Skaletsky et al Nature

423 (2003) using a combination of a

whole chromosome dotplot with a 2-

kb window and a custom Perl script

running BLAST alignments of all 5-kb

sequence segments, in 2-kb steps, of

the entire MSY (Male Specific Y).

exons introns

Page 28: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Phylogenetics• Phylogenetics is the reconstruction of evolutionary relatedness from

primary sequence information: DNA or protein• Traditional phylogenetic methods such as ClustalW and TCoffee all

start from a multiple sequence alignment

• Hard to find optimal alignment for many long sequences

• Want to use simple measures such as Manhattan or Euclidean distance derived from DNA walks to produce phylogenies without alignment

• Are these comparable to alignment methods?

Page 29: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Page 30: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Phylogeny algorithmsPublished Distance Matrix from Gilfillan GD, et. al. Microbiology 144 (1998) 829-838of 7 aligned 1798-nucleotide long small rRNA of Candida and Saccharomyces species

neighbour joining UPGMA

Page 31: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

UPGMA method

Page 32: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Tree construction

Distance Matrix

Output

Newick format string representation of a tree:

(Bovine:0.69395, (Gibbon:0.36079, (Orang:0.33636, (Gorilla:0.17147, (Chimp:0.19268, Human:0.11927) :0.08386):0.06124):0.15057):0.54939, Mouse:1.21460);

Page 33: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Phylogenies with each mapping

Page 34: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Can summing 3 mappings eliminate bias?

No.

Possible improvements:• A more complex distance measure, perhaps a composite of small

sequence distances• Larger sequences – human / chimpanzee chromosome – where

mapping bias may be informative rather than destructive

Page 35: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Conclusion

DNA walks can• summarise information about nucleotide content in DNA sequences• visualise tandem repeats• uncover more distant relationships such as duplications and retroviral

genomes without expert knowledge or complex algorithms• be overlaid with annotations from Ensembl walks to function as an

alternative to linear genome browsers• be used to construct phylogenetic relationships but are inaccurate• A-T G-C is most useful mapping for viewing 2D walks

Page 36: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Future

3D walks• mappings where one nucleotide opposes another result in information

loss, which can be helpful, but we can’t know what we are missing• use a tetrahedral mapping:• each step starts from the centre and proceeds to a corner• should produce 3D structure like proteins but much bigger• can recover 2D walk by viewing orientation

Page 37: 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

Acknowledgments

Biosciences

Dr. Gary Robinson (supervisor Biosciences)

Dr. Jürgen Schmidt (course convenor)

Dr. Anthony Baines (Bioinformatics lecturer)

Computing

Dr. Colin Johnson (supervisor Computing)