Upload
jonathan-blakes
View
840
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Presentation of my MSc project to my new research group in Nottingham.
Citation preview
Genomeexploration
in A-T G-C spacean introduction to
‘DNA walking’
Jonathan Blakes
Submitted for the degreeMSc Biotechnology and
ComputationJulie NewdollJulie NewdollDawn of the Double Helix Dawn of the Double Helix Oil/Mixed, 2002Oil/Mixed, 2002
Exponential growth of DNA sequencesGenBank release notes December 2007
“from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months.”
83,874,179,730 base pairs in 80,388,382 entries
EnsEMBLEMBL-EBI
Sanger InstituteWellcome Trust
UCSCUniversity of California
at Santa Cruz
Genome Browsers
Can we understand this Can we understand this informationinformation
a priori,a priori,summarise summarise andand
preserve fine structurepreserve fine structure?
GraphDNA – DNA Walker
Mapping
Mapping4 rotations for each of the 3 previous mappings
2 reflections of each of those
24 possible combinations of cardinal vectors
These are the 3 most parsimonious mappings of those 24
A-T G-C
A-G C-T
A-C G-T
A-T G-C
A-T G-C is consistently smallest
A-T G-C walks• contain more information in less space• are simply easier to print
Genome Exploration
Human chromosome 1250,000,000 bases
S. cerevisiae chromosome 1
EnsEMBL annotation
Duplications
small (~4 base) sequencesoccur several times ineach larger sequence
for each small sequence calculatelineline between occurrences
if the angle and length of 2 or more lines are consistentthen draw lineslines to reveal possible duplications
Can detect duplications by eye
or algorithmically:
Comparison withpublished data
This is a 7 fold contiguous duplication
in the male Y chromosome.
Members of the TSPY (Testis-
specific Y-encoded proteins) family
identified by Skaletsky et al Nature
423 (2003) using a combination of a
whole chromosome dotplot with a 2-
kb window and a custom Perl script
running BLAST alignments of all 5-kb
sequence segments, in 2-kb steps, of
the entire MSY (Male Specific Y).
exons introns
Phylogenetics• Phylogenetics is the reconstruction of evolutionary relatedness from
primary sequence information: DNA or protein• Traditional phylogenetic methods such as ClustalW and TCoffee all
start from a multiple sequence alignment
• Hard to find optimal alignment for many long sequences
• Want to use simple measures such as Manhattan or Euclidean distance derived from DNA walks to produce phylogenies without alignment
• Are these comparable to alignment methods?
Phylogeny algorithmsPublished Distance Matrix from Gilfillan GD, et. al. Microbiology 144 (1998) 829-838of 7 aligned 1798-nucleotide long small rRNA of Candida and Saccharomyces species
neighbour joining UPGMA
UPGMA method
Tree construction
Distance Matrix
Output
Newick format string representation of a tree:
(Bovine:0.69395, (Gibbon:0.36079, (Orang:0.33636, (Gorilla:0.17147, (Chimp:0.19268, Human:0.11927) :0.08386):0.06124):0.15057):0.54939, Mouse:1.21460);
Phylogenies with each mapping
Can summing 3 mappings eliminate bias?
No.
Possible improvements:• A more complex distance measure, perhaps a composite of small
sequence distances• Larger sequences – human / chimpanzee chromosome – where
mapping bias may be informative rather than destructive
Conclusion
DNA walks can• summarise information about nucleotide content in DNA sequences• visualise tandem repeats• uncover more distant relationships such as duplications and retroviral
genomes without expert knowledge or complex algorithms• be overlaid with annotations from Ensembl walks to function as an
alternative to linear genome browsers• be used to construct phylogenetic relationships but are inaccurate• A-T G-C is most useful mapping for viewing 2D walks
Future
3D walks• mappings where one nucleotide opposes another result in information
loss, which can be helpful, but we can’t know what we are missing• use a tetrahedral mapping:• each step starts from the centre and proceeds to a corner• should produce 3D structure like proteins but much bigger• can recover 2D walk by viewing orientation
Acknowledgments
Biosciences
Dr. Gary Robinson (supervisor Biosciences)
Dr. Jürgen Schmidt (course convenor)
Dr. Anthony Baines (Bioinformatics lecturer)
Computing
Dr. Colin Johnson (supervisor Computing)