DNA is Data - BioInfoSummer 2012 (Dave Adelson)

1

DNA is Data

Dave [email protected]

BioInfo Summer 2012Dec. 3 2012

mailto:[email protected]

mailto:[email protected]

2

What is Bioinformatics?

• The mathematical, statistical and computing methods that aim to solve biological problems using DNA and protein sequences and related information.

• My main interest is genome analysis of mammals.

3

G-gnome vs Genome

Thanks to Ernie Bailey

4

What is a Genome?• The genome is the total

genetic content of the individual/cell.

• All mammal genomes are about the same size.

• Made up of chromosomes, each of which is a single molecule of DNA.

• Total genome length 3,000,000,000 base pairs.

Image courtesy NHGRI

5

Central paradigm of Molecular Biology

DNA RNA Protein Phenotype

Guanine- GAdenine- AThymine- TCytosine- C

Guanine- GAdenine- AUracil- UCytosine- C

G Glycine Gly

P Proline Pro

A Alanine Ala

V Valine Val

20 amino acids

6

Central paradigm of Molecular Biology

7

Gene vs Genome

• Each chromosome is a single, long DNA molecule.

• Genes are the basic unit of heredity.

• Genes are specific DNA sequences located on chromosomes.

• Genome contains approximately 20,000 protein coding genes.

• The 20,000 genes fill up about 2% of the genome.

8

DNA Sequences- threebases and stop codons

http://www.genome.gov/EdKit/bio2b.html

9

Genetic Code

http://plato.stanford.edu/entries/information-biological/GeneticCode.png

10

Sense Strand / Antisense Strand

http://www.genome.gov/EdKit/bio2c.html

11

Open reading frames

p://www.genome.gov/EdKit/bio2d.html

12

Reading frames

http://www.genome.gov/EdKit/bio2e.html

13

Exons and Introns

http://www.genome.gov/EdKit/bio2i.html

14

Genes from different animals are similar

Query = human actinSubject= fruit fly actin

15

Bioinformatics: what we do

16

Bioinformatics: What we really do

• Once the sequencing has been done, every other part of the process is bioinformatics.– Genome Assembly– Gene Prediction– Sequence Analysis

17

Bioinformatics: Why do we do it?

• It’s the only way to make sense of billions of base pairs of DNA sequence.

• To understand the mechanistic basis of biological trait determination.

18

Cost of DNA sequencing

Genbank: 1982-2008

• The number of entries in databases of gene sequences has increased exponentially

19

Genbank: latest release

• In October 15 2012, Release 192.0– 145,430,961,262 bases– 157,889,737 reported sequences

20

21

Growth of GenBank

Nucleic Acids Res. 2011 Jan;39(Database issue):D32-7.

Current status of genome projects

http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Statistics

23

Mammalian Genes

Unique genes

Known genes

Cow Dog Man Mouse Rat Opossum Platypus

24

Mammal Family Tree Based on Genes

25

Key things about DNA sequencing

• Only short sequences can be generated (up to 1000bp long, depending on technology)

• Typical mammalian genome is 3x109 bp.• Sequencing a genome means stitching together

millions of short reads.• To assemble reads, one must be able to identify

overlap by aligning sequences.• Sequence alignment tools are fundamental to

bioinformatics.

26

Shotgun sequencing1. Create libraries of

the whole genome.2. Sequence millions

of fragments.3. Look for overlap

between reads.4. Assemble reads

based on overlaps into contigs.

ED Green(2001) Nature Reviews Genetics vol. 2 (8) pp. 573-583

27

Shotgun assembly steps

• Remove bad sequences, trim adapters from reads.

• Identify repeats.• Identify overlaps by sequence alignment

(excluding repeats).• Build contigs from overlapping sequences.• Used paired-end reads to assemble contigs

into scaffolds.• Use additional marker information to order

and orient scaffolds into super-scaffolds (chromosomes).

28

Shotgun sequencing problems

• Leaves gaps.• Contigs have

to be ordered and oriented.

• Occasional misassembled contigs.

ED Green(2001) Nature Reviews Genetics vol. 2 (8) pp. 573-583

29

Old style paired end libraries

• To take advantage of information from paired ends multiple libraries are made:

• Small insert ~2kb (plasmid)• Medium insert ~10kb (plasmid)• Large insert ~40kb (fosmid)• Tight control of insert size is paramount. Use

random shearing, not restriction digest to generate inserts. Small inserts may well be sequenced through with overlap from ends.

30

Scaffold (long range) assembly

E Myers et all(2000)Science,v287,p2196

31

Current shotgun sequencing

32

Repeat sequences cause problems

“Junk DNA”, an unfortunate choice of words

http://www.junkdna.com/ohno.html

Used to describe the mostly repetitive DNA between genes

Adelson GENE3111/3110 34

LINEs and SINEs

These are typical of Eukaryotes, in particular mammals.

Intact autonomous elements are about 6kb long.

Non-autonomous truncated (SINE) elements that share the same tail make use of the autonomous elements insertion machinery.

CytoplasmCell

Nucleus

Retrotransposition•Retrotransposons are ancient, retroviral like pieces of DNA that copy themselves around the genome.

•They cannot “infect” other individuals or cells because they lack key components that viruses have.

36

Repeats and genome assembly

• Repeats can align to many places in the genome.

• Many repeats are longer than the sequence reads produced by current sequencers.

• To avoid many to many mapping, leading to incorrect contig assembly, repeats must be identified and masked prior to alignment.

37

Repeats affect anything requiring alignment.

• Any sequence data needing to be aligned must have repeats masked.– Transcriptome data– Structural variation(SV)/mutation mapping

38

Resequencing is the norm

• Sequencing of patient samples to determine mutations underlying disease.

• Must be able to detect a range of mutation events (of various sizes).

• Applies to germ line mutations/variations or somatic mutations/variations (ie cancer).

39

Different classes of mutation operating in the human genome.

Freeman J L et al. Genome Res. 2006;16:949-961

Copyright © 2006, Cold Spring Harbor Laboratory Press

Genome resequencing for SV

40http://www.sciencemag.org/cgi/content/full/318/5849/420

Summary and Challenges Ahead

• DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore’s law (the rate at which computing gets faster and cheaper).

• the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.

• Data handling is now the bottleneck• It costs more to analyze a genome than to sequence a genome.• The cost of sequencing a human genome — all three billion bases of

DNA in a set of human chromosomes — plunged to under $10,000 this year from $8.9 million in July 2007

Summary and Challenges Ahead

• Storage and access to data causes issues– Not all data in Genbank or in a format that can be easily accessed

• Demand from health care system for tools to visualize, understand and interpret patient genomic data.

Biggest driver for bioinformatics

43

Documents

DNA is Data - BioInfoSummer 2012 (Dave Adelson)