Download pdf - New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Genome Assembly With Next Generation Sequencers

3 May, 2011Jongsun Park

Personal Genomics Institute

Table of Contents

1 Central Dogma and –Omics Studies

2 History of Sequencing Technologies

3 Genome Assembly Processes With NGS Sequences

4 Current Status of Plant Genomes

Central Dogma and -Omics Studies

Central Dogma in Molecular Biology and Bioinformatics

DNA RNA Proteins

Genomics Transcriptomics ProteomicsAd

van

ces

of

Bio

tech

no

logi

es

Bioinformatics

Further –Omics Studies With Bioinformatics

Genomics Population Genomics

Phylogenomics

Transcriptomics

Gene Expression

Regulatory Network

Non-coding RNAs

Epigenomics

Proteomics Metabolomics

Pathway Analysis

3D structure

Position of Sequences in Central Dogma!

AGCUACGUGAGAGACGUACUGUAC…

AGCTACGTGAGAGACGTACUGTAC…

MASTWTSWAMTCCAAMST…

Target for “Sequencing”

Predicted from sequences

History of Sequencing Technologies

Classical Method: Sanger Sequencing

- Using dideoxynucleotides, Dna synthesis can be stopped randomly with four

different reaction tubes.

DNA Template

DNA Template

DNA Template

DNA Template

Primer

Primer

Primer

Primer

New DNA strain A

New DNA strain T

New DNA strain G

New DNA strain C

+ddATP

+ddTTP

+ddGTP

+ddCTP

http://en.wikipedia.org/wiki/File:Sequencing.jpg

http://upload.wikimedia.org/wikipedia/commons/c/cb/Sequencing.jpg

Automated Method: Sanger Sequencing

- Using four fluorescent dyes, scanner can read sequences directly.

- With 96 capillaries, machine can read 96 or 384 different samples at one time.

DNA Template

DNA Template

DNA Template

DNA Template

Primer

Primer

Primer

Primer

New DNA strain A

New DNA strain T

New DNA strain G

New DNA strain C

+ddATPwith dye

+ddTTPwith dye

+ddGTPwith dye

+ddCTPwith dye

http://en.wikipedia.org/wiki/File:Radioactive_Fluorescent_Seq.jpg

http://en.wikipedia.org/wiki/File:Radioactive_Fluorescent_Seq.jpg

Capacity of Automated Sanger Sequencing

- Per one capillary, 700 bases (Quality Value is

higher than 20) can be read in one hour.

- Per one machine, ABI3730 with 384

capillaries kit, 700 * 384 * 24 = 6,451,200 bp

can be obtained without considering sample

preparation process.

https://products.appliedbiosystems.com/ab/en/US/adirect/ab?cmd=catNavigate2&catID=600533&tab=Overview

- If you want to get 1x human genome sequences using one this machine,

it will take 465 days (30,000,000,000 bp / 6,451,200 bp = 465.029..).

- For de novo assembly, usually 6x to 10x coverage are required: it will take

4,650 days for obtaining enough sequences for genome project.

Next Generation Sequencing (NGS) Technologies

- Next (or Current) generation sequencing technologies have accelerated the speed

of genome sequencing projects and have broaden application range of genome

sequences.

Solexa; Illumina

SOLiD; ABI

GS-Titanium; Roche 454

SMRT; Pacific Bioscience Helicos; Helicos Bioscience

NGS: 454 Technology

NGS: Solexa Technology

NGS: SOLiD Technology (1)



Capacities of Next Generation Sequencers

Solexa GA2; Illumina

SOLiD 4; ABI

GS-Titanium; Roche 454

ABI 3730; ABI

384 x 700 bp = 268,800 bp = 269Kb (per one reaction / 1 hr)

950,000 x 450 bp = 405,000,000 bp = 405Mb (per one reaction / 2-3 days)

35,000,000 x 7 x (151 x 2) bp = 73,990,000,000 bp = 74.0 Gb (per one reaction / 12 days)

1,400,000,000 x 75 bp (50+25) = 105,500,000,000 bp = 105.5Gb (per one reaction / 11 days)

HiSeq2000; Illumina

70,000,000 x 14 x (101 x 2) bp = 197,960,000,000 bp = 198.0 Gb (per one reaction / 8 days)

Single Read and Mate-Pair (Pair-end) Sequences

- Single read is the most simple method: just read one

direction for each sample (cluster or bead).

- Mate-pair (or pair-end) method can generate both

side of short-read sequences of each read, which is

similar to BAC-end, Fosmid-end, Cosmid-end

sequences.

- Mate-pair (pair-end) methods are useful for

generating larger sequences with gaps (We called it

scaffolding.)

Summary of Costs For NGS Technologies

Sequence explosion, Nature, 46(1), 670-671

Pros and Cons of NGS Technologies

- Large number of reads per one run

- Extremely low costs for generating

huge amount of sequences

- Diverse applications not only for

genomics but also for transcriptomics,

small RNAs, epigenomics, and etc.

with small modification of protocols.

Pros Cons

- Short read length per each reads

(36 bp to 151 bp or 450bp)

- Different types of sequencing

qualities (GS, Solexa, and SOLiD)

- Difficult to deal with large size of

sequences with normal programs

including de novo assembly

- Impossible to get small amount of

sequences with low costs

Genome Assembly ProcessesWith NGS Sequences

Whole Shotgun Sequence Strategy

- Assembly process is essential for genome project because read length of each

sequence is less than 1 kb.

- Assembly process was conducted by several popular programs, such as phrap

and PCAP3, for Sanger sequences.

Genome AssemblyScaffolding

Assembled genome

Genome Assembly Process

- We can perform genome assembly manually!

23 sequences should be compared with each other!

23C2 = 23*22 / 2 = 253 comparison!

How to Find Overlapped Sequences?

- Using dynamic algorithm, we can make the program for finding similar sequences.

- Complexity of this algorithm is O(n3).

http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html

Classification of Alignments (1)

Considering whole sequences for

alignment?

Global Alignment

Yes

No

Local alignment

Classification of Alignments (2)

How many sequences are considered for

alignment?

Pair-wise alignment

2 sequences!

More than 2 sequences!

Multiple alignment

Famous Bioinformatics Tools for Alignments

- Global alignment : ClustalW, T-coffee, and MUSCLE

- Local alignment : FASTA, and BLAST (Basic Local Alignment Searching Tools)

provided by NCBI.

- Pair-wise alignment : BLAST and FASTA

- Multiple sequence alignment : ClustalW, T-coffee, MUSCLE, and etc.

Example of Genome Assembly: Vitis Vinifera

Pair-wise comparison of 6,200,000 reads

6,200,000C2 = 19,219,996,900,000 comparisons

Genome Assemblers

http://www.phrap.org/

Short-read Sequence Assembly (1)

- Short-read sequences generated by NGS machines cause several problems

of already well-established genome assemblers.

- Too many reads require near to infinite computational power.

- Too short reads cannot find reliable overlaps to make long contigs.

- To reduce computational power, new algorithm was required.

- Short reads require another strategy to make reliable contig sequences.

- Dealing a lot of sequences also caused several technical problems, such as

physical memory problems, harddisk space, and computing power.

Short-read Sequence Assembly (2)

563,466,202C563,466,201 = 158,747,080,116,419,301 comparison?!

563,466,202

De brujin Graph: Alternative Method For Alignment

De brujin Graph Algorithm For Alignment (1)

- This algorithm has been utilized for finding overlapped short-read sequences

quickly.

- This algorithm consists of three parts:

i) Generating k-mer sequences

ii) Constructing de brujin graph

iii) Resolving the graph with generating sequences

K = 3GCAAAACACTTA…

GCA

CAA

AAA

AAA

AAC


ACA

CAC

1-3

1-4

1-1 1-2

1-5

1-6

1-71-8

1-9

ACT

CTT

TTA

1-10

ACA

CACACT

1-3

1-4

1-1 1-2

1-5

1-6

1-71-8

1-9

1-10

GCAAAACACTTA…


CTT

TTA

ACACTTATTCGT

TAT

ATT

TTC

TCG

CGT

TAT

ATT

TTC

TCGCGTK = 3

2-1

2-22-3

2-4

2-5

2-6

2-7

2-8

2-92-10


1-3

1-4

1-1 1-2

1-5

1-6

1-71-8

1-9

1-10

TAT

ATT

TTC

TCGCGT

2-1

2-22-3

2-4

2-5

2-6

2-7

2-8

2-92-10

GCAAAACACTTATTCGT

Genome Assemblers For Short Read Sequences

Examples of Plant Genome de novo Assembly

5,937,915,739 bp

# of contigs 950 ea

Total length 976,089 bp

Maximum length 12,606 bp

Average length 1,027.46 bp

N50 length 3,061 bp

Lithocarpushancei

Ficusaltissima

Ficusaltissima

Ficusaltissima

Ficustinctoria

Ficustinctoria

# of contigs 462,868 355,052 132,590 247,376 337,777 476,937

Total length 112,614,098 87,502,701 33,293,636 61,369,608 87,427,716 116,554,688

Maximumlength

1,748 1,090 1,688 1,334 1,274 1,578

Average length

243.30 246.45 251.10 248.08 258.83 244.38

N50 length 237 239 245 241 248 3061

Giant Panda Genome Project

Sordaria macrospora Genome Project

Large Scale Genome Projects: 1000 Human Genomes

http://www.1000genomes.org/page.phphttp://en.wikipedia.org/wiki/File:Genetic_Variation.jpg

http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg

http://www.ldl.genomics.cn/page/index.jsp

Large Scale Genome Projects: BGI 1000 Genomes

http://genome10k.org/

Large Scale Genome Projects: Genome 10K Project

Current Status ofPlant Genome Projects

Species name Method Size (Mb) # of contigs # of transcripts

Arabidopsis lyrata WGS 206.67 695 32,670

Medicago truncatula BAC, WGS 278.69 9 38,334

Selaginella moellendorffii WGS 212.76 768 22,285

Lycopersicon esculentum WGS, BAC 794.60 7,409 49,389

Solanum phureja WGS 702.58 57,681 110,512

Ricinus communis WGS 362.47 28,518 38,613

Mimulus guttatus WGS 416.66 11,243 47,442

Manihot esculenta WGS 321.73 2,216 27,501

Phoenix dactylifera WGS 284.68 234,704 -

Prunus persica WGS 227.25 202 28,689

Oryza glaberrima WGS 316.08 5,309 -

Unpublished 11 Higher Plant Genomes

All pictures are from wikipedia.

http://en.wikipedia.org/wiki/File:Medicago_truncatula_A17_branch.JPG

http://en.wikipedia.org/wiki/File:Smo2-72.jpg

http://en.wikipedia.org/wiki/File:Bright_red_tomato_and_cross_section02.jpg

http://en.wikipedia.org/wiki/File:Castor_bean_in_distubred_area.jpg

http://en.wikipedia.org/wiki/File:Mimulus_guttatus_5639.JPG

http://en.wikipedia.org/wiki/File:Manihot_esculenta_dsc07325.jpg

Species name Journal Method Size (Mb) # of contigs # of transcripts

Arabidopsis thaliana Nature, 2000 BAC +α 119.19 5 32,615

Oryza sativa japonicaScience, 2002Nature, 2005

BAC +α 372.08 12 66,710

Oryza sativa indicaScience, 2002PLoS Biology, 2005

WGS 426.32 10,267 49,710

Oryza sativa japonica (syngenta)

PLoS Biology, 2005 WGS 391.14 7,777 45,824

Populus trichocarpa Science, 2006 WGS 485.51 22,012 45,555

Vitis vinifera Nature, 2007WGS, Complete

497.51 35 30,434

Carica papaya Nature, 2008 WGS 369.69 17,677 28,589

Lotus japonicus DNA Research, 2008 WGS 323.24 110,945 26,700

Sorghum bicolor Nature, 2009 WGS 738.54 3,304 36,338

Zea mays Science, 2009 BAC, WGS 2,061.02 11 53,764

Cucumis sativusNature genetics, 2009

WGS 243.57 47,488 26,682

Glycine max Nature, 2010 WGS 996.90 4,262 62,199

Brachypodiumdistachyon

Nature, 2010 WGS 273.27 197 32,255

Malus x domesticaNature Genetics,2010

WGS 871.56 144,621 57,386*

14 Published Higher Plant Genomes from 11 Species


http://en.wikipedia.org/wiki/File:Arabidopsis_thaliana.jpg

http://en.wikipedia.org/wiki/File:US_long_grain_rice.jpg

http://en.wikipedia.org/wiki/File:Populus_trichocarpa_05897.JPG

http://upload.wikimedia.org/wikipedia/commons/e/e9/Vitis-vinifera.JPG

http://en.wikipedia.org/wiki/File:Koeh-029.jpg

http://en.wikipedia.org/wiki/File:Lotus_japonicus.jpg

http://en.wikipedia.org/wiki/File:Sorghum_bicolor03.jpg

http://en.wikipedia.org/wiki/File:ARS_cucumber.jpg

http://en.wikipedia.org/wiki/File:Koeh-283.jpg

http://en.wikipedia.org/wiki/File:Soybean.USDA.jpg

http://en.wikipedia.org/wiki/File:Brachypodium_distachyon_Bd21-3.jpg

Species name Journal MethodSize (Mb)

# of contigs

# of transcripts

Chlamydomonas reinhardtii Science, 2007 WGS 112.31 88 16,709

Micromonas pusilla CCMP1545 Science, 2009 WGS 22.04 27 10,547

Micromonas sp. RCC299 Science, 2009 WGS 20.99 17 10,108

Ostreococcus lucimarinus CCE9901 PNAS, 2007 WGS 13.2 21 7,488

Ostreococcus sp. RCC809 Not published yet WGS 13.41 22 7,492

Ostreococcus tauri PNAS, 2007 WGS 12.58 118 7,725

Coccomyxa sp. C169 Not published yet WGS 48.95 45 9,629

7 Unicellular Plants Genomes


http://en.wikipedia.org/wiki/File:Chlamydomonas6-1.jpg

Distribution of 32 Plant Genome Size

Mb

0.0

500.0

1000.0

1500.0

2000.0

2500.0

12.6 13.2 13.4 21.0 22.0 49.0 112.3 119.7

206.7 212.8 225.9 227.3 273.3 284.7 307.5 316.1 321.7 323.2 362.5 369.7 372.3 391.1 416.7 426.3

485.5 497.5

702.6 738.5 781.8

871.6

996.9

2061.0

Unicellular Plants

Distribution of Number of Transcripts of 29 Plants

# of transcripts

0

20,000

40,000

60,000

80,000

100,000

120,000

7,4887,4927,7259,62910,10810,547

16,709

22,2852668226,70027,50128,58928,68930,43432,25532,67033,410

36,33838,613

45,55545,82447,4424948349,71053,42353,764

62,19967,393

110,512

Unicellular Plants

Relationship between Genome Size and Transcripts# of transcript/Genome Size (Mb)

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

26.1 49.2

61.2 62.4 63.3 77.3 82.6 85.5 93.8

104.7 106.5 113.9 116.6 117.2 118.0 118.1 126.2 148.8 157.3 158.1

173.7 181.0 196.7

279.2

478.5 481.6

558.7 567.3

614.1 Unicellular Plants

Comparisons with Genomes in Other Kingdoms

Mb

5.0 102.9

23.8 24.1

1587.1

392.2

-500.0

0.0

500.0

1000.0

1500.0

2000.0

2500.0

3000.0

3500.0

Bacteria Oomycota Eukaryota Fungi Metazoa Viridaeplanta

5 species 6 species 23 species 256 species 90 species 32 species

Cucumber Genome Sequences

Thank you for your attention!