23
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

Embed Size (px)

Citation preview

Page 1: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

1

Pamela Ferretti

Laboratory of Computational Metagenomics

Centre for Integrative BiologyUniversity of Trento

Italy

Microbial Genome Assembly

Page 2: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

2

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

Page 3: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

3

DNA packaging

Page 4: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

4

DNA packaging

Page 5: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

5

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

Page 6: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

6

Next Generation Sequencing

TCTTATTGTGACC TAGGCTAGCTTAG

GCAATGCAGTAAC TCCAGCTAGGTTC

ACGTAGGCTAGCGTTAGCGA ........ CTGCAT C

Page 7: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

7

Genome Assembly

1. GENOME SEQUENCING2. PRELIMINARY ANALYSIS3. ASSEMBLY4. ADVANCED BIOINFORMATIC ANALYSIS

OVERLAPPING SEQUENCE ALIGMENT

Page 8: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

Sequencing the human genome with shotgun sequencing + assembly is the only feasible strategy

Computational assembly of shotgun sequencing data is simply unfeasible, and a bad idea anyway

Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409.

Green, Philip. "Against a whole-genome shotgun.“Genome Research 7.5 (1997): 410-417.

They were both right!(…well, Weber and Myers were a bit more right from the practical viewpoint…)

On the feasibility of sequence assembly

Page 9: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

9

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

Page 10: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

10

Genome assembly strategies Greedy approach → SSAKE

De Bruijn graph (DBG) → Velvet, SOAPdenovo

Overlap Consensus Layout (OLC) → MIRA

Mixed approaches → MaSuRCA

Page 11: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

11

Genome assembly strategies DE BRUIJN GRAPH APPROACH (DBG)

Velvet, SOAPdenovo2

Nodes = overlapping sequences of reads of uniform lengthEdges = kmer (unique subsequences within reads)

EULERIAN PATH

Page 12: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

12

Genome assembly strategies

OVERLAP CONSENSUS LAYOUT (OLC)

MIRA

Nodes = readsEdges = overlap between reads

1. OVERLAP2. LAYOUT3. CONSENSUS

HAMILTONIAN PATH

Page 13: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

13

Genome assembly strategies

Page 14: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

14

Genome assembly strategies

DBG OLC

ADVANTAGES Very sensitive to repeats Modular algorithmic design

Kmer storaged just once Flexibility and robustness

Eulerian cycle

Never explicitly computes pairwise computation

DISADVANTAGES Sensitive to sequencing errors (new k-mers)

Hamiltonian cycle

Large computational memory space requirements

Overlap stage istime-consuming

Genome-size limitations

Page 15: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

15

Greedy approach → SSAKE

De Bruijn graph (DBG) → Velvet, SOAPdenovo

Overlap Consensus Layout (OLC) → MIRA

Mixed approaches → MaSuRCA

Genome assembly strategies

Page 16: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

16

Genome Assemblers

Average CoverageNumber of ContigsNumber of Contigs > 1KbN50 contig sizeFraction of reads assembledTotal consensus (in nt)Number of scaffolds N50 scaffolds size

Ion Torrent PGM → MIRA 3.9

Illumina → MaSuRCA MIRA 3.9 too produced good quality results, but it has a longer execution time

and it becomes unstable with large amount of small reads

Page 17: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

17

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

Page 18: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

18

Mycobacteria Assembly: Case Study

Responsible for many animal and human diseases M. tuberculosis and M. leprae (TM)M. fortuitum (NTM) outbreak (nail salon, 2002)M. chelonae (NTM) outbreak (face lifts, 2004)

Illumina HiSeq sequencing (NGS Facility – CIBIO/UNITN) Twenty mycobacterial strains From 20 different Mycobacteria species

→ MaSuRCA

Novel mycobacteria detection clinical tests

Page 19: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

19

Fastq-mcf tool

• poor quality ends of reads• Ns, duplicates and sequencing

adapters• reads that are too short

Reduction up to 73%

Raw data quality assessment and pre-processing

Page 20: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

20

K-mers: strings of a particular length k, which are shorter than entire reads

Best empirical k-mer length: 91 bases long

Assembly parameters setting

High coverage

Page 21: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

21

MaSuRCA results of Mycobacteria

Abnormal GC content

Genome size too high

Page 22: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

22

Examples of environmental contaminations

GC content based quality analysis

Staphylococcus epidermidis

Page 23: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

Thanks

Photocoming

soon

http://gcat.davidson.edu/phast/#methods