44
NGS applications and databases 1 Alla L Lapidus, Ph.D. SPbAU, SPbSU, St. Petersburg

NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

NGS applications

and databases

1

Alla L Lapidus, Ph.D.

SPbAU, SPbSU,

St. Petersburg

Page 2: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Walter Gilbert and Allan Maxam at Harvard also

developed sequencing methods. One of them is

"DNA sequencing by chemical degradation“ –

published in 1977. (radioactive, no cloning,

purify fragment to be sequenced)

Frederick Sanger (MRC Centre, Cambridge, UK) -

"DNA sequencing with chain-terminating

inhibitors“ published in 1977 – radioactive but

less toxic => was chosen and automated

Some history

Page 3: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Sanger vs NGS (1)

‘Sanger sequencing’ has been the only DNA

sequencing method for 30 years but the need in

greater sequencing throughput and more

economical sequencing technology was obvious.

3

Page 4: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

2004 - develop novel technologies that will enable extremely low-cost,

high quality DNA sequencing

2009 - the cost to sequence an entire individual human genome to be

$1,000 by the end of 2009 and the time required for sequencing

less than one week

we are not their yet but very close – it is about $5000 these days

2012 - The NIH awarded $5.7 million in funding for research

projects that explore ways to use genome sequencing in clinical

care, and $800,000 to fund a coordinating center to support

these studies.

“REVOLUTIONARY GENOME SEQUENCING TECHNOLOGIES

THE $1000 GENOME”

(Department of Health and Human Services (DHHS))

Page 5: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Sequencing Technology at a Glance

Page 6: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Library Preparation - new

(Illumina)

(Not for PacBio)

Page 7: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

• NGS has the ability to process millions of

sequence reads in parallel rather than 96 or 384 at a time (1/6 of the cost or even less)

• No clonning bias

Objections: fidelity, read length, infrastructure

cost, handle large volum of data, need in the bioinformatics support

Sanger vs NGS.2

7

Page 8: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

454 data quality

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Perc

en

t T

ota

l

Error Type

Percent 454 Mismatch Bases By Error Type

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

1 2 3 4 5 6 7 8 9

Err

or

Rate

Length of HMP Base

Error Rate by HMP Length

A

T

C

G

0

50

100

150

200

250

300

350

400

2 3 4 5 6 7 8 9 10

Nu

mb

er

of

Err

ors

Length of HMP Base

Homopolymer Errors by HMP Length

A

T

C

G

Page 9: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

• Some bias in coverage when dealing with extremely

high GC and AT-reach genomes

• Error rate is < 0.4%

• Produces errors after long (>20 bp) homopolymers

• Strand dependent errors due to GGC motif

(followed by GC rich extension like GGG)

Illumina (MiSeq) data quality

9

Page 10: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Illumina read quality

Page 11: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

0

5

10

15

20

25

30

35

40

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148

ph

red

qu

ality

base position

Quality scores

GONW std GONU std GONY std GUYA std GUYB std GUYC std GUYF std GUYG std GUYH std GUYI std GOYZ jmp GUPO jmp GUPN jmp GWHN jmp

Illumina read quality by library

Page 12: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

• Very biased coverage when dealing with extremely high

AT-reach genomes

• Error rate is 1.78%

• Don’t generate reads for long (>14 base) homopolymer

• Can not predict the correct number of bases in

homopolymers >8 bases long

Ion Torrent (PGM) data quality

Page 13: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

• Some biased coverage towards high GC

• Error rate – 13% (!)

• The highest read length useful for de novo

assembly scaffolding

PacBio read quality

Page 14: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Platform Specific errors GGCGGG

Page 15: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a
Page 16: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Sequencing approach: -platform specific errors and features

-number of PCR steps

-enzyme used

Genome nature:

-ability to get sufficient amount of good quality gDNA

-single cell genomes

-GC content

-repeats (total number, variability, length)

-IS elements

-metagenomes

Problems caused by

16

Page 17: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Microbial genomics

de novo sequencing

Re-sequencing previously published reference strains (Whole genome re-sequencing)

Expand the number of available genomes

Comparative studies

Ecology

DNA mixtures from diverse ecosystems (Metagenomics)

Food, agriculture, forest

Sequencing extremely large genomes, crop plants

Clinical Studies (personalysed medicin).

Targeted sequencing (regions, genes, exomes)

Microbioita

Chip-seq: interactions protein-DNA

Epigenomics

Transriptom

DNA Methylation

Pharmaceuticals drug discovery

molecular basis of drug resistance

vaccine development

disease diagnostics

Ancient DNA

Forensic

• ……

Applications

Page 18: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

http://www.genomesonline.org

Page 19: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Molecular Medicine: let’s move to Personalized medicine

•Improved diagnosis of disease

•Earlier detection of genetic predispositions to disease

•Rational drug design

•Gene therapy and control systems for drugs

•Pharmacogenomics "custom drugs"

Page 20: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Cancer is a disease of genome alterations.

Which alterations can be detected:

M. Meyerson, S.Gabriel, G.Getz, Nature Reviews Genetics 11, 685-696

Page 21: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Somatic mutations

Modified from: M. Meyerson, S.Gabriel, G.Getz, Nature Reviews Genetics 11, 685-696

Page 22: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

COSMIC – Cancer Database

Some key features of COSMIC:

Contains information on publications, samples and mutations.in

different cancer types.

Samples entered include benign neoplasms and other benign

proliferations, in situ and invasive tumours, recurrences,

metastases and cancer cell lines.

The mutation data and associated information is extracted from

the primary literature and entered into the COSMIC database.

In order to provide a consistent view of the data a histology and

tissue ontology has been created and all mutations are

mapped to a single version of each gene. The data can be queried by tissue, histology or gene and

displayed as a graph, as a table or exported in various formats.

- was designed to collect and display information on

somatic mutations in cancer

Page 23: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

The UCSC Genome Browser database

The University of California, Santa Cruz Genome

Browser (http://genome.ucsc.edu) -

offers online access to a database of genomic sequence

and annotation data for a wide variety of organisms.

The Browser also has many tools for visualizing,

comparing and analyzing both publicly available and

user-generated genomic data sets, aligning

sequences and uploading user data.

Page 24: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Genome Browser display on the hg19 human assembly showing the gene search box in use.

Fujita P A et al. Nucl. Acids Res. 2010;nar.gkq963

© The Author(s) 2010. Published by Oxford University Press.

Page 25: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Genome Browser image on the hg18 human assembly showing the UCSC Genes, Conservation and Neandertal tracks (Human-Chimp coding differences, regions with the 5%

lowest S, SNPs used to calculate S and alignments of Neandertal sequence reads).

Fujita P A et al. Nucl. Acids Res. 2010;nar.gkq963

© The Author(s) 2010. Published by Oxford University Press.

Page 26: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Human microbiome

The human microbiome includes viruses, fungi and bacteria, their genes and their

environmental interactions, and is known to influence human physiology.

There’s very broad variation in these bacteria in different people and that severely

limits our ability to create a “normal” microflora profile for comparison among healthy

people and those with any kind of health issues.

Example 1: autistic children have microbiomes that differ from those of other kids.

The strong correlation of gastrointestinal symptoms with autism severity indicates

that children with more severe autism are likely to have more severe

gastrointestinal symptoms and vice versa. It is possible that autism symptoms are

exacerbated or even partially due to the underlying gastrointestinal problems.

Example 2: You are what you eat!

Page 27: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

SILVA - Good for metagenome analysis

Page 28: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Artemis: Genome Browser - Welcome

Trust Sanger Institute

Page 29: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

• Degraded state of the sample mitDNA sequencing

• Small amount of DNA

• Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106 bp )

Problems: contamination modern humans and bacterial DNA

Ancient Genomes

Page 30: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Ancient Genome DataBase: Saqqaq The ancient genome database is an ongoing effort to build a genotype-phenotype

catalogue and reference available ancient genome data to this. The saqqaq genome was

the first ancient nuclear genome sequence at high coverage, which is what is currently

referenced to in the database.

Page 31: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Forensic Genomics:

use of NGS for crime investigations and missing

person identification, kinship testing and

ancestry investigation

http://blog.illumina.co

m

Task: reveal Forensic DNA evidences from tiny and highly mixed samples.

Study: -short tandem repeat (STR) typing (repeating units of 2–6 nucleotides -mitochondrial DNA analysis, -dense panels of single nucleotide polymorphisms

(SNPs) offering

Page 32: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Summary of process used for

forensic DNA typing with STR

marker

The human identity testing community has

focused on 22 autosomal STR loci (Table and

about a dozen Y-chromosome STR markers

that are present in commercial kits. The use of core sets of loci enables common

information to be included in criminal DNA

databases.

The chromosomal locations, repeat motifs,

allele ranges, PCR product sizes, and

random match probabilities for these common

autosomal STR loci are stored in the database.

Page 33: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

NCBI

• NCBI - Established in 1988 as a national

resource for molecular biology information,

NCBI creates public databases, conducts

research in computational biology, develops

software tools for analyzing genome data, and

disseminates biomedical information - all for

the better understanding of molecular

processes affecting human health and disease.

www.ncbi.nlm.nih.gov/Genbank

Page 34: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

NCBI - databases and services

• sequence databases

GenBank, ESTs, SNPs, etc.

• PubMed - literature database

• Entrez

http://www.ncbi.nlm.nih.gov/entrez/

retrieval system connecting together plethora of

databases including PubMed, genomes, ontologies

• Blast - basic local alignment search tool

• Science primer

http://www.ncbi.nlm.nih.gov/About/primer/

introductions into molecular biology and

bioinformatics

Page 35: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

EMBL

EBI - The European Bioinformatics Institute (EBI)is a non-profit

academic organisation that forms part of the European

Molecular Biology Laboratory (EMBL)

EBI databses

• EMBL nucleotide database - http://www.ebi.ac.uk/embl/

• UniProt (together with Expasy and PIR)

• ArrayExpress - http://www.ebi.ac.uk/arrayexpress/

public repository for microarray data

• Ensembl - http://www.ensembl.org/ - genomes and annotation

for metazoa

www.ebi.ac.uk/embl

Page 36: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

DNA Data Bank of Japan

DDBJ - DDBJ (DNA Data Bank of Japan) began DNA data bank

activities in earnest in 1986 at the National Institute of

Genetics (NIG).

DDBJ has been functioning as the international nucleotide

sequence database in collaboration with EBI/EMBL and

NCBI/GenBank.

DNA sequence records organismic evolution more directly

than other biological materials and thus is invaluable not only

for research in life sciences but also human welfare in general.

The databases are, so to speak, a common treasure of human

beings. With this in mind, we make the databases online

accessible to anyone in the world.

www.ddbj.nig.ac.jp

Page 37: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

DNA variations – SNPs, CNVs

Page 38: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Gene prediction

Open-source and on-line gene prediction • Glimmer - bacteria, archea, viruses

http://cbcb.umd.edu/software/glimmer/

• GlimmerHMM - eukaryotic genes

http://cbcb.umd.edu/software/GlimmerHMM/

• GeneZilla (TIGRscan) - eukaryotic genes

http://www.genezilla.org/

• GenScan - human genes

http://genes.mit.edu/GENSCAN.html

• software lists

http://www.genefinding.org/

Page 39: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Nucleic structures

RNAs and 3D nucleic structural databases • 3D structures of nucleic acids

RNABase - http://www.rnabase.org/

NDB nucleic acids database- http://ndbserver.rutgers.edu/

• SCOR - structural classification of RNA - http://scor.berkeley.edu/

RNA motifs, structures and interactions

• other databases

Small RNA database - http://condor.bcm.tmc.edu/smallRNA/

Noncoding RNA database - http://biobases.ibch.poznan.pl/ncRNA/

Page 40: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

• UniProt - the universal protein resource

http://www.expasy.uniprot.org/

- knowledgebase, reference clusters, archives

• Swissprot - http://www.expasy.ch/sprot/

- database of protein sequences together with annotations

- structure and function of proteins

• prosite

http://www.expasy.ch/prosite/

- documentation on protein domains, folds, families

ExPASy (expert protein analysis system)

Page 41: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Moving from second-generation to

third-generation sequencing

strategies

Advanced Technological Approaches Generates Genomic Data Better, Faster, and Cheaper

In the next-gen sequencing arena, the focus over the past several years has been on technological advances, moving

from second-generation (SGS) to third-generation sequencing strategies (TGS) and producing research

instruments capable of delivering whole-genome sequences in parallel at increasing speed.

More recently, as read lengths and coverage continue to increase, throughputs rise, and costs decline, the

expanding range of applications of NGS has taken center stage.

Page 42: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

1 3000 6000 9000

“hypothetical”

gene (178 aa)

Page 43: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

Assembly problem -1

Page 44: NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

THANK YOU!