NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a

NGS applications

and databases

1

Alla L Lapidus, Ph.D.

SPbAU, SPbSU,

St. Petersburg

Walter Gilbert and Allan Maxam at Harvard also

developed sequencing methods. One of them is

"DNA sequencing by chemical degradation“ –

published in 1977. (radioactive, no cloning,

purify fragment to be sequenced)

Frederick Sanger (MRC Centre, Cambridge, UK) -

"DNA sequencing with chain-terminating

inhibitors“ published in 1977 – radioactive but

less toxic => was chosen and automated

Some history

http://en.wikipedia.org/wiki/Walter_Gilbert

http://en.wikipedia.org/wiki/Allan_Maxam

http://en.wikipedia.org/wiki/Harvard_University

http://en.wikipedia.org/wiki/Frederick_Sanger

http://en.wikipedia.org/wiki/Medical_Research_Council_(United_Kingdom)

http://en.wikipedia.org/wiki/Cambridge

Sanger vs NGS (1)

‘Sanger sequencing’ has been the only DNA

sequencing method for 30 years but the need in

greater sequencing throughput and more

economical sequencing technology was obvious.

3

2004 - develop novel technologies that will enable extremely low-cost,

high quality DNA sequencing

2009 - the cost to sequence an entire individual human genome to be

$1,000 by the end of 2009 and the time required for sequencing

less than one week

we are not their yet but very close – it is about $5000 these days

2012 - The NIH awarded $5.7 million in funding for research

projects that explore ways to use genome sequencing in clinical

care, and $800,000 to fund a coordinating center to support

these studies.

“REVOLUTIONARY GENOME SEQUENCING TECHNOLOGIES

THE $1000 GENOME”

(Department of Health and Human Services (DHHS))

Sequencing Technology at a Glance

Library Preparation - new

(Illumina)

(Not for PacBio)

• NGS has the ability to process millions of

sequence reads in parallel rather than 96 or 384 at a time (1/6 of the cost or even less)

• No clonning bias

Objections: fidelity, read length, infrastructure

cost, handle large volum of data, need in the bioinformatics support

Sanger vs NGS.2

7

454 data quality

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Perc

en

t T

ota

l

Error Type

Percent 454 Mismatch Bases By Error Type

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

1 2 3 4 5 6 7 8 9

Err

or

Rate

Length of HMP Base

Error Rate by HMP Length

A

T

C

G

0

50

100

150

200

250

300

350

400

2 3 4 5 6 7 8 9 10

Nu

mb

er

of

Err

ors

Length of HMP Base

Homopolymer Errors by HMP Length

A

T

C

G

• Some bias in coverage when dealing with extremely

high GC and AT-reach genomes

• Error rate is < 0.4%

• Produces errors after long (>20 bp) homopolymers

• Strand dependent errors due to GGC motif

(followed by GC rich extension like GGG)

Illumina (MiSeq) data quality

9

Illumina read quality

0

5

10

15

20

25

30

35

40

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148

ph

red

qu

ality

base position

Quality scores

GONW std GONU std GONY std GUYA std GUYB std GUYC std GUYF std GUYG std GUYH std GUYI std GOYZ jmp GUPO jmp GUPN jmp GWHN jmp

Illumina read quality by library

• Very biased coverage when dealing with extremely high

AT-reach genomes

• Error rate is 1.78%

• Don’t generate reads for long (>14 base) homopolymer

• Can not predict the correct number of bases in

homopolymers >8 bases long

Ion Torrent (PGM) data quality

• Some biased coverage towards high GC

• Error rate – 13% (!)

• The highest read length useful for de novo

assembly scaffolding

PacBio read quality

Platform Specific errors GGCGGG

Sequencing approach: -platform specific errors and features

-number of PCR steps

-enzyme used

Genome nature:

-ability to get sufficient amount of good quality gDNA

-single cell genomes

-GC content

-repeats (total number, variability, length)

-IS elements

-metagenomes

Problems caused by

16

Microbial genomics

de novo sequencing

Re-sequencing previously published reference strains (Whole genome re-sequencing)

Expand the number of available genomes

Comparative studies

Ecology

DNA mixtures from diverse ecosystems (Metagenomics)

Food, agriculture, forest

Sequencing extremely large genomes, crop plants

Clinical Studies (personalysed medicin).

Targeted sequencing (regions, genes, exomes)

Microbioita

Chip-seq: interactions protein-DNA

Epigenomics

Transriptom

DNA Methylation

Pharmaceuticals drug discovery

molecular basis of drug resistance

vaccine development

disease diagnostics

Ancient DNA

Forensic

• ……

Applications

http://www.genomesonline.org

http://www.genomesonline.org/

Molecular Medicine: let’s move to Personalized medicine

•Improved diagnosis of disease

•Earlier detection of genetic predispositions to disease

•Rational drug design

•Gene therapy and control systems for drugs

•Pharmacogenomics "custom drugs"

Cancer is a disease of genome alterations.

Which alterations can be detected:

M. Meyerson, S.Gabriel, G.Getz, Nature Reviews Genetics 11, 685-696

Somatic mutations

Modified from: M. Meyerson, S.Gabriel, G.Getz, Nature Reviews Genetics 11, 685-696

COSMIC – Cancer Database

Some key features of COSMIC:

Contains information on publications, samples and mutations.in

different cancer types.

Samples entered include benign neoplasms and other benign

proliferations, in situ and invasive tumours, recurrences,

metastases and cancer cell lines.

The mutation data and associated information is extracted from

the primary literature and entered into the COSMIC database.

In order to provide a consistent view of the data a histology and

tissue ontology has been created and all mutations are

mapped to a single version of each gene. The data can be queried by tissue, histology or gene and

displayed as a graph, as a table or exported in various formats.

- was designed to collect and display information on

somatic mutations in cancer

The UCSC Genome Browser database

The University of California, Santa Cruz Genome

Browser (http://genome.ucsc.edu) -

offers online access to a database of genomic sequence

and annotation data for a wide variety of organisms.

The Browser also has many tools for visualizing,

comparing and analyzing both publicly available and

user-generated genomic data sets, aligning

sequences and uploading user data.

http://genome.ucsc.edu/

Genome Browser display on the hg19 human assembly showing the gene search box in use.

Fujita P A et al. Nucl. Acids Res. 2010;nar.gkq963

© The Author(s) 2010. Published by Oxford University Press.

Genome Browser image on the hg18 human assembly showing the UCSC Genes, Conservation and Neandertal tracks (Human-Chimp coding differences, regions with the 5%

lowest S, SNPs used to calculate S and alignments of Neandertal sequence reads).

Fujita P A et al. Nucl. Acids Res. 2010;nar.gkq963

© The Author(s) 2010. Published by Oxford University Press.

Human microbiome

The human microbiome includes viruses, fungi and bacteria, their genes and their

environmental interactions, and is known to influence human physiology.

There’s very broad variation in these bacteria in different people and that severely

limits our ability to create a “normal” microflora profile for comparison among healthy

people and those with any kind of health issues.

Example 1: autistic children have microbiomes that differ from those of other kids.

The strong correlation of gastrointestinal symptoms with autism severity indicates

that children with more severe autism are likely to have more severe

gastrointestinal symptoms and vice versa. It is possible that autism symptoms are

exacerbated or even partially due to the underlying gastrointestinal problems.

Example 2: You are what you eat!

SILVA - Good for metagenome analysis

Artemis: Genome Browser - Welcome

Trust Sanger Institute

• Degraded state of the sample mitDNA sequencing

• Small amount of DNA

• Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106 bp )

Problems: contamination modern humans and bacterial DNA

Ancient Genomes

Ancient Genome DataBase: Saqqaq The ancient genome database is an ongoing effort to build a genotype-phenotype

catalogue and reference available ancient genome data to this. The saqqaq genome was

the first ancient nuclear genome sequence at high coverage, which is what is currently

referenced to in the database.

Forensic Genomics:

use of NGS for crime investigations and missing

person identification, kinship testing and

ancestry investigation

http://blog.illumina.co

m

Task: reveal Forensic DNA evidences from tiny and highly mixed samples.

Study: -short tandem repeat (STR) typing (repeating units of 2–6 nucleotides -mitochondrial DNA analysis, -dense panels of single nucleotide polymorphisms

(SNPs) offering

Summary of process used for

forensic DNA typing with STR

marker

The human identity testing community has

focused on 22 autosomal STR loci (Table and

about a dozen Y-chromosome STR markers

that are present in commercial kits. The use of core sets of loci enables common

information to be included in criminal DNA

databases.

The chromosomal locations, repeat motifs,

allele ranges, PCR product sizes, and

random match probabilities for these common

autosomal STR loci are stored in the database.

NCBI

• NCBI - Established in 1988 as a national

resource for molecular biology information,

NCBI creates public databases, conducts

research in computational biology, develops

software tools for analyzing genome data, and

disseminates biomedical information - all for

the better understanding of molecular

processes affecting human health and disease.

www.ncbi.nlm.nih.gov/Genbank

http://www.ncbi.nlm.nih.gov/

NCBI - databases and services

• sequence databases

GenBank, ESTs, SNPs, etc.

• PubMed - literature database

• Entrez

http://www.ncbi.nlm.nih.gov/entrez/

retrieval system connecting together plethora of

databases including PubMed, genomes, ontologies

• Blast - basic local alignment search tool

• Science primer

http://www.ncbi.nlm.nih.gov/About/primer/

introductions into molecular biology and

bioinformatics

EMBL

EBI - The European Bioinformatics Institute (EBI)is a non-profit

academic organisation that forms part of the European

Molecular Biology Laboratory (EMBL)

EBI databses

• EMBL nucleotide database - http://www.ebi.ac.uk/embl/

• UniProt (together with Expasy and PIR)

• ArrayExpress - http://www.ebi.ac.uk/arrayexpress/

public repository for microarray data

• Ensembl - http://www.ensembl.org/ - genomes and annotation

for metazoa

www.ebi.ac.uk/embl

http://www.ebi.ac.uk/

http://www.ensembl.org/

DNA Data Bank of Japan

DDBJ - DDBJ (DNA Data Bank of Japan) began DNA data bank

activities in earnest in 1986 at the National Institute of

Genetics (NIG).

DDBJ has been functioning as the international nucleotide

sequence database in collaboration with EBI/EMBL and

NCBI/GenBank.

DNA sequence records organismic evolution more directly

than other biological materials and thus is invaluable not only

for research in life sciences but also human welfare in general.

The databases are, so to speak, a common treasure of human

beings. With this in mind, we make the databases online

accessible to anyone in the world.

www.ddbj.nig.ac.jp

http://www.ddbj.nig.ac.jp/

DNA variations – SNPs, CNVs

Gene prediction

Open-source and on-line gene prediction • Glimmer - bacteria, archea, viruses

http://cbcb.umd.edu/software/glimmer/

• GlimmerHMM - eukaryotic genes

http://cbcb.umd.edu/software/GlimmerHMM/

• GeneZilla (TIGRscan) - eukaryotic genes

http://www.genezilla.org/

• GenScan - human genes

http://genes.mit.edu/GENSCAN.html

• software lists

http://www.genefinding.org/

Nucleic structures

RNAs and 3D nucleic structural databases • 3D structures of nucleic acids

RNABase - http://www.rnabase.org/

NDB nucleic acids database- http://ndbserver.rutgers.edu/

• SCOR - structural classification of RNA - http://scor.berkeley.edu/

RNA motifs, structures and interactions

• other databases

Small RNA database - http://condor.bcm.tmc.edu/smallRNA/

Noncoding RNA database - http://biobases.ibch.poznan.pl/ncRNA/

• UniProt - the universal protein resource

http://www.expasy.uniprot.org/

- knowledgebase, reference clusters, archives

• Swissprot - http://www.expasy.ch/sprot/

- database of protein sequences together with annotations

- structure and function of proteins

• prosite

http://www.expasy.ch/prosite/

- documentation on protein domains, folds, families

ExPASy (expert protein analysis system)

Moving from second-generation to

third-generation sequencing

strategies

Advanced Technological Approaches Generates Genomic Data Better, Faster, and Cheaper

In the next-gen sequencing arena, the focus over the past several years has been on technological advances, moving

from second-generation (SGS) to third-generation sequencing strategies (TGS) and producing research

instruments capable of delivering whole-genome sequences in parallel at increasing speed.

More recently, as read lengths and coverage continue to increase, throughputs rise, and costs decline, the

expanding range of applications of NGS has taken center stage.

1 3000 6000 9000

“hypothetical”

gene (178 aa)

Assembly problem -1

THANK YOU!

Documents

NGS applications and databases - Институт …bioinformaticsinstitute.ru/sites/default/files/lapidus_2.pdftissue ontology has been created and all mutations are mapped to a