Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
NGS applications
and databases
1
Alla L Lapidus, Ph.D.
SPbAU, SPbSU,
St. Petersburg
Walter Gilbert and Allan Maxam at Harvard also
developed sequencing methods. One of them is
"DNA sequencing by chemical degradation“ –
published in 1977. (radioactive, no cloning,
purify fragment to be sequenced)
Frederick Sanger (MRC Centre, Cambridge, UK) -
"DNA sequencing with chain-terminating
inhibitors“ published in 1977 – radioactive but
less toxic => was chosen and automated
Some history
Sanger vs NGS (1)
‘Sanger sequencing’ has been the only DNA
sequencing method for 30 years but the need in
greater sequencing throughput and more
economical sequencing technology was obvious.
3
2004 - develop novel technologies that will enable extremely low-cost,
high quality DNA sequencing
2009 - the cost to sequence an entire individual human genome to be
$1,000 by the end of 2009 and the time required for sequencing
less than one week
we are not their yet but very close – it is about $5000 these days
2012 - The NIH awarded $5.7 million in funding for research
projects that explore ways to use genome sequencing in clinical
care, and $800,000 to fund a coordinating center to support
these studies.
“REVOLUTIONARY GENOME SEQUENCING TECHNOLOGIES
THE $1000 GENOME”
(Department of Health and Human Services (DHHS))
Sequencing Technology at a Glance
Library Preparation - new
(Illumina)
(Not for PacBio)
• NGS has the ability to process millions of
sequence reads in parallel rather than 96 or 384 at a time (1/6 of the cost or even less)
• No clonning bias
Objections: fidelity, read length, infrastructure
cost, handle large volum of data, need in the bioinformatics support
Sanger vs NGS.2
7
454 data quality
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Perc
en
t T
ota
l
Error Type
Percent 454 Mismatch Bases By Error Type
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
1 2 3 4 5 6 7 8 9
Err
or
Rate
Length of HMP Base
Error Rate by HMP Length
A
T
C
G
0
50
100
150
200
250
300
350
400
2 3 4 5 6 7 8 9 10
Nu
mb
er
of
Err
ors
Length of HMP Base
Homopolymer Errors by HMP Length
A
T
C
G
• Some bias in coverage when dealing with extremely
high GC and AT-reach genomes
• Error rate is < 0.4%
• Produces errors after long (>20 bp) homopolymers
• Strand dependent errors due to GGC motif
(followed by GC rich extension like GGG)
Illumina (MiSeq) data quality
9
Illumina read quality
0
5
10
15
20
25
30
35
40
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148
ph
red
qu
ality
base position
Quality scores
GONW std GONU std GONY std GUYA std GUYB std GUYC std GUYF std GUYG std GUYH std GUYI std GOYZ jmp GUPO jmp GUPN jmp GWHN jmp
Illumina read quality by library
• Very biased coverage when dealing with extremely high
AT-reach genomes
• Error rate is 1.78%
• Don’t generate reads for long (>14 base) homopolymer
• Can not predict the correct number of bases in
homopolymers >8 bases long
Ion Torrent (PGM) data quality
• Some biased coverage towards high GC
• Error rate – 13% (!)
• The highest read length useful for de novo
assembly scaffolding
PacBio read quality
Platform Specific errors GGCGGG
Sequencing approach: -platform specific errors and features
-number of PCR steps
-enzyme used
Genome nature:
-ability to get sufficient amount of good quality gDNA
-single cell genomes
-GC content
-repeats (total number, variability, length)
-IS elements
-metagenomes
Problems caused by
16
Microbial genomics
de novo sequencing
Re-sequencing previously published reference strains (Whole genome re-sequencing)
Expand the number of available genomes
Comparative studies
Ecology
DNA mixtures from diverse ecosystems (Metagenomics)
Food, agriculture, forest
Sequencing extremely large genomes, crop plants
Clinical Studies (personalysed medicin).
Targeted sequencing (regions, genes, exomes)
Microbioita
Chip-seq: interactions protein-DNA
Epigenomics
Transriptom
DNA Methylation
Pharmaceuticals drug discovery
molecular basis of drug resistance
vaccine development
disease diagnostics
Ancient DNA
Forensic
• ……
Applications
http://www.genomesonline.org
Molecular Medicine: let’s move to Personalized medicine
•Improved diagnosis of disease
•Earlier detection of genetic predispositions to disease
•Rational drug design
•Gene therapy and control systems for drugs
•Pharmacogenomics "custom drugs"
Cancer is a disease of genome alterations.
Which alterations can be detected:
M. Meyerson, S.Gabriel, G.Getz, Nature Reviews Genetics 11, 685-696
Somatic mutations
Modified from: M. Meyerson, S.Gabriel, G.Getz, Nature Reviews Genetics 11, 685-696
COSMIC – Cancer Database
Some key features of COSMIC:
Contains information on publications, samples and mutations.in
different cancer types.
Samples entered include benign neoplasms and other benign
proliferations, in situ and invasive tumours, recurrences,
metastases and cancer cell lines.
The mutation data and associated information is extracted from
the primary literature and entered into the COSMIC database.
In order to provide a consistent view of the data a histology and
tissue ontology has been created and all mutations are
mapped to a single version of each gene. The data can be queried by tissue, histology or gene and
displayed as a graph, as a table or exported in various formats.
- was designed to collect and display information on
somatic mutations in cancer
The UCSC Genome Browser database
The University of California, Santa Cruz Genome
Browser (http://genome.ucsc.edu) -
offers online access to a database of genomic sequence
and annotation data for a wide variety of organisms.
The Browser also has many tools for visualizing,
comparing and analyzing both publicly available and
user-generated genomic data sets, aligning
sequences and uploading user data.
Genome Browser display on the hg19 human assembly showing the gene search box in use.
Fujita P A et al. Nucl. Acids Res. 2010;nar.gkq963
© The Author(s) 2010. Published by Oxford University Press.
Genome Browser image on the hg18 human assembly showing the UCSC Genes, Conservation and Neandertal tracks (Human-Chimp coding differences, regions with the 5%
lowest S, SNPs used to calculate S and alignments of Neandertal sequence reads).
Fujita P A et al. Nucl. Acids Res. 2010;nar.gkq963
© The Author(s) 2010. Published by Oxford University Press.
Human microbiome
The human microbiome includes viruses, fungi and bacteria, their genes and their
environmental interactions, and is known to influence human physiology.
There’s very broad variation in these bacteria in different people and that severely
limits our ability to create a “normal” microflora profile for comparison among healthy
people and those with any kind of health issues.
Example 1: autistic children have microbiomes that differ from those of other kids.
The strong correlation of gastrointestinal symptoms with autism severity indicates
that children with more severe autism are likely to have more severe
gastrointestinal symptoms and vice versa. It is possible that autism symptoms are
exacerbated or even partially due to the underlying gastrointestinal problems.
Example 2: You are what you eat!
SILVA - Good for metagenome analysis
Artemis: Genome Browser - Welcome
Trust Sanger Institute
• Degraded state of the sample mitDNA sequencing
• Small amount of DNA
• Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106 bp )
Problems: contamination modern humans and bacterial DNA
Ancient Genomes
Ancient Genome DataBase: Saqqaq The ancient genome database is an ongoing effort to build a genotype-phenotype
catalogue and reference available ancient genome data to this. The saqqaq genome was
the first ancient nuclear genome sequence at high coverage, which is what is currently
referenced to in the database.
Forensic Genomics:
use of NGS for crime investigations and missing
person identification, kinship testing and
ancestry investigation
http://blog.illumina.co
m
Task: reveal Forensic DNA evidences from tiny and highly mixed samples.
Study: -short tandem repeat (STR) typing (repeating units of 2–6 nucleotides -mitochondrial DNA analysis, -dense panels of single nucleotide polymorphisms
(SNPs) offering
Summary of process used for
forensic DNA typing with STR
marker
The human identity testing community has
focused on 22 autosomal STR loci (Table and
about a dozen Y-chromosome STR markers
that are present in commercial kits. The use of core sets of loci enables common
information to be included in criminal DNA
databases.
The chromosomal locations, repeat motifs,
allele ranges, PCR product sizes, and
random match probabilities for these common
autosomal STR loci are stored in the database.
NCBI
• NCBI - Established in 1988 as a national
resource for molecular biology information,
NCBI creates public databases, conducts
research in computational biology, develops
software tools for analyzing genome data, and
disseminates biomedical information - all for
the better understanding of molecular
processes affecting human health and disease.
www.ncbi.nlm.nih.gov/Genbank
NCBI - databases and services
• sequence databases
GenBank, ESTs, SNPs, etc.
• PubMed - literature database
• Entrez
http://www.ncbi.nlm.nih.gov/entrez/
retrieval system connecting together plethora of
databases including PubMed, genomes, ontologies
• Blast - basic local alignment search tool
• Science primer
http://www.ncbi.nlm.nih.gov/About/primer/
introductions into molecular biology and
bioinformatics
EMBL
EBI - The European Bioinformatics Institute (EBI)is a non-profit
academic organisation that forms part of the European
Molecular Biology Laboratory (EMBL)
EBI databses
• EMBL nucleotide database - http://www.ebi.ac.uk/embl/
• UniProt (together with Expasy and PIR)
• ArrayExpress - http://www.ebi.ac.uk/arrayexpress/
public repository for microarray data
• Ensembl - http://www.ensembl.org/ - genomes and annotation
for metazoa
www.ebi.ac.uk/embl
DNA Data Bank of Japan
DDBJ - DDBJ (DNA Data Bank of Japan) began DNA data bank
activities in earnest in 1986 at the National Institute of
Genetics (NIG).
DDBJ has been functioning as the international nucleotide
sequence database in collaboration with EBI/EMBL and
NCBI/GenBank.
DNA sequence records organismic evolution more directly
than other biological materials and thus is invaluable not only
for research in life sciences but also human welfare in general.
The databases are, so to speak, a common treasure of human
beings. With this in mind, we make the databases online
accessible to anyone in the world.
www.ddbj.nig.ac.jp
DNA variations – SNPs, CNVs
Gene prediction
Open-source and on-line gene prediction • Glimmer - bacteria, archea, viruses
http://cbcb.umd.edu/software/glimmer/
• GlimmerHMM - eukaryotic genes
http://cbcb.umd.edu/software/GlimmerHMM/
• GeneZilla (TIGRscan) - eukaryotic genes
http://www.genezilla.org/
• GenScan - human genes
http://genes.mit.edu/GENSCAN.html
• software lists
http://www.genefinding.org/
Nucleic structures
RNAs and 3D nucleic structural databases • 3D structures of nucleic acids
RNABase - http://www.rnabase.org/
NDB nucleic acids database- http://ndbserver.rutgers.edu/
• SCOR - structural classification of RNA - http://scor.berkeley.edu/
RNA motifs, structures and interactions
• other databases
Small RNA database - http://condor.bcm.tmc.edu/smallRNA/
Noncoding RNA database - http://biobases.ibch.poznan.pl/ncRNA/
• UniProt - the universal protein resource
http://www.expasy.uniprot.org/
- knowledgebase, reference clusters, archives
• Swissprot - http://www.expasy.ch/sprot/
- database of protein sequences together with annotations
- structure and function of proteins
• prosite
http://www.expasy.ch/prosite/
- documentation on protein domains, folds, families
ExPASy (expert protein analysis system)
Moving from second-generation to
third-generation sequencing
strategies
Advanced Technological Approaches Generates Genomic Data Better, Faster, and Cheaper
In the next-gen sequencing arena, the focus over the past several years has been on technological advances, moving
from second-generation (SGS) to third-generation sequencing strategies (TGS) and producing research
instruments capable of delivering whole-genome sequences in parallel at increasing speed.
More recently, as read lengths and coverage continue to increase, throughputs rise, and costs decline, the
expanding range of applications of NGS has taken center stage.
1 3000 6000 9000
“hypothetical”
gene (178 aa)
Assembly problem -1
THANK YOU!