Upload
dodan
View
226
Download
1
Embed Size (px)
Citation preview
Moore's law calculates and predicts the pace of improvement of one of the fastest improving
technologies, computers
In the last 15 years the pace of improvement of DNA sequencing technologies has been much faster than that
of computers
Frederick SangerNobel prize in chemistry in 1958 for sequencing insulin (and
proteins in general)
Nobel prize in chemistry in 1980 for sequencing nucleic acids
One of only three persons to win two Nobel prizes in science
SANGER SEQUENCINGThe most modern Sanger sequencers allow parallelization of up to 96 samples at once
Before sequencing a step of PCR and purification is necessary – and if you do not know the sequence in advance you need to perform a cloning step
OUTPUT: 1000 bases per run (96000 if you parallelize)
NEXT-GENSEQUENCING TECHNOLOGIES
• Roche/454 FLX
• Applied Biosystems SOLiD System
• Illumina/Solexa Genome Analyzer
• IonTorrent
NEXT-GENERATION DNA SEQUENCINGMAIN CHARACTERISTICS
EXTREME MINIATURIZATION
Reactions are carried out in volumes of microliters thanks to specific technological advances
This in turn allows
MASSIVE PARALLELIZATION
Thousands, millions of reactions are performed in parallel, reducing the costs and increasing the output volume by orders
of magnitude
NEXT-GENSEQUENCING TECHNOLOGIES
Some specific aspects of each method are protected by
copyright and therefore not disclosed
In 1977 Sanger made his method public (winning the second Nobel), today every new
method is marketed
SAMPLE PREPARATION
Nebulization of genomic DNA in fragments of 400-1000 base pairs
Ligation of fragments to two adapters (type A and type B)
Selection of single strand fragments with both adapters
EMULSION PCR
Fragments are mixed with agarose beads by 28 microns in diameter bearing complementary to oligo adapters
Isolation of each bead-fragment into individual micelles in water-oil
Emulsion PCR reaction in 1 million copies of amplified fragment on the surface of each bead
SAMPLE LOAD
Each bead is placed in a well of a picotiter slide (7x7 cm fiber optic slide); several million 44 microns diameter wells per slide
Multiple enzymes and reagents are added in the form of even smaller beads
PYROSEQUENCING REACTION
1 single nucleotide species is added each cycle
Nucleotide incorporation light generation→
Rothberg Nat. Biotechnol. 2008
ROCHE/454 FLX Pyrosequencer
1 EMULSION PCR takes the place of thousands of cloning experiments
1 SEQUENCING RUN takes the place of thousands of SANGER sequencing runs
EXTREME MINIATURIZATION
MASSIVE PARALLELIZATION
ROCHE/454 GSFLX+
BASE CALLING ACCURACY: 99.9% or more (lower in the final part of the reads)
OUTPUT: Generates reads up to 1,000
nucleotides long
Generates about 500,000-1,000,000 reads
For a total output of 700 megabases per run (8 hours)
454 MAIN ISSUEHomopolymers: stretches of one single nucleotide species
Intrinsic problem of the technology
Multiple identical nucleotides are incorporated in a single cycle
They generate more light, but discrimination becomes increasingly more difficult
ILLUMINA/SOLEXA Genome Analyzer
Currently the market leaderVery low cost per base, proven technology
sequencing by synthesis
ILLUMINA/SOLEXA Genome Analyzer
1. DNA fragmentation and ligation to 2 types of adapters
3. "bridge" amplification using primers complementary to the adapters that are bound to the substrate at high density production of →clusters of up to 1,000,000 of template copies "in situ" that generate a sufficient signal to be detected
2. Templates are bound on the surface of a flow microcell
ILLUMINA/SOLEXA Genome Analyzer
4. Addition of fluorescent nucleotides blocked at 3'-OH 5. Fluorescence detection6. Removal of the fluorophore 7. repeat steps 3-5
HISEQ 4000
- the newest Solexa/Illumina instrument
- total output: 125-1500 Gb- read length: 150bp paired ends
- cost per library construction: 500 euros - sequencing cost per lane: 3000 euros
ILLUMINA/SOLEXA Genome Analyzer
• Four different fluorophores no issues with →homopolymers
• Shorter reads blocking the incorporation of multiple nucleotides is one of
the basis of the Illumina methodEach cycle imperfect blocking happens, a small percentage
of the copies in a cluster incorporates two nucleotides, giving noise instead of good signal
When this percentage reaches a threshold, the signal is lost
The smallest sequencer, fast and economical
An instrument: $ 50,000A run: $ 1,000
Output: up to 80MB of reads long up to 400pb
Very quick, a run lasts for 3 hours
ION TORRENT
In many respects similar to 454
DNA is amplified on microbeads and inserted into wells
Then subjected to cycles of incorporation of a single type of nucleotide
ION TORRENT
ION TORRENT
The sequencing is performed on a semiconductor chip, which identifies the liberation of protons
Potential rapid technological development, taking advantage of the electronics industry
Does not detect light, but the release of H+ ions by sequencing - As a camera chip, which instead of detecting photons detects protons
All nucleotides release H+, so cycles of incorporations of individual types of nucleotides are required (A, C, G, T)
ION TORRENT
Same issue as 454: homopolymers
THIRD GENERATIONSEQUENCING TECHNOLOGIES
REAL TIME SEQUENCING
The idea is to bypass the amplification step
Advantage
THIRD GENERATIONSEQUENCING TECHNOLOGIES
REAL TIME SEQUENCING
The idea is to bypass the amplification step
This allows to avoid DNA fragmentation, and to obtain longer reads
Advantage
Pacific Biosciences PACBIOLaunched in 2009 (third-generation?)
Real-Time sequencing technology
The idea is to directly observe the DNA polymerization while it performed by DNA polymerase
Single Molecule Real Time (SMRT) sequencing
Recently the third machine was released: PACBIO SEQUEL
cost around 800,000 dollars
Zero-mode waveguide (ZMW)
Highly sensitive detection system
Nanophotonic structure with 50nm diameter cells
Same principle of microwave ovens doors
A laser illuminates from below, but the wavelength is too large to allow the diffusion of light
Zero-mode waveguide (ZMW)
The light penetrates 20-30 nm
This allows to identify only what happens on the bottom of the well, reducing background noise and getting high sensitivity and temporal resolution
The latest PacBio instrument has around 1,000,000 wells
Polimerase phi-29phage polymerase
Highly processive, up to 70,000 nt
High fidelity, up to 100 times more of Taq polymerase
Modifed to be slower
The polymerase is linked to the bottom of the wells
Only 1/3 of the wells have a single polymerase, and thus can perform the sequencing
PacBio sequencingAddition of single strand DNA that binds to the polymerase
Addition of the 4 nucleotide species, tagged with 4 different fluorophores
The nucleotide is incorporated and the
fluorophore is cut
The free fluorophore generates a flash of light, which is detected by a fluorescence microscope
Characteristics
Third generation sequencing
A novel revolution expecially for →bioinformatics
The sequencing is continuous, washing is not necessary →much faster
PacBio allows to obtain sequences of several thousands of nucleotides (up to 20,000)
PacBio ISSUES
Current issues are
the cost (10x more expensive than Illumina)
The read quality: single molecule sequencing means every mistake is recorded, and cannot be cancelled by the presence of thousands of parallel reactions
However these errors are random and can be overcome
Rivoluzione dal punto di vista dell'analisi a valle
http://flxlexblog.wordpress.com/2013/10/01/developments-in-next-generation-sequencing-october-2013-edition/
NEXT-GEN IS TRENDY
It is the new thing
It is powerful and cheap
It has uses in any biological system (From viruses to human genetics)
It is useful to answer a number of questions (De novo, mapping, transcriptomics)
NEXT-GEN IS TRENDY
So everyone wants to use it
you just extract your DNA/RNA and send it to a sequencing company
And then, who will do the analysis?
NEXT-GEN WORKFLOW
1. What is the goal?
2. Choose the right experimental setup
3. Choose the right sequencing technology
4. Data Analysis
What is your goal?
NO WAY BACK!
What exactly is the problem you want to address?
Evaluate approaches used in the past
Consider new approaches
Consider future problems
CHOOSE THE RIGHT EXPERIMENTAL SETUP
Nucleic acid quantity
Nucleic acid quality
Technical replicates
Biological replicates
Negative and/or positive controls
CHOOSE THE RIGHT TECHNOLOGY
de novo sequencing: 454, PacBio
Draft sequencing: Illumina, Iontorrent
Microbial communities: 454, Illumina
Transcriptomics: Illumina, Iontorrent
DATA ANALYSIS
A basic next-gen experiment generates gigabytes of information
This is HIGH-THROUGHPUT!
HIGH-TROUGHPUT TECHNOLOGIES
Technologies that generate too much data, that cannot be handled without computer assistance
EXAMPLES
Shotgun proteomics
Network analysis
BIOINFORMATICS
Bioinformatics is the development and use of computer methods for the analysis of biological data
Bioinformatics becomes absolutely necessary with the increase of data load
SO WHAT IS UNIX?
Unix is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, developed in the 1970s at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and others.
Full multitasking with protected memory
Very efficient virtual memory
Access controls and security
A rich set of small commands that do specific tasks well
Ability to combine commands to accomplish complicated tasks
A powerfully unified file system
Available on a wide variety of machines
Optimized for program development
UNIX Advantages
The ommand line interface is user hostile
Commands often have cryptic names and give very littleresponse to tell the user what they are doing
To use Unix well, you need to understand some of themain design features
Richness of utilities (over 400 standard ones) oftenoverwhelms novices
Documentation often feels underwhelming and poor ofExamples
Expensive
UNIX Disadvantages
UNIX LINUX→
Linux is a UNIX-like family of Operating Systems (OSs)
Each ”member” of the family has
different characteristics and comes
with different softwares and
graphic environments
Broadly, each distribution (a.k.a.
distro) is ”tuned” for a specific
task, to address a specific user or
designed for a specific kind of
devices
Most Unix advantages, plus it is FREE and User-friendly
Linux Distrosfor beginners:
Mint and Ubuntu, #1 and #2 most popular distributions
for a specific task:
e.g. BioLinux (bioinformatics), Scientific Linux (science in
general)and Ubuntu Studio (multimedia)
for a specific platform:
e.g. Mythbuntu (home theater PCs), Yellow Dog Linux (apple
machines), OpenWrt (routers)
LINUX FOR BIOINFORMATICS
It requires more work than other operating systems
Why Linux?
Free and runs on most hardware
fully customizable
more efficient and stable
Why Linux for bioinformatics?
Supports multiple users in a controlled manner
Optimized for writing and executing scripts/commands
Features for handling massive amounts of files
Adopted by the scientific community
LINUX – OPEN SOURCE
Why Linux? free and open software
Open-source software (OSS) is computer software with its source code made available with a license in which the copyright holder provides the rights to study, change, and distribute the software to anyone and for any purpose
Open-source software may be developed in a collaborative public manner
LINUX
Linux servers are widely used for example by Microsoft and →Apple
Why Linux? more efficient and stable
As a bioinformatician, if you want to interact with your server quickly and well, you may find it easier if you use the same language
is LINUX the only way to do bioinformatics?
ABSOLUTELY NO
However its characteristics make it optimal for most bioinformatic tasks
Supports multiple users in a controlled manner
Optimized for writing and executing scripts/commands
Features for handling massive amounts of files
Adopted by the scientific community
Many bioinformaticians use a Mac laptop to interact with a Linux server (MAC OS X is unix based)