Upload
esteban-pickering
View
217
Download
3
Tags:
Embed Size (px)
Citation preview
Copyright © 2004 Synamatix sdn bhd (538481-U)
Applications of a Novel Structured Pattern DatabaseTechnology for Analysis of Data from Second Generation
Sequencers
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Synamatix IntroductionsSynamatix Introductions
Dr. Arif Anwar – General Manager
14 yrs+ post-Ph.D. California and UK genomics backgroundB.Sc. (hons.) Genetics, U. of LondonPh.D. Genetics, UCL, U. of London and U. of Oxford
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Life and DeathLife and Death
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Genomics 2007-2020Genomics 2007-2020
Skilled
people
Biotechnology
Genome centres
Drug discovery Personalised Drugs
Integrated genomics healthcare
Foods and livestock
Medical
Nutraceuticals
Cosmeceuticals
2nd Gen. DNA sequencers
Bio-security
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Personalised medicinePersonalised medicine
Ultimate aim is predictabilityGenetic testing now active80% of healthcare costs are at chronic level
Disease progression
Cost
(Not
just $)
Predictive
DIagnostic
Chronic
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Personalised medicinePersonalised medicine
Much better and easier to treat “wellness”….than “sickness”
Disease progression with age (years)
Reversibility
(%)
Predictive
DIagnosticChronic
0
100
30 40 50 60
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Where is the science today?Where is the science today?
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
DNA sequencing, time and $ crashing..DNA sequencing, time and $ crashing..
2.76MB/run3730x
80MB/runFLX
1 G 1000MB/run
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Parallel revolution requiredParallel revolution required
Cost and speed of DNA sequencing
Cost and speed of data analysis
Synamatix R & D
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Command line interface
CORE Database platform
SynaRex Bulk
SynaProbe Bulk
SynaSearch Bulk
SynaMer
SXoligosearch
SXSequenceRefs
SXLRESearch
SXParse
Tool development & data analysis
Another 20+ apps
www.MGRC.com.my
Synamatix solutions built on SynaBASE Synamatix solutions built on SynaBASE platformplatform
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Synamatix approach for next-gen sequence Synamatix approach for next-gen sequence datadata
454 reads
Illuminareads
Sanger reads
SOLiD
Helicos
Others
SynaSearch Bulk
SXoligosearch
SynaMer
Another 20+ apps
Bioinformatics Presentation
Mining
Pre-dispositions
Diagnostics
Therapeutics
Nested GUI
Mapping and Analysis Viewer
CORE Database platform
ReferenceGenomes
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Strategy for fast mapping of 454 readsStrategy for fast mapping of 454 reads
Remaining sequences suspected to be repeats searched using long pattern seeds
Using lower stringency parameters, sensitive searches were conducted to find divergent sequences
High-speed searching1st pass
Increased sensitivity searching
2nd pass
Repeats searching3rd pass
More than 3 billion bp mapped in 6 hrsApprox 200 fold faster than BLAST and MegaBLASTUtilises 1 CPU
Run SynaSearch to query against SynaBASE of Human Genome using high stringency settings
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Faster than MegaBLASTFaster than MegaBLAST
SynaSearch, SynaSearch with a seed size(mml) of 28, and MegaBLAST performance speed in mapping 20,000 454 reads to the Human Genome (NCBI36).
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Higher sensitivity than MegaBLASTHigher sensitivity than MegaBLAST
Percent Coverage of 20,000 454 reads against the Human Genome (NCBI36) with SynaSearch, SynaSearch with a seed size(mml) of 28, and MegaBLAST.
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Existing approach for Illumina readsExisting approach for Illumina reads
Accuracy
Length
27
Can only handle 2 errors in the readPerforms poorly if length is above 30Insertions and deletions cause algorithm to crash
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Synamatix application for Illumina dataSynamatix application for Illumina data
Uses a weighted profile searchCan handle gaps, insertions and deletionsNo size limitLeverages the Solexa PRB file
Accuracy
Length
27
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Free on-line versionFree on-line version
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Increased sensitivityIncreased sensitivity
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Indels are importantIndels are important
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Indels are importantIndels are important
DBIndels Substitutions
Homo Sapiens
Human Gene mutation database 30% 70%
Overlapping BACs 21% 79%
Chromosome 22 18% 82%
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Distribution of gapsDistribution of gaps
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
An example of a read missed by ELANDAn example of a read missed by ELAND
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Using quality scoresUsing quality scores
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Using quality scoresUsing quality scores
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Longer reads give higher specificityLonger reads give higher specificity
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Longer reads give higher specificityLonger reads give higher specificity
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Main benefits of SXOligoSearchMain benefits of SXOligoSearch
Hundreds of times faster on Eukaryote-sized genomes More reads aligned to unique locationsGapped alignmentsAllows for more mismatches per readReporting of alignments to repeats improves read density analysis and identification of large deletion polymorphismsNo read length limit; most suitable for oligonucleotides < 60bp.
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
““Point of CarePoint of Care” Personal Genomes” Personal Genomes
SynaBASE uses a single CPU in a single integrated platformSoftware solutions start from $483.00 per Gbp of sequence generatedNo specialised HW or algorithm specific acceleratorsSavings up to $220,000.00 per year
Less consumablesOther running costs
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Long v Short n-mersLong v Short n-mersadvantages and disadvantages
100 mer
+ve
-ve
Fewer false positives
Improvement in final assembly
Errors in reads may lead to false negatives
Slow to process with conventional software
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Overlapper for assembly pre-processingOverlapper for assembly pre-processing
Original user data set and requirement was:To find all overlapping exact 100-mers in 50million 1kb sequencing reads – i.e. 50 Billion bpReport n-mers that have a frequency >2 and <m
Using conventional software and approaches the user took 500hrs and 1.5TB of disc space to find all 100-mer overlaps
Hence standard approach limits usage to 32mers
Longer mers help bridge repetitive regions
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Longer –mer size leads to better assemblyLonger –mer size leads to better assembly
Low-complexity region
A shorter overlap results in more false
positives
A longer overlap results in less false
positives
Final assembly improved
A
B
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Using SynaMer there is no time Using SynaMer there is no time increase with longer n-mersincrease with longer n-mers
Time vs n-mer (m 2 to 50)
0
5
10
15
20
25
0 20 40 60 80 100 120 140
n-mer
Tim
e, S
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Summary of SynaMerSummary of SynaMer
For 30million 1kb reads took 2+3 hours on a dual
CPU itanium machine, with temporary file size less
than 200GB
100 fold faster than conventional “overlappers”
Allows use of longer n-mers
Potentially increases quality of assembly
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
Sanger read mappingSanger read mappingAims: Mapping of whole genome shotgun reads from a mammalian genome to the Human Genome, to facilitate genome assembly using Synamatix and public tools.Compare sensitivity, specificity and performance advantages of Synamatix technologies
. Results:In comparison to BLASTz, SynaSearch:Is 219 fold fasterFinds 11% more true positivesFinds 17% more unique hits to queriesHas a higher specificity:
113% fewer false positivesfewer multiple placements per read – 2.7 v 5.3
Benefits:Enables significant enhancements in workflow throughput.
SynaSearch requires only 1 search process whereas BLASTz requires genome to be separated into 5MB chunks and apportioned across multiple processors.
Results in better assemblies of new genomes
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
ROIROI
SynaBASE uses a single CPU SynaBASE is a single integrated platformNo specialised HW or algorithm specific acceleratorsExtra coverage equivalent to consumable savings:
Illumina – 12%454 – 17%Sanger – 11%
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
SummarySummary
2nd generation sequencing technology leading to costs and throughput of genome sequencing to tumble
Synamatix ready TODAY to handle genome assembly and differentiation analysis of all types of reads with:
Higher-performanceIncreased sensitivityMore flexibility
454 reads
Solexa reads
Sanger reads
SOLiD
Helicos
Others
Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)
AcknowledgementsAcknowledgements
Karim Hercus - MDColin Hercus – CTOPoh Yang Ming – BioinformaticsZayed Albertyn – BioinformaticsAli Reza – Bioinformatics
Elaine MardisJarret Glasscock
Granger Sutton