37
Copyright © 2004 Synamatix sdn bhd (538481-U) lications of a Novel Structured Pattern Data nology for Analysis of Data from Second Gene Sequencers

Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Embed Size (px)

Citation preview

Page 1: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2004 Synamatix sdn bhd (538481-U)

Applications of a Novel Structured Pattern DatabaseTechnology for Analysis of Data from Second Generation

Sequencers

Page 2: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Synamatix IntroductionsSynamatix Introductions

Dr. Arif Anwar – General Manager

14 yrs+ post-Ph.D. California and UK genomics backgroundB.Sc. (hons.) Genetics, U. of LondonPh.D. Genetics, UCL, U. of London and U. of Oxford

Page 3: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Life and DeathLife and Death

Page 4: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Genomics 2007-2020Genomics 2007-2020

Skilled

people

Biotechnology

Genome centres

Drug discovery Personalised Drugs

Integrated genomics healthcare

Foods and livestock

Medical

Nutraceuticals

Cosmeceuticals

2nd Gen. DNA sequencers

Bio-security

Page 5: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Personalised medicinePersonalised medicine

Ultimate aim is predictabilityGenetic testing now active80% of healthcare costs are at chronic level

Disease progression

Cost

(Not

just $)

Predictive

DIagnostic

Chronic

Page 6: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Personalised medicinePersonalised medicine

Much better and easier to treat “wellness”….than “sickness”

Disease progression with age (years)

Reversibility

(%)

Predictive

DIagnosticChronic

0

100

30 40 50 60

Page 7: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Where is the science today?Where is the science today?

Page 8: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

DNA sequencing, time and $ crashing..DNA sequencing, time and $ crashing..

2.76MB/run3730x

80MB/runFLX

1 G 1000MB/run

Page 9: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Parallel revolution requiredParallel revolution required

Cost and speed of DNA sequencing

Cost and speed of data analysis

Synamatix R & D

Page 10: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Command line interface

CORE Database platform

SynaRex Bulk

SynaProbe Bulk

SynaSearch Bulk

SynaMer

SXoligosearch

SXSequenceRefs

SXLRESearch

SXParse

Tool development & data analysis

Another 20+ apps

www.MGRC.com.my

Synamatix solutions built on SynaBASE Synamatix solutions built on SynaBASE platformplatform

Page 11: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Synamatix approach for next-gen sequence Synamatix approach for next-gen sequence datadata

454 reads

Illuminareads

Sanger reads

SOLiD

Helicos

Others

SynaSearch Bulk

SXoligosearch

SynaMer

Another 20+ apps

Bioinformatics Presentation

Mining

Pre-dispositions

Diagnostics

Therapeutics

Nested GUI

Mapping and Analysis Viewer

CORE Database platform

ReferenceGenomes

Page 12: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Strategy for fast mapping of 454 readsStrategy for fast mapping of 454 reads

Remaining sequences suspected to be repeats searched using long pattern seeds

Using lower stringency parameters, sensitive searches were conducted to find divergent sequences

High-speed searching1st pass

Increased sensitivity searching

2nd pass

Repeats searching3rd pass

More than 3 billion bp mapped in 6 hrsApprox 200 fold faster than BLAST and MegaBLASTUtilises 1 CPU

Run SynaSearch to query against SynaBASE of Human Genome using high stringency settings

Page 13: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Faster than MegaBLASTFaster than MegaBLAST

SynaSearch, SynaSearch with a seed size(mml) of 28, and MegaBLAST performance speed in mapping 20,000 454 reads to the Human Genome (NCBI36).

Page 14: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Higher sensitivity than MegaBLASTHigher sensitivity than MegaBLAST

Percent Coverage of 20,000 454 reads against the Human Genome (NCBI36) with SynaSearch, SynaSearch with a seed size(mml) of 28, and MegaBLAST.

Page 15: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Existing approach for Illumina readsExisting approach for Illumina reads

Accuracy

Length

27

Can only handle 2 errors in the readPerforms poorly if length is above 30Insertions and deletions cause algorithm to crash

Page 16: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Synamatix application for Illumina dataSynamatix application for Illumina data

Uses a weighted profile searchCan handle gaps, insertions and deletionsNo size limitLeverages the Solexa PRB file

Accuracy

Length

27

Page 17: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Free on-line versionFree on-line version

Page 18: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Increased sensitivityIncreased sensitivity

Page 19: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Indels are importantIndels are important

Page 20: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Indels are importantIndels are important

DBIndels Substitutions

Homo Sapiens

Human Gene mutation database 30% 70%

Overlapping BACs 21% 79%

Chromosome 22 18% 82%

Page 21: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Distribution of gapsDistribution of gaps

Page 22: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

An example of a read missed by ELANDAn example of a read missed by ELAND

Page 23: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Using quality scoresUsing quality scores

Page 24: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Using quality scoresUsing quality scores

Page 25: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Longer reads give higher specificityLonger reads give higher specificity

Page 26: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Longer reads give higher specificityLonger reads give higher specificity

Page 27: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Main benefits of SXOligoSearchMain benefits of SXOligoSearch

Hundreds of times faster on Eukaryote-sized genomes   More reads aligned to unique locationsGapped alignmentsAllows for more mismatches per readReporting of alignments to repeats improves read density analysis and identification of large deletion polymorphismsNo read length limit; most suitable for oligonucleotides < 60bp.

Page 28: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

““Point of CarePoint of Care” Personal Genomes” Personal Genomes

SynaBASE uses a single CPU in a single integrated platformSoftware solutions start from $483.00 per Gbp of sequence generatedNo specialised HW or algorithm specific acceleratorsSavings up to $220,000.00 per year

Less consumablesOther running costs

Page 29: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Long v Short n-mersLong v Short n-mersadvantages and disadvantages

100 mer

+ve

-ve

Fewer false positives

Improvement in final assembly

Errors in reads may lead to false negatives

Slow to process with conventional software

Page 30: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Overlapper for assembly pre-processingOverlapper for assembly pre-processing

Original user data set and requirement was:To find all overlapping exact 100-mers in 50million 1kb sequencing reads – i.e. 50 Billion bpReport n-mers that have a frequency >2 and <m

Using conventional software and approaches the user took 500hrs and 1.5TB of disc space to find all 100-mer overlaps

Hence standard approach limits usage to 32mers

Longer mers help bridge repetitive regions

Page 31: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Longer –mer size leads to better assemblyLonger –mer size leads to better assembly

Low-complexity region

A shorter overlap results in more false

positives

A longer overlap results in less false

positives

Final assembly improved

A

B

Page 32: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Using SynaMer there is no time Using SynaMer there is no time increase with longer n-mersincrease with longer n-mers

Time vs n-mer (m 2 to 50)

0

5

10

15

20

25

0 20 40 60 80 100 120 140

n-mer

Tim

e, S

Page 33: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Summary of SynaMerSummary of SynaMer

For 30million 1kb reads took 2+3 hours on a dual

CPU itanium machine, with temporary file size less

than 200GB

100 fold faster than conventional “overlappers”

Allows use of longer n-mers

Potentially increases quality of assembly

Page 34: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

Sanger read mappingSanger read mappingAims: Mapping of whole genome shotgun reads from a mammalian genome to the Human Genome, to facilitate genome assembly using Synamatix and public tools.Compare sensitivity, specificity and performance advantages of Synamatix technologies

. Results:In comparison to BLASTz, SynaSearch:Is 219 fold fasterFinds 11% more true positivesFinds 17% more unique hits to queriesHas a higher specificity:

113% fewer false positivesfewer multiple placements per read – 2.7 v 5.3

Benefits:Enables significant enhancements in workflow throughput.

SynaSearch requires only 1 search process whereas BLASTz requires genome to be separated into 5MB chunks and apportioned across multiple processors.

Results in better assemblies of new genomes

Page 35: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

ROIROI

SynaBASE uses a single CPU SynaBASE is a single integrated platformNo specialised HW or algorithm specific acceleratorsExtra coverage equivalent to consumable savings:

Illumina – 12%454 – 17%Sanger – 11%

Page 36: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

SummarySummary

2nd generation sequencing technology leading to costs and throughput of genome sequencing to tumble

Synamatix ready TODAY to handle genome assembly and differentiation analysis of all types of reads with:

Higher-performanceIncreased sensitivityMore flexibility

454 reads

Solexa reads

Sanger reads

SOLiD

Helicos

Others

Page 37: Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation

Copyright © 2007 Synamatix Sdn. Bhd. (538481-U)

AcknowledgementsAcknowledgements

Karim Hercus - MDColin Hercus – CTOPoh Yang Ming – BioinformaticsZayed Albertyn – BioinformaticsAli Reza – Bioinformatics

Elaine MardisJarret Glasscock

Granger Sutton