A Genome Sequence Analysis System Built With Hypertable

A Genome Sequence A Genome Sequence Analysis System Built with Analysis System Built with

HypertableHypertable

Doug JuddDoug Judd

CEO, Hypertable, Inc.CEO, Hypertable, Inc.

ApplicationApplicationDevelopment TeamDevelopment Team

UCSF-Abbott Viral Diagnostics and UCSF-Abbott Viral Diagnostics and Discovery CenterDiscovery Center Director: Dr. Charles Chiu, M.D./Ph.D.Director: Dr. Charles Chiu, M.D./Ph.D. http://vddc.ucsf.edu/http://vddc.ucsf.edu/

Helices Inc.Helices Inc. Taylor Sittler, M.D.Taylor Sittler, M.D. John DennisJohn Dennis Brad Miller, M.D.Brad Miller, M.D. http://helic.es/http://helic.es/

What is Hypertable?What is Hypertable?

Modeled after Google’s Modeled after Google’s BigtableBigtable Open Source (GPL v2)Open Source (GPL v2) Horizontally ScalableHorizontally Scalable High Performance Implementation (C++)High Performance Implementation (C++) Thrift Interface for all popular languagesThrift Interface for all popular languages

(Java, PHP, Ruby, Python, Perl, etc.)(Java, PHP, Ruby, Python, Perl, etc.) NoSQLNoSQL

No joins (not yet)No joins (not yet) No transactions (not yet)No transactions (not yet)

Project Started in March 2007Project Started in March 2007

Hypertable DeploymentsHypertable Deployments

Why NoSQL?Why NoSQL?

QuickTime and aªBMP decompressor

are needed to see this picture.

Source: Nature 458, 719-724 (2009)

Source: wired.com, February 2011

Genomics 101Genomics 101

Base PairBase Pair(aka “base”)(aka “base”)

Two nucleotides on opposite Two nucleotides on opposite compl. DNA or RNA strands compl. DNA or RNA strands connected via hydrogen bondsconnected via hydrogen bonds

Double stranded DNA/RNA is Double stranded DNA/RNA is made up of base pairsmade up of base pairs

adenine (A) pairs with thymine (T)adenine (A) pairs with thymine (T) guanine (G) pairs with cytosine (C)guanine (G) pairs with cytosine (C) Base-paired DNA sequence:Base-paired DNA sequence:ATCGATTGAGCTCTAGCGATCGATTGAGCTCTAGCGTAGCTAACTCGAGATCGCTAGCTAACTCGAGATCGC

GeneGene

Encodes info on how to make a proteinEncodes info on how to make a protein DNA or RNA sequenceDNA or RNA sequence Thousands to millions of base pairs longThousands to millions of base pairs long Corresponds to various different biological Corresponds to various different biological

traitstraits Human genome contains about 23,000 Human genome contains about 23,000

genesgenes

Biological SamplesBiological Samples

Specimen taken from human or animalSpecimen taken from human or animal Nasal SwabsNasal Swabs Blood SerumBlood Serum DiarrhealDiarrheal Cerebral spinal fluidCerebral spinal fluid

Sent to a sequencing company to process into Sent to a sequencing company to process into DNA sequence information in digital formatDNA sequence information in digital format

Each sample will generate anywhere from 1M to Each sample will generate anywhere from 1M to 100M “reads”100M “reads”

A A readread is a short DNA sequence snippets of is a short DNA sequence snippets of approximately 100 basesapproximately 100 bases

Example Reads FileExample Reads File

GTGGATAGGGGGAGACTAATGTAGTATGATTATCATCATCAACAGAAGCTATGACACCAGGATAAACATTTCTTATTGCTGAAAGTATTCTATTGTAGAGATGTACCACAATTTGGTTTCTGGTTTTGTATTGGGAGGATACTAGGGATTACTGAAGCCAACTTTGCAGACTCATACATTTGACTAGACACAGCCACATTACAGTTTTCTGAGGAAAATTCTTAAGATGTTACCCCAAAACATAGCATTTTAAATTAAAACGGACCGGCTGAAGCCATGGCAGAAGAACATAAATTGTGAAGATTTCATGGGCATTTATTAGTTGGAAGTGATAAGTGTCCATGAAATCTTCACAATTTATGTTCAGAGATTGCAGTAAAGACAGGTGTAAAGACACAGCAAAGCTAAGAGGACCCAACACACGGTAGGGTCGGGGACCTTGGAGAAACATGGTGGCTTCTTCCTACATGCTTGTGATAGATGACCAAAAAACATTTGTTGAGTTGATGAATAGTACAAAAAAGGGGCGGATAATAAATGAAAAGGGAATGTGCTGTTATTTCCTACTAAGATCAGAAAGAGATATAAACAAAAGCTGTCATCACTTAGGGACTTCAGCCACATAAAACAATGTCAGGCTAGTCACTTAGAGCTTTGGGACTAGTTGAGTGGCAGCTTAACAAAGCAACGCAATATCCATAGGGATTGGGGATATTTACATCTAGTGGATTCTACCAGTATGGTGGTCTTATGTGGACTGCACGTGGTTTTCTAGTAAGATAGCAGCTCTTCCCAAATTTATTTATAATTGTGGCATTATTTATAATATCAAAATATTATGTTGCCAAAGGAGATTAACATTTGAGTCAGTGGGCGGGGTAAGGCCGACCTACCCTTAATCTGGTGGAGAAAGAAGCTGCTAATGGAGTTTAAAAGGTTACTGTCATTAATGAAAAATAAATTTACAGCCAGACATTTATGAACAGAAATGGGAAAAACACACTAGGAAAGCACTGCAAAGACTAATCTGTCTTTAAAGGAGATAGAGTGACTCCAGGCCCCTTAGAAATGACTATACCTGGCAGAGCATGCCAACTGATGGGCTCGAGTCCTCACAAATATGAATTCCCCCTAAGTCTTGAGAGGTCATTTGTGCATTTGGAAGGAAGAACATTCCATGCTCATGGGTAGGAAGAATCAATATCGTGAAAATGGTCATACTGCCCAGCGGGGTTTTTTTTTGTTTCATATTAACTTTAAAGTAGTTTTTTTCCATTTTGTGAAGAAAGACATAAAGAACCAAGGCTAATAGTTGTTTGAGTTGTACTTACCATGTTGTTAAATGTCACCTCACACCGCTGCCAGCCTATCAGAGCCGGGAATTACACCGTGCTTGGAGTTCTGGCACAGATCCACAGCTACAGTTCTTCATTGTAAGAAATGGATGCTAACATGTAACAAGAAAACATCTGAAGGTTAAACTCAAATAAATGGGTTAATAGTTTGTCTTTCGGTCTTCATACTTTCAATATAAGTGGTTTACTTAGCCGA

Sequence AlignmentSequence Alignment

Arranging the sequences of DNA or RNA to identify Arranging the sequences of DNA or RNA to identify regions of similarityregions of similarity

Fuzzy matching algorithmFuzzy matching algorithm Alignment methodsAlignment methods

BLAST - Basic Local Alignment Search ToolBLAST - Basic Local Alignment Search Tool MegaBLASTMegaBLAST

Faster but less accurate alignment methodsFaster but less accurate alignment methods SOAP - SOAP - Short Oligonucleotide Analysis PackageShort Oligonucleotide Analysis Package BLAT - BLAST-like Alignment ToolBLAT - BLAST-like Alignment Tool

TaxonomyTaxonomy

Hierarchical biological classificationHierarchical biological classification Method to group and categorize organisms by Method to group and categorize organisms by

biological typebiological type Basic RanksBasic Ranks

Kingdom, Phylum/Division, Class, Order, Family, Genus, SpeciesKingdom, Phylum/Division, Class, Order, Family, Genus, Species

Downloadable from National Center for Downloadable from National Center for Biotechnology Information (NCBI) websiteBiotechnology Information (NCBI) website

Every node in the taxonomy tree is assigned a Every node in the taxonomy tree is assigned a unique numeric IDunique numeric ID

GenBankGenBank

NIH genetic sequence databaseNIH genetic sequence database 380,000 distinct organisms380,000 distinct organisms 126,551,501,141 nucleotide bases126,551,501,141 nucleotide bases 135,440,924 sequence records135,440,924 sequence records

Most important and most influential database for Most important and most influential database for research in almost all biological fieldsresearch in almost all biological fields

Growth rate is exponentialGrowth rate is exponential Information on each sequence includes:Information on each sequence includes:

Numeric IDNumeric ID Taxonomic informationTaxonomic information

Schema DesignSchema Design

Taxa TableTaxa Table Schema

ContentsCREATE TABLE Taxa (ID, Type, Children, Name);

/1 ID 1/1 ID:fullName /root/1 Type no rank/1 Children 1,10239,12884,12908,28384,131567/1 Name root/1/10239 ID 10239/1/10239 ID:fullName /root/Viruses/1/10239 Type superkingdom/1/10239 Children 12333,12429,12877,29258,35237, …/1/10239 Name Viruses/1/10239/12333 ID 12333/1/10239/12333 ID:fullName /root/Viruses/unclassified phages/1/10239/12333 Type no rank/1/10239/12333 Children 12340,12347,12366,12371,12374, …/1/10239/12333 Name unclassified phages

Reads TableReads Table Schema

ContentsCREATE TABLE Reads (Sequence, Quality, GeneKey, Comments);

AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Sequence ATCGCACCATTGAACTCCAGTC...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Quality eeaeeeed\\e_Ycc]dcacab...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Comments:qualityFilter 11071815...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Sequence GGCTTACGCCTGTAATCCCAGC...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Quality gfee_cgggegggecggggegc...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 GeneKey:gnl|GNOMON|1320663.m 11...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 Sequence AGGATACGGAAGGCCCAAGGAG...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 Quality cdd`dffffffgffgggegf^e...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 GeneKey:chr10 110718151643.1308...AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Sequence ACGGAAGAGCACACGTCTGAAC...AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Quality cbccb[^W\\Ub]_b`_[bR_]...AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Comments:qualityFilter 11071815...AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Sequence GAACTCCAGTCACACAGTGATC...AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Quality eeeeeeeeeeeceeeeeaeeTQ...AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Comments:qualityFilter 11071815...

Genes TableGenes Table Schema

ContentsCREATE TABLE Genes (Sequence, TaxID, ID, ReadID);

1000075 Sequence GAATTCCATGGCAGTAAAACATCTTCCCTTC…1000075 TaxID 96061000075 ID:name HSLFBPS6 Human fructose-1,6-biphosphatase 1000075 ReadID:0310.Lane8big,HWI-EAS355:8:91:1231:1315#0/1 …1000075 ReadID:0908.Mexus2.TATTAT,SCS:1:22:395:324#0/1_TA …1000075 ReadID:0916.Enceph2,SCS:6:24:1519:513#0/11000075 ReadID:0916.Mexus,SCS:1:22:410:248#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:17:811:769#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:21:1132:1067#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:24:1207:492#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:33:1138:547#0/11000075 ReadID:0916.Parecho,SCS:3:4:679:1416#0/1|11000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 …1000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 …1000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.unbiased,SCS:7:30 …

Monitoring Table OverviewMonitoring Table Overview

ApplicationsApplications

Novel Virus DiscoveryNovel Virus Discovery

Process for discovering new viral DNA in a Process for discovering new viral DNA in a biological samplebiological sample

Algorithm OverviewAlgorithm Overview Import biological sample Import biological sample readread data from data from

sequencing company into systemsequencing company into system Strip out all Strip out all readsreads that align to known DNA that align to known DNA

sequencessequences What’s left over is novelWhat’s left over is novel

Novel Virus DiscoveryNovel Virus DiscoveryAlgorithm DetailAlgorithm Detail

Import sample data into Reads tableImport sample data into Reads table Run MapReduce program to filter/align reads Run MapReduce program to filter/align reads

and update and update CommentComment column of Reads table column of Reads table Filter out poor quality (“low entropy”) readsFilter out poor quality (“low entropy”) reads Align to common human RNA/DNAAlign to common human RNA/DNA Align to virus databaseAlign to virus database Align to GenBankAlign to GenBank

All Reads left in Reads table with no All Reads left in Reads table with no CommentComment column are novelcolumn are novel

Pathogen DiscoveryPathogen Discovery in Cancer Samples in Cancer Samples

Accomplished using same technique as Accomplished using same technique as novel virus discoverynovel virus discovery

Matthew Meyerson's Lab @ Broad Matthew Meyerson's Lab @ Broad InstituteInstitute

Taxonomic Tree ViewerTaxonomic Tree Viewer

Display Taxonomy breakdown of Display Taxonomy breakdown of biological samplebiological sample

For each aligned read in sample, consult For each aligned read in sample, consult Genes table to determine Taxonomy IDGenes table to determine Taxonomy ID

Populate HitSummary table with taxonomy Populate HitSummary table with taxonomy IDs for IDs for all aligned readsall aligned reads from from all samplesall samples

Depletion Array (future)Depletion Array (future)

Align reads to human genomeAlign reads to human genome Determine set of Determine set of probesprobes - sequences of human - sequences of human

genome with most number of alignmentsgenome with most number of alignments Send probes to Agilent to produce vial of Send probes to Agilent to produce vial of

“magnetized” DNA sequences of the probes“magnetized” DNA sequences of the probes Mix vial in with biological sampleMix vial in with biological sample Magnetized DNA binds to human DNA which Magnetized DNA binds to human DNA which

precipitates from solutionprecipitates from solution Increases viral percentage of sample fromIncreases viral percentage of sample from

~0.01% - 0.1% to 10 %~0.01% - 0.1% to 10 %

The EndThe End

Questions?Questions?

Technology

A Genome Sequence Analysis System Built With Hypertable