Upload
hypertable
View
1.446
Download
1
Tags:
Embed Size (px)
DESCRIPTION
This presentation was given by Doug Judd at the NoSQL Now! 2011 conference in San Jose.
Citation preview
A Genome Sequence A Genome Sequence Analysis System Built with Analysis System Built with
HypertableHypertable
Doug JuddDoug Judd
CEO, Hypertable, Inc.CEO, Hypertable, Inc.
ApplicationApplicationDevelopment TeamDevelopment Team
UCSF-Abbott Viral Diagnostics and UCSF-Abbott Viral Diagnostics and Discovery CenterDiscovery Center Director: Dr. Charles Chiu, M.D./Ph.D.Director: Dr. Charles Chiu, M.D./Ph.D. http://vddc.ucsf.edu/http://vddc.ucsf.edu/
Helices Inc.Helices Inc. Taylor Sittler, M.D.Taylor Sittler, M.D. John DennisJohn Dennis Brad Miller, M.D.Brad Miller, M.D. http://helic.es/http://helic.es/
What is Hypertable?What is Hypertable?
Modeled after Google’s Modeled after Google’s BigtableBigtable Open Source (GPL v2)Open Source (GPL v2) Horizontally ScalableHorizontally Scalable High Performance Implementation (C++)High Performance Implementation (C++) Thrift Interface for all popular languagesThrift Interface for all popular languages
(Java, PHP, Ruby, Python, Perl, etc.)(Java, PHP, Ruby, Python, Perl, etc.) NoSQLNoSQL
No joins (not yet)No joins (not yet) No transactions (not yet)No transactions (not yet)
Project Started in March 2007Project Started in March 2007
Hypertable DeploymentsHypertable Deployments
Why NoSQL?Why NoSQL?
QuickTime and aªBMP decompressor
are needed to see this picture.
Source: Nature 458, 719-724 (2009)
Source: wired.com, February 2011
Genomics 101Genomics 101
Base PairBase Pair(aka “base”)(aka “base”)
Two nucleotides on opposite Two nucleotides on opposite compl. DNA or RNA strands compl. DNA or RNA strands connected via hydrogen bondsconnected via hydrogen bonds
Double stranded DNA/RNA is Double stranded DNA/RNA is made up of base pairsmade up of base pairs
adenine (A) pairs with thymine (T)adenine (A) pairs with thymine (T) guanine (G) pairs with cytosine (C)guanine (G) pairs with cytosine (C) Base-paired DNA sequence:Base-paired DNA sequence:ATCGATTGAGCTCTAGCGATCGATTGAGCTCTAGCGTAGCTAACTCGAGATCGCTAGCTAACTCGAGATCGC
GeneGene
Encodes info on how to make a proteinEncodes info on how to make a protein DNA or RNA sequenceDNA or RNA sequence Thousands to millions of base pairs longThousands to millions of base pairs long Corresponds to various different biological Corresponds to various different biological
traitstraits Human genome contains about 23,000 Human genome contains about 23,000
genesgenes
Biological SamplesBiological Samples
Specimen taken from human or animalSpecimen taken from human or animal Nasal SwabsNasal Swabs Blood SerumBlood Serum DiarrhealDiarrheal Cerebral spinal fluidCerebral spinal fluid
Sent to a sequencing company to process into Sent to a sequencing company to process into DNA sequence information in digital formatDNA sequence information in digital format
Each sample will generate anywhere from 1M to Each sample will generate anywhere from 1M to 100M “reads”100M “reads”
A A readread is a short DNA sequence snippets of is a short DNA sequence snippets of approximately 100 basesapproximately 100 bases
Example Reads FileExample Reads File
GTGGATAGGGGGAGACTAATGTAGTATGATTATCATCATCAACAGAAGCTATGACACCAGGATAAACATTTCTTATTGCTGAAAGTATTCTATTGTAGAGATGTACCACAATTTGGTTTCTGGTTTTGTATTGGGAGGATACTAGGGATTACTGAAGCCAACTTTGCAGACTCATACATTTGACTAGACACAGCCACATTACAGTTTTCTGAGGAAAATTCTTAAGATGTTACCCCAAAACATAGCATTTTAAATTAAAACGGACCGGCTGAAGCCATGGCAGAAGAACATAAATTGTGAAGATTTCATGGGCATTTATTAGTTGGAAGTGATAAGTGTCCATGAAATCTTCACAATTTATGTTCAGAGATTGCAGTAAAGACAGGTGTAAAGACACAGCAAAGCTAAGAGGACCCAACACACGGTAGGGTCGGGGACCTTGGAGAAACATGGTGGCTTCTTCCTACATGCTTGTGATAGATGACCAAAAAACATTTGTTGAGTTGATGAATAGTACAAAAAAGGGGCGGATAATAAATGAAAAGGGAATGTGCTGTTATTTCCTACTAAGATCAGAAAGAGATATAAACAAAAGCTGTCATCACTTAGGGACTTCAGCCACATAAAACAATGTCAGGCTAGTCACTTAGAGCTTTGGGACTAGTTGAGTGGCAGCTTAACAAAGCAACGCAATATCCATAGGGATTGGGGATATTTACATCTAGTGGATTCTACCAGTATGGTGGTCTTATGTGGACTGCACGTGGTTTTCTAGTAAGATAGCAGCTCTTCCCAAATTTATTTATAATTGTGGCATTATTTATAATATCAAAATATTATGTTGCCAAAGGAGATTAACATTTGAGTCAGTGGGCGGGGTAAGGCCGACCTACCCTTAATCTGGTGGAGAAAGAAGCTGCTAATGGAGTTTAAAAGGTTACTGTCATTAATGAAAAATAAATTTACAGCCAGACATTTATGAACAGAAATGGGAAAAACACACTAGGAAAGCACTGCAAAGACTAATCTGTCTTTAAAGGAGATAGAGTGACTCCAGGCCCCTTAGAAATGACTATACCTGGCAGAGCATGCCAACTGATGGGCTCGAGTCCTCACAAATATGAATTCCCCCTAAGTCTTGAGAGGTCATTTGTGCATTTGGAAGGAAGAACATTCCATGCTCATGGGTAGGAAGAATCAATATCGTGAAAATGGTCATACTGCCCAGCGGGGTTTTTTTTTGTTTCATATTAACTTTAAAGTAGTTTTTTTCCATTTTGTGAAGAAAGACATAAAGAACCAAGGCTAATAGTTGTTTGAGTTGTACTTACCATGTTGTTAAATGTCACCTCACACCGCTGCCAGCCTATCAGAGCCGGGAATTACACCGTGCTTGGAGTTCTGGCACAGATCCACAGCTACAGTTCTTCATTGTAAGAAATGGATGCTAACATGTAACAAGAAAACATCTGAAGGTTAAACTCAAATAAATGGGTTAATAGTTTGTCTTTCGGTCTTCATACTTTCAATATAAGTGGTTTACTTAGCCGA
Sequence AlignmentSequence Alignment
Arranging the sequences of DNA or RNA to identify Arranging the sequences of DNA or RNA to identify regions of similarityregions of similarity
Fuzzy matching algorithmFuzzy matching algorithm Alignment methodsAlignment methods
BLAST - Basic Local Alignment Search ToolBLAST - Basic Local Alignment Search Tool MegaBLASTMegaBLAST
Faster but less accurate alignment methodsFaster but less accurate alignment methods SOAP - SOAP - Short Oligonucleotide Analysis PackageShort Oligonucleotide Analysis Package BLAT - BLAST-like Alignment ToolBLAT - BLAST-like Alignment Tool
TaxonomyTaxonomy
Hierarchical biological classificationHierarchical biological classification Method to group and categorize organisms by Method to group and categorize organisms by
biological typebiological type Basic RanksBasic Ranks
Kingdom, Phylum/Division, Class, Order, Family, Genus, SpeciesKingdom, Phylum/Division, Class, Order, Family, Genus, Species
Downloadable from National Center for Downloadable from National Center for Biotechnology Information (NCBI) websiteBiotechnology Information (NCBI) website
Every node in the taxonomy tree is assigned a Every node in the taxonomy tree is assigned a unique numeric IDunique numeric ID
GenBankGenBank
NIH genetic sequence databaseNIH genetic sequence database 380,000 distinct organisms380,000 distinct organisms 126,551,501,141 nucleotide bases126,551,501,141 nucleotide bases 135,440,924 sequence records135,440,924 sequence records
Most important and most influential database for Most important and most influential database for research in almost all biological fieldsresearch in almost all biological fields
Growth rate is exponentialGrowth rate is exponential Information on each sequence includes:Information on each sequence includes:
Numeric IDNumeric ID Taxonomic informationTaxonomic information
Schema DesignSchema Design
Taxa TableTaxa Table Schema
ContentsCREATE TABLE Taxa (ID, Type, Children, Name);
/1 ID 1/1 ID:fullName /root/1 Type no rank/1 Children 1,10239,12884,12908,28384,131567/1 Name root/1/10239 ID 10239/1/10239 ID:fullName /root/Viruses/1/10239 Type superkingdom/1/10239 Children 12333,12429,12877,29258,35237, …/1/10239 Name Viruses/1/10239/12333 ID 12333/1/10239/12333 ID:fullName /root/Viruses/unclassified phages/1/10239/12333 Type no rank/1/10239/12333 Children 12340,12347,12366,12371,12374, …/1/10239/12333 Name unclassified phages
Reads TableReads Table Schema
ContentsCREATE TABLE Reads (Sequence, Quality, GeneKey, Comments);
AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Sequence ATCGCACCATTGAACTCCAGTC...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Quality eeaeeeed\\e_Ycc]dcacab...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Comments:qualityFilter 11071815...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Sequence GGCTTACGCCTGTAATCCCAGC...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Quality gfee_cgggegggecggggegc...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 GeneKey:gnl|GNOMON|1320663.m 11...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 Sequence AGGATACGGAAGGCCCAAGGAG...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 Quality cdd`dffffffgffgggegf^e...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 GeneKey:chr10 110718151643.1308...AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Sequence ACGGAAGAGCACACGTCTGAAC...AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Quality cbccb[^W\\Ub]_b`_[bR_]...AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Comments:qualityFilter 11071815...AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Sequence GAACTCCAGTCACACAGTGATC...AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Quality eeeeeeeeeeeceeeeeaeeTQ...AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Comments:qualityFilter 11071815...
Genes TableGenes Table Schema
ContentsCREATE TABLE Genes (Sequence, TaxID, ID, ReadID);
1000075 Sequence GAATTCCATGGCAGTAAAACATCTTCCCTTC…1000075 TaxID 96061000075 ID:name HSLFBPS6 Human fructose-1,6-biphosphatase 1000075 ReadID:0310.Lane8big,HWI-EAS355:8:91:1231:1315#0/1 …1000075 ReadID:0908.Mexus2.TATTAT,SCS:1:22:395:324#0/1_TA …1000075 ReadID:0916.Enceph2,SCS:6:24:1519:513#0/11000075 ReadID:0916.Mexus,SCS:1:22:410:248#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:17:811:769#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:21:1132:1067#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:24:1207:492#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:33:1138:547#0/11000075 ReadID:0916.Parecho,SCS:3:4:679:1416#0/1|11000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 …1000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 …1000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.unbiased,SCS:7:30 …
Monitoring Table OverviewMonitoring Table Overview
ApplicationsApplications
Novel Virus DiscoveryNovel Virus Discovery
Process for discovering new viral DNA in a Process for discovering new viral DNA in a biological samplebiological sample
Algorithm OverviewAlgorithm Overview Import biological sample Import biological sample readread data from data from
sequencing company into systemsequencing company into system Strip out all Strip out all readsreads that align to known DNA that align to known DNA
sequencessequences What’s left over is novelWhat’s left over is novel
Novel Virus DiscoveryNovel Virus DiscoveryAlgorithm DetailAlgorithm Detail
Import sample data into Reads tableImport sample data into Reads table Run MapReduce program to filter/align reads Run MapReduce program to filter/align reads
and update and update CommentComment column of Reads table column of Reads table Filter out poor quality (“low entropy”) readsFilter out poor quality (“low entropy”) reads Align to common human RNA/DNAAlign to common human RNA/DNA Align to virus databaseAlign to virus database Align to GenBankAlign to GenBank
All Reads left in Reads table with no All Reads left in Reads table with no CommentComment column are novelcolumn are novel
Pathogen DiscoveryPathogen Discovery in Cancer Samples in Cancer Samples
Accomplished using same technique as Accomplished using same technique as novel virus discoverynovel virus discovery
Matthew Meyerson's Lab @ Broad Matthew Meyerson's Lab @ Broad InstituteInstitute
Taxonomic Tree ViewerTaxonomic Tree Viewer
Display Taxonomy breakdown of Display Taxonomy breakdown of biological samplebiological sample
For each aligned read in sample, consult For each aligned read in sample, consult Genes table to determine Taxonomy IDGenes table to determine Taxonomy ID
Populate HitSummary table with taxonomy Populate HitSummary table with taxonomy IDs for IDs for all aligned readsall aligned reads from from all samplesall samples
Depletion Array (future)Depletion Array (future)
Align reads to human genomeAlign reads to human genome Determine set of Determine set of probesprobes - sequences of human - sequences of human
genome with most number of alignmentsgenome with most number of alignments Send probes to Agilent to produce vial of Send probes to Agilent to produce vial of
“magnetized” DNA sequences of the probes“magnetized” DNA sequences of the probes Mix vial in with biological sampleMix vial in with biological sample Magnetized DNA binds to human DNA which Magnetized DNA binds to human DNA which
precipitates from solutionprecipitates from solution Increases viral percentage of sample fromIncreases viral percentage of sample from
~0.01% - 0.1% to 10 %~0.01% - 0.1% to 10 %
The EndThe End
Questions?Questions?