34
MongoDB and research Jan Aerts, PhD Wellcome Trust Sanger Institute Hinxton, UK [email protected] @jandot

MongoDB and research

Embed Size (px)

Citation preview

Page 1: MongoDB and research

MongoDBand research

Jan Aerts, PhD

Wellcome Trust Sanger InstituteHinxton, UK

[email protected]@jandot

Page 2: MongoDB and research

Disclaimer 1

Page 3: MongoDB and research

Disclaimer 2

Page 4: MongoDB and research

Acknowledgments

MongoDB community

Karen Ambrose

10gen

Page 5: MongoDB and research
Page 6: MongoDB and research

transcriptomics

genomics

proteomics

*omics

Page 7: MongoDB and research

transcriptomics

genomics

proteomics

*omics

instantiationomics

metabolomics

spliceomics

interactomics

metallomics

lipidomics

orfeomics

phenomicshistomics

Page 8: MongoDB and research

!= Academia industry

Page 9: MongoDB and research

heterogeneous systems

Page 10: MongoDB and research

transitory

Page 11: MongoDB and research

little optimization

Page 12: MongoDB and research

slow adoption of new technology

(don't break anything that works)

Page 13: MongoDB and research

data management = afterthought

money

Page 14: MongoDB and research

?Who are the players

Page 15: MongoDB and research

large genome/data centers

genome hackers(lone bioinformaticians)

bench-based scientists

Drawings by Morag Ann Lewis

Page 16: MongoDB and research

genome hackers (lone bioinformaticians)

bench-based scientists

heavy investment in infrastructure/pipelines

data exchange => standards!

large genome/data centers

Page 17: MongoDB and research

genome hackers (lone bioinformaticians)

bench-based scientists

little investment in infrastructure

little time/effort for optimization

one-off

getting it donecreating legacy

need IT support for heavier work

large genome/data centers

often self-taught

Page 18: MongoDB and research

large genome/data centers

genome hackers (lone bioinformaticians)

bench-based scientistsuse whatever everyone else is using

"normalization?"

Page 19: MongoDB and research

The data landscape

Page 20: MongoDB and research

1. Flat text filesLOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) genes, complete cds.VERSION    U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's 

yeast) ORGANISM   Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; 

           Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE  1 (bases 1 to 5028)AUTHORS    Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.TITLE      Cloning and sequence of REV7, a gene whose function is required for DNA            damage-induced mutagenesis in Saccharomyces cerevisiaeJOURNAL    Yeast 10 (11), 1503-1509 (1994)PUBMED     7871890FEATURES   Location/Qualifiers  gene     687..3158             /gene="AXL2" gene complement(3300..4037)             /gene="REV7"ORIGIN       1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg            61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct           121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa           181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg           241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa           301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa           361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat           421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctc...//LOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) ...

Page 21: MongoDB and research

1. Flat text filesLOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) genes, complete cds.VERSION    U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's 

yeast) ORGANISM   Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; 

           Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE  1 (bases 1 to 5028)AUTHORS    Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.TITLE      Cloning and sequence of REV7, a gene whose function is required for DNA            damage-induced mutagenesis in Saccharomyces cerevisiaeJOURNAL    Yeast 10 (11), 1503-1509 (1994)PUBMED     7871890FEATURES   Location/Qualifiers  gene     687..3158             /gene="AXL2" gene complement(3300..4037)             /gene="REV7"ORIGIN       1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg            61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct           121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa           181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg           241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa           301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa           361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat           421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctc...//LOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) ...

Page 22: MongoDB and research

1. Flat text files##format=VCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  NA000011   967433 . G A   151.43  0   AB=0.42;AC=1               GT:DP:GQ  1/0:11:99.001   970323 . G A   492.61  0   AB=0.41;AC=1;AF=0.50       GT:DP:GQ     1/0:28:99.001   970950 . A G  1287.90  0   AB=0.55;AC=1;AF=0.50       GT:DP:GQ     0/1:108:99.001  972804 . T C   210.56  0   AB=0.53;AC=1;AF=0.50  GT:DP:GQ     1/0:13:99.001  972857 . T C   846.18  0   AB=0.53;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:58:99.001   974165 . T C   810.47  0   AB=0.38;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:6:67.051   977063 . C T  1110.31  0   AB=0.50;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:67:99.001  1006892 . C G    62.39  SF  AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:2:6.021  1148494 . A G  5237.88  0   AC=2;AF=1.00;AN=2   GT:DP:GQ     1/1:160:99.001  1149380 . T C   165.10  0   AC=2;AF=1.00;AN=2          GT:DP:GQ   1/1:6:18.051  1212553 . C T   426.61  0   AB=0.26;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:18:99.001  1235867 . A G  1158.08  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:30:90.281  1237357 . T C    142.01  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:5:15.041  1239050 . G A 13952.03  0   AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:340:99.0020 14370 . G A       29  0   NS=58;DP=258;AF=0.786      GT:GQ:DP:HQ  0|0:48:1:51,5120 13330 . T A        3  q10 NS=55;DP=202;AF=0.024      GT:GQ:DP:HQ  0|0:49:3:58,5020 1110696 . A G,T     67  0   AF=0.421,0.579;AA=T;DB     GT:GQ:DP:HQ  1|2:21:6:23,2720 10237 . T .       47  0   NS=57;DP=257;AA=T          GT:GQ:DP:HQ  0|0:54:7:56,60...

Page 23: MongoDB and research

1. Flat text files##format=VCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  NA000011   967433 . G A   151.43  0   AB=0.42;AC=1               GT:DP:GQ  1/0:11:99.001   970323 . G A   492.61  0   AB=0.41;AC=1;AF=0.50       GT:DP:GQ     1/0:28:99.001   970950 . A G  1287.90  0   AB=0.55;AC=1;AF=0.50       GT:DP:GQ     0/1:108:99.001  972804 . T C   210.56  0   AB=0.53;AC=1;AF=0.50  GT:DP:GQ     1/0:13:99.001  972857 . T C   846.18  0   AB=0.53;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:58:99.001   974165 . T C   810.47  0   AB=0.38;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:6:67.051   977063 . C T  1110.31  0   AB=0.50;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:67:99.001  1006892 . C G    62.39  SF  AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:2:6.021  1148494 . A G  5237.88  0   AC=2;AF=1.00;AN=2   GT:DP:GQ     1/1:160:99.001  1149380 . T C   165.10  0   AC=2;AF=1.00;AN=2          GT:DP:GQ   1/1:6:18.051  1212553 . C T   426.61  0   AB=0.26;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:18:99.001  1235867 . A G  1158.08  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:30:90.281  1237357 . T C    142.01  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:5:15.041  1239050 . G A 13952.03  0   AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:340:99.0020 14370 . G A       29  0   NS=58;DP=258;AF=0.786      GT:GQ:DP:HQ  0|0:48:1:51,5120 13330 . T A        3  q10 NS=55;DP=202;AF=0.024      GT:GQ:DP:HQ  0|0:49:3:58,5020 1110696 . A G,T     67  0   AF=0.421,0.579;AA=T;DB     GT:GQ:DP:HQ  1|2:21:6:23,2720 10237 . T .       47  0   NS=57;DP=257;AA=T          GT:GQ:DP:HQ  0|0:54:7:56,60...

Page 24: MongoDB and research

1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  NA000011   967433 . G A   151.43  0   AB=0.42;AC=1               GT:DP:GQ  1/0:11:99.001   970323 . G A   492.61  0   AB=0.41;AC=1;AF=0.50       GT:DP:GQ     1/0:28:99.001   970950 . A G  1287.90  0   AB=0.55;AC=1;AF=0.50       GT:DP:GQ     0/1:108:99.001  972804 . T C   210.56  0   AB=0.53;AC=1;AF=0.50  GT:DP:GQ     1/0:13:99.001  972857 . T C   846.18  0   AB=0.53;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:58:99.001   974165 . T C   810.47  0   AB=0.38;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:6:67.051   977063 . C T  1110.31  0   AB=0.50;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:67:99.001  1006892 . C G    62.39  SF  AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:2:6.021  1148494 . A G  5237.88  0   AC=2;AF=1.00;AN=2   GT:DP:GQ     1/1:160:99.001  1149380 . T C   165.10  0   AC=2;AF=1.00;AN=2          GT:DP:GQ   1/1:6:18.051  1212553 . C T   426.61  0   AB=0.26;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:18:99.001  1235867 . A G  1158.08  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:30:90.281  1237357 . T C    142.01  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:5:15.041  1239050 . G A 13952.03  0   AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:340:99.0020 14370 . G A       29  0   NS=58;DP=258;AF=0.786      GT:GQ:DP:HQ  0|0:48:1:51,5120 13330 . T A        3  q10 NS=55;DP=202;AF=0.024      GT:GQ:DP:HQ  0|0:49:3:58,5020 1110696 . A G,T     67  0   AF=0.421,0.579;AA=T;DB     GT:GQ:DP:HQ  1|2:21:6:23,2720 10237 . T .       47  0   NS=57;DP=257;AA=T          GT:GQ:DP:HQ  0|0:54:7:56,60...

perl

java

python

ruby

“tab-delimited” is king

Page 25: MongoDB and research

2. Binary compressed flat files

One experiment => One datafile as text: 40-70Gb=> Compressed to 11-20Gb

Toolkits to access data (and generate tab-delimited)

Cjava

Page 26: MongoDB and research

3. MySQLand Oracle

Curated dataMeta-dataRaw data: BLOBs

Sequencing:>6 TB/week and growing…

Departmental project:40 individuals x 42mio datapoints/individual=> joins?

Denormalized copy

Page 27: MongoDB and research

4. - AceDB A C aenorhabditis e legansdatabase

object-orientedAuthor "Patel B" Full_name "Bala Patel" Laboratory CB Paper [cgc1011] Paper [cgc533] Mail "Laboratory of Molecular Biology" Mail "Hills Road, Cambridge" Fax "050 3456789"  Paper [cgc533] Title "Yet more of those Genes" Journal "Cell Reports" Volume 3 Year 1993

Page 28: MongoDB and research
Page 29: MongoDB and research

*Challenges in omics-

?Where canMongoDBplay a role

Page 30: MongoDB and research

explosion of data

every researcher must be able to handle data

Page 31: MongoDB and research

low stepping stone for bench-based scientists big data

Page 32: MongoDB and research
Page 33: MongoDB and research

?Takeoff within research community

widespread?Cannot manage all data in-house <= data exchange!=> focus more on file formats than on technology

smaller scaleImplement MongoDB for

* local storage and querying (load file from standard file format into custom DB)

* encourage non-informaticians to use MongoDB

Page 34: MongoDB and research

!Thankyou?Questions

. .jan aerts@gmail com@jandot

:// . .http saaientist blogspot com