genome big data Adrián Báez 16/06/2014

Genome Big Data

Presentation from the "Demystifying Big Data" Technical Conference (Universidad de La Laguna, Spain, June 2014). Biomedical sciences rely on massive data sets. By using machines capable of generating large amounts of data with low cost, science has entered the 'Big Data' era, making computational infrastructures essential to maintain, transfer and analyze all this information.

  • DNA Genes Proteins Genome Genomics Biomedicine
  • Sequencing Assembly
  • Fragments Reads FASTQ file Genome sequencing
  • 2003 2014 Human Genome Project ending (1990-2003) 2.7 billion dollars Illumina launchs HiSeqX Ten 1000 dollars/genome Forty such machines would be able to sequence more genomes in one year than had been produced by all other sequencers to date. Genome sequencing
  • 400x20 Reads MB ~ GB HDD Assembly Intermediate data structures GB ~ TB RAM Original sequence MB ~ GB HDD Reads Assembly (RAM) Result Escherichia coli 82.4 MB 1.64 GB 3.8 MB Trypanosoma cruzi 1 GB 13.75 GB 38.6 MB Genome assembly
  • Instituto Universitario de Enfermedades Tropicales y Salud Pblica de Canarias Current system: Web assembly and analysis
  • Future work: Big Data solutions Instituto Universitario de Enfermedades Tropicales y Salud Pblica de Canarias
  • Data transfer Biotorrents Implementing Big Data Security and privacy Advanced encryption algorithms Custom hardware solutions instead of cloud computing Consent forms to share personal genome data Data storage Lack of an integral, economic and safe solution
  • Sequencing/assembly projects Google Scholar: papers that mention genome sequencing or assembly Human Genome Project Cancer Genome Project Pine Genome Project Dog Genome Project Pediatric Cancer Genome Project Bovine Genome Project Mammoth Genome Project Pear Genome Project Fugu Genome Project
  • thanks for your attention