Big Data Exploration in Genome-based Data Analysis
Dr. Jittisak Senachak
Systems Biology and Bioinformatics (SBI) research group
King Mongkut’s University of Technology Thonburi (KMUTT)
[Affiliation: National Center for Genetic Engineering and Biotechnology (BIOTEC)]
Trilateral Scientific Meeting Indonesia-Thailand-France: “Climate change, Big data management and Health”
IICC-Bogor, Indonesia, October 29, 2015
@Bangkok, THAILAND BM: BangMod – Main campus BKT: Bangkhunthien campus - R&D cluster + Industrial Park - Pilot-plant 1,2,3 - Nation Biophamaceutical Facility
KX: Knowledge Exchange Center - Big Data Exchange Center
@Ratchaburi, THAILAND RAT: Ratchaburi Campus - Residential campus - Bee-Park
King Mongkut’s University of
Technology Thonburi
http://global.kmutt.ac.th/
BangKhunThian Campus, BKK
Main Campus @BangMod, BKK Incity Innovation Center
Ratchaburi Campus
<X : Knowledge eXchange for Innovation Center
• KMUTT Learning Square
• Working + Learning + Sharing • change
• perience
• pert
• tension
• plore
BX: Data scientists meet with enterprise to solve business problems
• The big data ecosystem in Thailand • e plain: Education & training
• e plore: Share idea & best practices
• e change: Case studies, prototypes & surveys among academia & IT providers
Mobilizing talent Leveraging education Big data Trends
Analysis Methods Big Data
New
In
sigh
t D
ata
Agenda
• Intro … KMUTT’s facility for Big Data trend
• Big Data & Characteristics Genome-base data
• Exploration of biological entities • Genome Browsers
• Application: Comparative genomics
• Exploration of relationships among the entities • Integrative tool with ~omics data
• Example: Applied Big Data technology to Genome-based data analysis
• Our current activities: conference & workshop
What is Big Data?
Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.
-- Gartner, 2015
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reduction and reduced risk.
-- Wikipedia, 2015
What is Big Data?
Characteristics (5V)
• Volume: Data at scale
• Variety: Data in various forms
• Velocity: Data flow
• Veracity: Data uncertainty
• Value
Genome-based data as Big data
How big genome-based data is?
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical? PLoS Biol 13(7): e1002195. doi: 10.1371/journal.pbio.1002195
Astronomy Twitter YouTube Genomics Current rate 7.5TB/sec 5M tweets/day 300 hours/min Raw 3.6PB + 35Pbp/year
Growth in Y2025 750TB/sec 1,200M tweets/day 1700 hours/min ??
Acquisition-2025(unit per
year)
25 ZB (~5 trillion DVD)
0.5-15M tweets 500-900M hours ??
Storage-2025 (byte/year)
1 EB (~212 million DVDs)
1-17 PB (˜3.62 million DVDs)
1-2 EB (~425 million DVDs)
??
Analysis-2025 In situ data reduction Real-time processing Massive volumes
Topic & sentiment mining Metadata analysis
Limited requirements Heterogeneous data/analysis Variant calling All-pairs genome alignments
Distribution-2025
Dedicated lines from antennae to server (600 TB/s)
Small units of distribution
Major component of modern user’s bandwidth (10MB/s)
Many small (10 MB/s) and fewer massive data movement (10 TB/s)
Zetta: 270 (~1021); Exa: 260 (~1018); Penta: 250 (~1015); Tera: 240 (~1012); Giga: 230 (~109); Single-sided DVD ~4.5GB
• In human genome, approx. 25,000 proteins (over 3,000Mbp), variant 0.1% of WGS data
• Assembled ~ 700 MB • Raw(30x) ~ 200,000MB • Variants ~ 125MB
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical? PLoS Biol 13(7): e1002195. doi: 10.1371/journal.pbio.1002195
Too much Data: Cautionary Tale of Sequencing Data
First Gen
Next-Gen Sequencer
Next-next Gen Sequencer
Moore’s law
Ilumina estimated
Historical growth
#(h
um
an g
eno
mes
)
x106
x103
Not only human genomes to being sequenced
Currently, Sequence Read Archive (SRA) @NIH/NCBI contains more than 3.6Pbp ~ 32,000 microbes ~ 5,000 plants & animals ~250,000 human genomes Massive Sequencing projects are on-going 3k rice genomes [Public data on AWS] 1,000k plants & animals 100k (UK) + 100k (SA) + 320k (Iceland) + 1,000k (US) + 1,000k (CN) Y2025, around 25% population (developing) + 50% population (developed)
How big genome-based data is?
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical? PLoS Biol 13(7): e1002195. doi: 10.1371/journal.pbio.1002195
Astronomy Twitter YouTube Genomics Current rate 7.5TB/sec 5M tweets/day 300 hours/min Raw 3.6PB + 35Pbp/year
Growth in Y2025 750TB/sec 1,200M tweets/day 1700 hours/min --estimated--
Acquisition-2025(unit per
year)
25 ZB (~5 trillion DVD)
0.5-15M tweets 500-900M hours 1ZB (~2125 billion DVDs)
Storage-2025 (byte/year)
1 EB (~212 million DVDs)
1-17 PB (˜3.62 million DVDs)
1-2 EB (~425 million DVDs)
2-40 EB (~8,500 million DVDs)
Analysis-2025
In situ data reduction Real-time processing Massive volumes
Topic & sentiment mining Metadata analysis
Limited requirements Heterogeneous data/analysis Variant calling (~2 trillion CPU-hours) All-pairs genome alignments (~10,000 trillion CPU-hours)
Distribution-2025
Dedicated lines from antennae to server (600 TB/s)
Small units of distribution Major component of modern user’s bandwidth (10MB/s)
Many small (10 MB/s) and fewer massive data movement (10 TB/s)
Zetta: 270 (~1021); Exa: 260 (~1018); Penta: 250 (~1015); Tera: 240 (~1012); Giga: 230 (~109); Single-sided DVD ~4.5GB
Exploration of Biological Entities
Biological Entities: Genome (Genes in total)
~1500 base pair in this slide
Need 2,000,000 slides for
human genome
Genome Browser Chromosomes to genes to nucleotides
Genome Browser Chromosomes to genes to nucleotides
Comparative Genomics
- Gene Presence/Absence
- Gene gains/losses - Evolutionary
Study - Strain screening Kit set for QA Alt options
Biological entities: varieties
Exploration of Relationships among entities
Relationships: Associations, Interactions, …. • Associations
• Correlation
• Interactions • DNA-Protein • TF-Target protein • Physical Protein-protein
• Metabolic Reactions • Substrates, Products • Catalysts/Enzymes • Metabolites
• Pathways • Metabolic pathways
Systems biology (integrative biology): velocity
How the cell regulates itself? Signal-Response study (condition + time)
Inte
grat
ive
anal
ysis
of
~om
ic d
ata
…or googling us: “SpirPro SBI”
J. Senachak et al. SpirPro: A Spirulina proteome database and web-based tools for the analysis of protein-protein interactions at the metabolic level in Spirulina (Arthrospira) platensis C1. BMC Bioinformatics 2015, 16:233 doi:10.1186/s12859-015-0676-z.
SpirPro: proteome-effect analysis
Proteome Data Temporal Stress-Response
Snapshot Interactions PPIs surround an expressed protein
SpirPro: proteome-effect analysis
Effect on Metabolisms (Proteome over KEGG pathways)
dashed line for expressed enzymes
Inter-pathways (Expressed protein effecting to other pathways via PPI)
Figure shows only a pathway of left-hand-side protein, and all possible PPIs to other pathways
Protein Interaction
SpirPro: proteome-effect analysis
• Web-based platform as browsing interactive • Comparative study of 52 cyanobacterial genomes
• Ortholog analysis • Ortholog classified by OrthoMCL algorithm
• Protein domain analysis • Pfam scan V.14 with in-house script for visualization
• Protein-protein interaction • Inferred from Yeast Two-hybrid screening in Synechocystis sp. PCC6803
…or googling us: “SpirPro SBI”
J. Senachak et al. SpirPro: A Spirulina proteome database and web-based tools for the analysis of protein-protein interactions at the metabolic level in Spirulina (Arthrospira) platensis C1. BMC Bioinformatics 2015, 16:233 doi:10.1186/s12859-015-0676-z.
…or googling us: “CyanoCOG”
[with-PPI]: Snapshot Interactions
<msa>: Multi-Sequence Alignment
[click-on-image]: Gene Location on Genome
Example: Speed up the analysis pipeline by applied Big Data technology
Chromosome VCFtools Impala SQL
All chromosomes 22x60 hr 1.6 min
Per chromosome 16.5 – 110 min 2 -7 sec
Speed up! ~1000X
1.5TB (uncompressed)
The analysis pipeline for Next-gen sequence data
Hadoop file system speeds up variant calling
Chromosome VCFtools Impala SQL
All chromosomes 22x60 hr 1.6 min
Per chromosome 16.5 – 110 min 2 -7 sec
Speed up! ~1000X
Use BigData technology
Master
Node Node Node Node Node
12.. 12.. 12.. 12.. 12..
10Gbps 10Gbps
10Gbps
Conclusions • Genome-based data as big data
• Data visualization is important for genome-based researches • Genome Browsers Multi-level data
• Multi-level –omics data integration Regulatory Network
• Biological networks and interactive tool SpirPro
• Example: Big data technology applied for genome-based analysis
bigdataexperience http://www.sbi.kmutt.ac.th/
http://www.bioinformatics.kmutt.ac.th/
Acknowledgements
Proteome (Algal Biotech)
Bioinformatics & Systems Biology
(BIF) program
Systems Biology & Bioinformatics (SBI)
Medical research (collab w/ hospitals)
Cell & Physiology (Algal Biotech)
Comparative Genomics Analysis
& Visualization
• Genome bioinformatics’s current success, challenges, and opportunities in the era of low-cost sequencing
• Big data analysis in Metagenomics
• Third Generation Sequencing for Rapid Surveillance http://www.csbio.org/2015/
http://academy.sbi.kmutt.ac.th/cmg2015/