Data analysis & integration challenges in genomics

Data analysis and integration challenges in genomics

Uppsala

March 19, 2015

Mikael Huss, SciLifeLab / Stockholm University

Where I work

INTEGRATIVE AND TECHNOLOGY DRIVEN RESEARCH IN HIGH-

THROUGHPUT BIOLOGY

SciLifeLab – an infrastructure for massive biology

Science 328,805 (14 May 2010)

Inaugurated mid-2010

Hosted by three universities in Stockholm: Karolinska Institutet (medical faculty), Royal Institute of Technology (technical) and Stockholm University (natural science). SciLifeLab node in Uppsala.

Approximately 700 researchers

More than 100 researchers in bioinformatics and systems biology

http://ngi-status.scilifelab.se/

National genomics facilities at SciLifeLab

Clinical Genomics Clinical biomarkers Clinical sequencing

Functional genomicsEukaryotic Single Cell GenomicsSingle Cell ProteomicsMicrobial Single Cell GenomicsKarolinska High Throughput Center (KHTC)

Bioimaging - Advanced Light Microscopy, Fluorescence Correlation Spectroscopy

Drug discovery – ADME, Antibody Therapeutics, Protein Expression &

Characterization, Lead Indetification, Biophysical Screning etc.

Chemical Biology Consortium Sweden – Umeå, Uppsala, KI

Structural Biology – Protein Science Facility

National facilities at SciLifeLabClinical diagnostics

Affinity proteomicsBiobank profiling, Cell profiling,Fluorescence Tissue Profiling, Mass Cytometry, PLA Proteomics, Protein and Peptide Arrays,Tissue Profiling

Bioinformatics facilities

• Bioinformatics compute and storage (UPPNEX)

• Short-term support (2 weeks / 80h) + paid extension

– About 45 FTEs

• Long-term support (500h) for projects selected by external committee

“embedded bioinformaticians”Participate in projects on a longer term basis

Long-term bioinformatics support group

• Currently 13 senior bioinformaticians + 2 managers

• Currently recruiting for 10 new employees and thereby expanding from Uppsala and Stockholm to other locations in Sweden

• Example projects (from my own work):– Characterizing the human muscle transcriptome in connection with exercise

– Metagenomics for looking at the connection between international travel and antibiotic resistance

– Characterizing neural stem cells in developing mouse brain

– Small RNAs involved in the CRISPR/Cas9 system in bacteria

Integrative bioinformatics initiative(“big data” project)

• Advertising for 4 positions, 2 in Gothenburg & 2 in Stockholm

• More in-depth support, experimental planning, method development

• Data integration

Pilot projectConnecting layers of information

DNA Whole-genome sequencingExome sequencingCGH

Mutations, SNVsCopy number variationsStructural variationsGene fusions

RNA mRNA isoformsAllele specific expressionFusion transcriptseQTLS

proteins

RNA-seqMicroarrays

High throughput mass spectrometry

Protein isoformsPost-translational modifications

My blog: Follow the Data

Machine learning, “big data”, “data science”, often in connection with life science

Published brief notes on APIs from One Codex, Google Genomics, SolveBio

Let’s get the ”big data” buzzword out of the way …

… but some people are willing to go out on a limb

“Where is the cut-off? The line in the sand is 5TB of unstructured data or 7.5-10TB of structured data, which cannot be reduced any further”

(OLRAC SPS)

http://www.itweb.co.za/index.php?option=com_content&view=article&id=111815

”There is no such thing as biomedical big data”

(Will Bush, VanderbiltUniversity Center for Human Genetic Research)

http://gettinggeneticsdone.blogspot.se/2014/02/no-such-thing-biomedical-bigdata.html

Genomics big data in context: ThroughputData processed per day (terabytes)

Tb

SciLifeLabKing

NYSE

SangerSpotify BGI

Twitter

Facebook

Baidu

NSA

Google Ebay

Internet

World

1e

+0

01

e+

02

1e

+0

41

e+

06

S

Genomics big data in context: Storage

Data stored (petabytes)

pb

AZ

SciLifeLabSpotify

Sanger

NovartisEbay

Facebook

Baidu

NSA

Google

110

10

01

000

10

000

Aside: Storage & processing frameworks

Hadoop, the standard solution for “big data” in industry, has not really caught on in genomics … Why? Some ideas –

- Existing computing infrastructure is sufficient- Or, focused on supercomputing solutions rather than commodity servers- The programming/sysadmin skills and training are not there- Many problems not parallelizable- Not enough flexibility for ad hoc, exploratory analysis

Spark/ADAM, new framework enabling more interactive and in-memory-oriented analysis

Genomics big data in context: Heterogeneity

“The size of the data is not the whole story.

If the data are uniform, they can almost always be compressed and filtered with traditional methods.

You do not get a ‘big data’ processing challenge until other factors, such as variety, non-uniformity and continuous growth, are added to a large data set.”

(adapted from Aleksi Kallio)

Ideas on improving data integration

1. APIs to mitigate friction in data collection and preprocessing

2. Querying “by data set”

3. Leveraging advances in machine learning

So much public data out there!

APIs

Lowering barriers to entry with APIs (application programming interfaces; ways for a computer program to automatically retrieve information in a defined manner).

“80% of the time of a data scientist is spent finding and preparing the data”

APIs against good reference collections mitigate the hassle of looking for the right data sources, handling different versions/releases, etc.

We should be able to ask questions such as:

“Which gene variants in a patient have been previously associated to a specific disease?” <= addressed by SolveBio and Google Genomics (with the inclusion of the Tute annotation db)

APIs

Other questions could be, e.g.:

“Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API

APIs



“Which genes are expressed exclusively in the parathyroid gland?”

APIs




“What is the most similar expression dataset to this one that I am currently working on?” <= partly addressed by NextBio (but it’s a commercial package!)

APIs





“Download all available sequences for arthropoda and store them as FASTQ files” <= addressed by bionode.io

APIs





“Download all available sequences for arthropoda and store them as FASTQ files” <= addressed by bionode.io

“Give me the publicly available RNA-seq sequences that support this peptide that I found in mass spectrometry and which appears to have been translated from a fusion transcript”

Data provenance

Researchers often want to look at processed data (avoiding the work of reprocessing everything from scratch) but they want to know how the processing was done.

Each data set should have an “analysis history” attached

Also important for reproducibility and paper writing

Querying by data set

Querying by dataset – we often want to relate our dataset to something “out there” without necessarily having a good preconception what it could be. (especially in metagenomics!)

NextBio does an interesting version of this but costs money (has been acquired by Illumina) and focuses on selected types of functional studies.

Querying by dataset

Querying by dataset – we often want to relate our dataset to something “out there” without necessarily having a good preconception what it could be. (especially in metagenomics!)

NextBio does an interesting version of this but costs money (has been acquired by Illumina) and focuses on selected types of functional studies.

Using the dataset itself, or a statistical description of it, as a query

Jeff Jonas:“Data finds data”“The data is the query”

“we want to support automated data exploration in ways that are simply not possible today” C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)

Cumulative biology and metagenomics:The unknown

http://www.ted.com/talks/nathan_wolfe_what_s_left_to_explore.html

“Biological dark matter”

“The unknown continent”

According to one estimate, less than 1% of the viral diversity has been explored!

=> Reference databases very limited!

The unknown

In a recent paper on soil metagenomics, Titus Brown and colleagues report that:

80% of the 398 billion sequences they obtained could not be assembled into putative genes

Of the cases where sequences could be assembled into putative genes which would create putative proteins, 60% of these proteins could not be matched to anything in the databases!

Ergo…

For metagenomics in particular, but also for other applications, we would like to have everything that has been published indexed in a better way, so we can relate new stuff to those. We need to have a constantly growing index.

When we perform a new experiment, we could then relate our results to all of the data out there, not just the part that has made it into the official reference databases.

Machine learning

Google has had great success with deep learning …

Learning to recognize cats from unlabel Youtube videos (2012)

Neural network with “3 million neurons and 1 billion synapses”

…now it’s all over the place

Inaugural Stockholm deep learning meetup,March 10, 2015

Deep learning

Perhaps deep learning could be used in genomics, proteomics etc to transform diverse data sets into a more general representation which would facilitate data integration?

New datasets can then be overlaid onto representations trained on large collections.

Deep learning in genomics (1)

How do gene expression patterns relate to cell type and state? Hard problem to classify expression profiles into cell types because it is really a hierarchy where different genes are important at different levels of the hierarchy

We may be starting to accumulate enough data to enable a deep learning approach to learn a hierarchical representation of cell state based on expression profiles (particularly with all the single-cell RNA-seq data now coming out)


How do gene expression patterns relate to cell type and state? Hard problem to classify expression profiles into cell types because it is really a hierarchy where different genes are important at different levels of the hierarchy

We may be starting to accumulate enough data to enable a deep learning approach to learn a hierarchical representation of cell state based on expression profiles (particularly with all the single-cell RNA-seq data now coming out)

First step: Casey Greene’s group (Dartmouth)

A denoising autoencoder learned a generalized representation of breast cancer expression profiles based on the METABRIC cohort (>2000 samples). Validated on TCGA.

The nodes in the net can be interpreted to stand for different biological features.

Tan et al. (2015)


Convolutional network for splice site detection

Reads the DNA sequence directly and abstracts into higher-level features.

This network learned patterns of splice sites

And also re-discovered the concept of codons

Hannes Bretschneider: http://www.psi.toronto.edu/~hannes/resources/MLCB2014-Presentation.pdf

“Classical” machine learning

Predictive modeling as a way to integrate information from different experimental assays.

Example: ongoing mouse neural development project

A number of genome-wide experiments have been done in developing spinal cord and cortex; have measurements/genome-wide signals about:

- Gene expression (RNA-seq) - Where the Sox2 transcription factor is bound in each tissue (ChIP-seq)- How open/accessible the chromatin is (DNase-seq)- Potential transcription factor binding sites (DNase footprints)

as well as some calculated features like certain interesting “DNA words” (transcription factor binding motifs) and how conserved each stretch of DNA is between mice and other organisms.

How to make some sense of all these data?

“Genome browser” view of genomic landscape around a gene

Gene

Conservation

Different data tracks

“Openness”

Sox2 binding

raw signal

peaks

(borrowed from Mark Gerstein)

We decided we are most interested in understanding differences in gene expression between spinal cord and cortex neurons. Can the other measurements help?

Progressively summarized and abstracted the raw signals into blocks with various features => matrix of ~20,000 genes x 13 features

Use machine learning techniques to predict relative gene expression in cortex/spinal cord based on these features (ongoing…)

Indexing and querying technology such as Google’s can help genomics researchers by e g

- Enabling programmatic access to published data (processed but with a known analysis history) to lower the threshold for integrative analysis

- Allowing them to relate their datasets to other published data without overly relying on curated reference databases (cumulative biology)

- Facilitating ingestion into machine learning (e g deep learning) systems for learning general features of biological data from a very large set of samples

Recap

Extra slides