GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

GenomeSnip: Fragmenting the

Genomic Wheel to augment discovery

in cancer researchMaulik R. Kamdar, Aftab Iqbal, Muhammad Saleem, Helena

F. Deus and Stefan Decker

Conference on Semantics in Healthcare and Life Sciences (CSHALS), February 26-28, 2014, Boston

Outline Introduction

Integrative Genomics Genomic Visualization Motivation

Methodology Integration of Data Sources Co-occurrence Data Cubes Similarity Measures

GenomeSnip Platform Genomic Wheel Genomic Tracks Display

Comparative Evaluation Discussion

Applicability Future Work

Integrative Genomics Advanced computation analyses of genomics datasets –

Genome-wide Association studies (GWAS), identify several susceptible loci (point alterations and oncogene) associated with various types of cancer, and are catalogued.

Network-based approaches isolate genes implicated together (‘Co-occurrence’) in any disease or pathway, i.e. macro-molecular associations on a higher, abstract level.

Linked Data and Semantic Web Technologies address challenges on integration of multiple knowledgebases.

Advantages: Representation using Resource Description Framework (RDF), structured querying, integration with other biomedical datasets, data mining and knowledge extraction.

Genomic VisualizationGenome browsers navigate across the human genome in an intuitive, linear fashion. Data points are mapped to the genomic coordinates and datasets are visualized as charts or heatmaps. Disadvantages:The perceptive faculties of humans interpret patterns in depth compared to length. Fail to display inter- and intra-chromosomal relations, which could be easily interpreted and used by humans.

Circular plots overcome the cognitive barriers to grasp relations between disjoint genomic features on an abstract level.Disadvantages: Focus on the entire human genome or chromosome.Visualizations rendered as static images and lack interactivity.

Motivation With the rise of high-throughput gene sequencing technologies, data analysis

has replaced data generation as the rate-limiting step for the interpretation of genomic patterns and discovery.

Analyzing genomic datasets using the knowledge extracted through GWAS could help discovering newer tumour risk hypotheses and diagnosing cancer on a personalized basis.

Studying Co-occurrence networks of genes could predict hidden protein-protein interactions (PPI) and functions.

Improved, interactive, intuitive, genomic visualization solutions, which infuse knowledge insights into cancer research, need to be perfected to augment discovery.

GenomeSnip PlatformA semantic, visual analytics prototype devised to expedite knowledge

exploration and discovery in cancer research.

Idea: ‘Snip’ the human genome informatively in fragments through interaction with an aggregative, circular visualization, the ‘Genomic Wheel’ (circular) and introspectively analyze the snipped fragments in a ‘Genomic Tracks’ (linear) display.

Technologies: Web-based client application developed using native technologies like HTML5 Canvas, JavaScript and JSON. KineticJS library, an HTML5 Canvas JavaScript framework, is used for node nesting, layering, caching and event handling.

Availability: http://srvgal78.deri.ie/genomeSnip/

http://kineticjs.com/

http://srvgal78.deri.ie/genomeSnip/

UCSC Genome Browser: Chromosome Bands (Ideograms) in the Human Genome Assembly (GRCh37/hg19, Feb 2009)


Integration of Data Sources

CellBase: Coordinates and descriptions of the protein-coding genes, genomic variants like cancer-related mutations (COSMIC) and SNPs (dbSNP).



Cancer Gene Census: Catalogued oncogenes, which bear somatic and germline mutations in their sequences



UniProt: Disease-gene mappings exposed by the EBI RDF platform.






Kyoto Encyclopedia of Genes and Genomes (KEGG): Pathway-gene Mappings exposed as a SPARQL Endpoint.





Pubmed2Ensembl: Customized service extended to provide gene-related publication information.














Linked TCGA: Information related to the genomic alterations like DNA methylations, SNPs, CNVs and gene expression changes, in cancer patients, mapped to the genomic loci

Linked TCGA: Information related to the genomic alterations like DNA methylations, SNPs, CNVs and gene expression changes, in cancer patients, mapped to the genomic loci


Co-occurrence Data Cubes

Chr1 Chr2 Chr3 … Total

Dis Path Pub Dis Path Pub Dis Path Pub … Dis Path Pub

Chr1 28 4049 4522 15 4251 2915 19 4811 2667 … 226 75044 49827

Chr2 15 4251 2915 14 1662 3589 14 2532 1748 … 155 42513 31424

Chr3 19 4811 2667 14 2532 1748 16 1338 1390 … 224 44310 26817

Genes mapped with diseases, pathways and publications are extracted a priori to instantiate co-occurrence pairs between all possible combinations of genes, after identifier conversions.

A co-occurrence data cube of 3 dimensions (segment 1, segment 2 and data source type) and 2 measures (co-occurrence count and names of mapped resources) is created between the genes, ideograms and chromosomes.

The data cubes are transformed to RDF (N-triples syntax) using RDF Data Cube Vocabulary and are stored in a Triple Store.

Similarity Measures

1 2 1 2 1 212 * * *

Dis Path Pub

TotalDis TotalPath TotalPub

Chr Chr Chr Chr Chr ChrSim

HG HG HG

Inspired from Tversky's feature-based similarity measure. Similarity between two entities (chromosomes) is a weighted summation of their

common features (total number of co-occurrence pairs of contained genes).

HGTotalDis = Total number of gene-gene co-occurrence pairs extracted from the entire Human Genome (HG) for a particular data source.HGTotalDis = 1435, HGTotalPath = 83970, HGTotalPub = 129035

1215 4251 2915

* * *1435 83970 129035

Sim

Similarity Measures (contd.)

The relative similarity impact on the side of any chromosome is calculated by dividing the similarity measure with the maximum similarity for the chromosome (Sim12/Sim1max).

1 1 11 * * *

TotalDis TotalPath TotalPubMax

TotalDis TotalPath TotalPub

Chr Chr ChrSim

HG HG HG

Evaluating with the values of last column in the data cube

The maximum similarity for any chromosome, is calculated by using the evaluating Equation 1, with the total number of gene-gene co-occurrence pairs, in which genes of Chromosome 1 participate.

1226 75044 49827

* * *1435 83970 129035

MaxSim

Genomic WheelThe human genome is laid in a

circular layout.

Chromosomes form the arcs on the perimeter, whose length is proportional to the size of each chromosome.

The thickness of the connecting chords is proportional to relative similarity impact between connected chromosomes.

= 1, = 0.4, = 0.1

Genomic Wheel (contd.)

Hovering the mouse over each chord displays slice of co-occurrence data cube.

Hierarchical categorization - chromosome, ideogram, gene and cancer point mutations forms layers in the representative arc.

Only the ‘chromosome’ and ‘ideogram’ layer are shown in the initial layout.

Chord’s thickness is proportional to relative similarity impact between Chromosome 1 and 2 on the connected side (tapering).

Chord’s thickness is proportional to relative similarity impact between Chromosome 1 and 2 on the connected side (tapering).


Clicking on each arc, the represented segment is highlighted and flares out to display subsequent layers.

At the genetic level the relation chords are represented using distinct colors (Red, green and blue for diseases, pathways and publications respectively) to enable visual discernibility.


Genes catalogued in the Cancer Gene Census, whose gene sequences bear somatic and germline mutations responsible for cancer, are represented by shades of red.

Hovering the mouse pointer over any gene displays additional information.

Genomic Tracks Display

Select any tumor and associated patients from LinkedTCGA datasets

Select any tumor and associated patients from LinkedTCGA datasets


Stacked Display of patients’ Exon Expression (Green) and DNA Methylation (Red) Results

Stacked Display of patients’ Exon Expression (Green) and DNA Methylation (Red) Results


Catalogued Cancer-related mutations extracted from COSMIC

Catalogued Cancer-related mutations extracted from COSMIC

Genomic Tracks DisplaySimultaneous analysis on different genome tracksSimultaneous analysis on different genome tracks


Zooming and Automated Scrolling controlsZooming and Automated Scrolling controls

Empirical Evaluation conducted through a questionnaire (http://goo.gl/vnLtX4) embedded within the GenomeSnip platform.

4 PhD students and 5 researchers researching in Biotechnology (primarily protein engineering) and Bioinformatics responded.

Empirical Evaluation conducted through a questionnaire (http://goo.gl/vnLtX4) embedded within the GenomeSnip platform.

4 PhD students and 5 researchers researching in Biotechnology (primarily protein engineering) and Bioinformatics responded.

Comparative Evaluation

UCSC IGVSavan

tGenomeMaps

Circos

Regulome

GenomeSnip

Linear Coordinates Circular Plot Web Interface Data Upload Third-party modules TCGA Demonstration

GWAS Insights Chromosomal Relations Simultaneous Analysis Knowledge Integration Other Visualizations

Heatmap

Heatmap Network

Dynamicity


UCSC IGVSavan

tGenomeMaps

Circos

Regulome

GenomeSnip



Heatmap

Heatmap Network

Dynamicity


Integration with JBrowse or GenomeMaps for a better Genome Tracks View

Integration with JBrowse or GenomeMaps for a better Genome Tracks View

UCSC IGVSavan

tGenomeMaps

Circos

Regulome

GenomeSnip



Heatmap

Heatmap Network

Dynamicity


Data Upload of genomics data of new patients for visual comparison

Data Upload of genomics data of new patients for visual comparison

UCSC IGVSavan

tGenomeMaps

Circos

Regulome

GenomeSnip



Heatmap

Heatmap Network

Dynamicity


Introduce Heatmap plots for exon expression dataIntroduce Heatmap plots for exon expression data

Applicability of GenomeSnipFormulating Improved Hypotheses: Integration of the knowledge extracted from GWAS, available literature and

pathways, and interpretation through an intuitive visualization, allowing isolation and comparative analysis of interesting genomic segments using the Linked TCGA datasets.

Discovering Protein Interactions: Usage and improved visual discernibility of co-occurrence networks of

genes, along with analysis of highly co-occurrent gene pairs using gene/protein expression datasets.

Predicting Tumour Risk on a personalized level: Evaluation of the genomic alterations, expression levels and clinical data of

a new patient against those patients registered under the TCGA project, enabling personalized medicine.

Future Work on GenomeSnip Integrate prostate cancer prediction models into the platform, for

validation using Linked TCGA datasets as a training set, and predicting tumour risk in new patients.

Provide an ‘interaction’ overlay by integrating knowledge on protein-protein interactions (BioGrid), gene co-expression (CoExpresDB) and functions (Gene Ontology).

Extract information like disease variant mutations from UniProt to interlink the ‘point mutations’ layer in the ‘Genomic Wheel’ with other segments.

Improve the granularity of the ‘Genomic Wheel’ and extensively test the user experience and the usability of this platform by conducting a user-driven evaluation.

Thank You! Questions?

Health & Medicine

GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research