33
GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research Maulik R. Kamdar, Aftab Iqbal, Muhammad Saleem, Helena F. Deus and Stefan Decker Conference on Semantics in Healthcare and Life Sciences (CSHALS), February 26-28, 2014, Boston

GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Embed Size (px)

DESCRIPTION

Presentation in the Conference on Semantics in Healthcare and Life Sciences (CSHALS) 2014, Boston. Abstract: Cancer genomics researchers have greatly benefited from high-throughput technologies for the characterization of genomic alterations in patients. These voluminous genomics datasets when supplemented with the appropriate computational tools have led towards the identification of 'oncogenes' and cancer pathways. However, if a researcher wishes to exploit the datasets in conjunction with this extracted knowledge his cognitive abilities need to be augmented through advanced visualizations. In this paper, we present GenomeSnip, a visual analytics platform, which facilitates the intuitive exploration of the human genome and displays the relationships between different genomic features. Knowledge, pertaining to the hierarchical categorization of the human genome, oncogenes and abstract, co-occurring relations, has been retrieved from multiple data sources and transformed a priori. We demonstrate how cancer experts could use this platform to interactively isolate genes or relations of interest and perform a comparative analysis on the 20.4 billion triples Linked Cancer Genome Atlas (TCGA) datasets.

Citation preview

Page 1: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

GenomeSnip: Fragmenting the

Genomic Wheel to augment discovery

in cancer researchMaulik R. Kamdar, Aftab Iqbal, Muhammad Saleem, Helena

F. Deus and Stefan Decker

Conference on Semantics in Healthcare and Life Sciences (CSHALS), February 26-28, 2014, Boston

Page 2: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Outline Introduction

Integrative Genomics Genomic Visualization Motivation

Methodology Integration of Data Sources Co-occurrence Data Cubes Similarity Measures

GenomeSnip Platform Genomic Wheel Genomic Tracks Display

Comparative Evaluation Discussion

Applicability Future Work

Page 3: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Integrative Genomics Advanced computation analyses of genomics datasets –

Genome-wide Association studies (GWAS), identify several susceptible loci (point alterations and oncogene) associated with various types of cancer, and are catalogued.

Network-based approaches isolate genes implicated together (‘Co-occurrence’) in any disease or pathway, i.e. macro-molecular associations on a higher, abstract level.

Linked Data and Semantic Web Technologies address challenges on integration of multiple knowledgebases.

Advantages: Representation using Resource Description Framework (RDF), structured querying, integration with other biomedical datasets, data mining and knowledge extraction.

Page 4: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Genomic VisualizationGenome browsers navigate across the human genome in an intuitive, linear fashion. Data points are mapped to the genomic coordinates and datasets are visualized as charts or heatmaps. Disadvantages:The perceptive faculties of humans interpret patterns in depth compared to length. Fail to display inter- and intra-chromosomal relations, which could be easily interpreted and used by humans.

Circular plots overcome the cognitive barriers to grasp relations between disjoint genomic features on an abstract level.Disadvantages: Focus on the entire human genome or chromosome.Visualizations rendered as static images and lack interactivity.

Page 5: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Motivation With the rise of high-throughput gene sequencing technologies, data analysis

has replaced data generation as the rate-limiting step for the interpretation of genomic patterns and discovery.

Analyzing genomic datasets using the knowledge extracted through GWAS could help discovering newer tumour risk hypotheses and diagnosing cancer on a personalized basis.

Studying Co-occurrence networks of genes could predict hidden protein-protein interactions (PPI) and functions.

Improved, interactive, intuitive, genomic visualization solutions, which infuse knowledge insights into cancer research, need to be perfected to augment discovery.

Page 6: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

GenomeSnip PlatformA semantic, visual analytics prototype devised to expedite knowledge

exploration and discovery in cancer research.

Idea: ‘Snip’ the human genome informatively in fragments through interaction with an aggregative, circular visualization, the ‘Genomic Wheel’ (circular) and introspectively analyze the snipped fragments in a ‘Genomic Tracks’ (linear) display.

Technologies: Web-based client application developed using native technologies like HTML5 Canvas, JavaScript and JSON. KineticJS library, an HTML5 Canvas JavaScript framework, is used for node nesting, layering, caching and event handling.

Availability: http://srvgal78.deri.ie/genomeSnip/

Page 7: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

UCSC Genome Browser: Chromosome Bands (Ideograms) in the Human Genome Assembly (GRCh37/hg19, Feb 2009)

UCSC Genome Browser: Chromosome Bands (Ideograms) in the Human Genome Assembly (GRCh37/hg19, Feb 2009)

Integration of Data Sources

Page 8: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

CellBase: Coordinates and descriptions of the protein-coding genes, genomic variants like cancer-related mutations (COSMIC) and SNPs (dbSNP).

CellBase: Coordinates and descriptions of the protein-coding genes, genomic variants like cancer-related mutations (COSMIC) and SNPs (dbSNP).

Integration of Data Sources

Page 9: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Cancer Gene Census: Catalogued oncogenes, which bear somatic and germline mutations in their sequences

Cancer Gene Census: Catalogued oncogenes, which bear somatic and germline mutations in their sequences

Integration of Data Sources

Page 10: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

UniProt: Disease-gene mappings exposed by the EBI RDF platform.

UniProt: Disease-gene mappings exposed by the EBI RDF platform.

Integration of Data Sources

Page 11: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

UniProt: Disease-gene mappings exposed by the EBI RDF platform.

UniProt: Disease-gene mappings exposed by the EBI RDF platform.

Integration of Data Sources

Kyoto Encyclopedia of Genes and Genomes (KEGG): Pathway-gene Mappings exposed as a SPARQL Endpoint.

Kyoto Encyclopedia of Genes and Genomes (KEGG): Pathway-gene Mappings exposed as a SPARQL Endpoint.

Page 12: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

UniProt: Disease-gene mappings exposed by the EBI RDF platform.

UniProt: Disease-gene mappings exposed by the EBI RDF platform.

Integration of Data Sources

Pubmed2Ensembl: Customized service extended to provide gene-related publication information.

Pubmed2Ensembl: Customized service extended to provide gene-related publication information.

Page 13: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

UCSC Genome Browser: Chromosome Bands (Ideograms) in the Human Genome Assembly (GRCh37/hg19, Feb 2009)

UCSC Genome Browser: Chromosome Bands (Ideograms) in the Human Genome Assembly (GRCh37/hg19, Feb 2009)

CellBase: Coordinates and descriptions of the protein-coding genes, genomic variants like cancer-related mutations (COSMIC) and SNPs (dbSNP).

CellBase: Coordinates and descriptions of the protein-coding genes, genomic variants like cancer-related mutations (COSMIC) and SNPs (dbSNP).

Cancer Gene Census: Catalogued oncogenes, which bear somatic and germline mutations in their sequences

Cancer Gene Census: Catalogued oncogenes, which bear somatic and germline mutations in their sequences

UniProt: Disease-gene mappings exposed by the EBI RDF platform.

UniProt: Disease-gene mappings exposed by the EBI RDF platform.

Kyoto Encyclopedia of Genes and Genomes (KEGG): Pathway-gene Mappings exposed as a SPARQL Endpoint.

Kyoto Encyclopedia of Genes and Genomes (KEGG): Pathway-gene Mappings exposed as a SPARQL Endpoint.

Pubmed2Ensembl: Customized service extended to provide gene-related publication information.

Pubmed2Ensembl: Customized service extended to provide gene-related publication information.

Linked TCGA: Information related to the genomic alterations like DNA methylations, SNPs, CNVs and gene expression changes, in cancer patients, mapped to the genomic loci

Linked TCGA: Information related to the genomic alterations like DNA methylations, SNPs, CNVs and gene expression changes, in cancer patients, mapped to the genomic loci

Integration of Data Sources

Page 14: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Co-occurrence Data Cubes

Chr1 Chr2 Chr3 … Total

Dis Path Pub Dis Path Pub Dis Path Pub … Dis Path Pub

Chr1 28 4049 4522 15 4251 2915 19 4811 2667 … 226 75044 49827

Chr2 15 4251 2915 14 1662 3589 14 2532 1748 … 155 42513 31424

Chr3 19 4811 2667 14 2532 1748 16 1338 1390 … 224 44310 26817

Genes mapped with diseases, pathways and publications are extracted a priori to instantiate co-occurrence pairs between all possible combinations of genes, after identifier conversions.

A co-occurrence data cube of 3 dimensions (segment 1, segment 2 and data source type) and 2 measures (co-occurrence count and names of mapped resources) is created between the genes, ideograms and chromosomes.

The data cubes are transformed to RDF (N-triples syntax) using RDF Data Cube Vocabulary and are stored in a Triple Store.

Page 15: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Similarity Measures

1 2 1 2 1 212 * * *

Dis Path Pub

TotalDis TotalPath TotalPub

Chr Chr Chr Chr Chr ChrSim

HG HG HG

Inspired from Tversky's feature-based similarity measure. Similarity between two entities (chromosomes) is a weighted summation of their

common features (total number of co-occurrence pairs of contained genes).

HGTotalDis = Total number of gene-gene co-occurrence pairs extracted from the entire Human Genome (HG) for a particular data source.HGTotalDis = 1435, HGTotalPath = 83970, HGTotalPub = 129035

1215 4251 2915

* * *1435 83970 129035

Sim

Page 16: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Similarity Measures (contd.)

The relative similarity impact on the side of any chromosome is calculated by dividing the similarity measure with the maximum similarity for the chromosome (Sim12/Sim1max).

1 1 11 * * *

TotalDis TotalPath TotalPubMax

TotalDis TotalPath TotalPub

Chr Chr ChrSim

HG HG HG

Evaluating with the values of last column in the data cube

The maximum similarity for any chromosome, is calculated by using the evaluating Equation 1, with the total number of gene-gene co-occurrence pairs, in which genes of Chromosome 1 participate.

1226 75044 49827

* * *1435 83970 129035

MaxSim

Page 17: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Genomic WheelThe human genome is laid in a

circular layout.

Chromosomes form the arcs on the perimeter, whose length is proportional to the size of each chromosome.

The thickness of the connecting chords is proportional to relative similarity impact between connected chromosomes.

= 1, = 0.4, = 0.1

Page 18: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Genomic Wheel (contd.)

Hovering the mouse over each chord displays slice of co-occurrence data cube.

Hierarchical categorization - chromosome, ideogram, gene and cancer point mutations forms layers in the representative arc.

Only the ‘chromosome’ and ‘ideogram’ layer are shown in the initial layout.

Chord’s thickness is proportional to relative similarity impact between Chromosome 1 and 2 on the connected side (tapering).

Chord’s thickness is proportional to relative similarity impact between Chromosome 1 and 2 on the connected side (tapering).

Page 19: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Genomic Wheel (contd.)

Clicking on each arc, the represented segment is highlighted and flares out to display subsequent layers.

At the genetic level the relation chords are represented using distinct colors (Red, green and blue for diseases, pathways and publications respectively) to enable visual discernibility.

Page 20: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Genomic Wheel (contd.)

Genes catalogued in the Cancer Gene Census, whose gene sequences bear somatic and germline mutations responsible for cancer, are represented by shades of red.

Hovering the mouse pointer over any gene displays additional information.

Page 21: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Genomic Tracks Display

Select any tumor and associated patients from LinkedTCGA datasets

Select any tumor and associated patients from LinkedTCGA datasets

Page 22: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Genomic Tracks Display

Stacked Display of patients’ Exon Expression (Green) and DNA Methylation (Red) Results

Stacked Display of patients’ Exon Expression (Green) and DNA Methylation (Red) Results

Page 23: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Genomic Tracks Display

Catalogued Cancer-related mutations extracted from COSMIC

Catalogued Cancer-related mutations extracted from COSMIC

Page 24: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Genomic Tracks DisplaySimultaneous analysis on different genome tracksSimultaneous analysis on different genome tracks

Page 25: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Genomic Tracks Display

Zooming and Automated Scrolling controlsZooming and Automated Scrolling controls

Page 26: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Empirical Evaluation conducted through a questionnaire (http://goo.gl/vnLtX4) embedded within the GenomeSnip platform.

4 PhD students and 5 researchers researching in Biotechnology (primarily protein engineering) and Bioinformatics responded.

Empirical Evaluation conducted through a questionnaire (http://goo.gl/vnLtX4) embedded within the GenomeSnip platform.

4 PhD students and 5 researchers researching in Biotechnology (primarily protein engineering) and Bioinformatics responded.

Comparative Evaluation

Page 27: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

UCSC IGVSavan

tGenomeMaps

Circos

Regulome

GenomeSnip

Linear Coordinates Circular Plot Web Interface Data Upload Third-party modules TCGA Demonstration

GWAS Insights Chromosomal Relations Simultaneous Analysis Knowledge Integration Other Visualizations

Heatmap

Heatmap Network

Dynamicity

Comparative Evaluation

Page 28: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

UCSC IGVSavan

tGenomeMaps

Circos

Regulome

GenomeSnip

Linear Coordinates Circular Plot Web Interface Data Upload Third-party modules TCGA Demonstration

GWAS Insights Chromosomal Relations Simultaneous Analysis Knowledge Integration Other Visualizations

Heatmap

Heatmap Network

Dynamicity

Comparative Evaluation

Integration with JBrowse or GenomeMaps for a better Genome Tracks View

Integration with JBrowse or GenomeMaps for a better Genome Tracks View

Page 29: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

UCSC IGVSavan

tGenomeMaps

Circos

Regulome

GenomeSnip

Linear Coordinates Circular Plot Web Interface Data Upload Third-party modules TCGA Demonstration

GWAS Insights Chromosomal Relations Simultaneous Analysis Knowledge Integration Other Visualizations

Heatmap

Heatmap Network

Dynamicity

Comparative Evaluation

Data Upload of genomics data of new patients for visual comparison

Data Upload of genomics data of new patients for visual comparison

Page 30: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

UCSC IGVSavan

tGenomeMaps

Circos

Regulome

GenomeSnip

Linear Coordinates Circular Plot Web Interface Data Upload Third-party modules TCGA Demonstration

GWAS Insights Chromosomal Relations Simultaneous Analysis Knowledge Integration Other Visualizations

Heatmap

Heatmap Network

Dynamicity

Comparative Evaluation

Introduce Heatmap plots for exon expression dataIntroduce Heatmap plots for exon expression data

Page 31: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Applicability of GenomeSnipFormulating Improved Hypotheses: Integration of the knowledge extracted from GWAS, available literature and

pathways, and interpretation through an intuitive visualization, allowing isolation and comparative analysis of interesting genomic segments using the Linked TCGA datasets.

Discovering Protein Interactions: Usage and improved visual discernibility of co-occurrence networks of

genes, along with analysis of highly co-occurrent gene pairs using gene/protein expression datasets.

Predicting Tumour Risk on a personalized level: Evaluation of the genomic alterations, expression levels and clinical data of

a new patient against those patients registered under the TCGA project, enabling personalized medicine.

Page 32: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Future Work on GenomeSnip Integrate prostate cancer prediction models into the platform, for

validation using Linked TCGA datasets as a training set, and predicting tumour risk in new patients.

Provide an ‘interaction’ overlay by integrating knowledge on protein-protein interactions (BioGrid), gene co-expression (CoExpresDB) and functions (Gene Ontology).

Extract information like disease variant mutations from UniProt to interlink the ‘point mutations’ layer in the ‘Genomic Wheel’ with other segments.

Improve the granularity of the ‘Genomic Wheel’ and extensively test the user experience and the usability of this platform by conducting a user-driven evaluation.

Page 33: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research

Thank You! Questions?