Upload
maulik-kamdar
View
102
Download
0
Embed Size (px)
DESCRIPTION
Presentation in the Conference on Semantics in Healthcare and Life Sciences (CSHALS) 2014, Boston. Abstract: Cancer genomics researchers have greatly benefited from high-throughput technologies for the characterization of genomic alterations in patients. These voluminous genomics datasets when supplemented with the appropriate computational tools have led towards the identification of 'oncogenes' and cancer pathways. However, if a researcher wishes to exploit the datasets in conjunction with this extracted knowledge his cognitive abilities need to be augmented through advanced visualizations. In this paper, we present GenomeSnip, a visual analytics platform, which facilitates the intuitive exploration of the human genome and displays the relationships between different genomic features. Knowledge, pertaining to the hierarchical categorization of the human genome, oncogenes and abstract, co-occurring relations, has been retrieved from multiple data sources and transformed a priori. We demonstrate how cancer experts could use this platform to interactively isolate genes or relations of interest and perform a comparative analysis on the 20.4 billion triples Linked Cancer Genome Atlas (TCGA) datasets.
Citation preview
GenomeSnip: Fragmenting the
Genomic Wheel to augment discovery
in cancer researchMaulik R. Kamdar, Aftab Iqbal, Muhammad Saleem, Helena
F. Deus and Stefan Decker
Conference on Semantics in Healthcare and Life Sciences (CSHALS), February 26-28, 2014, Boston
Outline Introduction
Integrative Genomics Genomic Visualization Motivation
Methodology Integration of Data Sources Co-occurrence Data Cubes Similarity Measures
GenomeSnip Platform Genomic Wheel Genomic Tracks Display
Comparative Evaluation Discussion
Applicability Future Work
Integrative Genomics Advanced computation analyses of genomics datasets –
Genome-wide Association studies (GWAS), identify several susceptible loci (point alterations and oncogene) associated with various types of cancer, and are catalogued.
Network-based approaches isolate genes implicated together (‘Co-occurrence’) in any disease or pathway, i.e. macro-molecular associations on a higher, abstract level.
Linked Data and Semantic Web Technologies address challenges on integration of multiple knowledgebases.
Advantages: Representation using Resource Description Framework (RDF), structured querying, integration with other biomedical datasets, data mining and knowledge extraction.
Genomic VisualizationGenome browsers navigate across the human genome in an intuitive, linear fashion. Data points are mapped to the genomic coordinates and datasets are visualized as charts or heatmaps. Disadvantages:The perceptive faculties of humans interpret patterns in depth compared to length. Fail to display inter- and intra-chromosomal relations, which could be easily interpreted and used by humans.
Circular plots overcome the cognitive barriers to grasp relations between disjoint genomic features on an abstract level.Disadvantages: Focus on the entire human genome or chromosome.Visualizations rendered as static images and lack interactivity.
Motivation With the rise of high-throughput gene sequencing technologies, data analysis
has replaced data generation as the rate-limiting step for the interpretation of genomic patterns and discovery.
Analyzing genomic datasets using the knowledge extracted through GWAS could help discovering newer tumour risk hypotheses and diagnosing cancer on a personalized basis.
Studying Co-occurrence networks of genes could predict hidden protein-protein interactions (PPI) and functions.
Improved, interactive, intuitive, genomic visualization solutions, which infuse knowledge insights into cancer research, need to be perfected to augment discovery.
GenomeSnip PlatformA semantic, visual analytics prototype devised to expedite knowledge
exploration and discovery in cancer research.
Idea: ‘Snip’ the human genome informatively in fragments through interaction with an aggregative, circular visualization, the ‘Genomic Wheel’ (circular) and introspectively analyze the snipped fragments in a ‘Genomic Tracks’ (linear) display.
Technologies: Web-based client application developed using native technologies like HTML5 Canvas, JavaScript and JSON. KineticJS library, an HTML5 Canvas JavaScript framework, is used for node nesting, layering, caching and event handling.
Availability: http://srvgal78.deri.ie/genomeSnip/
UCSC Genome Browser: Chromosome Bands (Ideograms) in the Human Genome Assembly (GRCh37/hg19, Feb 2009)
UCSC Genome Browser: Chromosome Bands (Ideograms) in the Human Genome Assembly (GRCh37/hg19, Feb 2009)
Integration of Data Sources
CellBase: Coordinates and descriptions of the protein-coding genes, genomic variants like cancer-related mutations (COSMIC) and SNPs (dbSNP).
CellBase: Coordinates and descriptions of the protein-coding genes, genomic variants like cancer-related mutations (COSMIC) and SNPs (dbSNP).
Integration of Data Sources
Cancer Gene Census: Catalogued oncogenes, which bear somatic and germline mutations in their sequences
Cancer Gene Census: Catalogued oncogenes, which bear somatic and germline mutations in their sequences
Integration of Data Sources
UniProt: Disease-gene mappings exposed by the EBI RDF platform.
UniProt: Disease-gene mappings exposed by the EBI RDF platform.
Integration of Data Sources
UniProt: Disease-gene mappings exposed by the EBI RDF platform.
UniProt: Disease-gene mappings exposed by the EBI RDF platform.
Integration of Data Sources
Kyoto Encyclopedia of Genes and Genomes (KEGG): Pathway-gene Mappings exposed as a SPARQL Endpoint.
Kyoto Encyclopedia of Genes and Genomes (KEGG): Pathway-gene Mappings exposed as a SPARQL Endpoint.
UniProt: Disease-gene mappings exposed by the EBI RDF platform.
UniProt: Disease-gene mappings exposed by the EBI RDF platform.
Integration of Data Sources
Pubmed2Ensembl: Customized service extended to provide gene-related publication information.
Pubmed2Ensembl: Customized service extended to provide gene-related publication information.
UCSC Genome Browser: Chromosome Bands (Ideograms) in the Human Genome Assembly (GRCh37/hg19, Feb 2009)
UCSC Genome Browser: Chromosome Bands (Ideograms) in the Human Genome Assembly (GRCh37/hg19, Feb 2009)
CellBase: Coordinates and descriptions of the protein-coding genes, genomic variants like cancer-related mutations (COSMIC) and SNPs (dbSNP).
CellBase: Coordinates and descriptions of the protein-coding genes, genomic variants like cancer-related mutations (COSMIC) and SNPs (dbSNP).
Cancer Gene Census: Catalogued oncogenes, which bear somatic and germline mutations in their sequences
Cancer Gene Census: Catalogued oncogenes, which bear somatic and germline mutations in their sequences
UniProt: Disease-gene mappings exposed by the EBI RDF platform.
UniProt: Disease-gene mappings exposed by the EBI RDF platform.
Kyoto Encyclopedia of Genes and Genomes (KEGG): Pathway-gene Mappings exposed as a SPARQL Endpoint.
Kyoto Encyclopedia of Genes and Genomes (KEGG): Pathway-gene Mappings exposed as a SPARQL Endpoint.
Pubmed2Ensembl: Customized service extended to provide gene-related publication information.
Pubmed2Ensembl: Customized service extended to provide gene-related publication information.
Linked TCGA: Information related to the genomic alterations like DNA methylations, SNPs, CNVs and gene expression changes, in cancer patients, mapped to the genomic loci
Linked TCGA: Information related to the genomic alterations like DNA methylations, SNPs, CNVs and gene expression changes, in cancer patients, mapped to the genomic loci
Integration of Data Sources
Co-occurrence Data Cubes
Chr1 Chr2 Chr3 … Total
Dis Path Pub Dis Path Pub Dis Path Pub … Dis Path Pub
Chr1 28 4049 4522 15 4251 2915 19 4811 2667 … 226 75044 49827
Chr2 15 4251 2915 14 1662 3589 14 2532 1748 … 155 42513 31424
Chr3 19 4811 2667 14 2532 1748 16 1338 1390 … 224 44310 26817
Genes mapped with diseases, pathways and publications are extracted a priori to instantiate co-occurrence pairs between all possible combinations of genes, after identifier conversions.
A co-occurrence data cube of 3 dimensions (segment 1, segment 2 and data source type) and 2 measures (co-occurrence count and names of mapped resources) is created between the genes, ideograms and chromosomes.
The data cubes are transformed to RDF (N-triples syntax) using RDF Data Cube Vocabulary and are stored in a Triple Store.
Similarity Measures
1 2 1 2 1 212 * * *
Dis Path Pub
TotalDis TotalPath TotalPub
Chr Chr Chr Chr Chr ChrSim
HG HG HG
Inspired from Tversky's feature-based similarity measure. Similarity between two entities (chromosomes) is a weighted summation of their
common features (total number of co-occurrence pairs of contained genes).
HGTotalDis = Total number of gene-gene co-occurrence pairs extracted from the entire Human Genome (HG) for a particular data source.HGTotalDis = 1435, HGTotalPath = 83970, HGTotalPub = 129035
1215 4251 2915
* * *1435 83970 129035
Sim
Similarity Measures (contd.)
The relative similarity impact on the side of any chromosome is calculated by dividing the similarity measure with the maximum similarity for the chromosome (Sim12/Sim1max).
1 1 11 * * *
TotalDis TotalPath TotalPubMax
TotalDis TotalPath TotalPub
Chr Chr ChrSim
HG HG HG
Evaluating with the values of last column in the data cube
The maximum similarity for any chromosome, is calculated by using the evaluating Equation 1, with the total number of gene-gene co-occurrence pairs, in which genes of Chromosome 1 participate.
1226 75044 49827
* * *1435 83970 129035
MaxSim
Genomic WheelThe human genome is laid in a
circular layout.
Chromosomes form the arcs on the perimeter, whose length is proportional to the size of each chromosome.
The thickness of the connecting chords is proportional to relative similarity impact between connected chromosomes.
= 1, = 0.4, = 0.1
Genomic Wheel (contd.)
Hovering the mouse over each chord displays slice of co-occurrence data cube.
Hierarchical categorization - chromosome, ideogram, gene and cancer point mutations forms layers in the representative arc.
Only the ‘chromosome’ and ‘ideogram’ layer are shown in the initial layout.
Chord’s thickness is proportional to relative similarity impact between Chromosome 1 and 2 on the connected side (tapering).
Chord’s thickness is proportional to relative similarity impact between Chromosome 1 and 2 on the connected side (tapering).
Genomic Wheel (contd.)
Clicking on each arc, the represented segment is highlighted and flares out to display subsequent layers.
At the genetic level the relation chords are represented using distinct colors (Red, green and blue for diseases, pathways and publications respectively) to enable visual discernibility.
Genomic Wheel (contd.)
Genes catalogued in the Cancer Gene Census, whose gene sequences bear somatic and germline mutations responsible for cancer, are represented by shades of red.
Hovering the mouse pointer over any gene displays additional information.
Genomic Tracks Display
Select any tumor and associated patients from LinkedTCGA datasets
Select any tumor and associated patients from LinkedTCGA datasets
Genomic Tracks Display
Stacked Display of patients’ Exon Expression (Green) and DNA Methylation (Red) Results
Stacked Display of patients’ Exon Expression (Green) and DNA Methylation (Red) Results
Genomic Tracks Display
Catalogued Cancer-related mutations extracted from COSMIC
Catalogued Cancer-related mutations extracted from COSMIC
Genomic Tracks DisplaySimultaneous analysis on different genome tracksSimultaneous analysis on different genome tracks
Genomic Tracks Display
Zooming and Automated Scrolling controlsZooming and Automated Scrolling controls
Empirical Evaluation conducted through a questionnaire (http://goo.gl/vnLtX4) embedded within the GenomeSnip platform.
4 PhD students and 5 researchers researching in Biotechnology (primarily protein engineering) and Bioinformatics responded.
Empirical Evaluation conducted through a questionnaire (http://goo.gl/vnLtX4) embedded within the GenomeSnip platform.
4 PhD students and 5 researchers researching in Biotechnology (primarily protein engineering) and Bioinformatics responded.
Comparative Evaluation
UCSC IGVSavan
tGenomeMaps
Circos
Regulome
GenomeSnip
Linear Coordinates Circular Plot Web Interface Data Upload Third-party modules TCGA Demonstration
GWAS Insights Chromosomal Relations Simultaneous Analysis Knowledge Integration Other Visualizations
Heatmap
Heatmap Network
Dynamicity
Comparative Evaluation
UCSC IGVSavan
tGenomeMaps
Circos
Regulome
GenomeSnip
Linear Coordinates Circular Plot Web Interface Data Upload Third-party modules TCGA Demonstration
GWAS Insights Chromosomal Relations Simultaneous Analysis Knowledge Integration Other Visualizations
Heatmap
Heatmap Network
Dynamicity
Comparative Evaluation
Integration with JBrowse or GenomeMaps for a better Genome Tracks View
Integration with JBrowse or GenomeMaps for a better Genome Tracks View
UCSC IGVSavan
tGenomeMaps
Circos
Regulome
GenomeSnip
Linear Coordinates Circular Plot Web Interface Data Upload Third-party modules TCGA Demonstration
GWAS Insights Chromosomal Relations Simultaneous Analysis Knowledge Integration Other Visualizations
Heatmap
Heatmap Network
Dynamicity
Comparative Evaluation
Data Upload of genomics data of new patients for visual comparison
Data Upload of genomics data of new patients for visual comparison
UCSC IGVSavan
tGenomeMaps
Circos
Regulome
GenomeSnip
Linear Coordinates Circular Plot Web Interface Data Upload Third-party modules TCGA Demonstration
GWAS Insights Chromosomal Relations Simultaneous Analysis Knowledge Integration Other Visualizations
Heatmap
Heatmap Network
Dynamicity
Comparative Evaluation
Introduce Heatmap plots for exon expression dataIntroduce Heatmap plots for exon expression data
Applicability of GenomeSnipFormulating Improved Hypotheses: Integration of the knowledge extracted from GWAS, available literature and
pathways, and interpretation through an intuitive visualization, allowing isolation and comparative analysis of interesting genomic segments using the Linked TCGA datasets.
Discovering Protein Interactions: Usage and improved visual discernibility of co-occurrence networks of
genes, along with analysis of highly co-occurrent gene pairs using gene/protein expression datasets.
Predicting Tumour Risk on a personalized level: Evaluation of the genomic alterations, expression levels and clinical data of
a new patient against those patients registered under the TCGA project, enabling personalized medicine.
Future Work on GenomeSnip Integrate prostate cancer prediction models into the platform, for
validation using Linked TCGA datasets as a training set, and predicting tumour risk in new patients.
Provide an ‘interaction’ overlay by integrating knowledge on protein-protein interactions (BioGrid), gene co-expression (CoExpresDB) and functions (Gene Ontology).
Extract information like disease variant mutations from UniProt to interlink the ‘point mutations’ layer in the ‘Genomic Wheel’ with other segments.
Improve the granularity of the ‘Genomic Wheel’ and extensively test the user experience and the usability of this platform by conducting a user-driven evaluation.
Thank You! Questions?