Mining Genotype-Phenotype Associations From Public Knowledge Sources

Mining Genotype-Phenotype Associations from Public Knowledge Sources via Semantic Web Querying Richard C. Kiefer Robert R. Freimuth, PhD Christopher G. Chute, MD, DrPH Jyotishman Pathak, PhD AMIA Clinical Research Informatics (CRI) Summit March 20, 2013

The Linked Clinical Data (LCD) Project

Outline

• Research goals Why?

• Data sources Who?

• Creation of endpoints Where?

•  SPARQL queries How?

• Results What?

•  Analysis Huh?

•  Future research Next?

©2013 MFMER | slide-2


Background

•  Advances have led to a heavy influx of large amounts of association data interlinking genes, SNPs, proteins, diseases and drugs.

•  The Linked Data paradigm provides the requisite backbone for building applications that can address analysis of this information.

• W3C Linked Open Data (LOD) project, where as of September 2012, more than 300 datasets with approximately 30 billion Resource Description Framework (RDF) triples have been connected via more than 500 million links.



The LOD “cloud”, September 2012


DBPedia: Extracting structured data from the Wikipedia


@prefix dbpedia <http://dbpedia.org/resource/>. @prefix dbterm <http://dbpedia.org/property/>. dbpedia:Amsterdam

dbterm:officialName "Amsterdam" ;

dbterm:longd "4" ; dbterm:longm "53" ; dbterm:longs "32" ;

dbterm:website <http://www.amsterdam.nl> ; dbterm:populationUrban "1364422" ; dbterm:areaTotalKm "219" ; ... dbpedia:ABN_AMRO dbterm:location dbpedia:Amsterdam ; ...


Objective

• Our goal is to demonstrate the applicability of SPARQL querying capabilities using three public knowledgebases: • NCBI database for single nucleotide

polymorphisms (dbSNP) • Online Mendelian Inheritance in Man (OMIM) • Gene Wiki Plus (GeneWiki+)

•  To extract disease-gene-SNP associations for six chronic diseases: Arthritis Asthma Cancer Diabetes Dementia Obesity



GeneWiki+ • GeneWiki+ enables semantic queries for

retrieval of large amounts of information which also includes additional public data sources.

• GeneWiki+ contains approximately 11,000 detailed entries on human genes with half of them being linked to the Disease Ontology. There are 19,000 articles on SNPs with a tenth of them being directly associated to a disease.

• Creating a local dump of the RDF data files (approximately 80MB) allowed us to experiment with SPARQL queries which could not be executed within the GeneWiki+ environment.



OMIM •  The Online Mendelian Inheritance in Man

(OMIM) is a comprehensive knowledgebase of human genes and genetic disorders that is distributed by the U.S. National Center for Biotechnology Information (NCBI) to support research in genomics.

• OMIM contains over 21,500 detailed entries on human genes and genetic disorders, including around 5,400 phenotypes with a clinical synopsis.

•  The OMIM data accessible via the Bio2RDF SPARQL endpoint contains 2,895,940 triples.



dbSNP •  The NCBI database for single nucleotide

polymorphisms (dbSNP) contains information about genetic variations. Variations in dbSNP are linked to a genetic locus using Entrez Gene identifiers.

•  In the June 2012 build of dbSNP there were more than 58.5 million reference SNPs and 187.8 million submitted SNPs for homo sapiens.

•  To the best of our knowledge, there is no public dbSNP SPARQL endpoint; therefore, we downloaded the database and created our own endpoint.



dbSNP and OMIM RDF graphs



Extraction – OMIM/dbSNP •  Perl scripts were run against dbSNP build 134

to create scripts allowing the load of data into MySQL. Using Virtuoso, an endpoint was created through an RDF view of the database.

•  To extract the disease-gene associations from OMIM, we devised a simple SPARQL query for execution at Bio2RDF’s OMIM endpoint

•  Federated queries were run against OMIM to get the genes which have been given an causal relationship to each of the six chronic diseases. Including dbSNP in the federated query provided SNPs which make up those genes.



SPARQL – OMIM & dbSNP federation



Extraction – GeneWiki+

• We downloaded the February 12th, 2012 snapshot of GeneWiki+ data as RDF files, and loaded the dataset in a local Virtuoso server to create a SPARQL endpoint.

• Unlike OMIM, GeneWiki+ explicitly specifies “disease-SNP” associations (in addition to the “disease-gene” associations).

•  A SPARQL query was run to retrieve the genes which have been given an causal relationship to each of the six chronic diseases. That result set was UNIONed to also retrieve the SNPs having a stated causal relationship.



SPARQL – GeneWiki+



Results – Gene comparison

• We compared the genes/SNPs returned from both queries and calculated results.

•  The uniqueness of genes was based on a comparison of their Entrez-Gene IDs rather than gene symbols/names.



Results – Validation

•  SPARQL queries discovered 130 unique genes were found to be associated with asthma.

•  A little over half of those genes were validated associations through a PubMed citation.



Analysis

•  For Asthma, all 17 genes found in OMIM had a PubMed citation. Only 50 of the 113 genes mentioned by GeneWiki+ had PubMed citations.

• Our technique can deduce disease-gene associations that were not explicitly stated in the RDF graph, although further validation and verification is required before such data can be applied effectively.


rs3184504

SH2B3

Asthma associatedWith

partOf INFERRED


Analysis

•  There were genes found to be unique in GeneWiki+ based on the query but the relationship to the disease was mentioned within the text of the article in OMIM.

• DPP10 did not show up being associated with Asthma in the query results, but was mentioned and referenced at the website.



Deductions

•  The total number of unique disease-gene associations in GeneWiki+ was significantly higher than OMIM. Such a finding is contrary to our initial hypothesis.

•  The overlap between the numbers of genes extracted from GeneWiki+ and OMIM was significantly less (<5% on average) for all the six diseases.

• Overall, the number of unique genes associated with Cancer was highest followed by Diabetes and Arthritis. This is consistent with the number of genome-wide association studies conducted.



Limitations and Future Work

• OMIM-dbSNP query results in a list of SNPs where the gene is associated with a disease but the SNPs association cannot be asserted. Therefore could not analyze SNP comparisons.

• OMIM data did not reflect page content.

• GWP+ surfaces more data but only half had PubMed citations to support.

•  Future work will use federated queries to join data between public endpoints and private medical data to investigate patient genotypes and phenotypes.



Thank You!


http://informatics.mayo.edu/LCD

Documents

Mining Genotype-Phenotype Associations From Public Knowledge Sources