21
Mining Genotype-Phenotype Associations from Public Knowledge Sources via Semantic Web Querying Richard C. Kiefer Robert R. Freimuth, PhD Christopher G. Chute, MD, DrPH Jyotishman Pathak, PhD AMIA Clinical Research Informatics (CRI) Summit March 20, 2013

Mining Genotype-Phenotype Associations From Public Knowledge Sources

  • Upload
    amia

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

2013 Summit on Clinical Research Informatics

Citation preview

Page 1: Mining Genotype-Phenotype Associations From Public Knowledge Sources

Mining Genotype-Phenotype Associations from Public Knowledge Sources via Semantic Web Querying Richard C. Kiefer Robert R. Freimuth, PhD Christopher G. Chute, MD, DrPH Jyotishman Pathak, PhD AMIA Clinical Research Informatics (CRI) Summit March 20, 2013

Page 2: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Outline

• Research goals Why?

• Data sources Who?

• Creation of endpoints Where?

•  SPARQL queries How?

• Results What?

•  Analysis Huh?

•  Future research Next?

©2013 MFMER | slide-2

Page 3: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Background

•  Advances have led to a heavy influx of large amounts of association data interlinking genes, SNPs, proteins, diseases and drugs.

•  The Linked Data paradigm provides the requisite backbone for building applications that can address analysis of this information.

• W3C Linked Open Data (LOD) project, where as of September 2012, more than 300 datasets with approximately 30 billion Resource Description Framework (RDF) triples have been connected via more than 500 million links.

©2013 MFMER | slide-3

Page 4: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

The LOD “cloud”, September 2012

©2012 MFMER | slide-4

Page 5: Mining Genotype-Phenotype Associations From Public Knowledge Sources

DBPedia: Extracting structured data from the Wikipedia

©2012 MFMER | slide-5

@prefix dbpedia <http://dbpedia.org/resource/>. @prefix dbterm <http://dbpedia.org/property/>. dbpedia:Amsterdam

dbterm:officialName "Amsterdam" ;

dbterm:longd "4" ; dbterm:longm "53" ; dbterm:longs "32" ;

dbterm:website <http://www.amsterdam.nl> ; dbterm:populationUrban "1364422" ; dbterm:areaTotalKm "219" ; ... dbpedia:ABN_AMRO dbterm:location dbpedia:Amsterdam ; ...

Page 6: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Objective

• Our goal is to demonstrate the applicability of SPARQL querying capabilities using three public knowledgebases: • NCBI database for single nucleotide

polymorphisms (dbSNP) • Online Mendelian Inheritance in Man (OMIM) • Gene Wiki Plus (GeneWiki+)

•  To extract disease-gene-SNP associations for six chronic diseases: Arthritis Asthma Cancer Diabetes Dementia Obesity

©2013 MFMER | slide-6

Page 7: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

GeneWiki+ • GeneWiki+ enables semantic queries for

retrieval of large amounts of information which also includes additional public data sources.

• GeneWiki+ contains approximately 11,000 detailed entries on human genes with half of them being linked to the Disease Ontology. There are 19,000 articles on SNPs with a tenth of them being directly associated to a disease.

• Creating a local dump of the RDF data files (approximately 80MB) allowed us to experiment with SPARQL queries which could not be executed within the GeneWiki+ environment.

©2013 MFMER | slide-7

Page 8: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

OMIM •  The Online Mendelian Inheritance in Man

(OMIM) is a comprehensive knowledgebase of human genes and genetic disorders that is distributed by the U.S. National Center for Biotechnology Information (NCBI) to support research in genomics.

• OMIM contains over 21,500 detailed entries on human genes and genetic disorders, including around 5,400 phenotypes with a clinical synopsis.

•  The OMIM data accessible via the Bio2RDF SPARQL endpoint contains 2,895,940 triples.

©2013 MFMER | slide-8

Page 9: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

dbSNP •  The NCBI database for single nucleotide

polymorphisms (dbSNP) contains information about genetic variations. Variations in dbSNP are linked to a genetic locus using Entrez Gene identifiers.

•  In the June 2012 build of dbSNP there were more than 58.5 million reference SNPs and 187.8 million submitted SNPs for homo sapiens.

•  To the best of our knowledge, there is no public dbSNP SPARQL endpoint; therefore, we downloaded the database and created our own endpoint.

©2013 MFMER | slide-9

Page 10: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

dbSNP and OMIM RDF graphs

©2013 MFMER | slide-10

Page 11: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Extraction – OMIM/dbSNP •  Perl scripts were run against dbSNP build 134

to create scripts allowing the load of data into MySQL. Using Virtuoso, an endpoint was created through an RDF view of the database.

•  To extract the disease-gene associations from OMIM, we devised a simple SPARQL query for execution at Bio2RDF’s OMIM endpoint

•  Federated queries were run against OMIM to get the genes which have been given an causal relationship to each of the six chronic diseases. Including dbSNP in the federated query provided SNPs which make up those genes.

©2013 MFMER | slide-11

Page 12: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

SPARQL – OMIM & dbSNP federation

©2013 MFMER | slide-12

Page 13: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Extraction – GeneWiki+

• We downloaded the February 12th, 2012 snapshot of GeneWiki+ data as RDF files, and loaded the dataset in a local Virtuoso server to create a SPARQL endpoint.

• Unlike OMIM, GeneWiki+ explicitly specifies “disease-SNP” associations (in addition to the “disease-gene” associations).

•  A SPARQL query was run to retrieve the genes which have been given an causal relationship to each of the six chronic diseases. That result set was UNIONed to also retrieve the SNPs having a stated causal relationship.

©2013 MFMER | slide-13

Page 14: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

SPARQL – GeneWiki+

©2013 MFMER | slide-14

Page 15: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Results – Gene comparison

• We compared the genes/SNPs returned from both queries and calculated results.

•  The uniqueness of genes was based on a comparison of their Entrez-Gene IDs rather than gene symbols/names.

©2013 MFMER | slide-15

Page 16: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Results – Validation

•  SPARQL queries discovered 130 unique genes were found to be associated with asthma.

•  A little over half of those genes were validated associations through a PubMed citation.

©2013 MFMER | slide-16

Page 17: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Analysis

•  For Asthma, all 17 genes found in OMIM had a PubMed citation. Only 50 of the 113 genes mentioned by GeneWiki+ had PubMed citations.

• Our technique can deduce disease-gene associations that were not explicitly stated in the RDF graph, although further validation and verification is required before such data can be applied effectively.

©2013 MFMER | slide-17

rs3184504

SH2B3

Asthma associatedWith

partOf INFERRED

Page 18: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Analysis

•  There were genes found to be unique in GeneWiki+ based on the query but the relationship to the disease was mentioned within the text of the article in OMIM.

• DPP10 did not show up being associated with Asthma in the query results, but was mentioned and referenced at the website.

©2013 MFMER | slide-18

Page 19: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Deductions

•  The total number of unique disease-gene associations in GeneWiki+ was significantly higher than OMIM. Such a finding is contrary to our initial hypothesis.

•  The overlap between the numbers of genes extracted from GeneWiki+ and OMIM was significantly less (<5% on average) for all the six diseases.

• Overall, the number of unique genes associated with Cancer was highest followed by Diabetes and Arthritis. This is consistent with the number of genome-wide association studies conducted.

©2013 MFMER | slide-19

Page 20: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Limitations and Future Work

• OMIM-dbSNP query results in a list of SNPs where the gene is associated with a disease but the SNPs association cannot be asserted. Therefore could not analyze SNP comparisons.

• OMIM data did not reflect page content.

• GWP+ surfaces more data but only half had PubMed citations to support.

•  Future work will use federated queries to join data between public endpoints and private medical data to investigate patient genotypes and phenotypes.

©2013 MFMER | slide-20

Page 21: Mining Genotype-Phenotype Associations From Public Knowledge Sources

The Linked Clinical Data (LCD) Project

Thank You!

©2012 MFMER | slide-21

http://informatics.mayo.edu/LCD