Scripps bioinformatics seminar_day_2

  • View
    66

  • Download
    0

  • Category

    Science

Preview:

Citation preview

Day 2 of Computing on the shoulders of

giants: how existing knowledge is represented and applied in

bioinformaticsBenjamin Good

bgood@scripps.eduAssistant Professor of the Department of

Molecular and Experimental Medicine

Recap from Day 1• Make things (articles, genes,

antibodies, etc.) easier to find• Answer questions• Generate hypotheses

Controlled vocabularies (MeSH)Ontologies (Gene Ontology)

knowledge graphs on the Web: the SPARQL query language

knowledge plus computation = inference, the ABC model

Computing with knowledge• Challenges with knowledge graphs

• Too much data• ->> query, sort, visualize, interact

• Not enough data• ->> mine for more..

• Goal for practical day: Go beyond PubMed! • gain hands on experience using a knowledge graph

• either with tools built for the purpose or with your own code…

Assignment: knowledge graph to hypothesis• Option 1 Coding

• Implement and apply an ABC Model style hypothesis generating program (can adapt from example provided)

• explain its logic, explain how you used it to generate a hypothesis, explain the hypothesis (provide a visual)

• Option 2 Non-coding• Use a knowledge discovery application(s) (list provided) to define a new hypothesis• if you can’t think of where to start, try to explain why Metformin may contribute to cancer survival

• Assignment deliverables: a document containing • the inputs you gave to your program or the online tool(s) you used• what was generated in response and the underlying logic • an image and text describing the results, especially any hypothesis you could derive

• (for Option 1 also submit any code written or files generated as a tar or zip archive)

Online tools for knowledge discovery• http://knowledge.bio (* we make this one…)• http://www.biograph.be (this is a good tool, but often breaks down) • http://epiphanet.uth.tmc.edu (also on the flaky side, but can be good) • https://skr3.nlm.nih.gov/SemMed/ (works okay, requires a (free)

account) • http://arrowsmith.psych.uic.edu (ugly interface, but good tool)

Example question: repurposing all drugs

http://tinyurl.com/hwm9388

?drug

?disease

interacts with

protein

geneencoded by genetic association

treats??

Example program (feel free to follow or adapt to your interest)• Example

• Input = a disease (A)• Output = a ranked list of drugs (C) that might be used for treatment

• Render the results of your workflow as a cytoscape network that illustrates the reasoning behind the predictions

• Implementation• Python• Use a SPARQL endpoint such as http://query.wikidata.org

• + identify and use another endpoint (e.g. EBI, UniProt)• ++ access pubmed articles and MeSH indexing

Python setup• pip install RDFLib, SPARQLWrapper, pandas…. • Hopefully Jupyter already installed ? else install it http://

jupyter.readthedocs.io/en/latest/install.html • get notebook from https://

github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_pandas.ipynb • go to directory where you put the notebook• run it with• >jupyter notebook• should be ready to run

the notebook• will run a basic search for disease-gene-drug connections in wikidata• will sort the results by the number of intervening genes• will export the data to a tab-delimited file you can view in Excel, text

editor, or load into cytoscape• Your job:

• Run it and extend it by one or more of:• adapting the query• changing the way the results are sorted• working with the output in cytoscape to produce an informative visualization

example output rendered in cytoscape

Other queries from Day 1 (slides 48-54)• Drugs that target a cancer and impact a specific biological process

• http://tinyurl.com/j222k6g

• Drugs that target a new disease linked via biological pathway with shared genes to disease the drug is now used to treat

• http://tinyurl.com/gpfr9kj

Possible inputs for adaptations• Browse and examine wikidata.org to see what you might make use of

• e.g. • Type of physical interaction between gene and drug• Gene ontology annotation (what evidence codes?)• Disease ontology hierarchy• Drug characteristics

Other possible knowledge sources • SPARQL

• UniProt http://sparql.uniprot.org • EBI SPARQL https://www.ebi.ac.uk/rdf/documentation/sparql-endpoints • look for unique identifiers on genes and proteins that you can use to link

wikidata content to their content

• Text• use the NCBI the E-utils API to programmatically access pubmed articles and

MeSH indexing http://www.ncbi.nlm.nih.gov/books/NBK25501/ • Can use to build co-occurrence networks of e.g. MeSH terms

Good luck! Ask questions!

ABC ranking algorithms• Out of all C, which are most strongly

related to A?• Rank by N shared B concepts

• c2: 4• c4:3• c1: 1• c3: 1• c5:1• c6:1

• Next level: adjust to down-weight highly connected nodes

A B Cc1c2c3c4c5c6

ABC ranking algorithms – advanced (require large networks to be useful) • Wren – Average Minimum Weight (AMW) (Wren)

• http://bioinformatics.oxfordjournals.org/content/20/3/389.full.pdf

• Linking Term Count with Average Minimum Weight (LTC-AMW) (Yetisgen-Yildiz and Pratt)

• https://www.researchgate.net/publication/23759128_A_new_evaluation_methodology_for_literature-based_discovery_systems

• Predicate inter-dependence (Rastegar-Mojarad)• https://s3.amazonaws.com/uploads.hipchat.com/25885/154162/UaGvvQqbr

hPBAWN/A%20new%20method.pdf

Recommended