55
Three TSRI Tools for capturing, sharing, and applying community knowledge Benjamin Good The Scripps Research Institute @bgood

2015 6 bd2k_biobranch_knowbio

  • Upload
    goodb

  • View
    99

  • Download
    3

Embed Size (px)

Citation preview

Page 1: 2015 6 bd2k_biobranch_knowbio

Three TSRI Tools for capturing, sharing, and applying community knowledge

Benjamin GoodThe Scripps Research Institute

@bgood

Page 2: 2015 6 bd2k_biobranch_knowbio

Outline

• Gene wiki, quick recap, update• Introducing:– http://knowledge.bio– http://biobranch.org

Page 3: 2015 6 bd2k_biobranch_knowbio

Gene Wiki (on Wikipedia)

3

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Page 4: 2015 6 bd2k_biobranch_knowbio

Wikidata

4

is a

regulates

Interacts with

Protein

Glycoprotein

Neural development

VLDL receptor

Amyloid precursor protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Page 5: 2015 6 bd2k_biobranch_knowbio

A computable Gene (& Disease & Drug) Wiki

5

Structured data

Here nowSoon

Downstream(but exciting potential..)

?? ?

Wikipedia(s)

Page 6: 2015 6 bd2k_biobranch_knowbio

Status Update

• Genes, diseases (and any minute.. Drugs) are in wikidata

• Demonstrations of incorporating this content in Wikipedia are functional

• We’ve been slowed a little bit by wikidata governance policies.. (they blocked our bot temporarily)

Page 7: 2015 6 bd2k_biobranch_knowbio

Wikidata activities

• YOU can help!• https://www.wikidata.org/wiki/User:ProteinBoxBot

Join in one of these discussions and voice your support

Page 8: 2015 6 bd2k_biobranch_knowbio

Outline

• Gene wiki• knowledge.bio• biobranch.org

Page 9: 2015 6 bd2k_biobranch_knowbio

Knowledge.bio

• Provides a concept-centric view of the scientific literature. – You search and interact with concepts rather than

documents.• Main purpose is hypothesis generation• 2 data sources mined from PubMed– 70 million Explicit semantic relations (‘triples’)– 200 million Implicit gene-disease associations

Page 10: 2015 6 bd2k_biobranch_knowbio

http://knowledge.bio

Page 11: 2015 6 bd2k_biobranch_knowbio

Explicit relations view

Search for concept

View related concepts

(67 results)

Filter results

Page 12: 2015 6 bd2k_biobranch_knowbio

View text where triple was extracted

Page 13: 2015 6 bd2k_biobranch_knowbio

Diseases implicitly related to queried concept: CYP2R1

Page 14: 2015 6 bd2k_biobranch_knowbio

Concepts linking CYP2R1 to Smith-Lemli Opitz Syndrome

Page 15: 2015 6 bd2k_biobranch_knowbio

Table views complemented by a Network view for taking notes..

Page 16: 2015 6 bd2k_biobranch_knowbio

Network (“Map”) view

Cytoscape.js canvasAuto and manual layout

Save Map as local text file

Load saved map

Page 17: 2015 6 bd2k_biobranch_knowbio

Step 1: find candidate relationWhat new diseases might be related to CYP2R1?

Implicit prediction

Page 18: 2015 6 bd2k_biobranch_knowbio

Step 2: find linking conceptsHow is CYP2R1 related to SLO syndrome?

Page 19: 2015 6 bd2k_biobranch_knowbio

Step 3: Start building a hypothesis to explain the predicted relation

Do CYP2R1 and DHCR7 participate in a process related to SLO syndrome?

Explicit relations view

Page 20: 2015 6 bd2k_biobranch_knowbio

Warning, may prove addictive..

Page 21: 2015 6 bd2k_biobranch_knowbio

Next steps for knowledge.bio

• Enhanced community sharing• Integration with http://ndexbio.org from the

Cytoscape consortium• Allow user actions to feedback into underlying

NLP systems• Include access to other structured knowledge

sources e.g. Gene Ontology

Page 22: 2015 6 bd2k_biobranch_knowbio

Outline

• Gene wiki• knowledge.bio• biobranch.org

Page 23: 2015 6 bd2k_biobranch_knowbio

Breast cancer prognosis:10 year survival?

find patterns

Inferring class predictors

No

van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536.

Yes make predictions on new samples

No

Yes

10 year survival?

Page 24: 2015 6 bd2k_biobranch_knowbio

find patterns make predictions

inferring survival predictors

1) select genes

2) infer predictor from data (e.g. decision tree, SVM, etc.)

Out of the 25,000+ genes, which small set works together the best?

No

Yes

10 year survival?

Page 25: 2015 6 bd2k_biobranch_knowbio

Problem: gene selection instability

instability: different methods, different datasets produce different gene sets for the same phenotype [1]

[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer.” Genome Medicine 5.10 (2013).

Page 26: 2015 6 bd2k_biobranch_knowbio

Problem: the validation gap

training data, test data

validation

validation: predictive signatures often perform worse on independent data created for validation.

Photograph by Richard Hallman, National Geographic Adventure Blog

Page 27: 2015 6 bd2k_biobranch_knowbio

find patternsmake predictions

Adding prior knowledge to the discovery algorithm

<10 yr survival

>10 yr survival

Page 28: 2015 6 bd2k_biobranch_knowbio

Ex.) Network guided forests

Use protein interaction network to find good gene combinations

Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology

Page 29: 2015 6 bd2k_biobranch_knowbio

But most knowledge is not structured

2000200120022003200420052006200720082009201020112012

500000

550000

600000

650000

700000

750000

800000

850000

900000

950000

1000000

Number ar-ticles added to PubMed

>100 publications/hour

>194715 publications linked to “breast cancer” since 2000 http://tinyurl.com/brsince2000

Page 30: 2015 6 bd2k_biobranch_knowbio

How can we use unstructured knowledge to improve predictors?

Need a distributed network of intelligent systems that are good at reading and hypothesizing

Like you and your friends

Page 31: 2015 6 bd2k_biobranch_knowbio

A game with a purpose: The Cure

• http://genegames.org/cure• http://games.jmir.org/2014/2/e

7/• The Cure: Design and Evaluation of a

Crowdsourcing Game for Gene Selection for Breast Cancer Survival Prediction JMIR Serious Games PMID: 25654473

Page 32: 2015 6 bd2k_biobranch_knowbio

People wanted to control the trees

Page 33: 2015 6 bd2k_biobranch_knowbio

http://biobranch.org

Page 34: 2015 6 bd2k_biobranch_knowbio

Branch Goals

• Provide easy, visual way for non-programmers to use large datasets to answer questions

• Construct libraries of manually crafted predictive models

• Use the collected models to generate ensemble predictors that incorporate the knowledge of the users

Page 35: 2015 6 bd2k_biobranch_knowbio

Branch walkthrough: Choose a dataset

Page 36: 2015 6 bd2k_biobranch_knowbio

Select evaluation option

Page 37: 2015 6 bd2k_biobranch_knowbio

Tree Builder

Page 38: 2015 6 bd2k_biobranch_knowbio

Split node builder

Each button is a different way to compose a split node in your decision tree

Page 39: 2015 6 bd2k_biobranch_knowbio

Split node

Predictions at leaf nodes

100% correct

56% accurate

View data, adjust split point

If age less than 34.5Predict relapse

If greater, Predict no relapse

Page 40: 2015 6 bd2k_biobranch_knowbio

Single feature splits

Pick from genes or clinical features

Type-ahead search

Statistical ranker

Page 41: 2015 6 bd2k_biobranch_knowbio

Custom feature combination

BRCA2TOP2B

BRCA2 + TOP2B

Allows user to use a manually composed linear combination of other features

Page 42: 2015 6 bd2k_biobranch_knowbio

Eg: 21 Gene Signature from OncoType Dx

ProliferationKi67STK15SurvivinCCNB1 (cyclin B1)MYBL2

InvasionMMP11CTSL2

HER2GRB7HER2

EstrogenERPGRBCL2SCUBE2

GSTM1

ReferenceACTB(b-actin)GAPDHRPLPOGUSTFRC

Recurrence Score Algorithm1. HER2 group score = 0.9 x GRB7+ 0.1 x HER2 (if the result is less than 8, then the GRB7

group score is considered 8);2. ER group score = (0.8x ER +1.2 x PGR + BCL2+ SCUBE2)÷43. Proliferation group score = ( Survivin + KI67 + MYBL2 + CCNB1 [the gene encoding

cyclin B1]+ STK15 )÷5 (if the result is less than 6.5, then the proliferation group score is considered 6.5)

4. Invasion group score=( CTSL2 [the gene encoding cathepsin L2] + MMP11 [the gene encoding stromolysin 3])÷2.

RSU=0.47* HER2- 0.34* ER +1.04* PROLIFERATION + 0.10* INVASION +0.05* CD68 -0.08* GSTM1 -0.07* BAG1

*A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast Cancer

CD68

BAG1

Page 43: 2015 6 bd2k_biobranch_knowbio

Classifier nodes

Classifier Node

Class B

Class A

…...…...…...…...

…...…...…...…...

…...…...…...…...

Use a trained predictive model such as A Support Vector Machine as a node in your tree

Use

Build

Page 44: 2015 6 bd2k_biobranch_knowbio

biobranch tree nodes

Branch decision tree

Class B

Class A

…...…...…...…...

…...…...…...…...

…...…...…...…...

Use a previously constructed tree as node

Page 45: 2015 6 bd2k_biobranch_knowbio

Visually set decision boundary nodes

Visual split

Class B

Class A

…...…...…...…...

…...…...…...…...

…...…...…...…...

Page 46: 2015 6 bd2k_biobranch_knowbio

Creating a visual split

Draw polygon

Add to treeSelect feature

Select feature

Page 47: 2015 6 bd2k_biobranch_knowbio

Teach students about overfitting..

Page 48: 2015 6 bd2k_biobranch_knowbio

Tree Builder

Page 49: 2015 6 bd2k_biobranch_knowbio

Evaluation panel

View training and testing sets

Performance metrics

Confusion matrix

ROC curve

Page 50: 2015 6 bd2k_biobranch_knowbio

Navigation

Save your treeNew tree

Page 51: 2015 6 bd2k_biobranch_knowbio

Tree Collection

Open and edit shared tree

Search trees you create and trees shared with the community

Page 52: 2015 6 bd2k_biobranch_knowbio

Editing shared tree

Tracks which user created each node

Page 53: 2015 6 bd2k_biobranch_knowbio

Next steps

• More user testing• More datasets• Lots of users?• Better models?

training data, test data

validation

Page 55: 2015 6 bd2k_biobranch_knowbio

Thanks

Funding and Support

BioGPS: GM83924Gene Wiki: GM089820BD2K COE: GM114833

Andra WaagmeesterSebastian BurgstallerElvira Mitraka

Lynn SchrimlGang FuEvan BoltonPaul PavlidisPeter RobinsonMany WikiDatans

Richard Bruskiewichhttp://starinformatics.com

Karthik GangavarapuVyshakh Babji

Andrew Su

The Prince of Crowdsourcing

ImplicitomeKristina Hettne, Leiden University

Contact: [email protected]@bgood