Upload
goodb
View
99
Download
3
Tags:
Embed Size (px)
Citation preview
Three TSRI Tools for capturing, sharing, and applying community knowledge
Benjamin GoodThe Scripps Research Institute
@bgood
Outline
• Gene wiki, quick recap, update• Introducing:– http://knowledge.bio– http://biobranch.org
Gene Wiki (on Wikipedia)
3
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Wikidata
4
is a
regulates
Interacts with
Protein
Glycoprotein
Neural development
VLDL receptor
Amyloid precursor protein
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
Reelin
http://www.wikidata.org/wiki/Q414043
A computable Gene (& Disease & Drug) Wiki
5
Structured data
Here nowSoon
Downstream(but exciting potential..)
?? ?
Wikipedia(s)
Status Update
• Genes, diseases (and any minute.. Drugs) are in wikidata
• Demonstrations of incorporating this content in Wikipedia are functional
• We’ve been slowed a little bit by wikidata governance policies.. (they blocked our bot temporarily)
Wikidata activities
• YOU can help!• https://www.wikidata.org/wiki/User:ProteinBoxBot
Join in one of these discussions and voice your support
Outline
• Gene wiki• knowledge.bio• biobranch.org
Knowledge.bio
• Provides a concept-centric view of the scientific literature. – You search and interact with concepts rather than
documents.• Main purpose is hypothesis generation• 2 data sources mined from PubMed– 70 million Explicit semantic relations (‘triples’)– 200 million Implicit gene-disease associations
http://knowledge.bio
Explicit relations view
Search for concept
View related concepts
(67 results)
Filter results
View text where triple was extracted
Diseases implicitly related to queried concept: CYP2R1
Concepts linking CYP2R1 to Smith-Lemli Opitz Syndrome
Table views complemented by a Network view for taking notes..
Network (“Map”) view
Cytoscape.js canvasAuto and manual layout
Save Map as local text file
Load saved map
Step 1: find candidate relationWhat new diseases might be related to CYP2R1?
Implicit prediction
Step 2: find linking conceptsHow is CYP2R1 related to SLO syndrome?
Step 3: Start building a hypothesis to explain the predicted relation
Do CYP2R1 and DHCR7 participate in a process related to SLO syndrome?
Explicit relations view
Warning, may prove addictive..
Next steps for knowledge.bio
• Enhanced community sharing• Integration with http://ndexbio.org from the
Cytoscape consortium• Allow user actions to feedback into underlying
NLP systems• Include access to other structured knowledge
sources e.g. Gene Ontology
Outline
• Gene wiki• knowledge.bio• biobranch.org
Breast cancer prognosis:10 year survival?
find patterns
Inferring class predictors
No
van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536.
Yes make predictions on new samples
No
Yes
10 year survival?
find patterns make predictions
inferring survival predictors
1) select genes
2) infer predictor from data (e.g. decision tree, SVM, etc.)
Out of the 25,000+ genes, which small set works together the best?
No
Yes
10 year survival?
Problem: gene selection instability
instability: different methods, different datasets produce different gene sets for the same phenotype [1]
[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer.” Genome Medicine 5.10 (2013).
Problem: the validation gap
training data, test data
validation
validation: predictive signatures often perform worse on independent data created for validation.
Photograph by Richard Hallman, National Geographic Adventure Blog
find patternsmake predictions
Adding prior knowledge to the discovery algorithm
<10 yr survival
>10 yr survival
Ex.) Network guided forests
Use protein interaction network to find good gene combinations
Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
But most knowledge is not structured
2000200120022003200420052006200720082009201020112012
500000
550000
600000
650000
700000
750000
800000
850000
900000
950000
1000000
Number ar-ticles added to PubMed
>100 publications/hour
>194715 publications linked to “breast cancer” since 2000 http://tinyurl.com/brsince2000
How can we use unstructured knowledge to improve predictors?
Need a distributed network of intelligent systems that are good at reading and hypothesizing
Like you and your friends
A game with a purpose: The Cure
• http://genegames.org/cure• http://games.jmir.org/2014/2/e
7/• The Cure: Design and Evaluation of a
Crowdsourcing Game for Gene Selection for Breast Cancer Survival Prediction JMIR Serious Games PMID: 25654473
People wanted to control the trees
http://biobranch.org
Branch Goals
• Provide easy, visual way for non-programmers to use large datasets to answer questions
• Construct libraries of manually crafted predictive models
• Use the collected models to generate ensemble predictors that incorporate the knowledge of the users
Branch walkthrough: Choose a dataset
Select evaluation option
Tree Builder
Split node builder
Each button is a different way to compose a split node in your decision tree
Split node
Predictions at leaf nodes
100% correct
56% accurate
View data, adjust split point
If age less than 34.5Predict relapse
If greater, Predict no relapse
Single feature splits
Pick from genes or clinical features
Type-ahead search
Statistical ranker
Custom feature combination
BRCA2TOP2B
BRCA2 + TOP2B
Allows user to use a manually composed linear combination of other features
Eg: 21 Gene Signature from OncoType Dx
ProliferationKi67STK15SurvivinCCNB1 (cyclin B1)MYBL2
InvasionMMP11CTSL2
HER2GRB7HER2
EstrogenERPGRBCL2SCUBE2
GSTM1
ReferenceACTB(b-actin)GAPDHRPLPOGUSTFRC
Recurrence Score Algorithm1. HER2 group score = 0.9 x GRB7+ 0.1 x HER2 (if the result is less than 8, then the GRB7
group score is considered 8);2. ER group score = (0.8x ER +1.2 x PGR + BCL2+ SCUBE2)÷43. Proliferation group score = ( Survivin + KI67 + MYBL2 + CCNB1 [the gene encoding
cyclin B1]+ STK15 )÷5 (if the result is less than 6.5, then the proliferation group score is considered 6.5)
4. Invasion group score=( CTSL2 [the gene encoding cathepsin L2] + MMP11 [the gene encoding stromolysin 3])÷2.
RSU=0.47* HER2- 0.34* ER +1.04* PROLIFERATION + 0.10* INVASION +0.05* CD68 -0.08* GSTM1 -0.07* BAG1
*A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast Cancer
CD68
BAG1
Classifier nodes
Classifier Node
Class B
Class A
…...…...…...…...
…...…...…...…...
…...…...…...…...
Use a trained predictive model such as A Support Vector Machine as a node in your tree
Use
Build
biobranch tree nodes
Branch decision tree
Class B
Class A
…...…...…...…...
…...…...…...…...
…...…...…...…...
Use a previously constructed tree as node
Visually set decision boundary nodes
Visual split
Class B
Class A
…...…...…...…...
…...…...…...…...
…...…...…...…...
Creating a visual split
Draw polygon
Add to treeSelect feature
Select feature
Teach students about overfitting..
Tree Builder
Evaluation panel
View training and testing sets
Performance metrics
Confusion matrix
ROC curve
Navigation
Save your treeNew tree
Tree Collection
Open and edit shared tree
Search trees you create and trees shared with the community
Editing shared tree
Tracks which user created each node
Next steps
• More user testing• More datasets• Lots of users?• Better models?
training data, test data
validation
Even more information!
• Screencasts• http://tinyurl.com/branch-cast• Open source code– https://bitbucket.org/sulab/biobranch– https://bitbucket.org/starinformatics/gbk– https://bitbucket.org/sulab/wikidatabots
Thanks
Funding and Support
BioGPS: GM83924Gene Wiki: GM089820BD2K COE: GM114833
Andra WaagmeesterSebastian BurgstallerElvira Mitraka
Lynn SchrimlGang FuEvan BoltonPaul PavlidisPeter RobinsonMany WikiDatans
Richard Bruskiewichhttp://starinformatics.com
Karthik GangavarapuVyshakh Babji
Andrew Su
The Prince of Crowdsourcing
ImplicitomeKristina Hettne, Leiden University
Contact: [email protected]@bgood