View
289
Download
0
Category
Tags:
Preview:
DESCRIPTION
Given at DBMI seminar series at UCSD. http://dbmi.ucsd.edu/display/DBMI/Seminars
Citation preview
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Andrew Su, Ph.D.@andrewsu
asu@scripps.eduhttp://sulab.org
April 5, 2013
UCSD DBMI Seminar
Few genes are well annotated…2
Data: NCBI, February 2013
41%
65%
CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF
20,473 protein-coding genes
Genes, sorted by decreasing counts
GO
An
no
tati
on
C
ou
nts
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
0
200,000
400,000
600,000
800,000
1,000,000
Number of PubMed-indexed articles
… because the literature is sparsely curated?3
… because the literature is sparsely curated?4
0
1 0
2 0
Average capacity of human scientistNumber of articles read by typical scientist
5
311,696 articles (1.5% of PubMed)have been cited by GO annotations
6
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
The Long Tail is a prolific source of content7
ShortHead
Long Tail
Content produced
Contributors (sorted)
News :Video:
Product reviews:Food reviews:Talent judging:
NewspapersTV/Hollywood
Consumer reportsFood criticsOlympics
BlogsYouTube
Amazon reviewsYelp
American Idol
Wikipedia is reasonably accurate8
Wikipedia has breadth and depth9
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words(millions)
Wikipedia Britannica Online
10
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
From crowdsourcing to structured data11
The Gene Wiki
Biological Games
Filtering, extracting, and summarizing PubMed
Documents
Concepts Review article
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Wiki success depends on a positive feedback14
Gene wiki page utility
Number ofusers
Number ofcontributors
1001
2002
10,000 gene “stubs” within Wikipedia15
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
Gene Wiki has a critical mass of readers16
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
Gene Wiki has a critical mass of editors17
Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Edi
tor
coun
t Editors
Edits Edi
t co
unt
A review article for every gene is powerful18
References to the literature
Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
Making the Gene Wiki more computable19
Structured annotationsFree text
Filling the gaps in gene annotation20
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
Candidate assertion
GO:0006897
6319 novel GO annotations2147 novel DO annotations
Gene Wiki content improves enrichment analysis21
GO term
Gene listConcept
recognitionPubMed abstracts
Enrichment analysis
GO:0007411
axon guidance
(GO:0007411)
264 genes
Linked genes through PubMed
P = 1.55 E-20
811 articles
Yes No
Yes 13 2
No 251 12033
Gene Wiki content improves enrichment analysis22
GO term
Gene listConcept
recognitionPubMed abstracts
Gene Wiki
+
Enrichment analysis
GO:0006936 GO:0006936
muscle contraction
(GO:0006936)
87 genes
Linked genes through PubMed
Linked genes through
PubMed + Gene Wiki
P = 1.0 P = 1.22 E-09
251 articles
87 articles
Gene Wiki content improves enrichment analysis23
p-value (PubMed only)
p-value (PubMed + GW)
Muscle contraction
More significant
PubMed + GW
More significant
PubMed only
Making the Gene Wiki more computable24
Structured annotationsFree text
Analyses
Making the Gene Wiki more computable25
Structured annotationsFree text
Databases
Making the Gene Wiki more computable26
Databases
Linked Data
The Long Tail of scientists is a valuable source of
information on gene function
27
From crowdsourcing to structured data28
The Gene Wiki
Biological Games
Gene databases are numerous and overlapping29
… and hundreds more …
Why is there so much redundancy?30
Users
Requests
Resources
Time
Communitydevelopment
BioGPS emphasizes community extensibility
Why do developers define the gene report view?31
BioGPS emphasizes user customizability
http://biogps.org
Community extensibility and user customizability32
Utility: A simple and universal plugin interface33
KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}
STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}
Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}
URL template
Gene entityRendered URL
Utility
UsersContributors
Utility: A simple and universal plugin interface34
Utility
UsersContributors
Utility: A simple and universal plugin interface35
Utility
UsersContributors
Utility: A simple and universal plugin interface36
Utility
UsersContributors
Utility: A simple and universal plugin interface37
Utility
UsersContributors
Utility: A simple and universal plugin interface38
Utility: A simple and universal plugin interface39
Utility
UsersContributors
Total of > 540 gene-centric online databases registered as BioGPS plugins
Users: BioGPS has critical mass40
• > 6400 registered users• 14,000 unique visitors per month• 155,000 page views per month
1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge
7. U Penn8. Stanford9. Wash U10. UNC
Top 10 organizations
Daily pageviewsUtility
UsersContributors
Contributors: Explicit and implicit knowledge41
540 plugins registered (>300 publicly shared)
by over 120 users
spanning 280+ domains
Utility
UsersContributors
All resources should provide RDF…42
Mining structured content from HTML43
Defining a data extraction template44
…
TP53 TNF APOE IL6 VEGF …EGFR TGFB1
The BioGPS Semantic Annotator45
http://54.244.135.254:8080
The Long Tail of
bioinformaticianscan collaboratively build a gene portal.
46
From crowdsourcing to structured data47
The Gene Wiki
Biological Games
48
http://www.flickr.com/photos/archana3k1/4124330493/
Seven million human hours
49
Twenty million human hours
http://www.flickr.com/photos/ableman/2171326385/
-50
150 billion human hours
http://www.flickr.com/photos/rvp-cw/6243289302/
per year
Using games to fold proteins51
Fold.it players have successfully:• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
Using games to diagnose malaria infection54
http://biogames.ee.ucla.edu/
Using games to map neurons55
http://eyewire.org
No good gene-disease annotation database57
Alzheimer's disease (AD)Lipoprotein glomerulopathySea-blue histiocyte disease
Query: Apolipoprotein E
No good gene-disease annotation database58
Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibility
Query: Apolipoprotein E
No good gene-disease annotation database59
Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular Diseases
Query: Apolipoprotein E
?
?
?
?
?
No good gene-disease annotation database60
Alzheimer's disease (AD)Neuropsychological Tests Cognition Disorders Dementia Cognition Disease Progression Cardiovascular Diseases Coronary Disease Diabetes Mellitus, Type 2 Memory Disorders
Query: Apolipoprotein E
Memory Coronary Artery Disease Hypertension Mental Status Schedule Psychiatric Status Rating
Scales Hyperlipidemias Atrophy Dementia, Vascular Parkinson Disease Brain Injuries Myocardial Infarction …
477 diseases!
Play Dizeez to annotate gene-disease links61
3. If it’s ‘right’, you get points
4. Then on to the next question…
2. Click the related disease (only one is “right”)
5. Hurry!
1. Read the clue (gene)
6. Play to win!
Dizeez players seem pretty smart…62
In total (since Dec 2011):• 230 unique gamers• 1045 games played• 8525 guesses
# Occurrences Gene Disease
11 NBPF3 neuroblastoma
11 SOX8 mental retardation
9 ABL1 leukemia
9 SSX1 synovial sarcoma
8 APC colorectal cancer
8 FES sarcoma
8 RBP3 retinoblastoma
8 GAST gastrinoma
8 DCC colorectal cancer
8 MAP3K5 cancer
Gene Wiki OMIM PharmGKB PubMed
Using games to predict phenotype from genotype?63
http://genegames.org
Classification problems in genome biology64
cancer normal
find patterns
Classify new samples
cancer
normalSVM
Neural networks
Naïve Bayes
KNN
…100s samples
100,
000s
fea
ture
s
Random forests65
Sample subset of cases and
featuresTrain decision
treecancer normal
100s samples
100,
000s
fea
ture
s
Random forests66
cancer normal
100s samples
100,
000s
fea
ture
s
Random forests67
Classify new samples
cancer
normal
cancer normal
100s samples
100,
000s
fea
ture
s
How to interject biological
knowledge?
Network-guided forests68
Dutkowski & Ideker (2011). PLoS Computational Biology
Network-guided forests69
Sample features by PPI
networkTrain decision
treecancer normal
100s samples
100,
000s
fea
ture
s
Human-guided forests70
Sample features by
human intelligence
Train decision treecancer normal
100s samples
100,
000s
fea
ture
s
71
The Cure: Genomic predictors for disease72
The Cure: Genomic predictors for disease73
The Cure: Genomic predictors for disease74
The Cure: Genomic predictors for disease75
The Cure: Genomic predictors for disease76
The Cure: Genomic predictors for disease77
Human-guided forests78
Classify new samples
cancer
normal
“Critical Assessment”-style challenge79
Results
• 214 registered players– 50% declared knowledge of cancer
biology– 40% self-identified as having Ph.D.
• Prediction results– 70% correct on survival concordance
index– Best scoring model was 76%– Player registrations still increasing!
80
The Long Tail of gamerscan collaboratively build an accurate disease classifier.
81
82
Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors
WP:MCB Project
Collaborators
Katie FischBen GoodSalvatore Loguercio
Max NanisChunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contacthttp://sulab.org
asu@scripps.edu@andrewsu+Andrew Su
Adriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco
Key group alumni
Doctoral Program in Chemical and Biological Sciences
CALIFORNIAOffice of Graduate
Studies10550 N. Torrey Pines
RoadLa Jolla, CA 92037
Email: gradprgrm@scripps.edu
Phone: 858.784.8469http://education.scripps.edu
Recruiting graduate
students in quantitative biology!
Recommended