Upload
andrew-su
View
2.345
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The use of crowdsourcing in biology is gaining popularity as a mechanism to tackle challenges of massive scale. However, to maximize participation and lower the barriers to entry, contributions to crowdsourcing efforts are typically not well-structured, which makes computing on these data challenging and difficult. The presentation will discuss strategies for translating this unstructured content into structured data. Three vignettes (in varying degrees of completion) will be described, one each from our Gene Wiki [1], BioGPS [2], and serious gaming [3] initiatives. [1]: http://en.wikipedia.org/wiki/Portal:Gene_Wiki [2]: http://biogps.org [3]: http://genegames.org
Citation preview
Translating unstructured, crowdsourced content into structured data
Andrew Su, Ph.D.The Scripps Research Institute
NCBO Webinar
February 20, 2013
Human genetics underlies human health2
~3 billion bases
~20,000 genes
Molecular diagnostics & therapeutics
Molecular understanding of:• Biological function• Genetic variation• Mutation• Deletion• Amplification• …
Gene annotations
Structured gene
annotations
Structured gene annotations enable computation3
Structured gene annotations
Few genes are well annotated4
41%
65%
CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF
Data: NCBI, February 2013
20,473 protein-coding genes
Genes, sorted by decreasing counts
GO
An
no
tati
on
C
ou
nts
Few genes are well annotated5
Genes, sorted by decreasing counts
GO
An
no
tati
on
C
ou
nts
Data: NCBI, February 2013
+ Electronic annotation (IEA)
Few genes are well annotated6
Genes, sorted by decreasing counts
GO
An
no
tati
on
C
ou
nts
Data: NCBI, February 2013
+ Electronic annotation (IEA)
Biological Process only
7
311,696 articles (1.5% of PubMed)have been cited by GO annotations
8
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
9
Crowdsourcing empowers the entire
scientific community to directly participate in the gene annotation process.
From crowdsourcing to structured data10
The Gene Wiki
GeneGames.org
10,000 gene “stubs” within Wikipedia11
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Gene Wiki has a critical mass of readers12
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Gene Wiki has a critical mass of editors13
Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Edi
tor
coun
t Editors
Edits Edi
t co
unt
A review article for every gene is powerful14
Hyperlinks to related concepts
References to the literature
Reelin: 68 editors, 543 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Document- and concept-centric text mining16
Subject Object
Predicate
Simple text mining for gene annotations17
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
Candidate assertion
GO:0006897
6319 novel Gene Ontology annotations2147 novel Disease Ontology annotations
Good, BMC Genomics, 2011.
Gene Wiki content improves enrichment analysis18
p-value (PubMed only)
p-value (PubMed + GW)
Muscle contraction
More significant
PubMed + GW
More significant
PubMed only
Good, BMC Genomics, 2011.
Gene Wiki+ for integrative queries19
http://genewikiplus.org
mwsync
Good, J Biomed Semantics, 2012.
Dynamic queries across genes, diseases, SNPs20
Good, J Biomed Semantics, 2012.
Gene Wiki+ for integrative queries21
http://genewikiplus.org
mwsync
{{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] [[HasSNP:: <q>[[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] </q>]]}}
…
OMIMPharmGKB
Good, J Biomed Semantics, 2012.
OMIMPharmGKB
Gene Wiki+ for integrative queries22
http://genewikiplus.org
mwsync
Good, J Biomed Semantics, 2012.
Wikidata23
Provide a database of the world’s knowledge that
anyone can edit
- Denny Vrandečić
Wikidata24
is a
regulates
Interacts with
Protein
Glycoprotein
Neural development
VLDL receptor
Amyloid precursor protein
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
Reelin
http://www.wikidata.org/wiki/Q414043
Wikidata25
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
Wikidata26
http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
Wikidata27
http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
From crowdsourcing to structured data28
The Gene Wiki
GeneGames.org
Not just the biomedical literature…29
BioGPS aggregates gene-centric information30
http://biogps.orgWu, NAR, 2013; Wu, Genome Biology, 2009.
The plugin interface is simple and universal31
KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}
STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}
Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}
URL template
Gene entityRendered URL
Wu, NAR, 2013; Wu, Genome Biology, 2009.
The plugin interface is simple and universal32
The plugin interface is simple and universal33
The plugin interface is simple and universal34
The plugin interface is simple and universal35
The plugin interface is simple and universal36
Total of 389 gene-centric online databases registered as BioGPS plugins
BioGPS has a critical mass of users37
• > 4100 registered users• 4000 unique visitors per week• 40,000 page views per week
1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge
7. U Penn8. Stanford9. Wash U10. UNC
Top 10 organizations
Daily pageviews
Wu, NAR, 2013; Wu, Genome Biology, 2009.
All resources should provide RDF…38
Mining structured content from HTML39
Defining a data extraction template40
…
TP53 TNF APOE IL6 VEGF …EGFR TGFB1
The BioGPS Semantic Annotator41
http://54.244.135.254:8000/
From crowdsourcing to structured data42
The Gene Wiki
GeneGames.org
43
http://www.flickr.com/photos/archana3k1/4124330493/
Seven million human hours
44
Twenty million human hours
http://www.flickr.com/photos/ableman/2171326385/
-45
150 billion human hours
http://www.flickr.com/photos/rvp-cw/6243289302/
per year
Using games to fold proteins46
Fold.it players have successfully:• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
Using games to align sequences 48
http://phylo.cs.mcgill.ca
Kawrykow, PLOS ONE, 2012.
No good gene-disease annotation database50
Alzheimer's disease (AD)Lipoprotein glomerulopathySea-blue histiocyte disease
Query: Apolipoprotein E
No good gene-disease annotation database51
Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibility
Query: Apolipoprotein E
No good gene-disease annotation database52
Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular Diseases
Query: Apolipoprotein E
?
?
?
?
?
No good gene-disease annotation database53
Alzheimer's disease (AD)Neuropsychological Tests Cognition Disorders Dementia Cognition Disease Progression Cardiovascular Diseases Coronary Disease Diabetes Mellitus, Type 2 Memory Disorders
Query: Apolipoprotein E
Memory Coronary Artery Disease Hypertension Mental Status Schedule Psychiatric Status Rating
Scales Hyperlipidemias Atrophy Dementia, Vascular Parkinson Disease Brain Injuries Myocardial Infarction …
477 diseases!
Play Dizeez to annotate gene-disease links54
3. If it’s ‘right’, you get points
4. Then on to the next question…
2. Click the related disease (only one is “right”)
5. Hurry!
1. Read the clue (gene)
6. Play to win!
Dizeez players seem pretty smart…55
In total (since Dec 2011):• 230 unique gamers• 1045 games played• 8525 guesses
# Occurrences Gene Disease
11 NBPF3 neuroblastoma
11 SOX8 mental retardation
9 ABL1 leukemia
9 SSX1 synovial sarcoma
8 APC colorectal cancer
8 FES sarcoma
8 RBP3 retinoblastoma
8 GAST gastrinoma
8 DCC colorectal cancer
8 MAP3K5 cancer
Gene Wiki OMIM PharmGKB PubMed
Using games to predict phenotype from genotype?56
http://genegames.org
Classification problems in genome biology57
cancer normal
find patterns
Classify new samples
cancer
normalSVM
Neural networks
Naïve Bayes
KNN
…100s samples
100,
000s
fea
ture
s
Random forests58
Sample subset of cases and
featuresTrain decision
treecancer normal
100s samples
100,
000s
fea
ture
s
Random forests59
cancer normal
100s samples
100,
000s
fea
ture
s
Random forests60
Classify new samples
cancer
normal
cancer normal
100s samples
100,
000s
fea
ture
s
How to interject biological
knowledge?
Network-guided forests61
Dutkowski & Ideker (2011). PLoS Computational Biology
Network-guided forests62
Sample features by PPI
networkTrain decision
treecancer normal
100s samples
100,
000s
fea
ture
s
Human-guided forests63
Sample features by
human intelligence
Train decision treecancer normal
100s samples
100,
000s
fea
ture
s
64
The Cure: Genomic predictors for disease65
The Cure: Genomic predictors for disease66
The Cure: Genomic predictors for disease67
The Cure: Genomic predictors for disease68
The Cure: Genomic predictors for disease69
The Cure: Genomic predictors for disease70
Human-guided forests71
Classify new samples
cancer
normal
“Critical Assessment”-style challenge72
Results
• 214 registered players– 50% declared knowledge of cancer
biology– 40% self-identified as having Ph.D.
• Prediction results– 70% correct on survival concordance
index– Best scoring model was 76%– Player registrations still increasing!
73
74
Crowdsourcing empowers the entire
scientific community to directly participate in the gene annotation process.
75
Doug Howe, ZFINJohn Hogenesch, U PennLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors
WP:MCB Project
Collaborators
Katie FischBen GoodSalvatore Loguercio
Max NanisChunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contacthttp://sulab.org
[email protected]@andrewsu+Andrew Su
Erik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco
Key group alumni