Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Andrew Su, Ph.D.@andrewsu
[email protected]://sulab.org
May 14, 2014
CBIIT
Slides: slideshare.net/andrewsu
Citizen Science!
Few genes are well annotated…2
Data: NCBI, February 2013
41%
65%
CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF
20,473 protein-coding genes
Genes, sorted by decreasing counts
GO
Ann
otat
ion
Cou
nts
… because the literature is sparsely curated?3
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
Number of new PubMed-indexed articles
… because the literature is sparsely curated?4
0
10
20
30
40
Average capacity of human scientist
5
311,696 articles (1.5% of PubMed)have been cited by GO annotations
6
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
The Long Tail is a prolific source of content7
ShortHead
Long Tail
Content produced
Contributors (sorted)
News :Video:
Product reviews:Food reviews:Talent judging:
NewspapersTV/Hollywood
Consumer reportsFood criticsOlympics
BlogsYouTube
Amazon reviewsYelp
American Idol
Wikipedia is reasonably accurate8
Wikipedia has breadth and depth9
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words(millions)
Wikipedia Britannica Online
10
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
From crowdsourcing to structured data11
The Gene Wiki
Citizen Science
Filtering, extracting, and summarizing PubMed
Documents
Concepts Review article
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Wiki success depends on a positive feedback14
Gene wiki page utility
Number ofusers
Number ofcontributors
10012002
10,000 gene “stubs” within Wikipedia15
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
Gene Wiki has a critical mass of readers16
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
Gene Wiki has a critical mass of editors17
Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Edi
tor c
ount Editors
Edits Edi
t cou
nt
A review article for every gene is powerful18
References to the literature
Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002Heparin: 358 editors, 654 edits since June 2003AMPK: 109 editors, 203 edits since March 2004RNAi: 394 editors, 994 edits since October 2002
Making the Gene Wiki more computable19
Structured annotationsFree text
Filling the gaps in gene annotation20
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
Candidate assertion
GO:0006897
6319 novel GO annotations2147 novel DO annotations
Gene Wiki content improves enrichment analysis23
p-value (PubMed only)
p-value (PubMed + GW)
Muscle contraction
More significant
PubMed + GW
More significant
PubMed only
Good BM et al., BMC Genomics, 2011
Making the Gene Wiki more computable24
Structured annotationsFree text
Analyses
Expansion through outreach and incentives26
SP-A1SP-A2
KIF11
LIG3 MIR155
EPHX2
Cardiovascular Gene Wiki Portal27
• CAMK2D -- CaM kinase II subunit delta• CSRP3 -- Cysteine and glycine-rich protein 3• GJA1 -- Gap junction alpha-1 protein / Connexin-43• MAPK14 -- Mitogen-activated protein kinase 14 / p38-α• MYL7 -- Myosin regulatory light chain 2, atrial isoform• MYL2 -- Myosin regulatory light chain 2, ventricular/cardiac
isoform • PECAM1 -- Platelet endothelial cell adhesion molecule/CD31• RYR2 -- Ryanodine receptor 2• ATP2A2 -- Sarcoplasmic/endoplasmic reticulum calcium
ATPase 2 / SERCA2• TNNI3 -- Troponin I, cardiac muscle• TNNT2 -- Troponin T, cardiac muscle
Peipei PingUCLA
The Long Tail of scientists is a valuable source of
information on gene function
28
From crowdsourcing to structured data29
The Gene Wiki
Citizen Science
Gene databases are numerous and overlapping30
… and hundreds more …
Why is there so much redundancy?31
Users
Requests
Resources
Time
Communitydevelopment
BioGPS emphasizes community extensibility
Why do developers define the gene report view?32
BioGPS emphasizes user customizability
http://biogps.org
Community extensibility and user customizability33
Utility
UsersContributors
Utility: A simple and universal plugin interface34
Utility
UsersContributors
Utility: A simple and universal plugin interface35
Utility
UsersContributors
Utility: A simple and universal plugin interface36
Utility
UsersContributors
Utility: A simple and universal plugin interface37
Utility
UsersContributors
Utility: A simple and universal plugin interface38
Utility: A simple and universal plugin interface39
Utility
UsersContributors
Total of > 540 gene-centric online databases registered as BioGPS plugins
Users: BioGPS has critical mass40
• > 6400 registered users• 14,000 unique visitors per month• 155,000 page views per month
1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge
7. U Penn8. Stanford9. Wash U10. UNC
Top 10 organizations
Daily pageviewsUtility
UsersContributors
Contributors: Explicit and implicit knowledge41
540 plugins registered (>300 publicly shared)
by over 120 users
spanning 280+ domains
Utility
UsersContributors
Gene Annotation Query as a Service42
http://mygene.info
• High performance• 3M hits/month
• Highly scalable• 13k species• 16M genes
• Weekly data updates• JSON output• REST interface• Python/R/JS libraries
The Long Tail of
bioinformaticianscan collaboratively build a gene portal.
43
From crowdsourcing to structured data44
The Gene Wiki
Citizen Science
The biomedical literature is growing fast45
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
Number of new PubMed-indexed articles
Information Extraction46
1. Find mentions of high level concepts in text
2. Map mentions to specific terms in ontologies
3. Identify relationships between concepts
Disease mentions in PubMed abstracts47
NCBI Disease corpus• 793 PubMed abstracts
• (100 development, 593 training, 100 test)
• 12 expert annotators (2 annotate each abstract)
6,900 “disease” mentions
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
Four types of disease mentions48
Specific Disease: • “Diastrophic dysplasia”
Disease Class:• “Cancers”
Composite Mention: • “prostatic , skin , and lung cancer”
Modifier:• ..the “familial breast cancer” gene , BRCA2..
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts?
49
The Turk50
http://en.wikipedia.org/wiki/The_Turk
The Turk51
http://en.wikipedia.org/wiki/The_Turk
Amazon Mechanical Turk (AMT)52
Requester
AmazonFor each task, specify:• a qualification test• how many workers per task• how much we will pay per task
Manages: • parallel execution of jobs• worker access to tasks
via qualification tests• payments• task advertising
Workers
1. Create tasks
2. Execute
3. Aggregate
Instructions to workers53
• Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients
received...”• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked
immunodeficiency…” • Highlight the longest span of text specific to a disease
• “... contains the insulin-dependent diabetes mellitus locus …”• Highlight disease conjunctions as single, long spans.
• “... a significant fraction of familial breast and ovarian cancer , but undergoes…”
• Highlight symptoms - physical results of having a disease– “XFE progeroid syndrome can cause dwarfism, cachexia, and
microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.
Qualification test54
Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”
Test #2: “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.”
Test #3: “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…”
26 yes / no questions
Qualification test results55
Threshold for passing
33/194 passed17%
Workers
qualified workers
Simple annotation interface56
Click to see instructions
Highlight disease mentions
Experimental design
• Task: Identify the disease mentions in the 593 abstracts from the NCBI disease corpus– $0.06 per Human Intelligence Task (HIT)– HIT = annotate one abstract from PubMed– 5 workers annotate each abstract
57
This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
Aggregation function based on simple voting58
58
1 or more votes (K=1)This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
K=2
K=3 K=4
This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
Comparison to gold standard59
F = 0.81, k = 2, N = 5• 593 documents• 7 days• 17 workers• $192.90
0 3 6 9 12 15 180
0.10.20.30.40.50.60.70.80.9
Comparison to gold standard60
Max F = 0.69 0.79 0.82
k=1
2
3
23 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 80.85 0.85
0 3 6 9 12 15 180
0.10.20.30.40.50.60.70.80.9
Comparison to gold standard61
Max F = 0.69 0.79 0.82
k=1
2
3
23 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 80.85 0.85
0 3 6 9 12 15 180
0.10.20.30.40.50.60.70.80.9
Comparison to gold standard62
Max F = 0.69 0.79 0.82
k=1
2
3
23 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 80.85 0.85
0 3 6 9 12 15 180
0.10.20.30.40.50.60.70.80.9
Comparison to gold standard63
Max F = 0.69 0.79 0.82
k=1
2
3
23 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 80.85 0.85
Comparisons to text-mining algorithms64
Comparisons to human annotators65
Average level of agreement
between expert annotators (stage 1)
F = 0.76
Comparisons to human annotators66
F = 0.76F = 0.87
Average level of agreement
between expert annotators (stage 2)
67
In aggregate, our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease
concept recognition.
Information Extraction68
1. Find mentions of high level concepts in text
2. Map mentions to specific terms in ontologies
3. Identify relationships between concepts
Annotating the relationships69
This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
therapeutic target
subject predicate
objectGENE
DISEASE
Citizen Science at Mark2Cure.org70
The Long Tail of citizen
scientistscan collaboratively
annotate biomedical text.
71
72
Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternLynn Schriml, U MarylandPaul Pavlidis, U British ColumbiaPeipei Ping, UCLAMany Wikipedia editors
WP:MCB Project
Collaborators
Katie FischKarthik GangavarapuLouis GioiaBen GoodSalvatore Loguercio
Adam MarkMax NanisGinger TseungChunlei Wu
Group members
Contact
http://[email protected]
@andrewsu+Andrew Su
Adriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco
Key group alumni
Citizen Science logo based on http://thenounproject.com/term/teamwork/39543/
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820, DA036134)
Related AMT work73
• [1] Zhai et al 2013, used similar protocol to tag medication names in clinical trials descriptions. F = 0.88 compared to gold standard
• [2] Burger et al, using microtask workers to identify relationships between genes and mutations.
• [3] Aroyo & Welty, used workers to identify relations between concepts in medical text.
[1] Zhai H. et al (2013) ”Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing” J Med Internet Res
[2] Burger, John, et al. (2014) "Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing.” Mitre technical report
[3] Aroyo, Lora, and Chris Welty. Harnessing disagreement in crowdsourcing a relation extraction gold standard. Tech. Rep. RC25371 (WAT1304-058), IBM Research, 2013.