Upload
oleg-mcguire
View
38
Download
3
Embed Size (px)
DESCRIPTION
The Gene Wiki, from a BioRDF-naïve perspective. W3C / HCLSIG BioRDF Subgroup November 17, 2008. Entrez Gene. Patterns of gene annotation. How do we efficiently annotate the function of the ~25,000 genes in the mammalian genome? Goal: “Genome-wide functional genomics”. P( k ) ~ k - a. - PowerPoint PPT Presentation
Citation preview
The Gene Wiki, from a BioRDF-naïve perspective
W3C / HCLSIGBioRDF Subgroup
November 17, 2008
2
How do we efficiently annotate the function of the ~25,000 genes in the mammalian genome?
Goal: “Genome-wide functional genomics”
Patterns of gene annotation
P(k) ~ k -a
Entrez Gene
0.0 1.0 2.0 3.0
01
23
4
log(# references)
log(
# ge
nes)
a = -1.32R squared = 0.963
0 1 2 3 4 5
01
23
4
log(# references)
log(
# ge
nes)
a = -0.6R squared = 0.894
0.0 1.0 2.0 3.0
0.0
0.5
1.0
1.5
log(# references)
log(
# ge
nes)
a = -0.4R squared = 0.562
0.0 1.0 2.0
1.0
1.5
2.0
2.5
3.0
log(# references)
log(
# ge
nes)
44% of genes in Entrez Gene have zero linked references. Over 75% have five or fewer linked references.
3
The Long Tail of Knowledge
• Traditional media revolves around the Short Head – a few number of publishers putting out lots of content
• “Web 2.0” media revolves around community generated content – a huge population of individuals each generating a (relatively) small amount of content
Users
Co
nte
nt
The Short Head
NewspapersTV/Hollywood
Consumer ReportsOlympics
Encyclopedia Britannica
The Long Tail
BlogsYouTube
Amazon reviewsAmerican Idol
Wikipedia
“Community intelligence”
The Long Tail of encyclopedias4
“http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles Words (millions) Average words / article
Wikipedia >2,000,000 >1,000 435
Britannica Online 120,000 55 370
An expert-led investigation carried out by Nature … revealed numerous errors in both encyclopaedias, but among 42 entries tested, the difference in accuracy was not particularly great: the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three.
• Wiki: “… a website that allows the visitors themselves to easily add, remove, and otherwise edit and change available content, typically without the need for registration.”
• Wikipedia: “the free encyclopedia that anyone can edit.”
5
Advantages of a Gene Wiki1) Existing gene portals are great for structured content, but a
wiki is suited for summarizing unstructured content
Entrez Gene Wikipedia
Unstructured content allows for free-text, images, diagrams, photos, etc.
6
Advantages of a Gene Wiki2) Wiki articles enable two-way communication of information,
encouraging contributions and edits from the community.
Dec 18, 2002Jan 3, 2004Dec 11, 2004May 6, 2006
Wikipedia is rarely the last place you look, but is often a good first place for an overview.
7
Gene “stubs”
• Active MCB community at WP had already developed ~650 gene articles
• Can we accelerate this process through stub creation?
• In total, created 7500 new articles and edited 650 previously existing articles.
8
Why Wikipedia?
• Critical mass of articles to which and from which we could link gene pages
• Critical mass of editors who were experienced in wiki-related issues (fighting vandalism, copyediting, governance)
• Active group of molecular biologists at the MCB “WikiProject” (http://en.wikipedia.org/wiki/WP:MCB)
• Alternatives considered– Home-built wiki– Citizendium (citizendium.org)
9
Gene wiki usage
(650)
(7500)50% of all edits to gene pages are to newly-created pages…
Gene Wiki pages are highly ranked at Google, ensuring critical mass of users and editors…
Current have ~9000 gene pages or stubs at Wikipedia
10
Positive feedback loopGene wiki page utility
Number ofreaders
Number ofeditors
1001
2002
11
25k gene-specific review articles?
Hyperlinks to related concepts
Reelin: 33 editors, 221 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
12
Gene Wiki activity
Steady (and growing?) edit rate over time
Gene Wiki Daily Activity(Oct 17 - Nov 14)
0
20
40
60
80
100
120
140
160
10
/17
/08
10
/19
/08
10
/21
/08
10
/23
/08
10
/25
/08
10
/27
/08
10
/29
/08
10
/31
/08
11
/2/0
8
11
/4/0
8
11
/6/0
8
11
/8/0
8
11
/10
/08
11
/12
/08
11
/14
/08
# ed
its
Gene Wiki Monthly Activity(May 07 - Nov 08)
0
2000
4000
6000
8000
10000
12000
May
-07
Jun-
07
Jul-0
7
Aug
-07
Sep
-07
Oct
-07
Nov
-07
Dec
-07
Jan-
08
Feb
-08
Mar
-08
Apr
-08
May
-08
Jun-
08
Jul-0
8
Aug
-08
Sep
-08
Oct
-08
Nov
-08
# e
dit
s
13
Gene Wiki article growth
http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/gene-wiki-top-2500-20081114
14
“Welcome to the semantic web…
The main concern with plaintext-on-Wikipedia is that it's not an effective way to truly exploit the long tail, since you're going to end up with this massive plaintext disaster that will require human collating (redundant work- just get it right the first time).”
- public-semweb-lifesci mailing list
15
Primary emphases
• Providing useful content – scientists will not find or contribute to a wiki unless it is already useful
• Instant feedback – wikis allow changes to be effective immediately, without approval or intermediary (e.g., corrections/additions to NCBI/Ensembl?)
• Emphasis on contributors, not data miners – emphasize getting data in, not on getting it out, since complex protocols encourage nonparticipation (e.g., MIAME)
• Critical mass – What will differentiate the Gene Wiki from the many other wiki efforts that are stagnant?
16
Secondary emphases
• Reliability and accuracy – do open and uncurated data models produce trustworthy content?
• Synergy with existing resource – how can the Gene Wiki make the growth of traditional annotation more efficient?
• Enabling semantic queries/structure – how can we structure unstructured content for data mining? (Semantic Mediawiki? NLP?)
17
Idealized information flow
Semantic structureNCBI Ensembl …
1 Create Gene Wiki stubs
2 Unstructured content from the community
Wikipedia
3 Semantic encoding of free text (how?)
Direct semantic
annotation by scientists
“Long tail” scientific contributions
Authoritative annotation databases
18
Figure to scale?
Semantic structureNCBI Ensembl …
Wikipedia
“Long tail” scientific contributions
19
Summary
• Goal: create a complementary resource to existing tools, not competitive.
• Primary emphasis will always be on maximizing community participation.
• How do we structure the unstructured contributions?
20
AcknowledgementsSerge Batalov
Jason BoyerJennifer Floyd
Yue HuJon Huss
Jeff JanesCamilo Orozco
Steve SuJulia TurnerChunlei Wu
David DelanoJames Goodale
Phil McClurgRichard Trager
Faramarz Valafar, SDSUTim Vickers, Washington Univ
Michael CookePete Schultz
Funding: NIGMS, NIH; Novartis Research Foundation