NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Preview:

DESCRIPTION

The use of crowdsourcing in biology is gaining popularity as a mechanism to tackle challenges of massive scale. However, to maximize participation and lower the barriers to entry, contributions to crowdsourcing efforts are typically not well-structured, which makes computing on these data challenging and difficult. The presentation will discuss strategies for translating this unstructured content into structured data. Three vignettes (in varying degrees of completion) will be described, one each from our Gene Wiki [1], BioGPS [2], and serious gaming [3] initiatives. [1]: http://en.wikipedia.org/wiki/Portal:Gene_Wiki [2]: http://biogps.org [3]: http://genegames.org

Citation preview

Translating unstructured, crowdsourced content into structured data

Andrew Su, Ph.D.The Scripps Research Institute

NCBO Webinar

February 20, 2013

Human genetics underlies human health2

~3 billion bases

~20,000 genes

Molecular diagnostics & therapeutics

Molecular understanding of:• Biological function• Genetic variation• Mutation• Deletion• Amplification• …

Gene annotations

Structured gene

annotations

Structured gene annotations enable computation3

Structured gene annotations

Few genes are well annotated4

41%

65%

CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF

Data: NCBI, February 2013

20,473 protein-coding genes

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Few genes are well annotated5

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Data: NCBI, February 2013

+ Electronic annotation (IEA)

Few genes are well annotated6

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Data: NCBI, February 2013

+ Electronic annotation (IEA)

Biological Process only

7

311,696 articles (1.5% of PubMed)have been cited by GO annotations

8

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

9

Crowdsourcing empowers the entire

scientific community to directly participate in the gene annotation process.

From crowdsourcing to structured data10

The Gene Wiki

GeneGames.org

10,000 gene “stubs” within Wikipedia11

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Gene Wiki has a critical mass of readers12

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Gene Wiki has a critical mass of editors13

Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Edi

tor

coun

t Editors

Edits Edi

t co

unt

A review article for every gene is powerful14

Hyperlinks to related concepts

References to the literature

Reelin: 68 editors, 543 edits since July 2002

Heparin: 175 editors, 320 edits since June 2003

AMPK: 44 editors, 84 edits since March 2004

RNAi: 232 editors, 708 edits since October 2002

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Document- and concept-centric text mining16

Subject Object

Predicate

Simple text mining for gene annotations17

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

6319 novel Gene Ontology annotations2147 novel Disease Ontology annotations

Good, BMC Genomics, 2011.

Gene Wiki content improves enrichment analysis18

p-value (PubMed only)

p-value (PubMed + GW)

Muscle contraction

More significant

PubMed + GW

More significant

PubMed only

Good, BMC Genomics, 2011.

Gene Wiki+ for integrative queries19

http://genewikiplus.org

mwsync

Good, J Biomed Semantics, 2012.

Dynamic queries across genes, diseases, SNPs20

Good, J Biomed Semantics, 2012.

Gene Wiki+ for integrative queries21

http://genewikiplus.org

mwsync

{{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] [[HasSNP:: <q>[[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] </q>]]}}

OMIMPharmGKB

Good, J Biomed Semantics, 2012.

OMIMPharmGKB

Gene Wiki+ for integrative queries22

http://genewikiplus.org

mwsync

Good, J Biomed Semantics, 2012.

Wikidata23

Provide a database of the world’s knowledge that

anyone can edit

- Denny Vrandečić

Wikidata24

is a

regulates

Interacts with

Protein

Glycoprotein

Neural development

VLDL receptor

Amyloid precursor protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Wikidata25

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

From crowdsourcing to structured data28

The Gene Wiki

GeneGames.org

Not just the biomedical literature…29

BioGPS aggregates gene-centric information30

http://biogps.orgWu, NAR, 2013; Wu, Genome Biology, 2009.

The plugin interface is simple and universal31

KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}

STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}

Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}

URL template

Gene entityRendered URL

Wu, NAR, 2013; Wu, Genome Biology, 2009.

The plugin interface is simple and universal32

The plugin interface is simple and universal33

The plugin interface is simple and universal34

The plugin interface is simple and universal35

The plugin interface is simple and universal36

Total of 389 gene-centric online databases registered as BioGPS plugins

BioGPS has a critical mass of users37

• > 4100 registered users• 4000 unique visitors per week• 40,000 page views per week

1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge

7. U Penn8. Stanford9. Wash U10. UNC

Top 10 organizations

Daily pageviews

Wu, NAR, 2013; Wu, Genome Biology, 2009.

All resources should provide RDF…38

Mining structured content from HTML39

Defining a data extraction template40

TP53 TNF APOE IL6 VEGF …EGFR TGFB1

The BioGPS Semantic Annotator41

http://54.244.135.254:8000/

From crowdsourcing to structured data42

The Gene Wiki

GeneGames.org

43

http://www.flickr.com/photos/archana3k1/4124330493/

Seven million human hours

44

Twenty million human hours

http://www.flickr.com/photos/ableman/2171326385/

-45

150 billion human hours

http://www.flickr.com/photos/rvp-cw/6243289302/

per year

Using games to fold proteins46

Fold.it players have successfully:• Outperformed state of the art protein

folding algorithms (Cooper, Nature, 2010)

• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)

• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)

• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)

Using games to fold RNAs47

http://eterna.cmu.edu/

Using games to align sequences 48

http://phylo.cs.mcgill.ca

Kawrykow, PLOS ONE, 2012.

Using games to annotate genes?49

http://genegames.org

No good gene-disease annotation database50

Alzheimer's disease (AD)Lipoprotein glomerulopathySea-blue histiocyte disease

Query: Apolipoprotein E

No good gene-disease annotation database51

Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibility

Query: Apolipoprotein E

No good gene-disease annotation database52

Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular Diseases

Query: Apolipoprotein E

?

?

?

?

?

No good gene-disease annotation database53

Alzheimer's disease (AD)Neuropsychological Tests Cognition Disorders Dementia Cognition Disease Progression Cardiovascular Diseases Coronary Disease Diabetes Mellitus, Type 2 Memory Disorders 

Query: Apolipoprotein E

Memory Coronary Artery Disease Hypertension Mental Status Schedule Psychiatric Status Rating

Scales Hyperlipidemias Atrophy Dementia, Vascular Parkinson Disease Brain Injuries Myocardial Infarction …

477 diseases!

Play Dizeez to annotate gene-disease links54

3. If it’s ‘right’, you get points

4. Then on to the next question…

2. Click the related disease (only one is “right”)

5. Hurry!

1. Read the clue (gene)

6. Play to win!

Dizeez players seem pretty smart…55

In total (since Dec 2011):• 230 unique gamers• 1045 games played• 8525 guesses

# Occurrences Gene Disease

11 NBPF3 neuroblastoma

11 SOX8 mental retardation

9 ABL1 leukemia

9 SSX1 synovial sarcoma

8 APC colorectal cancer

8 FES sarcoma

8 RBP3 retinoblastoma

8 GAST gastrinoma

8 DCC colorectal cancer

8 MAP3K5 cancer

Gene Wiki OMIM PharmGKB PubMed

Using games to predict phenotype from genotype?56

http://genegames.org

Classification problems in genome biology57

cancer normal

find patterns

Classify new samples

cancer

normalSVM

Neural networks

Naïve Bayes

KNN

…100s samples

100,

000s

fea

ture

s

Random forests58

Sample subset of cases and

featuresTrain decision

treecancer normal

100s samples

100,

000s

fea

ture

s

Random forests59

cancer normal

100s samples

100,

000s

fea

ture

s

Random forests60

Classify new samples

cancer

normal

cancer normal

100s samples

100,

000s

fea

ture

s

How to interject biological

knowledge?

Network-guided forests61

Dutkowski & Ideker (2011). PLoS Computational Biology

Network-guided forests62

Sample features by PPI

networkTrain decision

treecancer normal

100s samples

100,

000s

fea

ture

s

Human-guided forests63

Sample features by

human intelligence

Train decision treecancer normal

100s samples

100,

000s

fea

ture

s

64

The Cure: Genomic predictors for disease65

The Cure: Genomic predictors for disease66

The Cure: Genomic predictors for disease67

The Cure: Genomic predictors for disease68

The Cure: Genomic predictors for disease69

The Cure: Genomic predictors for disease70

Human-guided forests71

Classify new samples

cancer

normal

“Critical Assessment”-style challenge72

Results

• 214 registered players– 50% declared knowledge of cancer

biology– 40% self-identified as having Ph.D.

• Prediction results– 70% correct on survival concordance

index– Best scoring model was 76%– Player registrations still increasing!

73

74

Crowdsourcing empowers the entire

scientific community to directly participate in the gene annotation process.

75

Doug Howe, ZFINJohn Hogenesch, U PennLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors

WP:MCB Project

Collaborators

Katie FischBen GoodSalvatore Loguercio

Max NanisChunlei Wu

Group members

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

Contacthttp://sulab.org

asu@scripps.edu@andrewsu+Andrew Su

Erik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco

Key group alumni