75
Translating unstructured, crowdsourced content into structured data Andrew Su, Ph.D. The Scripps Research Institute NCBO Webinar February 20, 2013

NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Embed Size (px)

DESCRIPTION

The use of crowdsourcing in biology is gaining popularity as a mechanism to tackle challenges of massive scale. However, to maximize participation and lower the barriers to entry, contributions to crowdsourcing efforts are typically not well-structured, which makes computing on these data challenging and difficult. The presentation will discuss strategies for translating this unstructured content into structured data. Three vignettes (in varying degrees of completion) will be described, one each from our Gene Wiki [1], BioGPS [2], and serious gaming [3] initiatives. [1]: http://en.wikipedia.org/wiki/Portal:Gene_Wiki [2]: http://biogps.org [3]: http://genegames.org

Citation preview

Page 1: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Translating unstructured, crowdsourced content into structured data

Andrew Su, Ph.D.The Scripps Research Institute

NCBO Webinar

February 20, 2013

Page 2: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Human genetics underlies human health2

~3 billion bases

~20,000 genes

Molecular diagnostics & therapeutics

Molecular understanding of:• Biological function• Genetic variation• Mutation• Deletion• Amplification• …

Gene annotations

Structured gene

annotations

Page 3: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Structured gene annotations enable computation3

Structured gene annotations

Page 4: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Few genes are well annotated4

41%

65%

CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF

Data: NCBI, February 2013

20,473 protein-coding genes

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Page 5: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Few genes are well annotated5

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Data: NCBI, February 2013

+ Electronic annotation (IEA)

Page 6: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Few genes are well annotated6

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Data: NCBI, February 2013

+ Electronic annotation (IEA)

Biological Process only

Page 7: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

7

311,696 articles (1.5% of PubMed)have been cited by GO annotations

Page 8: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

8

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

Page 9: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

9

Crowdsourcing empowers the entire

scientific community to directly participate in the gene annotation process.

Page 10: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

From crowdsourcing to structured data10

The Gene Wiki

GeneGames.org

Page 11: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

10,000 gene “stubs” within Wikipedia11

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Page 12: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Gene Wiki has a critical mass of readers12

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Page 13: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Gene Wiki has a critical mass of editors13

Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Edi

tor

coun

t Editors

Edits Edi

t co

unt

Page 14: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

A review article for every gene is powerful14

Hyperlinks to related concepts

References to the literature

Reelin: 68 editors, 543 edits since July 2002

Heparin: 175 editors, 320 edits since June 2003

AMPK: 44 editors, 84 edits since March 2004

RNAi: 232 editors, 708 edits since October 2002

Page 15: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Page 16: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Document- and concept-centric text mining16

Subject Object

Predicate

Page 17: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Simple text mining for gene annotations17

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

6319 novel Gene Ontology annotations2147 novel Disease Ontology annotations

Good, BMC Genomics, 2011.

Page 18: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Gene Wiki content improves enrichment analysis18

p-value (PubMed only)

p-value (PubMed + GW)

Muscle contraction

More significant

PubMed + GW

More significant

PubMed only

Good, BMC Genomics, 2011.

Page 19: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Gene Wiki+ for integrative queries19

http://genewikiplus.org

mwsync

Good, J Biomed Semantics, 2012.

Page 20: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Dynamic queries across genes, diseases, SNPs20

Good, J Biomed Semantics, 2012.

Page 21: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Gene Wiki+ for integrative queries21

http://genewikiplus.org

mwsync

{{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] [[HasSNP:: <q>[[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] </q>]]}}

OMIMPharmGKB

Good, J Biomed Semantics, 2012.

Page 22: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

OMIMPharmGKB

Gene Wiki+ for integrative queries22

http://genewikiplus.org

mwsync

Good, J Biomed Semantics, 2012.

Page 23: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Wikidata23

Provide a database of the world’s knowledge that

anyone can edit

- Denny Vrandečić

Page 24: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Wikidata24

is a

regulates

Interacts with

Protein

Glycoprotein

Neural development

VLDL receptor

Amyloid precursor protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Page 25: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Wikidata25

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

Page 28: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

From crowdsourcing to structured data28

The Gene Wiki

GeneGames.org

Page 29: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Not just the biomedical literature…29

Page 30: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

BioGPS aggregates gene-centric information30

http://biogps.orgWu, NAR, 2013; Wu, Genome Biology, 2009.

Page 31: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The plugin interface is simple and universal31

KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}

STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}

Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}

URL template

Gene entityRendered URL

Wu, NAR, 2013; Wu, Genome Biology, 2009.

Page 32: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The plugin interface is simple and universal32

Page 33: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The plugin interface is simple and universal33

Page 34: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The plugin interface is simple and universal34

Page 35: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The plugin interface is simple and universal35

Page 36: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The plugin interface is simple and universal36

Total of 389 gene-centric online databases registered as BioGPS plugins

Page 37: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

BioGPS has a critical mass of users37

• > 4100 registered users• 4000 unique visitors per week• 40,000 page views per week

1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge

7. U Penn8. Stanford9. Wash U10. UNC

Top 10 organizations

Daily pageviews

Wu, NAR, 2013; Wu, Genome Biology, 2009.

Page 38: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

All resources should provide RDF…38

Page 39: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Mining structured content from HTML39

Page 40: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Defining a data extraction template40

TP53 TNF APOE IL6 VEGF …EGFR TGFB1

Page 41: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The BioGPS Semantic Annotator41

http://54.244.135.254:8000/

Page 42: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

From crowdsourcing to structured data42

The Gene Wiki

GeneGames.org

Page 43: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

43

http://www.flickr.com/photos/archana3k1/4124330493/

Seven million human hours

Page 44: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

44

Twenty million human hours

http://www.flickr.com/photos/ableman/2171326385/

Page 45: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

-45

150 billion human hours

http://www.flickr.com/photos/rvp-cw/6243289302/

per year

Page 46: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Using games to fold proteins46

Fold.it players have successfully:• Outperformed state of the art protein

folding algorithms (Cooper, Nature, 2010)

• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)

• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)

• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)

Page 47: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Using games to fold RNAs47

http://eterna.cmu.edu/

Page 48: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Using games to align sequences 48

http://phylo.cs.mcgill.ca

Kawrykow, PLOS ONE, 2012.

Page 49: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Using games to annotate genes?49

http://genegames.org

Page 50: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

No good gene-disease annotation database50

Alzheimer's disease (AD)Lipoprotein glomerulopathySea-blue histiocyte disease

Query: Apolipoprotein E

Page 51: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

No good gene-disease annotation database51

Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibility

Query: Apolipoprotein E

Page 52: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

No good gene-disease annotation database52

Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular Diseases

Query: Apolipoprotein E

?

?

?

?

?

Page 53: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

No good gene-disease annotation database53

Alzheimer's disease (AD)Neuropsychological Tests Cognition Disorders Dementia Cognition Disease Progression Cardiovascular Diseases Coronary Disease Diabetes Mellitus, Type 2 Memory Disorders 

Query: Apolipoprotein E

Memory Coronary Artery Disease Hypertension Mental Status Schedule Psychiatric Status Rating

Scales Hyperlipidemias Atrophy Dementia, Vascular Parkinson Disease Brain Injuries Myocardial Infarction …

477 diseases!

Page 54: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Play Dizeez to annotate gene-disease links54

3. If it’s ‘right’, you get points

4. Then on to the next question…

2. Click the related disease (only one is “right”)

5. Hurry!

1. Read the clue (gene)

6. Play to win!

Page 55: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Dizeez players seem pretty smart…55

In total (since Dec 2011):• 230 unique gamers• 1045 games played• 8525 guesses

# Occurrences Gene Disease

11 NBPF3 neuroblastoma

11 SOX8 mental retardation

9 ABL1 leukemia

9 SSX1 synovial sarcoma

8 APC colorectal cancer

8 FES sarcoma

8 RBP3 retinoblastoma

8 GAST gastrinoma

8 DCC colorectal cancer

8 MAP3K5 cancer

Gene Wiki OMIM PharmGKB PubMed

Page 56: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Using games to predict phenotype from genotype?56

http://genegames.org

Page 57: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Classification problems in genome biology57

cancer normal

find patterns

Classify new samples

cancer

normalSVM

Neural networks

Naïve Bayes

KNN

…100s samples

100,

000s

fea

ture

s

Page 58: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Random forests58

Sample subset of cases and

featuresTrain decision

treecancer normal

100s samples

100,

000s

fea

ture

s

Page 59: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Random forests59

cancer normal

100s samples

100,

000s

fea

ture

s

Page 60: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Random forests60

Classify new samples

cancer

normal

cancer normal

100s samples

100,

000s

fea

ture

s

How to interject biological

knowledge?

Page 61: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Network-guided forests61

Dutkowski & Ideker (2011). PLoS Computational Biology

Page 62: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Network-guided forests62

Sample features by PPI

networkTrain decision

treecancer normal

100s samples

100,

000s

fea

ture

s

Page 63: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Human-guided forests63

Sample features by

human intelligence

Train decision treecancer normal

100s samples

100,

000s

fea

ture

s

Page 64: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

64

Page 65: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The Cure: Genomic predictors for disease65

Page 66: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The Cure: Genomic predictors for disease66

Page 67: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The Cure: Genomic predictors for disease67

Page 68: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The Cure: Genomic predictors for disease68

Page 69: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The Cure: Genomic predictors for disease69

Page 70: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

The Cure: Genomic predictors for disease70

Page 71: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Human-guided forests71

Classify new samples

cancer

normal

Page 72: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

“Critical Assessment”-style challenge72

Page 73: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Results

• 214 registered players– 50% declared knowledge of cancer

biology– 40% self-identified as having Ph.D.

• Prediction results– 70% correct on survival concordance

index– Best scoring model was 76%– Player registrations still increasing!

73

Page 74: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

74

Crowdsourcing empowers the entire

scientific community to directly participate in the gene annotation process.

Page 75: NCBO Webinar: Translating unstructured, crowdsourced content into structured data

75

Doug Howe, ZFINJohn Hogenesch, U PennLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors

WP:MCB Project

Collaborators

Katie FischBen GoodSalvatore Loguercio

Max NanisChunlei Wu

Group members

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

Contacthttp://sulab.org

[email protected]@andrewsu+Andrew Su

Erik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco

Key group alumni