19
Creating structured biomedical knowledge networks via crowdsourcing Tong Shu Li Su Lab, The Scripps Research Institute Bio-Ontologies SIG, ISMB 2015 2015-07-10

Creating structured biomedical knowledge networks via crowdsourcing

Embed Size (px)

Citation preview

Creating structured biomedical knowledge networks via crowdsourcingTong Shu LiSu Lab, The Scripps Research InstituteBio-Ontologies SIG, ISMB 20152015-07-10

Knowledge networks allow for result interpretation

Bainbridge 2011

Network creation process

Relationship extraction subproblems

Crowdsourcing introduction

• Members of the public perform small tasks for small amounts of money• Tasks are usually difficult for

computers• Workers contribute as a way of

earning supplemental income• Useful source of labor for

academics and companies

Crowdsourcing driven biocuration

• Goal: replicate work done by PhD biocurators with members of the crowd• Advantages:• Scalability• Faster results at a lower cost• Well suited for non-automatable

tasks where an expert is not necessary

Crowdsourcing relies on gold standards for validation• Crowdsourcing methods need to be validated with gold standards• Gold standard: EU-ADR corpus [1]• “Positive”: known relationship• “Speculative”: uncertain relationship• “Negative”: known lack of relationship• “False”: no claim of relationship

• Sentence-bound relationships• 300 Abstracts annotated with relationships between

genes/diseases/drugs

[1] van Mulligan et al. (2012) J. Biomed Inform. 45: 879

Platform interface for relation annotation

Crowd agreement with the EU-ADR

• Strict agreement with EU-ADR: 71.67% (43/60 sentences)• Agreement after combining

speculative and positive: 76.67%

• 10 judgements/sentence• 10 cents/judgement• Time to complete: 2 hours• Total cost: $182.21 USD

Variability of gold standards

Number of experts who chose that relationship type

Percent of raw EU-ADR relations

Crowd agreement as a proxy for clarity

Percent of crowd which chose published EU-ADR answer

Crowd agreement and accuracy probability

Percent crowd agreement for the top choice

Percent of annotations which agreed with EU-ADR

Abstract level relationship extraction

Preliminary results

• AUC of 0.904• Max F-score of 0.791 (0.773

precision, 0.809 recall)• Max F-score achieved at a voting

score of 0.407• 4.5 hours, $54.72 USD to

annotate 30 abstracts

Conclusion and next steps

• Gold standards are variable and imperfect• Binary agreement may hide

interesting information• Expert and crowd agreement can

be used to measure gold standard consistency

• Ambiguous portions of a gold standard may need to be treated differently during evaluations• Integration with machine

learning methods• Data generation• Feature extraction

• Semantically typed relationships

Acknowledgements

• Dr. Andrew Su• Dr. Benjamin Good• Dr. Laura Furlong• Dr. Zhiyong Lu• The Su Lab• We’re hiring!

EU-ADR relationship examples• Positive

• For exposure levels within standard recommended guidelines, radioisotopes are far more likely to play a role in the occurrence of spontaneous abortions than X-rays.

• Speculative• Information from the SITE Cohort

Study should clarify whether use of these immunosuppressive drugs for ocular inflammation increases the risk of mortality and fatal cancer.

• Negative• We found no evidence of impaired

control of the carbohydrate and lipid metabolism or aggravation of vascular lesions during the two years an etonogestrel implant was used by diabetic women.

• False• The frequency of PONV did not

correlate to the amounts of alfentanil, propofol, postoperative antiemetics consumed, or to female gender, non-smoking status, and history of PONV or motion sickness.

Data for all 244 drug-disease sentences

Crowd agreement and accuracy probability

Percent of annotations which agreed with EU-ADR

Percent crowd agreement for the top choice