#CrowdTruth: Linked Data for Information Extraction @ISWC2015

Anca Dumitrache, Lora Aroyo, Chris Welty http://CrowdTruth.org

Measures for Language Ambiguity Medical Relation Extraction

Linked Data for Information Extraction @ ISWC2015

#CrowdTruth @anouk_anca @laroyo @cawelty #LD4IE2015

• Most knowledge is in text, but it’s not structured

•  Linked Data sources are a good start, but incomplete

•  Goal (Distance Supervision): –  extract LD triples from text –  given exis@ng tuples find sentences

that men@on both args –  use resul@ng sentences as TP to train

a classifier • But can some8mes be wrong –  <PALPATION> loca8on <CHEST> –  feeling the way CHEST expands

(PALPATION), can iden8fy areas of lung that are full of fluid

• Standard approach ⇒ Expert Annota8on

Background http://CrowdTruth.org

•  Human annotators with domain knowledge provide be>er annotated data, e.g –  if you want medical texts annotated

for medical rela@ons you need medical experts

•  But experts are expensive & don’t scale

•  MulFple perspecFves on data can be useful, beyond what experts believe is salient or correct

Human AnnotaFon Myth: Experts know best

What if the CROWD IS BETTER?

http://CrowdTruth.org

What is the relation between the highlighted terms?

He was the first physician to identify the relationship between HEMOPHILIA and HEMOPHILIC ARTHROPATHY.

Experts Know Best?

Crowd reads text literally -‐ provide be>er examples to machine

experts: cause crowd: no relaFon

hMp://CrowdTruth.org

Experts Know Best?

experts vs. crowd?

What is the (medical) relation between the highlighted (medical) terms?

•  91% of expert annotations covered by the crowd •  expert annotators reach agreement only in 30% •  most popular crowd vote covers 95% of this

expert annotation agreement


•  rather than accep@ng disagreement as a natural property of seman@c interpreta@on

•  tradi@onally, disagreement is considered a measure of poor quality in the annota@on task because: –  task is poorly defined or –  annotators lack training

This makes the eliminaFon of

disagreement a goal

Human AnnotaFon Myth: Disagreement is Bad

What if it is GOOD?


Disagreement Bad? Does each sentence express the TREAT relation?

ANTIBIOTICS are the first line treatment for indications of TYPHUS. → agreement 95% Patients with TYPHUS who were given ANTIBIOTICS exhibited side-effects. → agreement 80% With ANTIBIOTICS in short supply, DDT was used during WWII to control the insect vectors of TYPHUS. → agreement 50%

Disagreement can reflect the degree of clarity in a sentence


•  Annotator disagreement is signal, not noise.

•  It is indicative of the variation in human semantic interpretation of signs

•  It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality

CrowdTruth http://CrowdTruth.org

•  Goal: collecting a Medical RelEx Gold

Standard improve the performance of a

RelEx Classifier

•  Approach: crowdsource 900 medical

sentences measure disagreement with

CrowdTruth Metrics train & evaluate classifier with

CrowdTruth SRS Score

CrowdTruth for medical relaFon extracFon


RelEx TA

SK in CrowdFlow

er PaFents with ACUTE FEVER and nausea could be suffering from INFLUENZA AH1N1

Is ACUTE FEVER – related to → INFLUENZA AH1N1?


1 1 1

Worker Vector


1 1 1

1 1

1

1 1

1 1

1 1

1

1

1

0 1 1 0 0 4 3 0 0 5 1 0

Sentence Vector


Unclear relaFonship between the two arguments reflected in the disagreement

Sentence Clarity


Clearly expressed relaFon between the two arguments reflected in the agreement

Sentence Clarity


Measures how clearly a sentence expresses a relaFon

0 1 1 0 0 4 3 0 0 5 1 0

Unit vector for relation R6

Sentence Vector

Cosine = .55

Sentence-‐RelaFon Score (SRS)


0.907, p = 0:007

0.844

AnnotaFon Quality of Expert vs. Crowd AnnotaFons


0.907, p = 0:007

0.844

[0.6 -‐ 0.8] crowd significantly out-‐performs expert with max in 0.907 F1 @ 0.7 threshold

AnnotaFon Quality of Expert vs. Crowd AnnotaFons


•  Normally P = TP/(TP+FP) •  Intuition:

– some sentences make better examples – more important to get the clear cases right – but P normally treats all examples as equal

• We propose: – weight P with sentence-relation score (SRS)

PW = ∑i (TPi x SRSi) ∑i (TPi x SRSi) + ∑i (FPi x SRSi)

*and similarly for F1, Recall, and Accuracy

Weighted Precision*


CrowdTruth SRS Score as a Weight for AnnotaFon Quality F1

Unweighted Weighted

[email protected] 0.8382 0.9329

[email protected] 0.9074 0.9626

Expert 0.8444 0.8611

Single 0.6637 0.7344

Baseline 0.6559 0.6891

the sentences with a lot of disagreement weigh less



weighted F1 scores higher at any given threshold

RelEx CAUSE Classifier for Crowd & Expert

Weighted vs. Unweighted F1 Score

0.658

0.638

Crowd

Expert

0.642, p = 0:016 0.638

RelEx CAUSE Classifier F1 for Crowd vs. Expert AnnotaFons


0.642, p = 0:016 0.638

crowd provides training data that is at least as good if not bePer than experts

RelEx CAUSE Classifier F1 for Crowd vs. Expert AnnotaFons


Measured per worker

0 1 1 0 0 4 3 0 0 5 1 0

Worker’s sentence vector

Sentence Vector

AVG (Cosine)

Worker-‐Sentence Disagreement


•  crowd can build a ground truth •  performs just as well as medical

experts •  crowd is also cheaper •  crowd is always available

•  crowd can be used as a weight •  improved F1 scores for crowd

and expert ground truths

•  CrowdTruth = a solution to Clinical NLP Challenge:

•  lack of ground truth for training & benchmarking

Experimentsshowed:


CrowdTruth.org

http://data.CrowdTruth.org/medical-relex #CrowdTruth @anouk_anca @laroyo @cawelty #LD4IE2015 #ISWC2015

Technology

#CrowdTruth: Linked Data for Information Extraction @ISWC2015