Crowdsourcing event extraction

Crowdsourcing Event Extraction

Aljaž Košmerlj, Jenya Belyaeva, Gregor Leban,

Blaž Fortuna, Marko Grobelnik

Jozef Stefan Institute

Goal

Identify and extract features (info-box) about events

(e.g. earthquake, product launch…) reported in the

news.

Automatically extracting structured information

about events from news articles is challenging.

Even when limited to news articles there is little

structure in the text

Human annotators can alleviate shortcomings of

automatic approaches

Problem: expert annotators are expensive

Solution: use crowdsourcing to lower costs

Event type example

„San Bernardino, California was struck by a moderate

earthquake on Thursday night, with shaking felt from Los

Angeles to Orange County.

A preliminary reading by the U.S. Geological Survey showed a

4.5-magnitude quake struck at 7:49pm. …“

Event type: earthquake

Roles:

• magnitude – What was the magnitude of the

earthquake?

• location – Where did the earthquake occur?

• time – At what time did the earthquake occur?

• …

Constraints and considerations

Price of 1 $ – 10 $ per article is acceptable

The annotation process needs to be guided

(semi-automatic) in order to be efficient, reliable

and cheep.

We can assume some highly skilled workers (e.g.

editors)

Schema of the extracted data has to be open end

extensible

Event extraction subtasks

1. Identify articles that can be meaningfully

structured

2. Identify a set of event types

3. For each event type identify a set of roles

(a template)

4. For each new article identify its event type and fill

the roles with the entities from the article

Annotation interface

We annotate stories, not individual articles. A story

is a cluster of articles about the same event.

Sources of clusters: Event Registry, Google clusters…

The articles are sent through the Enrycher* service

(POS tagging, named entity extraction…)

Entities proposed for annotation currently identified

using only POS tags (sequences of numerals and

nouns)

Online annotation interface

Front end: JavaScript

Back end: Python

* http://enrycher.ijs.si/

Interface

http://aidemo.ijs.si/eventAnnotation (pick any username, leave password empty)

http://aidemo.ijs.si/eventAnnotation

Recommender of event types

QMiner[1] SVM classifier

Training data:

100 stories ~ 20 per event type

5 event types: bombing, product launch, protest,

road accident, earthquake

Features:

event: concepts, title, summary

articles: concepts, title

Leave-one-out testing:

𝐶𝐴 = 0.67

With 50 non-event stories:

𝐶𝐴 = 0.54

[1] https://github.com/qminer/qminer

Evaluation

Evaluation - results

11 annotators

10 stories

Overall stats:

nr. entitites annotated: 13.4 ± 6.9

% entities annotated: 12.1 % ± 3.1 %

nr. roles filled: 6.2 ± 0.9

Pairwise annotator agreement:

nr. agreed event types: 5.9 ± 2.0

jaccard index per story: 0.25 ± 0.09

Recommender success:

1st recommendation: 6.6 ± 1.9

in first two recommendations: 7.2 ± 2.0

Future work

Improve recommender

use predicates in features

Testing in a „professional“ environment

improvement in speed?

what is a „correct“ annotation?

Building a taxonomy of event types

active learning

Thank you

for your attention!

Science

Crowdsourcing event extraction