Upload
aljaz-kosmerlj
View
65
Download
1
Tags:
Embed Size (px)
DESCRIPTION
We present a system for manually extracting structured event information from freeform newswire text. The extraction is performed on news articles preprocessed by services developed within the XLike project and is guided by suggestions the system produces using machine learning techniques. Results of testing performed using human annotators show the system can produce meaningful data and suggest several avenues for improvement of the system.
Citation preview
Crowdsourcing Event Extraction
Aljaž Košmerlj, Jenya Belyaeva, Gregor Leban,
Blaž Fortuna, Marko Grobelnik
Jozef Stefan Institute
Goal
Identify and extract features (info-box) about events
(e.g. earthquake, product launch…) reported in the
news.
Automatically extracting structured information
about events from news articles is challenging.
Even when limited to news articles there is little
structure in the text
Human annotators can alleviate shortcomings of
automatic approaches
Problem: expert annotators are expensive
Solution: use crowdsourcing to lower costs
Event type example
„San Bernardino, California was struck by a moderate
earthquake on Thursday night, with shaking felt from Los
Angeles to Orange County.
A preliminary reading by the U.S. Geological Survey showed a
4.5-magnitude quake struck at 7:49pm. …“
Event type: earthquake
Roles:
• magnitude – What was the magnitude of the
earthquake?
• location – Where did the earthquake occur?
• time – At what time did the earthquake occur?
• …
Constraints and considerations
Price of 1 $ – 10 $ per article is acceptable
The annotation process needs to be guided
(semi-automatic) in order to be efficient, reliable
and cheep.
We can assume some highly skilled workers (e.g.
editors)
Schema of the extracted data has to be open end
extensible
Event extraction subtasks
1. Identify articles that can be meaningfully
structured
2. Identify a set of event types
3. For each event type identify a set of roles
(a template)
4. For each new article identify its event type and fill
the roles with the entities from the article
Annotation interface
We annotate stories, not individual articles. A story
is a cluster of articles about the same event.
Sources of clusters: Event Registry, Google clusters…
The articles are sent through the Enrycher* service
(POS tagging, named entity extraction…)
Entities proposed for annotation currently identified
using only POS tags (sequences of numerals and
nouns)
Online annotation interface
Front end: JavaScript
Back end: Python
* http://enrycher.ijs.si/
Interface
http://aidemo.ijs.si/eventAnnotation (pick any username, leave password empty)
Recommender of event types
QMiner[1] SVM classifier
Training data:
100 stories ~ 20 per event type
5 event types: bombing, product launch, protest,
road accident, earthquake
Features:
event: concepts, title, summary
articles: concepts, title
Leave-one-out testing:
𝐶𝐴 = 0.67
With 50 non-event stories:
𝐶𝐴 = 0.54
[1] https://github.com/qminer/qminer
Evaluation
Evaluation - results
11 annotators
10 stories
Overall stats:
nr. entitites annotated: 13.4 ± 6.9
% entities annotated: 12.1 % ± 3.1 %
nr. roles filled: 6.2 ± 0.9
Pairwise annotator agreement:
nr. agreed event types: 5.9 ± 2.0
jaccard index per story: 0.25 ± 0.09
Recommender success:
1st recommendation: 6.6 ± 1.9
in first two recommendations: 7.2 ± 2.0
Future work
Improve recommender
use predicates in features
Testing in a „professional“ environment
improvement in speed?
what is a „correct“ annotation?
Building a taxonomy of event types
active learning
Thank you
for your attention!