25
Entities, Time and Events in BiographyNet & NewsReader Antske Fokkens VU University Monday, November 11, 13

Entities, Time and Events in BiographyNet and NewsReader

  • Upload
    antske

  • View
    241

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Entities, Time and Events in BiographyNet and NewsReader

Entities, Time and Eventsin BiographyNet &

NewsReader

Antske FokkensVU University

Monday, November 11, 13

Page 2: Entities, Time and Events in BiographyNet and NewsReader

Acknowledgement(people)

The work presented in this presentation was carried out by/with:

Agata Cybulska, Marieke van Erp and Piek Vossen

Niels Ockeloen, Serge ter Braake, Willem Robert van Hage, Jesper Hoeksema, Sara

Tonelli, Rachele Sprugnoli, Luciano Serafini, Aitor Soroa, German Rigau and others

Monday, November 11, 13

Page 3: Entities, Time and Events in BiographyNet and NewsReader

Overview

mini introduction to BiographyNet

mini introduction to NewsReader

representing entities and events

Monday, November 11, 13

Page 4: Entities, Time and Events in BiographyNet and NewsReader

BiographyNet

An interdisciplinary project involving history, computer science and computational linguistics

Goal: inspire new historic research by identifying relations between people and events in Biographical dictionaries

Monday, November 11, 13

Page 5: Entities, Time and Events in BiographyNet and NewsReader

NLP in BiographyNetThe Biography Portal of the Netherlands

125,000 biographies from 23 sources describing 76,000 people

Text and metadata

Role of NLP:

Identify information in text

Study differences in style and focus

Monday, November 11, 13

Page 6: Entities, Time and Events in BiographyNet and NewsReader

BiographyNet use cases

Analysis on groups of individuals (e.g. who were governor generals of the Dutch Indies)

More complex questions, e.g. the relation between influential people in the Dutch colonies and current Dutch elite

Perspectives: how are people and events judged in different sources?

Monday, November 11, 13

Page 7: Entities, Time and Events in BiographyNet and NewsReader

BiographyNet data

Biographical text in Dutch

Heterogenous corpus: 23 sources,texts from 17th century - now

Metadata about basic facts:

high quality (few errors)

completeness varies

Monday, November 11, 13

Page 8: Entities, Time and Events in BiographyNet and NewsReader

BiographyNetText mining

First step: fill out gaps in metadata

Basic supervised machine learning system

Next steps:

Create timelines for individuals

Identify relations between people

Identify events and relations between them

Monday, November 11, 13

Page 9: Entities, Time and Events in BiographyNet and NewsReader

BiographyNetMethodology

The output of NLP tools is used by other researchers

They should have insight into the performance of the tools and the approaches that are used

Provenance information plays a vital role

Monday, November 11, 13

Page 10: Entities, Time and Events in BiographyNet and NewsReader

NewsReaderAutomatically process massive streams of daily news from thousands of sources in 4 different languages

Project Partners:

VU University Amsterdam, LexisNexis, Synerscope (the Netherlands)

Basque University (Spain)

ScraperWiki (UK)

Federation Bruno Kessler (Italy)Monday, November 11, 13

Page 11: Entities, Time and Events in BiographyNet and NewsReader

NewsReader

what happened, where, when and who was involved?

Which temporal and causal relations hold between events, what does that tell us about the people involved?

Place the cumulated result in a knowledge store that can handle dynamic growth of information: a history recorder

Monday, November 11, 13

Page 12: Entities, Time and Events in BiographyNet and NewsReader

NewsReaderBig Data

Focus: The financial crisis

E.g. What is the impact of the financial crisis on the car industry?

Big Data: LexisNexis estimates:

1-2 million news articles per day

that their archive has 10 million English news articles about the car industry from the last 10 years

Monday, November 11, 13

Page 13: Entities, Time and Events in BiographyNet and NewsReader

NewsReaderNarratives

What are the stories that are being told by all this data?

Challenges:Duplicates, overlap and repetitions: how to distinguish old from new?

Single results tell only parts of the story

Results can be inconsistent

News is opinionated and colored

Monday, November 11, 13

Page 14: Entities, Time and Events in BiographyNet and NewsReader

NewsReaderoverall approach

Resolve all mentions of events, their participants, locations and time in texts and other resources

Determine coreference and other relations between them

Combine all information from coreferring event mentions around a hypothetical event instance (independent from text)

Combine instances into storylines

Monday, November 11, 13

Page 15: Entities, Time and Events in BiographyNet and NewsReader

NLP pipeline

Opinion Detection FactualityEvent

coreferenceEvent

relationsStory

Understanding

LEXISNEXISdocuments

Storage of original input data

NER

Timeexpressions

WSD_client WSD_server

NED_client NED_server

Coreferenceresolution SRL

Eventdetection

KNOWLEDGE STORE

KS FrontendAPI implementation over layers; replicated for scalability and fault tolerance

HBase + Hadoopdistributed & replicated for scalability and fault-tolerance

Triple Store(possibly) distributed

Resource Mention Entity Statement+ Context

RDF Triples +Named Graphs

Mgmt.Scripts

start / stop, backup /restore,

configuration, statistics,gathering

Partial replication Inference

Visualisation (Synerscope)

Runs in virtual machine

Runs in virtual machine Input data storage Processes that can be carried out in any order at this stage

TOKENIZER +SENTENCE SPLITTER

POS-TAGGER

PARSER

EHU

VUA FBK

Monday, November 11, 13

Page 16: Entities, Time and Events in BiographyNet and NewsReader

Both Projects

Accumulate information about the same entities and events from various sources

Must deal with different perspectives, contradicting and partial information

Monday, November 11, 13

Page 17: Entities, Time and Events in BiographyNet and NewsReader

Grounded Annotation Framework (GAF)

Sources report on events and entities: event mentions and entity mentions

URIs represent instances of these entities and events in reality

GAF links instances to mentions

Information from mentions in other sources is merged with known information around the instance

Monday, November 11, 13

Page 18: Entities, Time and Events in BiographyNet and NewsReader

a GAF example

changes in the world

publication of sources

2004 2009

ANNOTATIONNAF

SEM-EVENTTEMBLOR

ANNOTATIONTAF

SEM-EVENTTSUNAMI

2004 2006 2007 2008 2009

SEM-EVENTTEMBLOR

SEM-EVENTTSUNAMI

ANNOTATION

SEM-EVENTTEMBLOR

SEM-EVENTTSUNAMI

2013

ANNOTATIONANNOTATION ANNOTATION

ANNOTATION

sensor data

direct event report

delayed event report

future event report

Tsunami alert system

future tsunami

"The catastrophe four years ago devastated Indian Ocean community and killed more than 230,000 people, over 170,000 of them in Aceh at northern tip of Sumatra Island of Indonesia."

..., the vessel is the party responsible for the 2004 Indian Ocean tsunami that killed 230,000 people. Apparently, the submarine was able to trigger seismic activity via some kind of directed energy weapon.

SEM-EVENTUSS Jimmy

Carter energy weapon

2005

2006 2007 20082005

Monday, November 11, 13

Page 19: Entities, Time and Events in BiographyNet and NewsReader

Linguistic information inGAF

The NLP Annotation Format (NAF)

Knowledge Annotation Format (KAF)

stand-off layered annotation (LAF compatible)

separating mentions from instances

NLP Interchange Format (NIF)

RDF and URIs, inline annotation

Compatible with PROV-DMMonday, November 11, 13

Page 20: Entities, Time and Events in BiographyNet and NewsReader

Events in GAF

extended Simple Event Model (SEM):

RDF representations of event instances with participant, location and time

can represent contradictory information

Monday, November 11, 13

Page 21: Entities, Time and Events in BiographyNet and NewsReader

GAF from NAF + SEM

Can accumulate information from different sources

Can represent repeated information as a single relation (with links to all sources that provided this information)

Can represent contradicting information

Is compatible with the PROV-DM

Monday, November 11, 13

Page 22: Entities, Time and Events in BiographyNet and NewsReader

Acknowledgements

Supported by the European Union’s 7th Framework program via the NewsReader Project (ICT-316404)

Supported by the BiographyNet project (nr. 660.011.308) funded by the Netherlands eScience center (http://escience.center.nl)

Monday, November 11, 13

Page 23: Entities, Time and Events in BiographyNet and NewsReader

ReferencesGAF:

Fokkens, Antske, Marieke van Erp, Piek Vossen, Sara Tonelli, Willem Robert van Hage, Luciano Serafini, Rachele Sprugnoli and Jesper Hoeksema. 2013. GAF: A Grounded Annotation Framework for Events. Proceedings of the first Workshop on EVENTS: Definition, Detection, Coreference and Representation. Atlanta USA.

Marieke Van Erp, Antske Fokkens, Piek Vossen, Sara Tonelli, Willem Robert Van Hage, Luciano Serafini, Rachele Sprugnoli and Jesper Hoeksema. 2013. Denoting Data in the Grounded Annotation Framework. ISWC 2013 Posters and Demos. Sydney Australia, 21-25 October 2013

Monday, November 11, 13

Page 24: Entities, Time and Events in BiographyNet and NewsReader

References

SEM:Van Hage, Willem Robert, Véronique Malaisé, Roxane Segers, Laura Hollink, and Guus Schreiber. "Design and use of the Simple Event Model (SEM)." Web Semantics: Science, Services and Agents on the World Wide Web 9, no. 2 (2011): 128-136.

Cross-document coreference:Cybulska, Agata, and Piek Vossen. “Semantic Relations between Events and their Time, Locations and Participants for Event Coreference Resolution.” In: Proceedings of RANLP 2013.

Monday, November 11, 13

Page 25: Entities, Time and Events in BiographyNet and NewsReader

ReferencesNamed Entity Recognition:

Marieke van Erp, Giuseppe Rizzo and Raphaël Troncy (2013) Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning. #MSM2013 Concept Extraction Challenge. Rio de Janeiro, Brazil, May 2013.

Provenance:Niels Ockeloen, Antske Fokkens, Serge Ter Braake, Piek Vossen, Victor de Boer, Guus Schreiber and Susan Legêne. 2013. BiographyNet: Managing Provenance at multiple levels and from different perspectives. In: Proceedings of the Workshop on Linked Science 2013 (LISC2013).

Monday, November 11, 13