27
Claudiu Mihăilă The National Centre for Text Mining The University of Manchester News Search Using Discourse Analytics

News Search Using Discourse Analytics

Embed Size (px)

DESCRIPTION

Enhanching access to information within digital heritage archives, e.g. New York Times Corpus, by identifying discourse phenomena and searchng and filtering events according to multiple facets.

Citation preview

Page 1: News Search Using Discourse Analytics

Claudiu Mihăilă

The National Centre for Text Mining

The University of Manchester

News Search Using Discourse Analytics

Page 2: News Search Using Discourse Analytics

Data

Exponential growth of data

Information overload

Growing

Page 3: News Search Using Discourse Analytics

Data

Exponential growth of data

Information overload

Data deluge

Pouring

Page 4: News Search Using Discourse Analytics

Data

Exponential growth of data

Information overload

Data deluge

Can we process a deluge of data

in a useful manner?

Processing

Page 5: News Search Using Discourse Analytics

Searching

Give a query as input

Obtain a set of relevant articles

Keyword v. Semantics

– Synonyms

– Hyponyms

– Spelling variants

– Inflections

– Relations between query terms

Page 6: News Search Using Discourse Analytics

Searching

Crimes in the town of Sandwich

Keywords

Page 7: News Search Using Discourse Analytics

Searching

Crimes in the town of Sandwich

– Crime Sandwich by Click Bang

Productions on SoundCloud

– Sandwich Crime - Topix

– Crime on rye: Four accused of

stealing $10 sandwich from car

– Crime Scene Sandwich Bags

– Crime rate in Sandwich, Illinois (IL):

murders, rapes, robberies

– Ham Sandwich Nation: Due Process

When Everything is a Crime

Keywords

Page 8: News Search Using Discourse Analytics

Searching

Crimes in the town of Sandwich

Semantics

Page 9: News Search Using Discourse Analytics

Searching

Crimes in the town of Sandwich

– Kent Police issue warning after fake

£20 notes reported in Sandwich

– Trio jailed for total of 30 years after

crime spree in Sandwich

– Murder at Sandwich - Kent

Semantics

Page 10: News Search Using Discourse Analytics

Semantic search engine

Specification of semantic types of

search terms: town:Sandwich

Normalisation of semantic entities:

Sandwich, Kent = Sandwich, UK

Relations between search terms to

describe events: location:Sandwich

Restrictions on discourse context of

retrieved events

Features

Page 11: News Search Using Discourse Analytics

Structured events

The event

Page 12: News Search Using Discourse Analytics

Discourse interpretation

Karl Munro may have killed Sunita in Weatherfield in 2013.

According to Karl Munro, Craig Tinker set Sunita on fire in Weatherfield in 2013.

Karl Munro said he will kill Sunita.

Karl Munro didn’t fail to kill Sunita in Weatherfield in 2013.

Stella Price condemned all of Karl’s wrongdoings.

The story

Page 13: News Search Using Discourse Analytics

ACE corpus

599 news-domain documents

– News articles

– Transcripts of broadcast news

– Transcripts of broadcast conversation

– Conversational telephone speech

– Weblogs

– Discussion fora

Polarity

Tense

Specificity

Modality

Source type

Subjectivity

2005 version Discourse -related Attributes

Page 14: News Search Using Discourse Analytics

Discourse context of events

Scheme

Page 15: News Search Using Discourse Analytics

New York Times corpus

20 years-worth of news articles – 1.8M

Includes annotations of

– Metadata

– Named entities

– Normalisation

Facilitates diachronic studies

– Language evolution

– Social change

– Development of events

Digital archive

Page 16: News Search Using Discourse Analytics

ISHER

Web-based

User-friendly interface

Intuitive query-building mechanism

Refining/filtering according to facets

Semantically enabled searching

Page 17: News Search Using Discourse Analytics

ISHER

Automatic Event Recognition - EventMine

Miwa, Thompson, Ananiadou. (2012). Boosting automatic event extraction from the literature using

domain adaptation and coreference resolution. Bioinformatics, 28(13), 1759-1765

Page 18: News Search Using Discourse Analytics

ISHER

Web-based interface – “Coronation Street”

Page 19: News Search Using Discourse Analytics

ISHER

Semantic clustering

Lingo – 3rd party

NaCTeM clustering

Page 20: News Search Using Discourse Analytics

ISHER

Semantic clustering Cluster summarisation

Page 21: News Search Using Discourse Analytics

ISHER

Metadata in the NYT corpus

Page 22: News Search Using Discourse Analytics

ISHER

Entities

Page 23: News Search Using Discourse Analytics

ISHER

Events

Page 24: News Search Using Discourse Analytics

ISHER

Events

Prime Minister Tony Blair’s election last month

Page 25: News Search Using Discourse Analytics

Final remarks

Same technique can be adapted to other domains

Previously developed

–EUPMC – medical journal articles

–ASCOT – clinical trials

Other domains

Page 26: News Search Using Discourse Analytics

Final remarks

Enhanced access to information within

digital heritage archives (NYT)

Identified discourse phenomena to

search for and filter events

Created ISHER, semantic search

engine to access the NYT corpus

Apply to new domains and institutional

repositories

Customise towards social unrest

Diachronic studies

Other languages in danger of digital

extinction – Meta-Net

Summary Future work

Page 27: News Search Using Discourse Analytics

Thank you!