19
DISCOVERING LINKS BETWEEN POLITICAL DEBATES AND MEDIA Damir Juric, Delft University of Technology Laura Hollink, VU University Amsterdam Geert-Jan Houben, Delft University of Technology ICWE2013

ICWE2013 - Discovering links between political debates and media

Embed Size (px)

DESCRIPTION

Discovering links between political debates and media by Damir Juric, Laura Hollink, Geert-Jan Houben TU Delft - WIS at ICWE 2013, Aalborg, Denmark, July 2013

Citation preview

Page 1: ICWE2013 - Discovering links between political debates and media

DISCOVERING LINKS BETWEEN POLITICAL DEBATES AND MEDIADamir Juric, Delft University of Technology

Laura Hollink, VU University Amsterdam

Geert-Jan Houben, Delft University of Technology

ICWE2013

Page 2: ICWE2013 - Discovering links between political debates and media

The PoliMedia project: linking politics to media

Page 3: ICWE2013 - Discovering links between political debates and media

PoliMedia research questions• How is a person, subject or process covered & visualised by the

media?• How do debates and arguments develop over a longer period of

time?• Analysing the changing ideas, arguments and presentation in

different media

Page 4: ICWE2013 - Discovering links between political debates and media

Issues with current approach• Go to different archives, look up original data!

Page 5: ICWE2013 - Discovering links between political debates and media

Goal: explicit links to different media types in one system

Page 6: ICWE2013 - Discovering links between political debates and media

Data Sets – DebatesHandelingen der Staten-General or Dutch Hansard from 1945-1995

Some provenance:1. Transcripts are made of the complete debates of the

Dutch parliament.2. Published online by the government on

http://www.statengeneraaldigitaal.nl/ (1818 - 1995) and http://officielebekendmakingen.nl/ (from 1995)

3. PoliticalMashup project has translated government pdf and txt files into XML, incl URI’s as identifiers, see http://politicalmashup.nl/

4. We build on that.

Page 7: ICWE2013 - Discovering links between political debates and media

Structure of the debate data

Debate Metadata

Topic 1

Topic 2

Speaker 1 / Content

Speaker 2 / Content

Speaker 3 / Content

Speaker 1 / Content

Aan de orde is de behandeling van: - de brief van de minister van Economische Zaken inzake Borssele (16226, nr. 26). De beraadslaging wordt geopend.

NEs={Economische Zaken, Borssele}

NEs={Borssele, Partij van de Arbeid, D66}

Metadata

Speaker 1

Speaker 2

Speaker 3

Mijnheer de Voorzitter! Met de verdragen tot uitbreiding van de EEG met Denemarken, Engeland, Ierland en Noorwegen wordt een van de doelstellingen van ons buitenlands beleid verwezenlijkt.

• who, when, what • identifiers for subparts of the debate• chronological order of speakers

Page 8: ICWE2013 - Discovering links between political debates and media

Data sets – Media• Newspaper articles

• at the National Library of the Netherlands• Many newspapers 1950- 1995• Text + images of newspaper layout

Page 9: ICWE2013 - Discovering links between political debates and media

All data and links expressed as RDF• We have created a semantic model to capture the

datasets and link between them.• Reusing other vocabularies

• Simple Event Model (SEM)• Dublic Core• FOAF• ISOCAT

Page 10: ICWE2013 - Discovering links between political debates and media

All data and links expressed as RDF

Page 11: ICWE2013 - Discovering links between political debates and media

PoliMedia linking method• Debate speeches and newspaper articles are different

types of documents, so default document similarity metrics are insufficient• Speeches contain many named entities, digressions.• Newspapers are formal and concise, words are used sparingly.

• The challenge: how to create a representation of the speeches that contains enough information to be used as a query to retrieve the right media articles from the archive?

Page 12: ICWE2013 - Discovering links between political debates and media

PoliMedia linking method

• Our PoliMedia linking method consists of four steps:

1. topics: enriching the existing debate metadata with topics

2. preselection of articles: when the candidate articles were published and who spoke in the debate (timeframe and speakers)?

3. automatic query creation: candidate articles are ranked based on similarity to the query (automatically created from speech text) by comparing vectors of topics and named entities

4. link creation: links are created between a speech and an article if the similarity score is above a threshold t

Page 13: ICWE2013 - Discovering links between political debates and media

TopicsThe MALLET topic model package• Unsupervised analysis of text• “a Topic consists of a cluster of words that frequently occur together”• [see http://mallet.cs.umass.edu/topics.php]• Input: Text, Number of iterations, Number of topics/clusters

• Output: Words that cluster around one topic.

• Example:• Text: a speech in a debate from 1975• number of iterations: 2000• number of topics: 1

Page 14: ICWE2013 - Discovering links between political debates and media

Kombrink

rente

inkomstenbelasting

bronheffing

vereenvoudiging

tarief

contourennota

Nederland

word

tussen

wetgeving

sociale

moeten

fraude

fraudebestrijding

vraag

misbruik

ten

gebruik

kamer

misbruik fraudebestrijding

ismo-rapport

Contourennota

Kombrink

EEG

Netherlandse

OESO-verband

Nederland

Contou

Engwirda

Couprie

Midden-Oosten

Euro-kapitaalmarkt

Tariefnota

Staatssecretaris

Regering

Financiën

Zwitserland

Brussel

Grave

TopicSet Speech

NE Speech

TopicSet Topic

NE Topic

Automatic query creation

Debate Metadata

Topic 1

Topic 2

Speaker 1 / Content

Speaker 2 / Content

Speaker 3 / Content

Speaker 1 / Content

Actor

Query Debate

came from

came from

Page 15: ICWE2013 - Discovering links between political debates and media

Polimedia pipeline

RDF

semantic modelRDF files

NERs Speech

TopicSet Speech

NERs Topic

TopicSet Topic

contextual vectors

PoliticalMashup(xml)

Query NE

Stopword removal

Topic modeling

Query content

Expanded query creation

SRU Query (actor, date range)

automatic query creation

KB(preselect

data)

similarity calculation

ranking

filtering

articlemetadat

a

Page 16: ICWE2013 - Discovering links between political debates and media

Evaluation

• We tried three different approaches:

• Experiment 1: NEs in speech

• Experiment 2: NEs + topics in speech

• Experiment 3: NEs + topics in speech and debate

• Two independent evaluators: reading the speeches and articles linked to them and manually assessing their relatedness• Randomly selected 20 debates from our dataset of 10,924 debates (different subjects: from fraud in the social system to the European elections)• Each experiment: random 50 speeches • In total: 150 speech-article pairs, namely 3 sets of 50 each

Page 17: ICWE2013 - Discovering links between political debates and media

Evaluation

Results:• best approach: named entities (speech + debate descriptions) and topics (speech + debate)

(2: relevant, 1: partially relevant, 0: unrelated)

Page 18: ICWE2013 - Discovering links between political debates and media

Evaluation

• Relative recall:• different evaluation: annotator reads a speech, manually creates a

suitable query for it, and assess the relevance of the articles returned for that query

Precision: 75%, recall 62%experiment 3 on 5 speeches/115 articles gave a recall of 3804 links

Page 19: ICWE2013 - Discovering links between political debates and media

Conclusion• Creation of links between two very different datasets: a dataset of political

debates and a media archive• Linking method takes advantage of:

• Debate content and metadata• Named Entities and Topics from the debates• semantic partOf structure of the debates

• In experiments we have shown the added value of topics and debate structure

• Produced links• different in nature than those produced by e.g. ontology alignment tools

• Now: coarsely typed links• Future: nature and strength of the link