29
Computational Support in eHumanities Proof of concept produced during CLARIN’s Creative Camp Talk of Europe Wim Peters Adam Funk University of Sheffield, UK [email protected] [email protected]

Information Extraction in the TalkOfEurope Creative Camp

Embed Size (px)

Citation preview

Computational Support in eHumanitiesProof of concept produced during

CLARIN’s Creative CampTalk of Europe

Wim PetersAdam Funk

University of Sheffield, UK

[email protected]@sheffield.ac.uk

CLARIN’s Creative CampTalk of Europe

• Our main aim in this event:

• Term identification and structuringin ToE and UK Parliament data

• Linking ToE and UK Parliament terminology

• Automatic enrichment of ToE data set

• http://linkedpolitics.ops.few.vu.nl/home

Data set 1

• Talk of Europe data set

• Plenary debates of the European Parliament as Linked Open Data

• http://linkedpolitics.ops.few.vu.nl/

Data set 2

• UK Parliamentary Archives

UK Parliamentary Archiveshttp://www.parliament.uk/business/publications/parliamentary-archives/

ParlParse

• Speeches scraped from UK Parliamentary web site

• Converted in to structured XML representations

http://parser.theyworkforyou.com/

Workflow

Output

• For terms in each data set:

– Terms

– Term hierarchies

– Term clusters

– Sententence-based sentiment context

• Between data sets:

– Term relatedness between terms

• To identify and extract relevant information from the source material, we use the GATE architecture for the production of semantic metadata in the form of text annotations.

• GATE is a framework for language engineering applications, which supports efficient and robust text processing including functionality for both manual and automatic annotation.

• It is highly scalable and has been applied in many large text processing projects;

• It is an open source desktop application written in Java that provides a user interface for professional linguists and text engineers to bring together a wide variety of natural language processing tools and apply them to a set of documents.

General Architecture for Text Engineering

• General Architecture for Text Engineering (GATE)

• open source framework which

supports plug-in NLP components

to process a corpus of text.

http://gate.ac.uk/

Free system download and training courses

LEX 2014, Ravenna, Italy

General Architecture for Text Engineering

Advantages

• Reproducibility

• Reusability

• Flexibility

• Customisability to scholarly requirements regarding research questions and analysis methodology

• http://www.gate.ac.uk

Text Annotations

Term Extraction

• TermRaider• http://www.dcs.shef.ac.uk/~wim/termraider.html • automatically provides domain-specific noun phrase

term candidates from a text corpus together with a statistically derived termhood score.

• Possible terms are filtered by means of a multi-word-unit grammar that defines the possible sequences of part of speech tags constituting noun phrases.

• It computes various termhood scores such as Kyoto Domain Relevance and frequency/inverted document frequency (TF/IDF). The scores indicate the salience of each term candidate for each document in the corpus.

KYOTO domain relevance score

• df* (1+nh)

– df: number of documents in the current corpora containing the term

– nf: number of hyponymic term candidates

• W. Bosma and P. Vossen. Bootstrapping language-neutral term extraction. In 7th Language Resources and Evaluation Conference (LREC), Valletta, Malta (2010)

Tf-Idf(WikiPedia)

Term Relatedness 1: Hyponyms(rdf: skos:narrowerTransitive)

• Hierarchical relations between terms based on head phrase matching

• fight– fight against all form of intolerance

• fight– fight against serious crime and terrorism

• fight– fight against all form of intolerance and discrimination

• fight– fight against illegal drug and the organised crime

• fight– fight against corruption and organised crime

• control– efficient control

• efficient control of EU fund

Term relatedness 2: Clusters

• Compute Pointwise Mutual Information– Pair-wise association score for terms that co-occur

within a context window (in our case sentences)

Cluster creation

• Simple clique algorithm• https://en.wikipedia.org/wiki/Cluster_analysis

• Each cluster member (a term candidate with Kyoto Domain Relevance score of > 70/100 is connected to all other cluster members by means of a PMI score > 70/100

– Result: “statistical thesaurus”

– strongly associated groups of words

– Use enhance data exploration by expanding searches with related terms (query expansion)

Clusters including “human rights” ToE data

(manually highlighted elements indicative of contrast with UK perspective)

• \end\vote\commission\network\programme\funding\proposal\report\text\level\service\freedom\fund\concern\president\access\basis\internet\enforcement\example\instrument\plastic\money\EU policy

• \recommendation\position\level\change\community\right\part\approach\discussion\dossier\regard\opinion\policy\force\negotiation\account\public\opportunity\fight

Clusters including “human rights” UK data

(manually highlighted elements indicative of contrast with EU perspective)

• \foreign\press\answer\election

• \realise\MPs\politician\consequence\claim\interest\lesson\pension\employment

• \incentive\accountability\movement\treatment\word\young people\assessment\

Term Relatedness 3: Links between ToEand UK terms

(rdf: skos:related)

• For now the link is limited to orthographic overlap of terms’ canonical forms

– Lemmatised

– decapitalised

Sentiment Context for Terminology

• Sentences have a sentiment value of positive, negative or neutral

• This allows the exploration of the emotional load of the context in which terminology is used

Added RDF

Why RDF output?

• Standard knowledge representation

• Queryable in SPARQL

• Slots additional knowledge into the Talk of Europe data model

Coverage of results

• Proof of concept

• EuroParliament– 2 months (6546 speeches)– 7900 term candidates

• UK Parliament– 1 month (January 2014, 7571 UK speeches)– 28000 term candidates

• Around 750000 triples• 2900 relations between EU and UK terminology

Usability of data and methodology

• Assists further exploration of parliamentarians’ styles, priorities and perspectives through term usage and context

– E.g. compare cluster members of terms in order to detect contrastive perspectives between ToE and UK terminological use

– (see “human rights” example)

• Flexible methodology, re-usable on other data

Data

• RDF

• http://www.dcs.shef.ac.uk/~wim/TalkOfEurope-Gate-Terms.zip

• Owl model

• http://www.dcs.shef.ac.uk/~wim/toe-data-model.owl