19
Linguistic Processing in Lattice-Based Taxonomy Construction Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and Nina Belyaeva State University Higher School of Economics, Moscow School of Applied Mathematics and Computer Science CLA 2010 Seville, Spain. October19-21, 2010.

Linguistic Processing in Lattice- Based Taxonomy Construction Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and Nina Belyaeva State University

Embed Size (px)

Citation preview

Linguistic Processing in Lattice-Based

Taxonomy Construction

Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and NinaBelyaeva

State University Higher School of Economics, MoscowSchool of Applied Mathematics and Computer Science

CLA 2010 Seville, Spain. October19-21, 2010.

Outline

• Motivation in Social Studies and the Data• Building a lattice-based taxonomy over a text

corpus• Natural language processing techniques for

automatic attributes acquisition– Keywords extraction– Probabilistic latent modeling of text– Named entity recognition

Motivation

• Represent the structure of a given domain in a form of a lattice-based taxonomy– Interdisciplinary research project “Discrete mathematical

models for political analysis of democratic institutions and human rights"

– Speeches of Western leaders and international organizations– The context in which Russia is addressed– The role and importance of democracy and human rights

agenda• Construct a context from the text corpora

– Extract the set of attributes from texts for describing the documents

– Analyze and develop natural language processing methods

THE DATA: 26 FULL SPEECHES OF FOREIGN LEADERS

CONSTRUCTING LATTICE-BASED TAXONOMY OVER A TEXT CORPUS

Preliminary text processing Attributes extraction for describing the

documentsBuilding and pruning the lattice

THREE KINDS OF TAXONOMIES

Three kinds of taxonomies depending on the attributes type: frequent words latent topicsnamed entities

BUILDING A TAXONOMY WITH FREQUENT WORDS

eliminating of stop-words

stemming - collapsing all morphological variants of the term to a single root form

describing each document with its N most frequent terms

building and pruning the lattice

t1 … tn

Doc1 Х … -

Doc2 - Х

DocT Х … Х

ijij

ikk

ntf

n

31 FORMAL CONCEPTS OF THE LATTICE BASED ON FREQUENT WORDS

Figures in squares show the number of documents in each concept

ACCORDING TO WORD FREQUENCIES TAXONOMY:

security issues and relationships of Russia with Europe are the most discussed topics along with some global problems

democracy and human rights are not included in the presented taxonomy due to pruning◦ words "democracy", "human" and "right" appear in

the concepts which include speeches by Barack Obama and Hillary Clinton.

Probabilistic latent semantic analysis (pLSA)

• P( z ) – the distribution over topics z in a particular document

• P( w | z ) – the probability distribution over words w given topic z

• T is the number of topics

1

( ) ( | ) ( )T

i i i ij

P w P w z j P z j

BUILDING A TAXONOMY WITH LATENT TOPICS

probabilistic modeling of text: documents are represented as random mixtures

over latent topicseach topic is characterized by a distribution over

words.20 topics were derived from the 26 documents20 topics were used as attributes for describing the

documents

6 OF THE 20 RECEIVED TOPICS FROM THE DOCUMENTS: WORDS DISTRIBUTIONS OVER TOPICS

Economics and financial crisis

Democracy and human rights

Future of the US and weapon issues

France and ecological problems

Russian – Georgian conflict

Russia and energy issues

crisi right nation franc georgia russia

presid human unit summit russian russian

finance govern nuclear responc intern interest

econom peopl america final georgian energy

system democraci american french territori medvedev

govern work interest preapar south issu

reform women futur longer order rule

propos democrat weapon lead process trust

time protect alli choic ethnic dialog

market principl centuri environment feder area

subject societi war african direct agreement

unit account common debat address partnership

bank univers year renew ossetia trade

septemb commun prosper organ plan law

reason leader forward africa sepatatism intern

euro Life partnership collect august neighbor

war clinton great contribut absolut common

promot independ goal ambiti bomb gas

17 FORMAL CONCEPTS OF THE LATTICE BASED ON LATENT TOPICS

17 FORMAL CONCEPTS OF THE LATTICE BASED ON LATENT TOPICS

ACCORDING TO THE LATENT TOPICS - TAXONOMY

The most actual topics are those connected with: European Union global problems security issues energy resources Russian-Georgian conflict possible ways of solving conflicts and problems

The topic of democracy and human rights is not included in the presented taxonomy due to pruning

the concept with this topic includes speeches by Barack Obama and Nicolas Sarcozy

BUILDING A TAXONOMY WITH NAMED ENTITIES

38 paragraphs derived from the 26 and enlighten solely issues concerning Russia

three types of named entities for describing the documents◦ names of persons◦ organizations◦ geographical objects

21 CONCEPTS OF A LATTICE BUILT FROM PARAGRAPHS

AND NAMED ENTITIES

CONCLUSION REMARKS several techniques have been proposed to build a

context over a text corpus frequent words allowed to define what questions

are raised most frequently by foreign leaders regarding Russia

latent topic modeling allowed to specify and describe these issues more thoroughly

Named-entity would be more informative to use in the context of latent topics

the corpus of the texts should be expanded

Thank you!