76
UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission Joint Research Centre (JRC) – Italy http:// langtech .jrc.it/ http://press.jrc.it/NewsExplorer

UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

Embed Size (px)

Citation preview

Page 1: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 1

Language Technology, JRC, European Commission

Multilingual Information Extraction

Bruno Pouliquen

European Commission – Joint Research Centre (JRC) – Italy

http://langtech.jrc.it/ http://press.jrc.it/NewsExplorer

Page 2: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 2

Joint Research Centre, IPSC, SES, langtech

• Joint Research Centre Mission: provide customer-driven scientific and technical support for the conception, development, implementation and monitoring of EU policies. (As a service of the European Commission, the JRC functions as a reference centre of science and technology for the Union. Close to the policy-making process, it serves the common interest of the Member States, while being independent of special interests, whether private or national.)

• Institute for the Protection and Security of the Citizen• Support for external Security (unit)• Web and language technology (action)

Page 3: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 3

Agenda

• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)

• Overview of our approach• Description of the components, challenges

• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.

• Future work and Conclusions

Page 4: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 4

Motivation for cross-lingual document linking

• EU: need for multilinguality• 23 official languages (since 1st Jan 2007:Irish, Romania, Bulgarian)• 253 language pairs• + interest in Arabic, Farsi, Russian…

• Enhanced Information Extraction by combining information from texts in various languages

• Applications• NewsExplorer

• Based on Europe Media Monitor (collecting news in 30 languages from 1000 news web site)

• Analysing News in 18 languages (ar, bg, da, de, en, es, et, fa, fr, it, nl, pl, pt, ro, ru, sl, sv, tr)

• OSInt• Combines internet searches and entity extraction

• Medisys (medical intelligence)

Page 5: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 5

NewsExplorer

Page 6: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 6

OSInt

Page 7: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 7

Medisys

Page 8: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 8

Agenda

• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)

• Overview of our approach• Description of the components, challenges

• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.

• Future work and Conclusions

Page 9: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 9

• Map documents onto multilingual thesauri, nomenclatures, gazetteers, …(general, medical, technical, geographical, …)

• Identify and normalise language-independent text features • Numbers; dates; currency expressions; measurement expressions; …

le treize août 2007, am 13.8.2007, by 13/08/07 DATE_2007_08_13

• Names of persons, organisations, …

Muqtada al-Sadr, Mouqtada Sadr, Mokdata Sadr, Muqtadā aş-Şadr, Муктада Садр

ID=236

• Calculate the various vector similarities and add up• For details, see Steinberger et al. (2004, Informatica 28).

Approach to CLDS in a nutshell

Page 10: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 10

Geo-coding multilingual text

• Place names cannot be recognised by looking for patterns in text (Gey 2000) Multilingual gazetteer needed

‘Санкт-Петербург’, ‘Saint Petersburg’, ‘Saint Pétersbourg’, ‘Leningrad’, ‘Petrograd’, …

• Place name recognition via the lookup of text words in the gazetteer

Latitude / Longitude

Page 11: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 11

Major challenges for geocoding (1)

1. Places homographic with common words

Geo-stopwords

2. Places homographic with person names

Location only if not part of a person namee.g. ‘Kofi Annan’, ‘Annan’

Page 12: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 12

Major challenges for geocoding (2)

3. Completeness of the gazetteer: exonyms, endonyms, …‘Санкт-Петербург’, ‘Saint Petersburg’, ‘Saint Pétersbourg’, ‘Leningrad’, ‘Petrograd’, …

4. Inflection• Romanian: Parisului (of Paris)• Estonian: Londonit (London),

New Yorgile (New York) • Arabic: (the Paris inhabitants)

Usage of context-sensitive suffix lists and replacement rules to generate all variants(Romanian example)

Paris(ul|ului)? / Români(a|e)

[albaRiziu:n] الباريسون

Page 13: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 13

Major challenges for geocoding (3)

5. Homographic place names

Use size class information

Use country context (news source; unambiguous anchors)

Use kilometric distance E.g. “from Warsaw to Brest”

Brest (France): 2000 km from Warsaw Brest (Belarus): 200 km from Warsaw

Place name Number of cities with this nameAleksandrovka 244

...Washington 32London 18Berlin 15Paris 15

Page 14: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 14

Agenda

• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)

• Overview of our approach• Description of the components, challenges

• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.

• Future work and Conclusions

Page 15: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 15

Multilingual name recognition and variant merging

asesinato del exprimer ministro Rafic al-Hariri, que la oposición atribuyóes

l'assassinat de l'ex-dirigeant Rafic Hariri et le départ du chef de la diplomfr

na de moord op oud-premier Rafiq al-Hariri gingen gisteren bijna eennl

libanesischen Regierungschef Rafik Hariri vor einem Monat wichtige Bde

danjega libanonskega premiera Rafika Haririja. Libanonska opozicija sisl

möödumisele ekspeaminister Rafik al-Hariri surma põhjustanud pommiplet

death of former Prime Minister Rafik Hariri, blamed by many oppositionen

السابق اغتيال الوزراء الحريري رئيس سابقا رفيق حدث وما يهودية arبأياد

Бывший премьер-министр Ливана Рафик Харири, который ru

Page 16: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 16

Recognition of person names

• Lookup of known names from database • Currently over 600,000 names but about 50.000 are really looked for (+ 135.000 variants)• Insertion of about 1000 new names per day• Pre-generate morphological variants (Slovene example):

Tony(a|o|u|om|em|m|ju|jem|ja)?\s+Blair(a|o|u|om|em|m|ju|jem|ja)

• Guessing names using empirically-derived lexical patterns• Trigger word(s) + Name Surname

• President, Minister, Head of State, Sir, American• “death of”, “[0-9]+-year-old”, …• Known first names (John, Jean, Hans, Giovanni, Johan, …)• …• Combinations: “56-year-old former prime minister Kurmanbek Bakiyev”

Page 17: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 17

Person name recognition – dealing with inflection

• Recognition of person names, using regular expressions (Slovene example):

• kandidat(a|u|om)?• legend(a|e|i|o)• milijarder(ja|ju|jem)?• predsednik(a|u|om|em)?• predsednic(a|e|i|o)• ministric(a|e|i|o)• sekretar(ja|ju|jom|jem)?• diktator(ja|ju|jem)? • playboy(a|u|om|em)?

+ uppercase words

… verskega voditelja Moktade al Sadra je z notranjim … = Muqtada al-Sadr (ID=236)

Page 18: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 18

Learning name variants

• For all new names found: apply approximate name matching• Slow method:

• Based on sets of letter bigrams and letter trigrams• Merge two names if cosine similarity is > 70%

• Fast method:• Pre-select merging candidate using a “normalised form”

• Collect variants automatically from Wikipedia• Cross-script name matching

(Cyrillic, Greek, Arabic + Farsi, Hindi)

Name Normalised form

Malik Saïdoullaïev, Malik Sayduláyevmlk sdlv (malik saidulaiev)

Mahmoud Ahmadinejad, Mahmūd Ahmadīnežād

mhmd hmdnjd (mahmud ahmadinejad)

Сергей Куприянов, Sergei Kupriyanov, Sergei Kuprianow, Sergueï Kouprianov

srg kprnv (sergei kuprianov)

Ban Ki-moon, Ban Ki Moon, Пан Ги Мун bn k mn (ban ki mun)

Page 19: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 19

Person name recognition – Result

• Frequency list of normalised person names found(numerical person ID)

• Together with the “trigger words” associated

Page 20: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 20

Agenda

• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)

• Overview of our approach• Description of the components, challenges

• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.

• Future work and Conclusions

Page 21: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 21

• Over 6000 classes• Covering many different subject domains (wide coverage)• Multilingual (over 20 languages, one-to-one translations) • Developed by the European Parliament and others• Hierarchically organised

Eurovoc Thesaurushttp://europa.eu.int/celex/eurovoc

Page 22: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 22

Eurovoc categorisation – Major challenges

• Eurovoc is a conceptual thesaurus

categorisation vs. term extraction

• Large number of classes (~ 6000)

• Very unevenly distributed

• Various text types (heterogeneous training set)

• Multi-label categorisation (both for training and assignment)

E.g.• SPORT• PROTECTION OF MINORITIES• CONSTRUCTION AND TOWN PLANNING• RADIOACTIVE MATERIALS

Page 23: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 23

Eurovoc categorisation – Approach

• Profile-based, category ranking task• Training: Identification of most significant words for each class• Assignment: combination of measures to calculate similarity between profiles and new document

• Empirical refinement of parameter settings• Training:

• Stop words• Lemmatisation• Multi-word terms• Consider number of classes of each training document• Thresholds for training document length and number of training documents per class• Methods to determine significant words per document (log-likelihood vs. chi-square, etc.)• Choice of reference corpus• …

• Assignment:• Selection and combination of similarity measures (cosine, okapi, …)• ...

For details, see Pouliquen et al. (Eurolan 2003)

Page 24: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 24

Assignment Result (Example)

Title: Legislative resolution embodying Parliament's opinion on the proposal for a Council Regulation amending Regulation No 2847/93 establishing a control system applicable to the

common fisheries policy (COM(95)0256 - C4-0272/95 - 95/ 0146(CNS)) (Consultation procedure)

Page 25: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 25

The ‘JRC-Acquis’ parallel corpus in 21 languages

• Freely available for research purposes on our web site: http://langtech.jrc.it/JRC-Acquis.html• For details, see Steinberger et al. (2006, LREC)

• Average of 8.8 Million words per language• Pair-wise alignment for all 210 language pairs! • Average of 7600 documents per language• Most documents have been Eurovoc-classified manually• Incoming new version soon (about twice the same size) useful for:

1. Training of multilingual subject domain classifiers.2. Creation of multilingual lexical space (LSA, KCCA)3. Training of automatic systems for Statistical Machine Translation.4. Producing multilingual lexical or semantic resources such as dictionaries or ontologies.5. Training and testing multilingual information extraction software.6. Automatic translation consistency checking.7. Testing and benchmarking alignment software (sentences, words, etc.), across a larger variety of language pairs.

8. All types of multilingual and cross-lingual research.

Page 26: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 26

Agenda

• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)

• Overview of our approach• Description of the components, challenges

• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.

• Future work and Conclusions

Page 27: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 27

Monolingual document representation

• Vector of keywords and their keyness using log-likelihood test (Dunning 1993)

Keyness Keyword 109.24 jackson 41.54 neverland 37.93 santa 32.61 molestation 24.51 boy 24.43 pop 20.68 documentary 18.79 accuser 13.59 courthouse 11.12 jury 10.08 ranch 9.60 california

Keyness Keyword 9.39 verdict 7.56 testimony 6.50 maria 4.09 michael 1.73 reached 1.68 ap 1.05 appeared 0.53 child 0.50 trial 0.45 monday 0.26 children 0.09 family

“Michael Jackson Jury Reaches Verdicts”

Original cluster

Page 28: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 28

Calculation of a text’s Country Score

• Aim: show to what extent a text talks about a certain country• Sum of references to a country, normalised using the log-likelihood test• Add country score vector to keyword vector

Keyness Keyword 109.2478 jackson 41.5450 neverland 37.9347 santa 32.6105 molestation 24.5193 boy 24.4351 pop 20.6824

documentary 18.7973 accuser 13.5945 courthouse 11.1224 jury 10.4184 *us* 10.0838 ranch 9.6021 california 9.3905 verdict

KeynessKeyword

7.5620 testimony 6.5014 maria 4.0957 michael 1.7368 reached 1.6857 ap 1.5610 *gb* 1.5610 *il* 1.5610 *br* 1.0520 appeared 0.5384 child 0.5045 trial 0.4502 monday 0.2647 children 0.0946 family

Page 29: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 29

• Input: Vectors consisting of keywords and country score• Similarity measure: cosine• Method: Bottom-up group average unsupervised clustering• Build the binary hierarchical clustering tree (dendrogram)

• Retain only “big” nodes in the tree with a high cohesion (empirically refined minimum intra-node similarity: 45%)

• Use the title of the cluster’s medoid as the cluster title• For details, see Pouliquen et al. (CoLing 2004)

Monolingual news clustering

87%76%

40%

20%

10%

97%

45%

Page 30: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 30

Agenda

• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)

• Overview of our approach• Description of the components, challenges

• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.

• Future work and Conclusions

Page 31: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 31

Cross-lingual cluster linking – combination of 4 ingredients

CLDS = α∙S1 + β∙S2 + γ∙S3 + δ∙S4

• Ranked list of Eurovoc classes (40%)• Country score (30%)• Names + frequency (20%)• Monolingual cluster representation without country score (10%)

+ + +

Page 32: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 32

example

46%

Page 33: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 33

Cross-lingual cluster linking – evaluation

• Evaluation results depending on similarity threshold• Ingredients: 40/30/30 (names not yet considered)• Evaluation for EN FR and EN IT (136 EN clusters)

* Recall at 15% similarity threshold = 100%

• For details, see Pouliquen et al. (CoLing 2004)

Page 34: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 34

Filter out bad links by exploiting all cross-lingual links

Assumption: If EN is linked to FR, ES, IT, … FR should also be linked to ES, IT, ...If not: lower link likelihood

Page 35: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 35

Agenda

• Other applications• automatic detection of man made violent events• Social networks and relation extraction• Quotation extraction• Medisys: extraction of medical concepts (MeSH thesaurus)

from health news

Page 36: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 36

Automatic detection of man made violent events

• NEXUS News cluster Event eXtraction Using language Structures

• 3415 templates in total• Processes the news clusters for one day for about 30 minutes• Detects about 10-15 security-related events per day

Page 37: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 37

NEXUS: Learning Templates by Bootstrapping - Example

• We learned core templates like [action killed] [person X] and others• In some of the news clusters we match a text like

“U.S. troops killed Al Zarkawi”• Al Zarkawi is taken as an anchor entity and we consider the contexts in which this name appears in

the same news cluster where the above mentioned text appears• For example: “the body of Al Zarkawi was identified”, “the body of Al Zarkawi was found”, etc.• All these contexts from different anchor entities are passed to a pattern learning algorithm to extract

candidate patterns (e.g. the body of X)

Page 38: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 38

NEXUS: Event Structure

Fields Semantics

Date This is always the date of the cluster, though in practice sometimes we have back references to security related events in the past

Location Country and (if possible to identify) city , town, area. Sometimes differ from the cluster main geo-coding.

Cluster information Cluster code, title of the cluster, list of cluster article IDs, alerts

Number of dead and wounded These numbers are estimated from the cluster articles using special-purpose build damage-estimation algorithm.

Dead persons People who were killed in the event

Wounded persons People who were injured

Kidnapped persons People who were abducted

Actors These are the killers, kidnappers, or persons and organizations which are likely to have initiated a violent act.

Actions This field contains description of the violent act. This still looks like a list of keywords and key phrases

Article text Fragments from the articles referring to the main event. (Only titles and first sentence)

Page 39: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 39

Knowledge base

Benedict XVI

Photograph Wikipedia entries 1.10 Clusters Titles Name variants Associated keywords Associated countries Associated names (sorted by frequency) Weighted associated names

Page 40: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 40

Social networks

• Based on co-occurrences• Result of machine learning approach

Page 41: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 41

Associated names: weighting formula

).ln(1

12)ln(

2121

21

2121

,,,

eeee

eeeeee AACC

CCw

Weighting depending on the total number of persons they are associated with

Number of clusters they appear together

Weighting depending on the total number of clusters they appear in

Page 42: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 42

Associated names example

Javier Solana

Solana’s spokespersonSolana’s assistant

Without weighting

With weight

(European Union's currentForeign Secretary)

Page 43: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 43

Social network visualisation: interactive

Page 44: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 44

Social network visualisation, summarising relations around a person

Page 45: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 45

Relation Extraction

• We parsed the news articles in a 6 month span using a full dependency parser (MiniPar) and represented the corpus via a model called Syntactic Network (Tanev, Magnini, 2006)

• Beginning with a set of hand coded syntactic patterns (e.g. “X <-subj- contact -obj-> Y”) we capture anchor entities which fill the variable slots

• On the next step using these anchor entities we learn new syntactic templates (e.g. “X <-subj- meet -obj-> Y”)

• These two steps repeat until no more templates can be learned

Page 46: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 46

Relation Extraction

• Using the bootstrapping algorithm described above and the Syntactic Network, we extracted instances of the “contact”, “support”, “family”, and ”criticize” relations.

• We captured in total about 9000 occurrences of these relations in the corpus

Page 47: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 47

Quotations

e.g. -- Vi försökte uppmuntra samverkan, säger Urban Lundmark.

Person nameverb

Quotation mark

Page 48: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 48

Medisys

Page 49: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 49

Medisys: extraction of medical concepts (MeSH thesaurus) from health news

• MeSH extractor (en, es, de, fr, it, pt) • Geocoding (allows for region-level analysis)• Highlight medical terms in text• NewsExplorer-like medical application

Page 50: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 50

Agenda

• Future work and Conclusions

Page 51: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 51

Planned work (1)

1. Add more languages (analysis and cross-lingual linking)

2. Add more information facets for cross-lingual cluster linking• Dates• Professions• Expressions of measurement• Vehicles• ...

3. Empirically optimise weighting of ingredientsCLDS = α∙S1 + β∙S2 + γ∙S3 + δ∙S4 + …

Page 52: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 52

Planned work (2)

4. Link longer-running stories to other languages (currently, only individual clusters are linked)

5. Categorise persons (governmental, military, religious, civil, …)6. Categorise relations between persons

• Family relation• Contact/Meeting information• ORG-Membership relation• …

• …

Page 53: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 53

Cross-lingual GlossingЮрий Ехануров сменил на посту премьер-министр Украины Юлию Тимошенко, правительство которой было отправлено президентом Ющенко в отставку 8 сентября

Юрий Ехануров [Yuriy Yekhanurov] сменил [replaced] на посту премьер-министр [prime minister] Украины [Ukrainian] Юлию Тимошенко [Yulia Tymoshenko] , правительство [government] которой было отправлено [replaced] президентом [president] Виктор Ющенко [Viktor Yushchenko ] в отставку 8 сентября [08/09/2005]

Viktor Yushchenko Key Titles and Phrasesprésident ukrainien (fr-75)ukrainian president (en-67)president (en,sv - 182)präsident (de - 124)presidente ucraino (it - 16)président (fr - 65)predsednik (sl - 14)presidente (es,it - 28)

Yulia Tymoshenko

Key Titles and Phrasesministerpräsidentin (de-32)regierungschefin (de-23)prime minister (en-36)premier ministre (fr-26)

Yuriy YekhanurovKey Titles and Phrasesministerpräsidenten (de-17)prime minister (en - 13)ally (en - 8)Candidat (fr, 6)57 (en - 2)premier (en,it - 2)ministro de economía (es-2)

Resignation (dismissal) of Yulia Timoshenko’s government and her replacement by Yuriy Yekhanurov

Resignation (dismissal) of Yulia Timoshenko’s government and her replacement by Yuriy Yekhanurov

Page 54: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 54

Conclusion

• Cross-lingual linking of documents/clusters via language-independent representations is feasible.

• More information can be extracted when using texts written in several languages.

• Simple means can take you a long way• JRC effort to add a new language is 1 - 6 months

• Simple lexical patterns• Simple morphological suffix treatment

• Clustering and categorisation• Language-independent statistics and heuristics.

• Monolingual analysis is sufficient; no language pair-specific information is needed.

Page 55: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 55

NewsExplorer – Multilingual News Analysis

with Cross-lingual Linking

Machine Learning for Multilingual Information Access

NIPS Workshop -- Whistler, Canada, 10 December 2006

Ralf Steinberger, Bruno Pouliquen, Camelia Ignat

European Commission – Joint Research Centre (JRC)

http://langtech.jrc.it/ http://press.jrc.it/NewsExplorer

Page 56: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 56

Cross-lingual linking (instead of live demo) (2)

Page 57: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 57

Cross-lingual linking (instead of live demo) (3)

Page 58: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 58

Time-linked clusters (‘Stories’)http://press.jrc.it/NewsExplorer/storyindex/en/year.html

Page 59: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 59

Alexander Litvinenko

Page 60: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 60

Alexander Litvinenko (2)

Page 61: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 61

Alexander Litvinenko (3)

Page 62: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 62

Normalisation of name orthography

• Latin normalisation: Malik Saïdoullaïev

• accented character non-accented equivalent Malik Saidoullaiev

• double consonant single consonant Malik Saidoulaiev

• ou u Malik Saidulaiev

• wl (beginning of name) vl• ow (end of name) ov• ck k• ph f• ž j• š sh

Page 63: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 63

Transliteration

• Transliterate each character, or sequence of characters, by a Latin correspondent• ψ => ps• λ => l• μπ => b• Allow specific common transliterations Джордж [DJORDJ] => "George“, Джеймс => "James", • Examples of transliterations:

• Κόφι Ανάν, Greek Kofi Anan• Кофи Аннан, Russian Kofi Anan• Кофи Анан, Bulgarian Kofi Anan• عنان Arabic Kufi Anan ,كوفي• को�फी� अन्ना�न, Hindi Kofi Anan

Page 64: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 64

Fuzzy matching

Valéry Giscard d’Estaing Валери Жискар д'Эстен

valeri giskar destenvaleri giscard destaing

trigram overlap

bigram overlap

normalisation transliteration

_vvaalleerrii__ggiisscca…

_vvaalleerrii__ggiisskka…

_vavalalelereriri_i_g_gigisiscscacar…

_vavalalelereriri_i_g_gigisiskskakar…

First formula: average of bigram and trigram

First formula: average of bigram and trigram

Page 65: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 65

Fuzzy matching

Rafiq al-Hariri الحريري رفيق

trigram overlap

bigram overlap

rfk l hrr

consonants

Edit distance

rfk lhrr

consonants

normalisation

rafik al hariri

transliteration

rfik alhrir

Final formula: average of bigram, trigram and edit distance

Page 66: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 66

The EMM news data

• JRC’s Europe Media Monitor system• ~ 30,000 articles per day• 30 languages• ~ 1000 media news sources• Updated every 10 minutes• News in UTF8-encoded RSS format (XML)

• EMM available at http://press.jrc.it/• Breaking news• Topic-specific news (EU Commissioners,

EU Constitution, GMO, natural disasters, communicable diseases, …)

• Email and SMS alerts (ca. 10,000 per day)

live

Page 67: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 67

Evaluation of geocoding: results

Disambiguation technique F-measure

None 55.4%

Geo-context (country in the text) 66.7%

Importance of the place 75.9%

Minimum kilometric distance 68.7%

Filtering by person names 64.2%

Geo Stop list 61.1%

All techniques 77 %

Language Precision Recall F-measure

German 0.80 0.68 0.73

English 0.91 0.78 0.84

Spanish 0.75 0.80 0.76

French 0.68 0.74 0.70

Italian 0.70 0.85 0.76

Average 0.77 0.77 0.77

By disambiguation technique

By language

Difficult test set: Pouliquen et al. (2004) on the same test set: F = 0.38Same algorithm on 48 average English news texts: F = 0.94

Page 68: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 68

Evaluation of person name recognition

Evaluation of person name recognition in various languages. The number of rules (i.e. trigger words) gives an idea of the expected coverage for this language. The third and fourth columns show the size of the test set (number of texts, number of manually identified person names).

Page 69: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

69

DG JRC – slides references

join

t re

sear

ch c

entr

eEuropean Commission

Future work on NewsExplorerStory trackingBruno Pouliquen

Page 70: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 70

Future work

• Live clusters• Add new articles to existing clusters• Every 10 minutes export current clusters

• Stories• Each cluster is compared with previous 7 days• Put together similar clusters in “stories”

Page 71: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 71

Page 72: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 72

Linking clusters over time

Blair We'll hold Iraq nerve, says Blair

Baghdad security plan 'failing'

Blair: Troop rile Iraq rebels

British PM Nixes Withdrawal From Iraq

Iraq attacks kill nine US troops91 Die in

Sectarian Violence in Iraq

Day 0 Day 1 Day 2 Day 4

Page 73: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 73

Interlinked clusters

riots

Election results

Violence during elections

3 civils killed

violence

elections

Elected president

Election frauds

Page 74: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 74

Converging stories 2=>1

riots

Election results

Violence during elections

3 civils killed

violence

elections

Elect.d+6

Election frauds

…Elect.d+1

Page 75: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 75

Diverging stories 1=>2

riots

Election results

Election frauds

…Fraud d+1

Fraud d+6

…Riots d+1

Riots d+6

Election polls

Page 76: UkVisit’2006, Slide 1 Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R

UkVisit’2006, Slide 76

Results of automatic evaluation across languages (F1 per document at rank=6)

0

10

20

30

40

50

60

En Es Da De Fi Fr It Nl Pt Sv Li Hu

With pre-processing (French=> only stop words)Without pre-processing

Human evaluation (correct descriptors, compared to inter-annotator agreement):

English: 83%

Spanish: 80%