Upload
margery-banks
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
UkVisit’2006, Slide 1
Language Technology, JRC, European Commission
Multilingual Information Extraction
Bruno Pouliquen
European Commission – Joint Research Centre (JRC) – Italy
http://langtech.jrc.it/ http://press.jrc.it/NewsExplorer
UkVisit’2006, Slide 2
Joint Research Centre, IPSC, SES, langtech
• Joint Research Centre Mission: provide customer-driven scientific and technical support for the conception, development, implementation and monitoring of EU policies. (As a service of the European Commission, the JRC functions as a reference centre of science and technology for the Union. Close to the policy-making process, it serves the common interest of the Member States, while being independent of special interests, whether private or national.)
• Institute for the Protection and Security of the Citizen• Support for external Security (unit)• Web and language technology (action)
UkVisit’2006, Slide 3
Agenda
• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)
• Overview of our approach• Description of the components, challenges
• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.
• Future work and Conclusions
UkVisit’2006, Slide 4
Motivation for cross-lingual document linking
• EU: need for multilinguality• 23 official languages (since 1st Jan 2007:Irish, Romania, Bulgarian)• 253 language pairs• + interest in Arabic, Farsi, Russian…
• Enhanced Information Extraction by combining information from texts in various languages
• Applications• NewsExplorer
• Based on Europe Media Monitor (collecting news in 30 languages from 1000 news web site)
• Analysing News in 18 languages (ar, bg, da, de, en, es, et, fa, fr, it, nl, pl, pt, ro, ru, sl, sv, tr)
• OSInt• Combines internet searches and entity extraction
• Medisys (medical intelligence)
UkVisit’2006, Slide 5
NewsExplorer
UkVisit’2006, Slide 6
OSInt
UkVisit’2006, Slide 7
Medisys
UkVisit’2006, Slide 8
Agenda
• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)
• Overview of our approach• Description of the components, challenges
• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.
• Future work and Conclusions
UkVisit’2006, Slide 9
• Map documents onto multilingual thesauri, nomenclatures, gazetteers, …(general, medical, technical, geographical, …)
• Identify and normalise language-independent text features • Numbers; dates; currency expressions; measurement expressions; …
le treize août 2007, am 13.8.2007, by 13/08/07 DATE_2007_08_13
• Names of persons, organisations, …
Muqtada al-Sadr, Mouqtada Sadr, Mokdata Sadr, Muqtadā aş-Şadr, Муктада Садр
ID=236
• Calculate the various vector similarities and add up• For details, see Steinberger et al. (2004, Informatica 28).
Approach to CLDS in a nutshell
UkVisit’2006, Slide 10
Geo-coding multilingual text
• Place names cannot be recognised by looking for patterns in text (Gey 2000) Multilingual gazetteer needed
‘Санкт-Петербург’, ‘Saint Petersburg’, ‘Saint Pétersbourg’, ‘Leningrad’, ‘Petrograd’, …
• Place name recognition via the lookup of text words in the gazetteer
Latitude / Longitude
UkVisit’2006, Slide 11
Major challenges for geocoding (1)
1. Places homographic with common words
Geo-stopwords
2. Places homographic with person names
Location only if not part of a person namee.g. ‘Kofi Annan’, ‘Annan’
UkVisit’2006, Slide 12
Major challenges for geocoding (2)
3. Completeness of the gazetteer: exonyms, endonyms, …‘Санкт-Петербург’, ‘Saint Petersburg’, ‘Saint Pétersbourg’, ‘Leningrad’, ‘Petrograd’, …
4. Inflection• Romanian: Parisului (of Paris)• Estonian: Londonit (London),
New Yorgile (New York) • Arabic: (the Paris inhabitants)
Usage of context-sensitive suffix lists and replacement rules to generate all variants(Romanian example)
Paris(ul|ului)? / Români(a|e)
[albaRiziu:n] الباريسون
UkVisit’2006, Slide 13
Major challenges for geocoding (3)
5. Homographic place names
Use size class information
Use country context (news source; unambiguous anchors)
Use kilometric distance E.g. “from Warsaw to Brest”
Brest (France): 2000 km from Warsaw Brest (Belarus): 200 km from Warsaw
Place name Number of cities with this nameAleksandrovka 244
...Washington 32London 18Berlin 15Paris 15
UkVisit’2006, Slide 14
Agenda
• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)
• Overview of our approach• Description of the components, challenges
• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.
• Future work and Conclusions
UkVisit’2006, Slide 15
Multilingual name recognition and variant merging
asesinato del exprimer ministro Rafic al-Hariri, que la oposición atribuyóes
l'assassinat de l'ex-dirigeant Rafic Hariri et le départ du chef de la diplomfr
na de moord op oud-premier Rafiq al-Hariri gingen gisteren bijna eennl
libanesischen Regierungschef Rafik Hariri vor einem Monat wichtige Bde
danjega libanonskega premiera Rafika Haririja. Libanonska opozicija sisl
möödumisele ekspeaminister Rafik al-Hariri surma põhjustanud pommiplet
death of former Prime Minister Rafik Hariri, blamed by many oppositionen
السابق اغتيال الوزراء الحريري رئيس سابقا رفيق حدث وما يهودية arبأياد
Бывший премьер-министр Ливана Рафик Харири, который ru
UkVisit’2006, Slide 16
Recognition of person names
• Lookup of known names from database • Currently over 600,000 names but about 50.000 are really looked for (+ 135.000 variants)• Insertion of about 1000 new names per day• Pre-generate morphological variants (Slovene example):
Tony(a|o|u|om|em|m|ju|jem|ja)?\s+Blair(a|o|u|om|em|m|ju|jem|ja)
• Guessing names using empirically-derived lexical patterns• Trigger word(s) + Name Surname
• President, Minister, Head of State, Sir, American• “death of”, “[0-9]+-year-old”, …• Known first names (John, Jean, Hans, Giovanni, Johan, …)• …• Combinations: “56-year-old former prime minister Kurmanbek Bakiyev”
UkVisit’2006, Slide 17
Person name recognition – dealing with inflection
• Recognition of person names, using regular expressions (Slovene example):
• kandidat(a|u|om)?• legend(a|e|i|o)• milijarder(ja|ju|jem)?• predsednik(a|u|om|em)?• predsednic(a|e|i|o)• ministric(a|e|i|o)• sekretar(ja|ju|jom|jem)?• diktator(ja|ju|jem)? • playboy(a|u|om|em)?
+ uppercase words
… verskega voditelja Moktade al Sadra je z notranjim … = Muqtada al-Sadr (ID=236)
UkVisit’2006, Slide 18
Learning name variants
• For all new names found: apply approximate name matching• Slow method:
• Based on sets of letter bigrams and letter trigrams• Merge two names if cosine similarity is > 70%
• Fast method:• Pre-select merging candidate using a “normalised form”
• Collect variants automatically from Wikipedia• Cross-script name matching
(Cyrillic, Greek, Arabic + Farsi, Hindi)
Name Normalised form
Malik Saïdoullaïev, Malik Sayduláyevmlk sdlv (malik saidulaiev)
Mahmoud Ahmadinejad, Mahmūd Ahmadīnežād
mhmd hmdnjd (mahmud ahmadinejad)
Сергей Куприянов, Sergei Kupriyanov, Sergei Kuprianow, Sergueï Kouprianov
srg kprnv (sergei kuprianov)
Ban Ki-moon, Ban Ki Moon, Пан Ги Мун bn k mn (ban ki mun)
UkVisit’2006, Slide 19
Person name recognition – Result
• Frequency list of normalised person names found(numerical person ID)
• Together with the “trigger words” associated
UkVisit’2006, Slide 20
Agenda
• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)
• Overview of our approach• Description of the components, challenges
• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.
• Future work and Conclusions
UkVisit’2006, Slide 21
• Over 6000 classes• Covering many different subject domains (wide coverage)• Multilingual (over 20 languages, one-to-one translations) • Developed by the European Parliament and others• Hierarchically organised
Eurovoc Thesaurushttp://europa.eu.int/celex/eurovoc
UkVisit’2006, Slide 22
Eurovoc categorisation – Major challenges
• Eurovoc is a conceptual thesaurus
categorisation vs. term extraction
• Large number of classes (~ 6000)
• Very unevenly distributed
• Various text types (heterogeneous training set)
• Multi-label categorisation (both for training and assignment)
E.g.• SPORT• PROTECTION OF MINORITIES• CONSTRUCTION AND TOWN PLANNING• RADIOACTIVE MATERIALS
UkVisit’2006, Slide 23
Eurovoc categorisation – Approach
• Profile-based, category ranking task• Training: Identification of most significant words for each class• Assignment: combination of measures to calculate similarity between profiles and new document
• Empirical refinement of parameter settings• Training:
• Stop words• Lemmatisation• Multi-word terms• Consider number of classes of each training document• Thresholds for training document length and number of training documents per class• Methods to determine significant words per document (log-likelihood vs. chi-square, etc.)• Choice of reference corpus• …
• Assignment:• Selection and combination of similarity measures (cosine, okapi, …)• ...
For details, see Pouliquen et al. (Eurolan 2003)
UkVisit’2006, Slide 24
Assignment Result (Example)
Title: Legislative resolution embodying Parliament's opinion on the proposal for a Council Regulation amending Regulation No 2847/93 establishing a control system applicable to the
common fisheries policy (COM(95)0256 - C4-0272/95 - 95/ 0146(CNS)) (Consultation procedure)
UkVisit’2006, Slide 25
The ‘JRC-Acquis’ parallel corpus in 21 languages
• Freely available for research purposes on our web site: http://langtech.jrc.it/JRC-Acquis.html• For details, see Steinberger et al. (2006, LREC)
• Average of 8.8 Million words per language• Pair-wise alignment for all 210 language pairs! • Average of 7600 documents per language• Most documents have been Eurovoc-classified manually• Incoming new version soon (about twice the same size) useful for:
1. Training of multilingual subject domain classifiers.2. Creation of multilingual lexical space (LSA, KCCA)3. Training of automatic systems for Statistical Machine Translation.4. Producing multilingual lexical or semantic resources such as dictionaries or ontologies.5. Training and testing multilingual information extraction software.6. Automatic translation consistency checking.7. Testing and benchmarking alignment software (sentences, words, etc.), across a larger variety of language pairs.
8. All types of multilingual and cross-lingual research.
UkVisit’2006, Slide 26
Agenda
• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)
• Overview of our approach• Description of the components, challenges
• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.
• Future work and Conclusions
UkVisit’2006, Slide 27
Monolingual document representation
• Vector of keywords and their keyness using log-likelihood test (Dunning 1993)
Keyness Keyword 109.24 jackson 41.54 neverland 37.93 santa 32.61 molestation 24.51 boy 24.43 pop 20.68 documentary 18.79 accuser 13.59 courthouse 11.12 jury 10.08 ranch 9.60 california
Keyness Keyword 9.39 verdict 7.56 testimony 6.50 maria 4.09 michael 1.73 reached 1.68 ap 1.05 appeared 0.53 child 0.50 trial 0.45 monday 0.26 children 0.09 family
“Michael Jackson Jury Reaches Verdicts”
Original cluster
UkVisit’2006, Slide 28
Calculation of a text’s Country Score
• Aim: show to what extent a text talks about a certain country• Sum of references to a country, normalised using the log-likelihood test• Add country score vector to keyword vector
Keyness Keyword 109.2478 jackson 41.5450 neverland 37.9347 santa 32.6105 molestation 24.5193 boy 24.4351 pop 20.6824
documentary 18.7973 accuser 13.5945 courthouse 11.1224 jury 10.4184 *us* 10.0838 ranch 9.6021 california 9.3905 verdict
KeynessKeyword
7.5620 testimony 6.5014 maria 4.0957 michael 1.7368 reached 1.6857 ap 1.5610 *gb* 1.5610 *il* 1.5610 *br* 1.0520 appeared 0.5384 child 0.5045 trial 0.4502 monday 0.2647 children 0.0946 family
UkVisit’2006, Slide 29
• Input: Vectors consisting of keywords and country score• Similarity measure: cosine• Method: Bottom-up group average unsupervised clustering• Build the binary hierarchical clustering tree (dendrogram)
• Retain only “big” nodes in the tree with a high cohesion (empirically refined minimum intra-node similarity: 45%)
• Use the title of the cluster’s medoid as the cluster title• For details, see Pouliquen et al. (CoLing 2004)
Monolingual news clustering
87%76%
40%
20%
10%
97%
45%
UkVisit’2006, Slide 30
Agenda
• Motivation • EU: need for multilinguality• Enhanced Information Extraction by combining information from texts in various languages• Applications (analysing News, web mining, parliament texts…)
• Overview of our approach• Description of the components, challenges
• IE: Locations• IE: Person and Organisation names• Categorisation into Subject domains (Eurovoc classes)• Clustering of news• Linking clusters over time and languages• Others: event extraction, quotations, medical-concept indexation etc.
• Future work and Conclusions
UkVisit’2006, Slide 31
Cross-lingual cluster linking – combination of 4 ingredients
CLDS = α∙S1 + β∙S2 + γ∙S3 + δ∙S4
• Ranked list of Eurovoc classes (40%)• Country score (30%)• Names + frequency (20%)• Monolingual cluster representation without country score (10%)
+ + +
UkVisit’2006, Slide 32
example
46%
UkVisit’2006, Slide 33
Cross-lingual cluster linking – evaluation
• Evaluation results depending on similarity threshold• Ingredients: 40/30/30 (names not yet considered)• Evaluation for EN FR and EN IT (136 EN clusters)
* Recall at 15% similarity threshold = 100%
• For details, see Pouliquen et al. (CoLing 2004)
UkVisit’2006, Slide 34
Filter out bad links by exploiting all cross-lingual links
Assumption: If EN is linked to FR, ES, IT, … FR should also be linked to ES, IT, ...If not: lower link likelihood
UkVisit’2006, Slide 35
Agenda
• Other applications• automatic detection of man made violent events• Social networks and relation extraction• Quotation extraction• Medisys: extraction of medical concepts (MeSH thesaurus)
from health news
UkVisit’2006, Slide 36
Automatic detection of man made violent events
• NEXUS News cluster Event eXtraction Using language Structures
• 3415 templates in total• Processes the news clusters for one day for about 30 minutes• Detects about 10-15 security-related events per day
UkVisit’2006, Slide 37
NEXUS: Learning Templates by Bootstrapping - Example
• We learned core templates like [action killed] [person X] and others• In some of the news clusters we match a text like
“U.S. troops killed Al Zarkawi”• Al Zarkawi is taken as an anchor entity and we consider the contexts in which this name appears in
the same news cluster where the above mentioned text appears• For example: “the body of Al Zarkawi was identified”, “the body of Al Zarkawi was found”, etc.• All these contexts from different anchor entities are passed to a pattern learning algorithm to extract
candidate patterns (e.g. the body of X)
UkVisit’2006, Slide 38
NEXUS: Event Structure
Fields Semantics
Date This is always the date of the cluster, though in practice sometimes we have back references to security related events in the past
Location Country and (if possible to identify) city , town, area. Sometimes differ from the cluster main geo-coding.
Cluster information Cluster code, title of the cluster, list of cluster article IDs, alerts
Number of dead and wounded These numbers are estimated from the cluster articles using special-purpose build damage-estimation algorithm.
Dead persons People who were killed in the event
Wounded persons People who were injured
Kidnapped persons People who were abducted
Actors These are the killers, kidnappers, or persons and organizations which are likely to have initiated a violent act.
Actions This field contains description of the violent act. This still looks like a list of keywords and key phrases
Article text Fragments from the articles referring to the main event. (Only titles and first sentence)
UkVisit’2006, Slide 39
Knowledge base
Benedict XVI
Photograph Wikipedia entries 1.10 Clusters Titles Name variants Associated keywords Associated countries Associated names (sorted by frequency) Weighted associated names
UkVisit’2006, Slide 40
Social networks
• Based on co-occurrences• Result of machine learning approach
UkVisit’2006, Slide 41
Associated names: weighting formula
).ln(1
12)ln(
2121
21
2121
,,,
eeee
eeeeee AACC
CCw
Weighting depending on the total number of persons they are associated with
Number of clusters they appear together
Weighting depending on the total number of clusters they appear in
UkVisit’2006, Slide 42
Associated names example
Javier Solana
Solana’s spokespersonSolana’s assistant
Without weighting
With weight
(European Union's currentForeign Secretary)
UkVisit’2006, Slide 43
Social network visualisation: interactive
UkVisit’2006, Slide 44
Social network visualisation, summarising relations around a person
UkVisit’2006, Slide 45
Relation Extraction
• We parsed the news articles in a 6 month span using a full dependency parser (MiniPar) and represented the corpus via a model called Syntactic Network (Tanev, Magnini, 2006)
• Beginning with a set of hand coded syntactic patterns (e.g. “X <-subj- contact -obj-> Y”) we capture anchor entities which fill the variable slots
• On the next step using these anchor entities we learn new syntactic templates (e.g. “X <-subj- meet -obj-> Y”)
• These two steps repeat until no more templates can be learned
UkVisit’2006, Slide 46
Relation Extraction
• Using the bootstrapping algorithm described above and the Syntactic Network, we extracted instances of the “contact”, “support”, “family”, and ”criticize” relations.
• We captured in total about 9000 occurrences of these relations in the corpus
UkVisit’2006, Slide 47
Quotations
e.g. -- Vi försökte uppmuntra samverkan, säger Urban Lundmark.
Person nameverb
Quotation mark
UkVisit’2006, Slide 48
Medisys
UkVisit’2006, Slide 49
Medisys: extraction of medical concepts (MeSH thesaurus) from health news
• MeSH extractor (en, es, de, fr, it, pt) • Geocoding (allows for region-level analysis)• Highlight medical terms in text• NewsExplorer-like medical application
UkVisit’2006, Slide 50
Agenda
• Future work and Conclusions
UkVisit’2006, Slide 51
Planned work (1)
1. Add more languages (analysis and cross-lingual linking)
2. Add more information facets for cross-lingual cluster linking• Dates• Professions• Expressions of measurement• Vehicles• ...
3. Empirically optimise weighting of ingredientsCLDS = α∙S1 + β∙S2 + γ∙S3 + δ∙S4 + …
UkVisit’2006, Slide 52
Planned work (2)
4. Link longer-running stories to other languages (currently, only individual clusters are linked)
5. Categorise persons (governmental, military, religious, civil, …)6. Categorise relations between persons
• Family relation• Contact/Meeting information• ORG-Membership relation• …
• …
UkVisit’2006, Slide 53
Cross-lingual GlossingЮрий Ехануров сменил на посту премьер-министр Украины Юлию Тимошенко, правительство которой было отправлено президентом Ющенко в отставку 8 сентября
Юрий Ехануров [Yuriy Yekhanurov] сменил [replaced] на посту премьер-министр [prime minister] Украины [Ukrainian] Юлию Тимошенко [Yulia Tymoshenko] , правительство [government] которой было отправлено [replaced] президентом [president] Виктор Ющенко [Viktor Yushchenko ] в отставку 8 сентября [08/09/2005]
Viktor Yushchenko Key Titles and Phrasesprésident ukrainien (fr-75)ukrainian president (en-67)president (en,sv - 182)präsident (de - 124)presidente ucraino (it - 16)président (fr - 65)predsednik (sl - 14)presidente (es,it - 28)
Yulia Tymoshenko
Key Titles and Phrasesministerpräsidentin (de-32)regierungschefin (de-23)prime minister (en-36)premier ministre (fr-26)
Yuriy YekhanurovKey Titles and Phrasesministerpräsidenten (de-17)prime minister (en - 13)ally (en - 8)Candidat (fr, 6)57 (en - 2)premier (en,it - 2)ministro de economía (es-2)
Resignation (dismissal) of Yulia Timoshenko’s government and her replacement by Yuriy Yekhanurov
Resignation (dismissal) of Yulia Timoshenko’s government and her replacement by Yuriy Yekhanurov
UkVisit’2006, Slide 54
Conclusion
• Cross-lingual linking of documents/clusters via language-independent representations is feasible.
• More information can be extracted when using texts written in several languages.
• Simple means can take you a long way• JRC effort to add a new language is 1 - 6 months
• Simple lexical patterns• Simple morphological suffix treatment
• Clustering and categorisation• Language-independent statistics and heuristics.
• Monolingual analysis is sufficient; no language pair-specific information is needed.
UkVisit’2006, Slide 55
NewsExplorer – Multilingual News Analysis
with Cross-lingual Linking
Machine Learning for Multilingual Information Access
NIPS Workshop -- Whistler, Canada, 10 December 2006
Ralf Steinberger, Bruno Pouliquen, Camelia Ignat
European Commission – Joint Research Centre (JRC)
http://langtech.jrc.it/ http://press.jrc.it/NewsExplorer
UkVisit’2006, Slide 56
Cross-lingual linking (instead of live demo) (2)
UkVisit’2006, Slide 57
Cross-lingual linking (instead of live demo) (3)
UkVisit’2006, Slide 58
Time-linked clusters (‘Stories’)http://press.jrc.it/NewsExplorer/storyindex/en/year.html
UkVisit’2006, Slide 59
Alexander Litvinenko
UkVisit’2006, Slide 60
Alexander Litvinenko (2)
UkVisit’2006, Slide 61
Alexander Litvinenko (3)
UkVisit’2006, Slide 62
Normalisation of name orthography
• Latin normalisation: Malik Saïdoullaïev
• accented character non-accented equivalent Malik Saidoullaiev
• double consonant single consonant Malik Saidoulaiev
• ou u Malik Saidulaiev
• wl (beginning of name) vl• ow (end of name) ov• ck k• ph f• ž j• š sh
UkVisit’2006, Slide 63
Transliteration
• Transliterate each character, or sequence of characters, by a Latin correspondent• ψ => ps• λ => l• μπ => b• Allow specific common transliterations Джордж [DJORDJ] => "George“, Джеймс => "James", • Examples of transliterations:
• Κόφι Ανάν, Greek Kofi Anan• Кофи Аннан, Russian Kofi Anan• Кофи Анан, Bulgarian Kofi Anan• عنان Arabic Kufi Anan ,كوفي• को�फी� अन्ना�न, Hindi Kofi Anan
UkVisit’2006, Slide 64
Fuzzy matching
Valéry Giscard d’Estaing Валери Жискар д'Эстен
valeri giskar destenvaleri giscard destaing
trigram overlap
bigram overlap
normalisation transliteration
_vvaalleerrii__ggiisscca…
_vvaalleerrii__ggiisskka…
_vavalalelereriri_i_g_gigisiscscacar…
_vavalalelereriri_i_g_gigisiskskakar…
First formula: average of bigram and trigram
First formula: average of bigram and trigram
UkVisit’2006, Slide 65
Fuzzy matching
Rafiq al-Hariri الحريري رفيق
trigram overlap
bigram overlap
rfk l hrr
consonants
Edit distance
rfk lhrr
consonants
normalisation
rafik al hariri
transliteration
rfik alhrir
Final formula: average of bigram, trigram and edit distance
UkVisit’2006, Slide 66
The EMM news data
• JRC’s Europe Media Monitor system• ~ 30,000 articles per day• 30 languages• ~ 1000 media news sources• Updated every 10 minutes• News in UTF8-encoded RSS format (XML)
• EMM available at http://press.jrc.it/• Breaking news• Topic-specific news (EU Commissioners,
EU Constitution, GMO, natural disasters, communicable diseases, …)
• Email and SMS alerts (ca. 10,000 per day)
live
UkVisit’2006, Slide 67
Evaluation of geocoding: results
Disambiguation technique F-measure
None 55.4%
Geo-context (country in the text) 66.7%
Importance of the place 75.9%
Minimum kilometric distance 68.7%
Filtering by person names 64.2%
Geo Stop list 61.1%
All techniques 77 %
Language Precision Recall F-measure
German 0.80 0.68 0.73
English 0.91 0.78 0.84
Spanish 0.75 0.80 0.76
French 0.68 0.74 0.70
Italian 0.70 0.85 0.76
Average 0.77 0.77 0.77
By disambiguation technique
By language
Difficult test set: Pouliquen et al. (2004) on the same test set: F = 0.38Same algorithm on 48 average English news texts: F = 0.94
UkVisit’2006, Slide 68
Evaluation of person name recognition
Evaluation of person name recognition in various languages. The number of rules (i.e. trigger words) gives an idea of the expected coverage for this language. The third and fourth columns show the size of the test set (number of texts, number of manually identified person names).
69
DG JRC – slides references
join
t re
sear
ch c
entr
eEuropean Commission
Future work on NewsExplorerStory trackingBruno Pouliquen
UkVisit’2006, Slide 70
Future work
• Live clusters• Add new articles to existing clusters• Every 10 minutes export current clusters
• Stories• Each cluster is compared with previous 7 days• Put together similar clusters in “stories”
UkVisit’2006, Slide 71
UkVisit’2006, Slide 72
Linking clusters over time
Blair We'll hold Iraq nerve, says Blair
Baghdad security plan 'failing'
Blair: Troop rile Iraq rebels
British PM Nixes Withdrawal From Iraq
Iraq attacks kill nine US troops91 Die in
Sectarian Violence in Iraq
Day 0 Day 1 Day 2 Day 4
UkVisit’2006, Slide 73
Interlinked clusters
riots
Election results
Violence during elections
3 civils killed
violence
elections
Elected president
Election frauds
UkVisit’2006, Slide 74
Converging stories 2=>1
riots
Election results
Violence during elections
3 civils killed
violence
elections
Elect.d+6
Election frauds
…Elect.d+1
UkVisit’2006, Slide 75
Diverging stories 1=>2
riots
Election results
Election frauds
…Fraud d+1
Fraud d+6
…Riots d+1
Riots d+6
Election polls
UkVisit’2006, Slide 76
Results of automatic evaluation across languages (F1 per document at rank=6)
0
10
20
30
40
50
60
En Es Da De Fi Fr It Nl Pt Sv Li Hu
With pre-processing (French=> only stop words)Without pre-processing
Human evaluation (correct descriptors, compared to inter-annotator agreement):
English: 83%
Spanish: 80%