View
68
Download
0
Category
Preview:
Citation preview
EDRAK: Entity-centric Data Resource
for Arabic Knowledge
Mohamed H. Gad-Elrab Mohamed Amir Yosef Gerhard Weikum
Max-Planck-Institut für Informatik
Saarbrücken, Germany
30th July 2015
Comprehensive Arabic Data Resource!
2
If only we have a
Outline
• Resources Use-cases
• Related Work
• EDRAK Resource
• EDRAK Creation
• EDRAK in Numbers
• Evaluation
3
Resources Use-cases
• Entity Linking / Named Entity Disambiguation
4
Angela_Merkel
خطة إنقاذ اليونانبدعم األلمانيالبرلمان تطالب ميركل
ألمانيا - السياسية - انتخابات -االجتماعيالديمقراطي الحزب –
Germany – Politics –Elections .. etc
Context
ميركلأنجيال دوروتيا - أنغيال
أنجيال - ميركل المستشارة األلمانية –Angela Merkel – Merkel … etc
Names
Person, German_Politician, ..etc
Types
Merkel calls the German parliament to support the Greece bailout plan
Resources Use-cases
• Entity Linking / Named Entity Disambiguation
• Dictionary-based NER
• Entity Summarization
• Question Answering
• Fine-grained Semantic Type Classifier
• ….
5
Existing ResourcesResource Name Entity-
Aware?Building source Arabic
Names Size
Context info?
JRC-Names(Steinberger et al. 2011)
No Wikipedia + News 17K No
Arabic Lexical NEs (Attia et al. 2010)
No Wikipedia + WordNet 45K should
CMUQ-Arabic-NET(Azab et. al. 2013)
No Wikipedia + News 60K No
Google-Word-To-Concept(Spitkovsky and Chang, 2012)
Yes Wikipedia + Web 800K No
BabelNet(Navigli and Ponzetto, 2012)
Yes Wikipedia + Concepts Translation
NA Yes
AIDArabic(Yosef et. Al 2014)
Yes Wikipedia (Eng & Ar) 495K Yes
6
EDRAK
7
Entity Catalog
(2.4M Entities)
Names DictionaryKeyphrasesDictionary
Weights
Semantic Types
Entity-Entity Similarity
EDRAK Creation
Yago3(English & Arabic)
Culture Specific
Prominent Entities
Entity Catalog
8
EDRAK Creation
• Manually from the Arabic Wikipedia• Page Titles
• Anchor Text
• Redirects
• Disambiguation Pages
Names Dictionary
9
EDRAK Creation Names Dictionary
10
• Limitation
Missing Arabic Names
Have Arabic Names
EDRAK Creation Names Dictionary
11
Populating Arabic names for Entities that exist only in the English Wikipedia, and compile more names
for entities in the Arabic Wikipedia
External Resources
Named Entity Translation
Transliteration
En. Entity Names
Generated Ar. Names
Names Dictionary
EDRAK Creation
• Approach 1: External Resources• Entity-aware: Google Word to Concept (GW2C)
• Web Hypertext Anchors to Wikipedia
• Name-Dictionaries:• JRC-Names
• CMUQ-Arabic-NET (Azab et al. 2013)
Names Dictionary
------ ------ ---- --------- --- --- ----- ------ ----- ----- -------- --------- ------ ----- ---- ----- ---- --- -- ------ -------------- ---- --- ------- --------
-- ------ -------------- ---- --- ---- --- -------- --- -- --- ---- ---- ---- ----- ---- ------------ ---- --------- --- --- ----- ------- ---------
-- ------ ---- --- ------- ---------- ---- --------- --- --- ----- ------ ----- ----- ------------ -------- ---------
-- ------ -------------- ---- --- ---- --- -------- --- -- --- ---- ---- ---- ----- ---- -------------- ----- ---- ----- ---- --- -- ------ ------
--------- ---- --- ------- ----- ------ --------- --- -- ------ -------------- ---- --- ------- ---- ----
------ ------ ---- --------- --- --- ----- ------ ----- ----- ----------- ---- --- -- ------ -------------- ---- --- ------- -
Web pages
12
EDRAK Creation
• Approach 2: Entity Names Translation• Statistical Machine Translation (SMT)
• Services target full text
• Name Entities are mistranslated• E.g. “Nolan North is an American actor”
• E.g. “Robert Green”
• SMT Systems do not consider types
Names Dictionary
13
EDRAK Creation
• Approach 2: Entity Names Translation• Entity-Names SMT
Names Dictionary
14
Christian Schmidt ?
Christian Dior ديوركريستيان
Eric Schmidt إشميتإريك
EDRAK Creation
• Approach 2: Entity Names Translation• Type-Aware Entity Names SMT
• Wikipedia Cross-Languages links + QCMU-Arabic-NETs
• Persons, Non-persons and Full back
Names Dictionary
Arabic Entity NameArabic Entity NameArabic Entity Name
Non-PersonsTranslation
Model
English Entity
is Person?
ParallelPERSONnames
PersonsTranslation
Model
yesNo
Pick top-k
Arabic Entity Name15
ParallelNON-
PERSONnames
EDRAK Creation
• Approach 3: Persons Names Transliteration• Persons Names
• Unseen Names
• Capturing several Arabic possibilities
• Ex: Tony ( (توني /طوني
• Transliteration as Character-Level SMT• Training: En-AR Persons Names.
Names Dictionary
16
A n g e l a SPACE M e r k e l اليجنأ SPACE لكريم
A l b e r t SPACE E i n s t e i n تربلأ SPACE نياتشنيأ
EDRAK Creation
• Manually from the Arabic Wikipedia• In-link Pages Titles
• Anchor Texts
• Categories
• Citations
17
KeyphrasesDictionary
EDRAK Creation
• Arabic Keyphrases Generation
18
KeyphrasesDictionary
Named Entity Translation
Transliteration
En. In-link Titles
KeyphrasesDictionary
Named Entity TranslationEn. Categories
EDRAK in Numbers
• AIDArabic Resource (Yosef et al. 2014) vs EDRAK
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
Entities Count Unique Names Entity-NameEntries
UniqueKeyphrases
Entity-keyph.Entries
AIDArabiic EDRAK
19
EDRAK in Numbers
• Entities per Semantic Type
1,220,03252%
360,10816%
359,07115%
199,8469%
196,3058%
PERSON LOCATION ARTIFACT EVENT ORGANIZATION
20
EDRAK in Numbers
• Example
21
EDRAK in Numbers
• Example
22
Evaluation
• Manual Assessment • 55 Native Arabic Speakers
• Distributed over many areas
• 150 Names for annonator
• Fairly distributed sample
23
Evaluation
• Manual Assessment • Precision @1
24
0102030405060708090
100
First/LastNames
labels PERSON labels NON-PERSON
RedirectsPERSON
Redirects NON-PERSON
Categories
Type-Aware Combined Transliterated Categories Translation
Evaluation
• Manual Assessment
• Results precision changes according to the source.• Highest: First/Last Names
• Lowest: Redirects (NON-PERSON)
• No real difference between Type-Aware and Combined SMT.
• Transliterated names confusion• Ex. Johannes, Friedrich
25
Conclusion
• EDRAK offers 2.4M Entities with Potential names, Contextual keyphrases and Semantic Types.
• EDRAK is not limited to the Arabic Wikipedia• External Resources
• Type-Aware Entity Names Translation
• Person Names Transliteration
26
Download EDRAK
http://www.mpi-inf.mpg.de/yago-naga/aida/
27
Thank you!
28
Recommended