Named Entity Recognition and Linking...2018/01/24  · Definitions •Entity: something that exists...

Preview:

Citation preview

NamedEntityRecognitionandLinking

ArtūrsZnotiņš,arturs.znotins@gmail.com

Supervisor:Guntis Bārzdiņš,Dr.sc.comp.

TheProblem

• Largeamountsofunstructureddata– naturallanguage(WWW,books,newspapers,radio)• Alotofambiguity- contextisveryimportant• Humansaregoodatsemanticdisambiguation:• Whatentitiesdoestextreferto?• Whatfactsdoestextdescribe?• Whatisthemeaningofspecificword?

• Howtodothisautomatically?

2

Motivation

• UtilizeexistingunstructurednaturallanguageresourcesHiddenknowledge

• Usecases:• Informationextraction(IE)• Textclustering• Mediamonitoring• Dialogsystems• Questionanswering• Machinetranslation

3

Cikos atiet vilciens uz Daugavpili?

Definitions

• Entity:somethingthatexistsbyitself• Namedentity(NE): entitywithspecificname• Mention:phrasethatreferstoentity• Knowledgebase(KB):organizedrepositoryofknowledgeconsistingofconcepts,properties,links• (Named)EntityLinking(NEL):linkingmentionsofentitieswithinatexttoKBentities• WordSenseDisambiguation(WSD):assigningmeaningstowordoccurrenceswithintext

4

KnowledgeBase:Example

5https://www.wikidata.org/wiki/Wikidata:Introduction

KnowledgeBase:Example

6https://www.wikidata.org/w/index.php?search=&search=andris+berzins

NamedEntityLinking:Example

“Arī otra figūra Daimlerlietā ir Bojāra ārštatapadomnieks,unsens eksmēra draugs noarmijas laikiem

– Armands Zeihmanis.”(notvnet.lv)

• Bojāra→GundarsBojārs,dz.1967https://lv.wikipedia.org/wiki/Gundars_Bojārs

• Otra figūra=Bojāra =eksmēra

• Bojārs↔Zeihmanis:draugs,padomnieks

7

NEL:GeneralApproach

• NamedEntityRecognition• CandidateSelection

SelectingpossiblecandidatesfromthetargetKnowledgeBase

• DisambiguationDecidingwhichcandidateisthecorrectidentitycorrespondingtothementionofaNamedEntity

8

NEL:ContextFreeApproach

• ExtractsurfaceformsfromKBorannotatedcorpus• DBpedia labels(rathersparse)• InternallinksofWikipedia

• Cleanandcatalog• Faststringmatch

+ Simple+ Highprecision- Lowrecall- Doesnotsolveambiguities

9

𝑈"# = 𝑒# ∈ 𝐸"#: 𝐶 𝑒# ≥ 𝛼×-𝐶 𝑒.

/01

.

Generatenamealternatives

Decideonwhichsurfaceformshaveambiguouslabelswhichcannotbeconsideredwithoutcontext

NEL:KnowledgeRichApproach

• Stringmatchbasedcandidateselection• Mentionattributes:

• Bag-of-name-words• Bag-of-mention-words• Bag-of-context-words

• Cosinesimilaritybetweenvectors

10

Compatibilityfunction

Entity(KB)

Subentity (documentlevel)

Mentions+ Cansolveambiguities+ KBcanbeenrichedwithvalidatedentities- Performancedependsonpre-processingcomponents

Wicketal,2013;Stoyanov etal,2012

NEL:ProcessingPipeline

11

Morphologyandsyntax

NER

Knowledgebase

Co-referenceresolution

Annotateddocument

EntityLinking

Textdocument

MorphologyandSyntax

• Lexiconbasedmorphologicalanalyzer• CMMmorphologicaltagger• LSTMtransition-basedparser• Wordembeddings• CharacterLSTMrepresentation

12

Model UD(UAS, %)

LSTM+ WE 75.1

LSTM+WE+morpho-tag 75.4

• Inflectiongeneration• Mentioncandidate

selection

Paikens etal,2013;Pretkalniņa etal,2016

NamedEntityRecognition

13

Model NER(F1, %)

BaselineCRF 83.4

LSTM-CRF 63.7

LSTM-CRF+WE 90.5

NER:LearningCurve

0% 20% 40% 60% 80% 100%

Corpus fraction used for training

50%

60%

70%

80%

90%

100%

F1-

scor

e

58.8%

66.0%

57.1%

53.9%

80.1%

62.6%

57.1%

84.5%

72.1%

60.0%

86.6%

83.4%

63.7%

90.2%

CRF

LSTM-CRF

LSTM-CRF + embeddings

14

NamedEntityrecognition

15

Bi-directionalLSTM-CRF• Pre-trainedwordembeddings• CharacterLSTMorCNNrepresentations• Jointcross-labelmodelling

CRFlayer

LSTM

Wordembeddings

Characterlevelwordvector

Lample etal,2015

Co-referenceresolution

• Mentionsthatreferstothesamerealwordentity

16

Followinginthefootstepsoftheir fatherDainis,anex-bobsleighspecialist,brothersMartins andTomassDukurs arecurrentlyrankedamongtheglobe’stopskeletonracers.Martins,worldnumberone,isthegoldmedalfavourite.He dreamsofsharingthepodiumwithhis eldersibling.

• {theirfatherDainis,anex-bobsleighspecialist}• {their,brothersMartinsandTomass Dukurs }• {Martins,Martins,worldnumberone,thegoldmedalfavourite,He,his}• {Tomass Dukurs,hiseldersibling}

Co-referenceResolution

• MentiondetectionNE,pronouns,nounphrases,etc.

• Mentionchaining– clustering• Grammaticallyrelatedmentions• Salience(gender,number,person,etc.)• Pairwiseorclusterbased

• Representativenameandtype(person,etc.)

17

LVCoref:RuleBasedSystem

• Exactstringmatch• Precisegrammaticalconstructions• Appositives (AndrisBērziņš,thepresidentofLatvia)• Nominalpredicatives (Jānis Bērziņšisaprofessor)• Acronyms

• Headmatchvariations (SupremeCourtoftheRepublicofLatvia↔SupremeCourtofLatvia)

• Pronounanaphora

18

LVCoref:Results

19

F1 P R

Predictedmentions

Exactmatch 40.0 74.2 28.0

+Preciseconstruction 46.2 74.5 34.3

+Headmatch 56.2 69.4 47.3

+Pronouns 58.0 65.1 52.3

Goldmentions

Exactmatch 42.8 86.0 29.2

+Preciseconstruction 45.4 98.4 29.5

+Headmatch 66.8 88.5 54.1

+Pronouns 76.5 87.0 68.9

LVCoref: Components

0.3

0.9

0.1

0.7

1.12.3

-1 0 1 2 3 4

Gender

Number

Syntacticalconstraints

Entitylevelconstraints

Namedentitylabels

Syntacticalinformation

pp

ΔR ΔP ΔF120

DataannotationforLatvian• SelectedparagraphsfromTheBalancedCorpusofModernLatvian

• NER/NEL• 8categories:person,organization,location,GPE,product,event,time,entity

• DerivedfromMUCandAMRguidelines• LinkstoWikipedia

• Co-references:• Namedentities,nounphrasesandpronounsreferringtospecificentities

• DerivedfromOntoNotes 6.0guidelines• Alsosyntax,frames,AMR

Projects:• Tekstaautomātiskasdatorlingvistikas analīzespētījumsjaunainformācijasarhīvaproduktaizstrādē(2013-2014)

• Daudzslāņu valodas resursu kopa teksta semantiskajai analīzei unsintēzei latviešu valodā (2016-2019) 21

FutureWork

• Moreannotateddata• ImprovementiondetectionforLatvian• HighimpactonCR• Jointlearningwithshallowsyntaxparsing

• Namedentityrecognition• Hierarchicalnamedentitymentions• IncorporateneuralLSTMcharacterlevellanguagemodel

• Co-referenceresolutionJointlearningwithNEL

• NamedentityrecognitioninspeechSubword andclassbasedlanguagemodels

• Cross-lingualnamedentitylinking22

Publikācijas• Miranda,S.,Znotins,A.,Cohen,S.B.,Barzdins,G.(2017).MultilingualClusteringofStreamingNews.SubmittedtoNAACLHLT.

• Znotiņš,A.(2016).WordEmbeddingsfor Latvian NaturalLanguage ProcessingTools.InHumanLanguageTechnologies:TheBalticPerspective,IOSPress.WebofScience.

• Znotiņš,A.,Polis,K.,andDarģis R.(2015).MediamonitoringsystemforLatvianradioandTVbroadcasts.InProceedingsofthe16thAnnualConferenceoftheInternationalSpeechCommunicationAssociation(INTERSPEECH),732–733.WebofScience,Scopus.

• Darģis,R.andZnotiņš,A.(2014).BaselineforKeywordSpottinginLatvianBroadcastSpeech.InHumanLanguageTechnologies:TheBalticPerspective,IOSPress,75–82.WebofScience.

• Znotiņš,A.andPaikens,P.(2014).Coreference ResolutionforLatvian.InProceedingsoftheNinthInternationalConferenceonLanguageResourcesandEvaluation(LREC),3209–3213.WebofScience.

• Pretkalniņa,L.,Znotiņš,A.,Rituma,L.,Goško,D.(2014).Dependencyparsingrepresentationeffectsontheaccuracyofsemanticapplications— anexampleofaninflectivelanguage.InProceedingsoftheNinthInternationalConferenceonLanguageResourcesandEvaluation(LREC),4074–4081.WebofScience.

23

Summary

• NER/NEL:findallentitymentionswithinatextandlinkthemtoauthoritativedatasource(knowledgebase)• NELastextanchoring:• Coarsesummaryofwhattextisabout• OthertaskscanutilizeinformationfromKB• Othertaskscanenrichnamedentitieswithadditionalsemanticannotationlayers

24

Recommended