Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
PracticalNLPandInformationExtractionwithGATE
Dr. DianaMaynardUniversityofSheffield,UK
Whatistextmining?
• TextMiningisthediscoveryofnew,previouslyunknowninformation,byautomaticallyextractinginformationfromdifferenttextualresources.
• Akeyelementisthelinkingtogetheroftheextractedinformationtogethertoformnewfactsornewhypothesestobeexploredfurther
• Textminingletsyouinvestigatewhat’sactuallyinadocumentoracollectionofdocuments
• Itletsusanswerwh-questions:who,what,why,when,how,where,which?
TextMiningisnotDataMining
Examples:• using consumer purchasing
patterns to predict which products to place close together on shelves in supermarkets
• analysing spending patterns on credit cards to detect fraudulent card use.
• Data mining is about using analytical techniques to find interesting patterns from large structured databases
TextMiningisnotWebSearch
• Text mining is also different from traditional web search.• In search, the user is typically looking for something that is
already known and has been written by someone else.• The problem lies in sifting through all the material that currently
isn't relevant to your needs, in order to find the information that is.• The solution often lies in better ways to ask the right question• You can't ask Google to tell you:
• How does the language used by Donald Trump differ from the language used by Hilary Clinton?
• In which parts of the country did people talk more about the environment during the UK elections?
• Which female MPs talked in the last 6 months about British hospitals with more than 100 deaths per month since 2010?
Basictextprocessing
His MMSE was 23/30 on 15 January 2008.
(MMSE=MiniMentalStateExamination)
Characteroffsets
His MMSE was 23/30 on 15 January 2008.
0....5....10...15...|....|....|....|....
Sentences
His MMSE was 23/30 on 15 January 2008.
0....5....10...15...|....|....|....|....
Tokens
His MMSE was 23/30 on 15 January 2008.
0....5....10...15...|....|....|....|....
Partofspeechcategories
His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....
PP NN VB CD CD PR CD NN CD
MorphologicalAnalysis
VB
His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....
PP NN CD CD PR CD NN CDbe
Knowledgeengineering:findingpatterns
His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....
Month
PP NN VB CD CD PR CD NN CDbe
Knowledgeengineering
His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....
MMSE Month
PP NN VB CD CD PR CD NN CDbe
Knowledgeengineering
His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....
MMSE Month
PP NN VB CD CD PR CD NN CDbe
{number}{Month}{number}
Knowledgeengineering
His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....
MMSE Month
PP NN VB CD CD PR CD NN CDbe
Date
Knowledgeengineering
His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....
MMSE Month
PP NN VB CD CD PR CD NN CDbe
{number}{slash}{number} Date
Knowledgeengineering
His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....
MMSE Month
Score Date
PP NN VB CD CD PR CD NN CDbe
Knowledgeengineering
His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....
MMSE Month
Score Date
PP NN VB CD CD PR CD NN CDbe
{MMSE}{BE}{Score}{?}{Date}
Knowledgeengineering
MMSE Month
Score Date
MMSE with score and date
His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....
PP NN VB CD CD PR CD NN CDbe
Typical Information Extraction pipeline
• Pre-processing (tokenisation, sentence splitting, morphological analysis, POS tagging)
• Entity finding (gazetteer lookup, NE grammars)• Co-reference (alias finding, orthographic co-
reference etc.)• Export the results somewhere (database / XML
/ ontology)
Example of Information Extraction
John lives in London . He works there for Polar Bear Design .
Basic Named Entity Recognition
John lives in London . He works there for Polar Bear Design .
PER LOC ORG
same_as
John lives in London . He works there for Polar Bear Design .
Co-reference
PER LOC ORG
John lives in London . He works there for Polar Bear Design .
Relations
PER LOC ORG
live_in
John lives in London . He works there for Polar Bear Design .
Relations (2)
PER LOC ORG
employee_of
John lives in London . He works there for Polar Bear Design .
Relations (3)
PER LOC ORG
based_in
WhatisEventRecognition?
l Aneventisanactionorsituationrelevanttothedomainexpressedbysomerelationbetweenentitiesorterms.
l Itisalwaysgroundedintime,e.g.theperformanceofaband,anelection,thedeathofaperson
Mitt Romney, the favorite to win the Republican nomination for president in 2012
Event DatePerson
Relation Relation
WhyareEntitiesandEventsUseful?
l Theycanhelpanswerthe“Big5”journalismquestions(who,what,when,where,why)
l Theycanbeusedtocategorisethetextsindifferentwaysl lookatalltextsaboutDonaldTrumpl Theycanbeusedastargetsforopinionminingl findoutwhatpeoplethinkaboutDonaldTrump
l Whenlinkedtoanontologyand/orcombinedwithotherinformation,theycanbeusedforreasoningaboutthingsnotexplicitinthetextl seeinghowopinionsaboutdifferentAmericanpresidents
havechangedovertheyears
GATE: General Architecture for Text
Engineering
WhyGATE?
• GATEisthemostwidelyusedopensourcetoolkitforNLPintheworld• We’reusingitbecauseit’sagreatwaytoshowcaseallthecoreNLP
componentsthatareusedfortextanalysistasks• YoucanplaywithallthetoolsinGATEandtryoutthingsforyourselftosee
howitworks• Experts
• DevelopedattheUniversityofSheffieldsince2000(initscurrentform)• ThepersonwhohasledthedevelopmentoftheNLPtoolsinGATEsince2000
istheonepresentingtoyounowJ
• Andbytheway,justbecauseit’solddoesn’tmeanit’soutofdate.GATEisinconstantdevelopmentwithnewtechnologiesbeingconstantlyadded.
Aboutthistutorial
• ThistutorialwillgetyoustartedwiththeGATEgraphicaluserinterface(GUI),alsoknownas“GATEDeveloper”
• EverythingyoudointheGUIcanbedoneviatheAPI,butit’seasiertoseewhat’sgoingonintheGUI
• Itwillbeahands-onsession.YoucantrythingsoutinGATEasthetopicsarepresented.
• Thingssuggestedforyoutotryyourselfarein red.• DownloadandinstallGATE8.5.1(ifyouhaven’talready)from
http://gate.ac.uk/download• StartGATEonyourcomputer(ifyouhaven'talready)bydoubleclickingthe
icon
GATEGUI
Resources Pane
Menu Bar
Shortcut Buttons
ResourceFeatures
Messages
DisplayPane
Resources
• MostthingsyouusewithinGATEare“resources”:• Languageresources (LRs)aredocuments,documentcollections,
ontologies...• Acollectionofdocumentsisknownasacorpus
• Processingresources (PRs)areprogramsthatoperateontextwithinthedocuments,andoftencreateormodifyannotations
• Datastores areforstoringdocumentsandcorporaforlateruse• Applications (“pipelines”)aresequencesofprocessingresourcesthat
runononeormoredocuments
DisplayingResources
• WhenyoufirstopenGATE,thedisplaypanewillshowmessagesfromthesysteminthe“Messages”tab
• Thedisplaypanedisplayswhateverelementsyouarecurrentlyworkingwith,e.g.anapplication,adocumentoraprocessingresource,eachinitsowntab
• Doubleclickingonaresourceintheresourcespanewilldisplayit• Tabsalongthetopofthedisplaypaneallowyoutochoosewhichof
theopenresourcestodisplay
CreateNewDocument
• FromtheResourcesPane,rightclick“LanguageResources”→New→GATEDocument
• Ignoretheparametersettingsthatwillbedisplayed• ClickOK• “GATEDocument_<id>”willnowbeaddedto“LanguageResources”• Doubleclickthatdocumentname• Atabisopenedinthedisplaypane,showingtheemptydocument.• Nowentersometextthere.
EmptyDocument
DocumentTab
DocumentEditor
DocumentName
DocumentEditor Buttons
DocumentResource Views
DocumentEditor
• TheDocumentEditorisshownasanewTabintheDisplayPane,alongsidetheMessagePane
• TherearebuttonsonthetopoftheEditor,e.g.“AnnotationSets”–wewilllearnaboutthemlater.
• TherearetabsatthebottomoftheDocumentTab:theseshowdifferent“Views”ofthedocument.
• Thesmallpaneinthelowerleftshowsthe“documentfeatures”(optionalinformationassociatedwiththedocumentresourceaskey/valuepairs)
Simpleoperationsonresources
• Rightclickingonthenameofaresourceintheresourcepanegivesaccesstoamenuofactions
• Doubleclickingonthenameofaresourceopensaviewoftheresourceinthedisplaypane(tripleclickingthenamecanbeusedtorename)
• SelectingaresourceinstanceandpressingtheDelete(Mac:Fn+BS)keywillgenerallycloseit
• Youcanalsorightclickandthenselect“Close”
Parameters
• Resourcescanhaveparameterswhichneedtogetspecifiedwhentheresourceiscreated:Initialization(init)Parameters
• Processingresourcescanalsohaveparameterswhichcanbechangedforeachrun:RuntimeParameters
• Init parametersspecifyhowaresourceiscreated,e.g.thelocationofadocumenttoload
• Runtimeparametersconfigurewhataprocessingresourcedoes,e.g.ifsomeprocessingiscase-sensitiveornot.
Loadinganexistingdocument
• GATEcanreadandloaddocumentsinmanyformats:e.g.plaintext,HTML,XML,PDF,Word,CoNLL ,CSV,JSON
• GATEcanloaddocumentsfromfilesandfromURLs• Whenadocumentisloaded,itgetsconvertedtoGATEinternal
formatasdocumenttext+annotations
Loadingadocument
• Toloadadocument:- rightclickonLanguageResources→“New→GATEDocument”OR- Filemenu→ NewLanguageResource→GATEDocument
• UsethesourceURL parametertospecifythedocumenttobeloaded:- typethefilenameorURL,or- clickthefilebrowsericontonavigatetothecorrectdocument
• Loadafilefromyourhands-onmaterials:corpora→news-texts→ft-airlines-27-jul-2001.xml
• Loadawebpage,e.g.http://news.bbc.co.uk
Documentviewer
Documentviewerbuttons
Document
Highlighted tab is the resource currently being viewed
Annotations
• AnnotationsarecentraltoGATE• Annotationsrepresentaspectsofthetextyouwanttoanalyze:
words,sentences,Dates,PersonNames• Annotationsarenamedbytheirtype,e.g.“Person”• Annotationconsistsof
• Annotationtype• startandendoffsets• setoffeatures,eachfeatureisanarbitraryname/valuepair,e.g.
gender=male
AnnotationSets
• Annotationsaregroupedintosets• Eachsetcancontainanynumberofannotationsofanytype• Youcancreateandorganizeyourannotationsetsasyouwish.• Predefinedsets
• Defaultset(emptyname):cannotbedeleted• “Originalmarkups”:annotationsfromthemarkupsinthefile• “Key”:byconvention,usedforgoldstandardannotations
• Clickthe“AnnotationSets”buttoninthedocumentviewerfortheft-airlinesdocumentyouloaded
AnnotationSets
Defaultannotationset
Original markupsannotation set
Annotation types
DocumentViewerButtons
Tabs
Viewingannotations
• ClickingontheAnnotationSetsbuttonopensanewpaneontherighthandsideinsidethedocumentview(AnnotationSetsview)
• Default(unnamed)setcontainssomeexamplesofannotations• ClickontheAnnotationSetname(eg Key)todisplaytheannotation
typesbelongingtothatset• YoushouldseetypessuchasLocation,Date,Personetc.• Clickthecheckboxforanannotationtypetoviewallthe
annotationsofthattypeinthedocument
Acloserlookattheannotations
• ClicktheAnnotationsListbuttonfromthemenuabovetheDisplaypane• Tableshowsannotationtype,annotationset,offsets,annotationid,and
features(forallselectedannotations)• Selectarowinthetabletohighlighttheannotationinthetext• TherearealsootherannotationviewspossiblesuchastheAnnotation
StackandCoreference Editor• TrytheAnnotationStackview
Annotations
Date annotation
Annotations table
Editingexistingannotations• SelectanannotationtypefromtheAnnotationSetsviewandhover
overahighlightedannotationinthetext• Apopupwindowdisplaysmoreinformationaboutit:thisisthe
annotationeditor• Clickthedrawingpinsymbolatthetopoftheeditor.Thiswill“pin” the
windowopen(youcanstillmovethewindowaroundonyourscreenifyouwish)
• Tryeditingtheannotation:youcanchangetheannotationtype,featurenamesandvalues,thespanoftheannotation(clickingleftandrightarrowsatthetopofthebox)ordeletetheannotationoritsfeatures(redXs)
• ClosetheannotationeditorbyclickingtheXinthetoprightcorner,thenviewyoureditedannotationintheAnnotationList
Annotationeditor
annotation editorfeature name value
Annotation type
CreatingaCorpus
• Acorpusisacollectionofdocuments.• Wetendtorunapplicationsonacorpusratherthanonadocument
itself• FirstclosethedocumentsyouhaveloadedinGATE,justsowedon’tget
confused(rightclickandClose)• Nowcreateanewemptycorpus• RightclickonLanguageResources→New→GATECorpus• Youcangivethecorpusaname,orusethedefaultone
PopulatingaCorpus
• Sometimestherecouldbehundredsofdocumentsinacorpus.• Usingthepopulatefunctionmeansyoucanloadlotsofdocumentsinto
thecorpusinonego• RightclickonthenameofyournewcorpusintheResourcespaneand
select“Populate”• Selectthenameofthedirectorywithyourdocuments(hands-
on/corpora/news-texts)
• Allthedocumentswillbeloadedinonego• Viewadocumentorthecorpusbydoubleclickingonit
ProcessingResources
• Processingresources(PRs)arethetoolsthatprocessandannotatetext(textprocessingalgorithms).
• An“application”or“pipeline”consistsofanynumberofPRs,runsequentiallyoveracorpusofdocuments
• ANNIEcontainsPRsfortokenisation,sentencesplitting,POStagging,NamedEntityrecognitionetc.
Applications
Here'sonewemadeearlier:ANNIE
• ANNIEisaready-madecollectionofPRsthatperformsInformationExtractiononunstructuredtext.
• ClicktheiconfromthetopGATEmenuORSelectFile→LoadANNIEsystem
• ViewtheANNIEapplicationbydoubleclickingonit• RunANNIEonyourcorpus(selectthecorpusnameandclick“Run
thisapplication”)
Runninganapplication
PRs selected in application (in order of their execution)
Corpus on which the application is executed
Runtime parameters of the selected PR
Execute the application
Viewingtheresults
• WhenamessageappearsinthebottomleftcornerofyourGATEwindowsayingsomethinglike“ANNIErunin1.3seconds”,theapplicationhasfinished.
• Doubleclickonthedocumenttoviewit• ViewtheannotationsbyselectingAnnotationSetsand
clickingonanyAnnotationtypesintheDefault(unnamed)set
• Ifyouwant,youcanviewtheannotationstableorstackviewtoo.
• Rememberthatnotalltheresultswillbeperfect!
Plugins
• ApluginisacollectionofPRs,andotherresourcesbundledtogether.• EverythingneededforIEinANNIEisintheANNIEplugin.• EverythingneededforIEinFrenchisinthelang_french plugin.
• AnapplicationcanusePRsfromoneormoredifferentplugins.• InordertousePRs,youneedtoloadtherelevantplugin(s)• PluginsareloadedviathePluginManager(greenjigsawpieceicon)
Plugins
• ClicktheicononthetopGATEmenutoopenthePluginManager[orgoviaFile →ManageCREOLEPlugins]
Plugins
List of available pluginsResources in the selected pluginLoad the
plugin for this session only
Load the plugin everytime GATE starts
Apply all the settings
Close the plugins manager
Plugins
• Clickonaplugintosee(ontheRHS)thenamesoftheresourcesitcontains.Havealookatafew.
• NowloadtheToolspluginbycheckingtherelevant“LoadNow” boxforit
• Click“ApplyAll” toloadtheplugin• Click“Close”• RightclickonProcessingResourcestoseewhichnewPRsare
nowavailable
AddinganewPR
• Let'saddaVerbPhraseChunker PRtoANNIE.• First,wehavetoloadthepluginthatcontainsit,andthen
loadthePRintoGATE,beforewecanaddittotheapplication.
• Ifyouwerelookingclosely,you’llhavenoticedthattheToolspluginyoujustloadedcontainstheANNIEVPChunker.
• RightclickonProcessingResourcesandselect“New”→“ANNIEVPChunker”
• Leaveallthedefaultparameterssetandclick“OK”
AddinganewPR(2)
• NowweneedtoaddthenewPRtotheapplication.• DoubleclickonANNIE.• You'llseetheANNIEVPChunker isinthelistofloadedPRs.This
meansit'savailableinGATE,butisn'tyetcontainedintheapplication.
• Addittotheapplicationbyselectingitandusingtherightarrowtotransferit.
• Nowusetheuparrowtomoveittotherightplaceintheapplication.Itshouldgoafter(below)thePOStaggerbutbefore(above)theNEtransducer.
• Runtheapplicationandviewtheresultsonthedocument.• Youshouldseeanewannotationtype“VG”.
IEforotherlanguages
• Youcantryoutotherapplicationsintwoways• Loadapluginthatcontainsaready-madeapplication• ClickApplications->Readymadeapplications• Loadsomedocumentsandruntheapplication• Youcanalsojustloadablankdocumentandtypesometextinit• Ifyoudothis,youneedtorightclickonthedocumentandselect
“Newcorpuswiththisdocument”first• Runthenewapplicationonyourcorpus
NERinFrench
NERinArabic
Hands-onwithTwitIE
l TwitIE isaversionofANNIEthat’sbeenretrainedfortweetsl LoadtheTWITIEplugin(greenjigsawicon)l Nowright-clickon“Applications”,select“Ready-madeapplications”and
“TwitIE”l Createanewcorpus,nameit“Tweets”l Right-clickonthecorpusandselect“populatefromTwitterJSON”,
selectingthefilehands-on-materials/corpora/energy-tweets.jsonl Onceloaded,doubleclickonTwitIE toopenit,andthenselect“Runthis
application”(makesurethetweetscorpusischosen)l Lookatthedifferentannotationsinthedefaultannotationsetl ToseeTokensinhashtags,usetheAnnotationStackview
Analysing tweets