View
1
Download
0
Category
Preview:
Citation preview
The International Journal of Information Retrieval Research is indexed or listed in the following: ACM Digital Library; Bacon’s Media Directory; Cabell’s Directories; DBLP; Google Scholar; INSPEC; JournalTOCs; MediaFinder; ProQuest Advanced Technologies & Aerospace Journals; ProQuest Computer Science Journals; ProQuest Illustrata: Technology; ProQuest SciTech Journals; ProQuest Technology Journals; Research Library; The Standard Periodical Directory; Ulrich’s Periodicals Directory; Web of Science; Web of Science Emerging Sources Citation Index (ESCI)
Editorial Preface
iv Vishal Bhatnagar, Ambedkar Institute of Advanced Communication Technologies and Research, Department of Computer Science and Engineering, Delhi, India
Research Articles
1 Offlinevs.OnlineSentimentAnalysis:IssuesWithSentimentAnalysisofOnlineMicro-Texts;
Ritesh Srivastava, University of Delhi, New Delhi, India
M.P.S. Bhatia, University of Delhi, New Delhi, India
19 SemanticAnalysisBasedApproachforRelevantTextExtractionUsingOntology;
Poonam Chahal, Manav Rachna International University, Faridabad, India
Manjeet Singh, YMCA University of Science and Technology, Faridabad, India
Suresh Kumar, Manav Rachna International University, Faridabad, India
37 FrequentItemsetMininginLargeDatasetsaSurvey;
Amrit Pal, Indian Institute of Information Technology, Allahabad, India
Manish Kumar, Indian Institute of Information Technology, Allahabad, India
50 Geo-TaggingNewsStoriesUsingContextualModelling;
Md Sadek Ferdous, University of Southampton, Southampton, United Kingdom
Soumyadeb Chowdhury, Singapore Institute of Technology, Singapore, Singapore
Joemon M Jose, University of Glasgow, Scotland, United Kingdom
72 HadoopMapOnlyJobforEncipheringPatient-GeneratedHealthData;
Arushi Jain, Ambedkar Institute of Advanced Communication Technologies and Research, Department of Computer Science and Engineering, New Delhi, India
Vishal Bhatnagar, Ambedkar Institute of Advanced Communication Technologies and Research, Department of Computer Science and Engineering, New Delhi, India
CoPyRightThe International Journal of Information Retrieval Research (IJIRR) (ISSN 2155-6377; eISSN 2155-6385), Copyright © 2017 IGI Global. All rights, including translation into other languages reserved by the publisher. No part of this journal may be reproduced or used in any form or by any means without written permission from the publisher, except for noncommercial, educational use including classroom teaching purposes. Product or company names used in this journal are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. The views expressed in this journal are those of the authors but not necessarily of IGI Global.
Volume 7 • Issue 4 • October-December-2017 • ISSN: 2155-6377 • eISSN: 2155-6385An official publication of the Information Resources Management Association
InternationalJournalofInformationRetrievalResearch
TableofContents
DOI: 10.4018/IJIRR.2017100104
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
Copyright©2017,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited.
Geo-Tagging News Stories Using Contextual ModellingMd Sadek Ferdous, University of Southampton, Southampton, United Kingdom
Soumyadeb Chowdhury, Singapore Institute of Technology, Singapore, Singapore
Joemon M Jose, University of Glasgow, Scotland, United Kingdom
ABSTRACT
Withtheever-increasingpopularityofLocation-basedServices,geo-taggingadocument-theprocessofidentifyinggeographiclocations(toponyms)inthedocument-hasgainedmuchattentioninrecentyears.Therehavebeenseveralapproachesproposedinthisregardandsomeofthemhavereportedtoachievehigherlevelofaccuracy.Theexistingapproachesperformwellatthecityorcountrylevel,unfortunately,theperformancedegradesduringgeo-taggingatthestreet/localitylevelforaspecificcity.Moreover,thesegeo-taggingapproachesfailcompletelyintheabsenceofaplacementionedinadocument.Inthispaper,analgorithmispresentedtoaddressthesetwolimitationsbyintroducingamodelofcontextswithrespecttoanewsstory.Thealgorithmevolvesaroundtheideathatanewsstorycanbegeo-taggednotonlyusingplace(s)foundinthenews,butalsousingcertainaspectsofitscontext.Animplementationoftheproposedapproachispresentedanditsperformanceisevaluatedonauniquedatasetwherefindingssuggestanimprovementoverexistingapproaches.
KeywoRdSContextual Modelling, Evaluation, Geo-Tagging, Information Retrieval, Text Mining
INTRodUCTIoN
With the ever-increasing popularity of Location-based Services, geo-tagging a document - theprocessofidentifyinggeographiclocations(toponyms)inthedocument-hasgainedmuchattentioninrecentyears.Insuchservices,geographiclocationsactasthegluethatbindtogetherdisparatedocumentsets(suchastextualcontents,imagesandvideos)frommultipledatasources.Devicesthatproducemultimediadocumentssuchasimagesandvideosareequippedwiththecapabilitytohaveadditionalsensors(GPSsensors)thatcangeo-tagtherelateddocumentwithgeographicinformationsuchaslatitudeandlongitudeandtherespectiveinformationisstoredinametadataalongwiththecorrespondingdocument.Webservicesthataccumulatesuchdocuments(e.g.YouTubeandFlickr)canretrievesuchinformationautomatically.Inaddition,suchservicesallowanyusertomanuallytaganymultimediadocumentwithgeographiclocationsincasesthedocumentsarenotgeo-taggedbytheircapturingdevices.Unfortunately,thegeo-taggingprocedureisrathercumbersomefortextualdocumentsandgenerallyreliesonmanualhumaninput.Therehavebeenseveralworkstoaddress
50
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
51
thislimitationandsomeofthemhavereportedtoachievehighlevelofaccuracyasreportedin(Ding,2000),(Amitay,2004),(Garbin,2005),(Lieberman,2007),(Andogah,2012)and(Ignazio,2014).
Aspartofalarge-scaleproject,wehavebeencollectingnewsstoriesaboutacountryfromthecountry-specificRSSfeedofdifferentonlinenewswebsitesonadailybasisforaroundayear.ThemainideaistoaggregatethisdatasetwithothermodesofpublicdatasuchassocialmediapostsfromTwitter;multimediadatafromimagesharingwebsitessuchasFlickranddatafromwearablesensorssuchaslifeloggersandGPStrackerstocreateauniquemulti-modal(textualaswellasmultimedia)setofdataaboutaparticulargeographiclocation.Thiswillencodeexperiencesfrommultipleuserperspectivesandhasenormouspotentialinexploitingforpublicbenefit.Oneofthecorechallengesfordealingwithsuchheterogeneoussetofdataistodefinetheparametersthatcanbeusedtolinkthemtogetherfordifferentuse-casescenarios.Amongseveralparameters,thespatio-temporalattributepairisthesimplestofchoicesduetotheiromni-presenceinallourdatasetsexceptinnewsstories.
Newsstories,mostlytextual,areequippedwithatemporalattribute(intheformofatimestamp)tohighlightthetimeanddateofpublication,however,lackanyaccompanyingmetadatatopublicisethespatialattribute,eventhougheverynewsgenerallyhasageographicfocusinit(Andogah,2012).Thelackofanyspatialattributemakesitachallengingtasktogeo-taganewsstoryinanautomaticfashion.Togeo-tagourcollectionofnewsstories,wehavebeenlookingforpubliclyavailablegeo-taggingAPIs.CLAVIN(CLAVIN,2016)andCLIFF(Ignazio,2014)and(CLIFF,2015)aretwosuchAPIs.
AfterutilisingCLAVINandCLIFFoverasubsetofournewsdataset,wehavenoticed thefollowingshortcomings:
• Theyfailtogeo-tagadocumentintheabsenceofdirectmentionsofalocation;and• Theyfailtocreateanassociationbetweenafine-grainedlocationandacityincasesaTextual
documenthasbeengeo-tagged.
Whatwemeanbyafine-grainedlocationisatthegranularityofastreetoralocalityinacity.AnexampleofalocalityisChelseainLondonandanexampleofastreetisKing’sRoadinLondon,UK.Withoutaproperassociationbetweensuchfine-grainedlocationsandacity,itopensupthedoorfordisambiguity,sincemanycitiesmaysharethesamenameforalocalityorastreet.Thereasonforourinterestinsuchfine-grainedlocationsisthatitallowsustolinksuchnewswithotherdatasets,especiallylifelogsandGPStrails,whicharesupplementedwithsuchfine-grainedgeo-information.
Inthispaperweinvestigatethewaystheabovementionedproblemscanberectified.Especially,weinvestigatehowamathematicalmodelofcontextwithrespecttoanewsstorycanbedevelopedandhowsuchamodelcanberelatedwithamathematicalmodelofgeo-tagginganditsalgorithmicimplementationtorectifysuchproblems.
Inparticular,weseekanswerstothefollowingresearchquestions:
[RQ-1]:Canwedevelopamathematicalmodelofcontextrelatedtonewsstoriesandrelatewithamathematicalmodelofgeo-tagging?
[RQ-2]:Canweexploitthenewscontextmodeltogeo-tagnewsintheabsenceofdirectmentions?[RQ-3]:Canweexploitthenewscontextmodeltogeo-tagnewsatstreetgranularitylevelconsidering
thedisambiguitythatmayoccur?
Withthisintroduction,thepaperisorganisedasfollows.WedescribetherelatedworkinSection:RELATEDWORK.Section:GEO-TAGGINGMODELLINGintroducesourmathematicalmodel
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
52
of context and geo-tagging for a news story. In Section: IMPLEMENTATION, we discuss ourimplementationalongwiththealgorithmthatutilisesourmodelofcontextforgeo-tagginganewsstory.WedescribeourevaluationprocedureinSection:EVALUATIONandpresenttheresultsinSection:RESULT.Weanswerourresearchquestions,discusstheadvantagesandhighlightthelimitationsoftheproposedapproachinSection:DISCUSSION.WeconcludeinSection:CONCLUSION.
ReLATed woRK
Oneoftheearliestworksongeo-taggingtextualwebresourceswasreportedin(Ding,2000)wheretheauthorsintroducedheuristictechniquesforautomaticallydetectingthegeographicalscope(s)withintheresource.Thetechniquesreliedontheanalysisoftextualcontentsandexaminingthegeographicaldistributionofhyperlinkswithintheresources.Anevaluationoftheirreportwascarriedoutover150webresourcesandmorethan75%precisionandrecallwasreported.Finally,ageo-awaresearchenginewasdevelopedusingtheirproposedapproachtoshowthesuitabilityoftheirapproach.Theauthorsmainlyfocusedonthecitylevelgranularityanditwasnotinvestigatediftheapproachwouldbesuitableforstreet/localitylevelgranularity.
An influential work for geo-tagging web documents was presented in (Amitay, 2004). Thepaperdescribedadataminingapproachutilisingagazetteer (anatlasenlisting thenamesof allplaces)tolocateplacesmentionedwithinthedocumentaswellastodeterminethegeographicfocus,representingthebroaderlocalitysuchascitiesorstates,ofthedocument.Theauthorsalsodiscussedmechanismstoresolvetwotypesofambiguities:geo/non-geoandgeo/geo.Thefirstambiguitydepictsthescenarioswhenalocationnameissimilartoanynon-geographicname,e.g.Turkey,whereasthesecondambiguity(geo/geo)illustratesthescenarioswhenplacesindifferentcountriessharethesamename,e.g.London,EnglandandLondon,Canada.Basedontheevaluationover600webpages,theauthorsreportedaprecisionof82%forindividualgeo-tagsandaprecisionof91%indeterminingthegeographicfocusofthenews.Theirpaperalsodidnotinvestigateiftheapproachwouldbesuitableforstreet/localitylevelgranularity.
Oneofthemajorchallengesingeo-taggingadocumentistohandledisambiguity.Inthisregard,theauthorsin(Garbin,2005)presentedanapproachbasedonunsupervisedmachinelearningbyaggregating twopublicly available gazetteers. At first, ambiguous locations weredisambiguatedautomaticallybyapplyingpreferenceheuristicswhichactedasatrainingdatasetforthemachinelearner.Next,themachinelearnerwasusedtodisambiguateambiguouslocationsfromotherdata.Theirresultoftheirapproachwascomparedwithahuman-annotatednewscorpuscontaining7,739documentswith78.5%precision.
Liebermanet.al.presentedaSpatio-TextualsearchenginecalledSTEWARDwhichisasystemforgeo-tagging,determininggeographicfocus,querying,andvisualisinggeographiclocationsintextdocumentsin(Lieberman,2007).Theauthorsutilisedadocumenttaggertoextractpotentialreferencesoflocationswhicharethenresolvedtogeographiclocationsusingtwogazetteers.Toresolvedisambiguity,theyformulatedanalgorithmcalledPairStrengthAlgorithm.Thealgorithmutilisedthefrequencycountofambiguouslocations,theirdistancewithinthedocument,theirgeodesicdistanceandtheirpopulations.Thedocumentdistanceisameasurementofoffsetbetweenapairoflocationsfromthestartofthedocument.AnalgorithmcalledContext-AwareRelevancyDetermination(CARD)wasformulatedtodeterminethefocusofthedocument.Then,theproposedapproachwasappliedtodesignanddevelopasystemthatvisualisesthedocumentontoamapallowingdocumentretrievalbasedonspatio-textualqueries.Toresolvedisambiguity,wehaveadoptedasimilarapproachasdocumentdistanceforcalculatingthedistancebetweenacitynameandstreetname(calledvicinityscoreinourapproach,seeSection:IMPLEMENTATION)mentionedinthenews.
WiththeassumptionthateverysingledocumentinanIR(InformationRetrieval)Systemandthequerytoretrievesuchdocumentsfromthesystemhasageographicfocus,Andogahet.al.presentedanapproachfordetermininggeographicscopeofadocument(Andogah,2012).Then, thescope
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
53
hadbeenutilisedfortoponymresolution,relevancerankingandqueryexpansion.ThegeographicscopewasdeterminedbyextractingnamedentitiesusingaNamedEntityRecognizer(NER)tool.Especially, theyexploited thehierarchicalstructureof thenamedplacesandpeople,particularlypoliticalandgovernmentleaders,todeterminethescopeofthedocument.Then,theyemployedaheuristicalgorithmforresolvingambiguity.Finally,theauthorsdescribedhowtheirapproachcouldbeusedforqueryexpansionandrelevanceranking.AnevaluationwascarriedoveradatasetcalledTR-CoNLLcontaining946documents (Leidner, 2006).They reportedanaccuracyof79%overmanuallyannotatedarticlesfromthedatasetforgeographicresolutionaswellasanaccuracyof71%-80%overmanuallyannotatedarticlesfortoponymresolution.
A recent work on geo-tagging the news article was presented in (Ignazio, 2014) where theauthorsextendedanexistinggeo-taggingAPIcalledCLAVIN(CLAVIN,2016)byapplyingafewheuristicsbasedonthemethoddescribedin(Amitay,2004).Theirapproachdeterminedthefocusofanewsarticleaswellasallplacesmentionedinthearticle.Theyreported95%accuracyoverasmallmanuallyannotateddatasetof75newsand90%-91%accuracyindeterminingfocusatthecountrylevelusingseparate10,000samplesfromtheNewYorkTimesAnnotatedCorpus(Sandhaus,2008)andReutersRCV-1Corpusrespectively(Sandhaus,2004).
ThereareseveralcommercialAPIsavailableforgeo-taggingtextualdocumentssuchasnewsstories,e.g.OpenCalais(OpenCalais,2016),Placespotter(Placespotter,2016)andGeoTag(GeoTag,2016),however,wehavebeenlookingforpubliclyavailableAPIssothattheycanbeextendedtomeet our requirements. We have found two such APIs, namely, CLAVIN (CLAVIN, 2016) andCLIFF(CLIFF,2016).Betweenthesetwo,CLIFFisbasedonCLAVINandhasextendedCLAVIN’scapability.Moreover,ithasbeenreportedtoachievebetterperformancethanCLAVINin(Ignazio,2014). Therefore, we have selected CLIFF for our experiment. CLIFF utilises Stanford NER(StanfordNER,2016)toextractnamedentitiesandthenappliesafewheuristicstogeo-tagthenewsandtodetermineitsgeographicfocus.EventhoughCLIFFcanidentifyastreet/locality,itdoesnotassociateastreetwithacity.Inaddition,alltheworksdiscussedabovemainlyfocusedeitheronacountryoracitylevelanddidnotinvestigateiftheirapproachwouldbesuitableforstreetorlocalitylevelgeo-tagging.Furthermore,theirapproachwouldfailintheabsenceofdirectmentionsoflocations.
Havingbeeninspiredbytheworkof(Lieberman,2007)and(Ignazio,2014),wewouldliketoinvestigateifthementionedshortcomingscanbehandledandtheeffectivenessofgeo-taggingcanbeimprovedbyintroducingandexploitingamathematicalmodelofcontextandgeo-tagginginrelationtoanewsstory.Toourknowledge,thisisthefirstattempttoformaliseacontextwithrespecttoanewsstoryandthentoapplythatcontextforgeo-taggingnewsstories.
Geo-TAGGING ModeLLING
Inthissection,wedefineourmathematicalmodelofcontextofanewsstory.Atfirst,wedefinethetermContext.Next,wedefineamodelofspatiallocationinformation.Finally,werelatecontextualinformationwithspatial location informationbyformallydefining theprocessofgeocodingandgeo-tagging.
Contextual InformationTheterm“Context”(alsoknownascontextualinformation)hasbeenpopularisedfromthedomainofContext-awareservices.Interestingly,whatthetermContextmeansinhighlydebatedandithasbeendefinedinnumerousways.SchilitandTheimerusedthetermcontext-awareforthefirsttimein(Schilit,1994)wheretheydescribedcontextsaslocations,identitiesofnearbypeople,objectsandchangestothoseobjects.Similarly,Ryanetal.regardedcontextsastheuser’slocation,environment,identityandtime(Ryan,1997).Hulletal.representcontextsasdifferentaspectsofthecurrentsituation(Hull,1997).Oneofthemostaccurateandwidely-useddefinitionsisgivenbyAbowdetal.whereacontexthasbeendescribedas“anyinformationthatcanbeusedtocharacterizethesituationofentities
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
54
(i.e.,whetheraperson,placeorobject)thatareconsideredrelevanttotheinteractionbetweenauserandanapplication,includingtheuserandtheapplicationthemselves”(Abowd,1999).
InthedomainofInformationRetrieval,acontextisusedtodefinethestateofauser(includingtheuser’sspatio-temporalattributes,interests,previousretrievalhistoriesandsoon)andsuchinformationcanbeusedasaknowledgebaseinordertoachievehigheraccuracyduringinformationretrieval(Gross,2002).Interestingly,inaninformationretrievalsystemwhichmainlydealswithnewsstories(asinours),howanewsstoryisstructuredcanoffervaluableinsights,whichcanbesupplementedwiththecontextofausertoachievehighlyaccurateretrievalresults.
Anewsstoryoutlinesafactualstorybyanswering5corequestionsofwho,what,where,whenandwhy(Errico,1997).Oftheseanswers,theanswerofwhohighlightsthenamedentitiessuchaspeopleororganisationsmentionedinthenewswhereastheanswersofwhenandwherehighlightthetemporalandaspatiallocationinformationrespectivelyregardingthenews.Theanswersofwhatandwhyrepresentthecontentsofthenewsandcanbeusedtoclassifythenews.Thecombinedanswersofthesequestionsessentiallydefinethestateofanews,muchlikethewayacontextdefinesthestateofauser.Thismotivatesustodefinethecontextofanewsinthefollowingway:
Definition 1:Thecontextofanewsstoryconsistsofthenamedentitiesandthetemporalandaspatiallocationinformationmentionedinthenewsaswellasthecategoriesdepictingtheclassificationofthenews(seeFigure1).
Itisimportanttounderstandthatthelocationinformationinsuchacontextmerelyrepresentsanaspatiallocationallydescriptivetext,usuallyidentifiedbyaNamedEntityRecognizer(NER)andhence,isnotavalidspatialrepresentationofalocation.
Figure 1. Modelling contexts of a news
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
55
Mathematically,letP denotethesetofpeople, LN denotethesetofaspatiallocationnames,T denotethesingletonsetdepictingatimestampandO denotethesetoforganisations.Then,thecontextofanewscanbedefinedusingthefollowingset:
INFO P LN T OCONTEXT = ∪ ∪ ∪
where,P P⊆ , LN LN⊆ andO O⊆ .
Spatial Location ModellingAspatiallocationdescribesthephysicallocationofalocationnameandisrepresentedusinggeospatialcoordinatessuchaslatitudeandlongitude(Research,2016).Tomodelaspatiallocation,weassumeatreestructurewheretheworldistherootofthetreeandallcountriesintheworldisitsfirstlevelchildren.Acountryisassumedtobedividedindifferentcitiesconsistingofroads,localitiesandpostcodes.Themodelispresentedbelow.
LetC denotethesetofallcountriesintheworldandCITYc denotethesetofallcitiesinacountry c C∈ .Wedefinethesetofallroadsin city CITYc∈ ofacountry c with Rcity .Eachcountrydefinesitsownformatofapostcode.Withoutspecifyingwhatthatformatis,weassumethatapostcodeisassignedtoacollectionofoneormoreroadswithinaspecificcity.WedenotethesetofpostcodeswithPCcity forcity CITYc∈ .Torelateapostcodewiththecorrespondingroads,wedefinethefollowingfunction.
Definition 2:Let pcToRoads PC P Rcity city: → ( ) bethefunctionthatreturnsthesetofroads
whicharepartofthatpostcodein city CITYc∈ .
Inversely,wecandefineanotherfunctioncalled, roadToPC whichgivenaninputofaroadwillreturnthepostcodeofthatroad.Formally:
Definition 3:Let roadToPC R PCcity city: → bethefunctionthatreturnsthepostcodeofthatroadin city CITYc∈ .
Forexample,ifweassumethatthepostcode p PCcity∈ isassignedforroads r1 , r2 and r3 inthe city CITYc∈ ,then:
pcToRoads p r r r( ) = { }1 2 3, ,
roadToPC r p roadToPC r p and roadToPC r p1 2 3( ) = ( ) = ( ) =,
Anexampleofaroadalongwithitspostcodeis176King’sRoad(depictingtheroad),SW34UP(depictingthepostcode)incityLondon,UK.
Next,wedefinealocalityasthecollectionofseveralpostcodeswithinacityanddenotethesetof localitieswithinacityas LOCcity .Likeabove,wedefinethefollowingfunctionstorelateapostcodewithalocalitywithinacity.
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
56
Definition 4:LetlocToPC LOC P PCcity city: ( )→ bethefunctionthatreturnsthesetofpostcodeswhicharepartofthatlocalityin city CITYc∈ .
Definition 5:Let pcToLoc PC LOCcity city: → bethefunctionthatreturnsthelocalityofthatpostcodein city CITYc∈ .
Forexample,ifweassumethatalocalityloc LOCcity∈ consistsof p1 , p2 and p3 postcodesin city CITYc∈ ,then:
locToPC loc p p p( ) = { }1 2 3, ,
pcToLoc p loc pcToLoc p loc and pcToLoc p loc1 2 3( ) = ( ) = ( ) =,
AnexampleofalocalityinLondon,UKisChelseaconsistingofseveralpostcodesandoneofitspostcodesisSW34UP.Itmayhappenthatforacountryorevenforacityinacountrythereisnodefinedlocality.Insuchcases,thelocalitysetforthatcity LOCcity( ) willrepresentanemptyset.
Finally,wedefinealocationasanorderedpairconsistingofaroad,alocality,apostcode,acityandacountry.Formally,thesetoflocationsforacity city CITYc∈ incountry c C∈ isdenotedas Lcity andisdefinedas:
L r loc pc city ccity = ( ){ }, , , ,
where:
r R loc LOC pc PC city CITY and c Ccity city city c∈ ∈ ∈ ∈ ∈, , ,
Anexampleofanelementofthelocationsetcanbegivenasfollows:
l r pcToLoc roadToPC r roadToPC r city c= ( )( ) ( )( )1 1 1, , , ,
where l Lcity∈ , roadToPC r1( ) resolves the rode to the corresponding postcode and
pcToLoc roadToPC r1( )( ) resolves the road to the postcode which is then resolved to thecorrespondinglocality.Anexampleofalocationwherealocalityisdefinedis:176King’sRoad,Chelsea,SW34UP,London,UK.Anotherexampleofalocationwherealocalityisnotdefinedis:29EthelbertRoad,CT13NF,Canterbury,UKwhereCanterburyisacity,notalocality,intheUK.
Then,wecandefinethesetoflocationsforacountryc C∈ astheunionofthelocationsetforallcitiesinthatcountryandtheuniversallocationset(denotedasL )canbedefinedastheunionofthelocationsetofallcountriesintheworld.Formally:
L L city CITY and L L c CC city c C= ∪ ∈ = ∪ ∈{ | } { | }
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
57
Theremayexistdifferentwaysaspatialinformationcanberepresented.Inthispaper,weassumethataspatialinformationisrepresentedusingalocationasdefinedaboveandageographiccoordinateasdefinednext.
Ageographiccoordinateconsistsofalatitudeandlongitude.WedenotethesetoflatitudesasLAT andthesetoflongitudesasLON .Formally,ageographiccoordinateisdenotedasCOORD andisdefinedasanorderedpairoflatitudeandlongitude:
COORD lat lan lat LAT and lon LON= ( ) ∈ ∈{ , | }
Wedenotethesetofspatialinformationas INFOSPATIAL anddefineitasanorderedpairoflocationandcoordinatesinthefollowingway:
INFO Bcoord B L and coord COORDSPATIAL = ( ) ∈ ∈{ , | }
Geocoding and Geo-Tagging Modelling
Torelatecontextualinformation INFOCONTEXT( ) withspatialinformation INFOSPATIAL( ) ,wedefinetheconceptofgeocoding.In(Goldberg,2008),Geocodingisdefinedas:“theactoftransformingaspatiallocationallydescriptivetextintoavalidspatialrepresentationusingapredefinedprocess”.Formally,wedefinethegeocodingprocessasafunctionasfollows:
Definition 6:Let geocoding C INFO T INFOCONTEXT SPATIAL: ( \ )× → bethefunctionthat,giventheinputsofacountryandanycontextualinformationexceptatimestamp,returnsthespatialinformationforthatcontextwithinthatcountry.
Thecountryintheinputofthe geocoding functionisusedtorestrictthegeographicfocusontoaspecificcountryandisutilisedbythegeocodingservicessuchasGeocode.FarmAPI(Geocode,2016).
AsmentionedinSection1,ageo-taggingprocessidentifieslocationswithinadocument.Thisismerelyaninformalambiguousdefinitionasalocationinformationcanbeaspatial(asdefinedinINFOSPATIAL orspatial(asdefinedin INFOCONTEXT ).Arigorousformaldefinition,hence,shouldeliminatesuchambiguity.Furthermore,toresonatewiththescopeofthispaper,wemainlyfocusongeo-taggingnewsstories.Withthisgoalinmind,wedefinethegeo-taggingprocessforanewsstoryasafunctioninthefollowingwaywhereN isthesetofnewsstories:
Definition 7:Letgeo tagging N C P INFOnews SPATIAL− × →: ( ) bethefunctionthat,giventheinputsofanewsandacountry,returnsthesetofspatiallocationinformationwhichareidentifiedinthatnewsandlocatedwithinthatcountry.
AsinDefinition6,thecountryinputinthe geo taggingnews− functionisusedtorestrictthegeographicfocusontoaspecificcountry.
SUMMARy
Ourmodelofgeo-tagginganewsstoryissummarisedinFigure2.Inessence,ourmodelconsistsofdifferentatomicsets(P ,LN ,O ,R ,PC andsoon)andcombinedsets( Lcity , INFOCONTEXT ,
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
58
INFOSPATIAL andsoon)whichareusedtorepresentdifferententities,attributesandgeographicalproperties.Forexample, INFOCONTEXT encodesthenotionofcontextualinformationwithrespecttoanewsstorywhereas INFOSPATIAL representsaspatiallocationofacitywithinacountryalongwith its coordinates.Themodel alsoconsistsofdifferent functions thatdefine the inter-relationbetweenthespecifiedsetsusingmathematicalfunctions.Amongalldefinedfunctions,geocoding andgeo taggingnews− areofparticularinterest.Thegeocoding functionillustratestheconceptofconvertingacontextintoaspatiallocationsconsistingofthecorrespondingcoordinates.Ontheotherhandthe geo taggingnews− functionillustratestheconceptoftagginganewswithaspatialinformationandprovidesapowerfulabstractionthathidesawaytheinternalgeocodingprocess.
These twofunctions, in reality,provide theblue-print fordevelopingalgorithms thatcanbeutilisedtoimplementanimprovedgeo-tagger.Inthenextsection,weelaboratehowwehaveachievedthesegoals.
IMPLeMeNTATIoN
Toutilisethemodelofcontext,anapplication,calledGeo-Tagger,hasbeendesignedandimplemented.Theapplicationutilisesanalgorithm,calledGeo-Taggingalgorithm(seebelow),whichrevolvesaroundtheideathatthereareotheraspectsofacontext,apartfromalocation,thatcanbeusedtogeo-taganews.Especially,weseektoexploitinformationregardingorganisationsfoundinanewsstoryforgeo-taggingthestory.Insuch,ourideagoesbeyondthetechniquesutilisedbyexistinggeo-taggingtoolssuchasCLAVINandCLIFFwhereonlylocationinformationwasexploitedforgeo-tagging.Otheraspectsofacontextsuchaspeopleandtimestamphavenotbeenconsideredforourcurrentimplementation.Thisisbecauseatimestamphasnogeographiclocationassociatedwith
Figure 2. Geo-tagging model
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
59
it.Furthermore,eventhoughpeople,particularlypoliticalleaders,havebeenexploitedingeo-tagginganewsin(Andogah,2012),wearguethatthisisquitetrickyandcanbesusceptibletoerrors,sincethelocationofapersonisnotstationary.
The architecture of the Geo-Tagger is illustrated in Figure 3. The application relies on thefollowingcomponents:
• TheStanfordNamedEntityRecognizer (NER),which isused toextractnamedentitiesandaspatial location information such as locations, persons and organisations within a news(StanfordNER,2016)representingthesetofcontextualinformation INFOCONTEXT( ) ;
• Geocode.FarmRESTAPI(Geocode,2016),whichhasbeenusedtogeocodeanamedentity(consideredasanimplementationofthe geocoding function);
• Adatabasefromwhereanewstoryisretrievedandtowheretheresultaftergeo-taggingthenewsisstored;
• Aninputfilecontainingboundingboxcoordinatesofthecorrespondinggeographiclocation;and• TwoalgorithmscalledGeo-TaggingAlgorithmandAmbiguityResolutionAlgorithm(ARAin
short)describedbelow.
Sinceournewscollection(denotedas N )primarilyconsistsofnewsstoriesfromaspecificcountry,themainfocusoftheGeo-Taggeristoidentifylocationswithinthatcountry.Theboundingboxcoordinatesspecifiedintheinputfileisusedtofilteroutanyotherlocationsoutsideofthisboundingbox.Inthisway,theboundingboxcoordinatesintheinputfileactsasthegeographicfocusspecifiedbytheuserandrepresentsacountry c C∈ cforthe geo taggingnews− function.Thisisincontrastwithanyexistingapproachwherethegeographicalfocusisdetectedautomatically.TheadvantageofourapproachwillbediscussedinSection:Advantages.
Theflowforgeo-tagginganewsstoryisasfollows.TheuseroftheGeo-Taggerinputstherequiredboundingboxcoordinatesintotheinputfile.AnewsisfetchedfromthedatabaseandispassedintotheGeo-Taggingalgorithm(Algorithm1)alongwiththeinputfile.Thealgorithmprocessesthefileandoutputsthelocationsalongwiththeircoordinates.Thisinformationisthenstoredbackintothedatabasewithareferenceofthenews,indicatingthatthecorrespondingnewshasbeengeo-tagged.
Figure 3. Architecture of geo-tagger application
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
60
The geo-tagging algorithm (Algorithm 1) essentially is an implementation of thegeo taggingnews− function.Ittakesanews n N∈( ) andaboundingboxofacountry(representingc C∈ ).Internally,itexploitsthecontextualinformationextractedfromnamedentitiesusingtheStanford NER representing the INFOCONTEXT set. At the first phase, all aspatial locationsLN INFOCONTEXT⊆( ) areprocessedonebyone.Ifsuchalocationpresentsacityhavingthe
coordinates(retrievedusingtheGeocode.FarmAPI)withinthespecifiedboundingbox,thedocumentisgeo-taggedwiththelocation.Ifthelocationrepresentseitheralocalityorastreet,thenthelocationisfedintotheARA(Algorithm2).Thisisbecausesuchalocationmayexistinmorethanonecitywithinthespecifiedboundingbox.TheARAmayreturnaproperlydisambiguatedlocationandifso,thenewsisgeo-taggedwiththelocation.TheARAmayalsoreturnalistoflocationsindicatingthelocationsinthenewsarestillambiguous.Insuchcases,thenewsisgeo-taggedwiththelocationalongwiththisambiguitytagandallmatchedcitiesandisstoredinthedatabase.Atthesecondphase,allorganisations O INFOCONTEXT⊆( ) areextractedfromthenewsusingtheStanfordNER.ThecoordinatesofeachorganisationisthenretrievedusingGeocode.FarmAPIandiftheyarewithintheboundingbox,thenewsisgeo-taggedwiththelocationoftheorganisation.
Algorithm1.Geo-taggingalgorithm
Input: a news story n N∈( ), input file (representing c C∈ )
Output: spatial locations identified in the news (a subset of INFOSPATIAL )1: → use Stanford NER to extract the set INFOCONTEXT( ) of named entities within the news 2: → extract locations LN INFOCONTEXT⊆3:loop for ln LN∈4: if ln represents a city CITYc∈ then5: → utilise the Geocode.Farm API by passing ln and c and retrieve coordinates i INFOSPATIAL∈( ) of ln within c.6: → geotag n with i7: else if ln∈LOCcity or ln∈Rcity (i.e. ln representing a locality or a road in a city) then
8: → call ARA passing the location ln( ) and the named entities INFOCONTEXT( ) as inputs and retrieve coordinates i INFOSPATIAL∈( )9: if i null≠ then10: → geotag n with i11: end loop12: → extract the set of organisations locations O INFOCONTEXT⊆13: loop for o O∈14: →utilise the Geocode.Farm API by passing o and c and retrieve coordinates i INFOSPATIAL∈( ) of o within c15: if i null≠ then16: → geotag n with i17: end loop
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
61
Algorithm2.Ambiguityresolutionalgorithm
Input: an aspatial location ln( ) , a news n N∈( ), named entities from the news INFOCONTEXT( ) and a country c C∈Output: a set of spatial locations i INFOSPATIAL⊆ containing either one or multiple locations 1: → retrieve i INFOSPATIAL'⊆ the set of coordinates for ln in c using the Geocode.Farm API 2:loop for i i∈ '3: if coordinates in i belong to a city CITYc∈ then
4: → return i associating ln with city5: else if coordinates in i belong to different cities within country c then6: → extract city locations LN INFOCONTEXT⊆7: if only one city is found LN =( )1 then
8: → return i associating ln with city9: else if more than one match is found LN >( )1 then
10: → match between cities extracted in ln and i11: if only a single match is found then12: → return i associating ln with city13: else if more than one match is found then14: → determine the vicinity between ln and the matched cities 15: → choose city having the lowest vicinity score16: → return i associating ln with city17: → if more than one cities having the same vicinity score then
18: → store all cities in a list l1( )19: else if no city is found or list l1 exists then20: → extract O INFOCONTEXT⊆21: → retrieve locations for each organization using similar heuristic and return i associating ln with city22: end loop23: →return l1
TheAmbiguityResolutionAlgorithm(ARA)retrievesthecoordinatesofalocationusingtheGeocode.FarmAPI.Ifthelocation( ln LN∈ ,mainlyalocalityorastreet)isresolvedtoonlyonecityhavingthecoordinateswithintheboundingbox,thelocationisassociatedwiththecityandthenewsisgeo-taggedwiththelocationalongwithitscoordinates.Ifthelocationisresolvedtomorethanone city, cities residingwithin theboundingboxare selected.Then, aspatial city locationsLN INFOCONTEXT⊆( ) fromthenewsareextractedusingtheStanfordNER.Twolistsofcities
arematched.Ifonlyonematchisfound,thelocationisassociatedwiththematchedcity.Ifmorethanonematchisfound,thismeansthatthelocationmayresideinanyofthematchedcities.Inthe
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
62
nextstep,avicinityscorebetweenthelocationandmatchedcitiesiscalculated.Avicinityscoremeasuresthedistancebetweenthelocationandthematchedcitiesmentionedwithinthenews.Thedistanceitselfismeasuredbycountingthenumberofwordsbetweenthelocationandthecity.Theintuitionisthatalocalityorastreetisgenerallymentionedalongwithitscityinthesamesentenceand/orinthesameparagraph.Thelocationisassociatedwiththecityhavingthelowestvicinityscore.Ifmorethanonecityhasthesameorreasonablyclosevicinityscore,allofthemarestoredinalist(called l1).Asthelastresortforambiguityresolution,allorganisationsO INFOCONTEXT⊆ areextractedfromthenewsusingtheStanfordNER.CoordinatesforeachorganisationareretrievedusingtheGeocode.FarmAPI.Citieshavingthecoordinateswithintheboundingboxarechosenandarematchedagainstthecitieslistedin l1.Ifonlyonematchisfound,thelocationisassociatedwiththecityandthenewsisgeo-taggedwiththelocation.Thereasonforthisisthatoftenanewshavingmentionedalocalityorastreetmaydiscussaboutanorganisationfromthesamecity.However,thismaynotbealwaystrueandthusmayresultinmorethanonematch.Thismeansthatthelocationisstillambiguousandhence,alistofambiguouslocationsisreturnedbythealgorithm.
eVALUATIoN
The studywith thehighest number evaluateddocumentswas reported in (Ignazio, 2014)whichutilisedNewYorkTimesAnnotatedCorpus(Sandhaus,2008)andReutersRCV-1Corpusrespectively(Sandhaus,2004),bothannotatedatcountryandcitylevel.However,theevaluationwasnotconductedforfinergranularity(e.g.street/localitylevel).Toourknowledge,thereisnopubliclyavailabledatasetwhichisannotatedatstreetorlocalitylevel.Hence,acomparativeevaluationwithsuchdatasetswasnotperformed.Instead,wehavedesignedauserstudywithasmallerdatasetfurtherdiscussedinSection:OurDataSetbelow.Thisdatasetwasgeo-taggedbytheGeo-Taggerapplicationandusedfortheuserstudy.Themaingoalofthestudyistodemonstratetheeffectivenessofthealgorithmaswellastoidentifyitslimitations.Thedatasetgenerationprocedure,theuser-studyanditsprotocolsarepresentedbelow.
Reuter data SetForalargescalecomparison,wecomparedthecountrylevelresultidentifiedbyGeo-TaggerwiththereportedresultbyCLIFFusingtheReuterRCV-1Corpus(Sandhaus,2004).TheRCV-1corpusconsistsofover800,000newswhereeachnewsincludesacountrytag.Fromthiscollection,asampleof10,000newswasrandomlyselectedrepresentingourReuterdatasetwhichwerethengeo-taggedusingCLIFFandtheGeo-Tagger.
our data SetOurcollection( n )ofnewsstoriesconsistsofmorethan11,000newsretrievedfromdifferentnewswebsitesforaroundaperiodofoneyear,asdiscussedinSection:INTRODUCTION.Atfirst,wehaveutilisedtheCLIFFAPItogeo-tagallnewsstoriesin n generatingtwodatasets:thefirstset,S1 ,consistingofnewsforwhichCLIFFhasfailedtogenerateanylocationandthesecondset, S2 consistingofnewsthathavebeengeo-taggedbyCLIFF.Then,fromS1 ,wehavechosen100randomnewsrepresentingourfirstevaluationdataset(denotedasEVAL SET−
100)andfromS2 wehave
selected200randomnewsrepresentingoursecondevaluationdataset(denotedasEVAL SET−200
).Next, our application has been utilised to geo-tag these 300 news in EVAL SET−
100 and
EVAL SET−200
whicharethenusedtocarryoutthefollowinguserstudy.
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
63
User StudyTheweb-baseduserstudyconsistedof6subjects(Female:2;Male:4;agerange:25-35years).Thesubjectswererecruitedbysendinganemailto6differentresearchgroups.SinceEVAL SET−
100
and EVAL SET−200
primarilyconsistedofnewsfromaspecificgeographiclocation,oneoftherecruitmentconditionswasthatthesubjectneededtohavegoodfamiliarityregardingthatgeographiclocation.Thiswastoensurethatthesubjectcanidentifythelocationappropriatelywithinthatspecificregionandcanassociatethestreetwithacitywithhigherconfidence.Wereceived8responseswhoagreedtovoluntarilytakepartinourstudy.Amongthem,wechose6,sincetheothertwosubjectshavenotlivedinthatgeographiclocationformorethanayear.Thechosensubjectslivedintothatgeographiclocationformorethanthreeyears,whichensuredthattheyhavehigherfamiliaritywiththelocation.Moreover,only2subjectswereexpertsinthefieldofIR(subjectarearesearchedinthispaper),2wereComputingScienceresearchersand2wereconductingtheirresearchinEngineering.Wedidnotchooseacrowdsourcedevaluationsinceitwouldbedifficulttoensurethegeographicalknowledgeoftheevaluators,whichissignificanttofulfiltheobjectiveoftheevaluation.
Inordertoevaluatetheevaluationdatasets,weassigned100newsstoriestoeachsubject,whichensuredthateachstorywasevaluatedby2subjects.Inourstudy,eachsubjectwasgivenalinktotheweb-basedstudy,andanaccesskey.Uponaccessingthestudy,alistof100storieswasdisplayedtothem.Uponclickinganewsstory,thetitleandthenewsstorywasdisplayed,withthreequestions.Thesubjectswereaskedtofirstreadthewholecontent.Then,theywererequiredtocompletethefollowingtasks(i.e.toanswerthequestions):
T1: Ifanysortof locationinformation(i.e.nameofacity,country,etc.)relevant to thenewsispresent in thenews story?T1helpedus togather statistics related to thenumberof storieshavingnolocationinformation.Theseresultswillbefurtherusedtoevaluatetheeffectivenessofourapproach,i.e.abilitytofindageo-locationintheabsenceofdirectionmentionsinastory.
T2:Ifastreetname(e.g.JamaicaStreet)oralocalityinformation(e.g.Chelsea)relevanttothenewsispresentexplicitlyinthenews?T2gatheredstatisticsrelatedtothenumberofstorieshavinggranularlocationinformation(streetname,localityetc.).Theseresultswillbefurtherusedtodemonstratetheeffectivenessofourapproachinabsenceofdirectmentionsofastreet/localityinastory.
T3:Thesubjectswereshownalistoflocationsgeo-taggedbyGeo-Tagger,andaskedtochoosetheirrelevantones.T3helpedtofinderroneousgeo-taggedlocations.Theseresultswillbefurtheranalysedtoidentifythelimitationoftheproposedalgorithm.
Uponcompletingalltasks,thesubjectswerecontactedtovoicetheiropinionsontherelevanceofthelocationspresentedtothemduringthestudy.Theresultsobtainedduringthestudyandtheiranalysisalongwiththecollecteduseropinionsarediscussednext.
ReSULT
Inthissection,wepresentourevaluationresults.Atfirst,wepresenttheresultsofourstudywheretheindependentvariablesareuserresponses,CLIFFandGeo-Taggerapplicationwhereasthedependantvariablerepresentstheeffectivenessforeachindependentvariable.Then,wepresentacomparativeresultbetweenCLIFFandGeo-Taggeroverasampleof5,000newsstoriesfromRCV-1corpus.
T1 and T2Atfirst,weanalysetheresultswithrespecttoT1andT2.Theagreementlevel(i.e.bothsubjectshavingthesameopinion)inrelationtoT1was100andforT2was98.Thedisagreementlevelin
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
64
thecaseofT2isattributedtothefactthatsomesubjectsconsideredahighwayzone(e.g.A390)asapartofthestreetinformation,butothersdidnot.Wealsofoundslightdisagreementwithrespecttolocality.Thehighagreementlevelisattributedtooursubjectdemographics(knowledgeaboutlocation),whichindeedmadetheresultsobtainedfromtheevaluationvalid.
ThecomparisonofCLIFF,Geo-TaggerandUserevaluationovertheevaluationdatasetwithrespecttoT1andT2ispresentedinTable1alongwiththecorrespondingstandarddeviation(SDcolumninTable1)andillustratedinFigure4.From EVAL SET−
100,theGeo-Taggerisableto
geo-tag78%ofthenewsstories.Thismeansthatfortheevaluationdatasetcomprisingof300news,thesuccessratio(indicatingthetotalnumberofnewssuccessfullygeo-taggedwithrespectto300news)forCLIFFis67whereasfortheGeo-Tagger,itis93%-animprovementof26%.
The results from the Geo-Tagger and the user response with respect to T1 and T2 are alsocompared.Thesubjectsidentifiedlocationsin260newsoutof300(87%).Thisisduetothefactthatsubjectshavebeenabletoidentifylocationsfromthenameoforganisationsbecauseoftheirlocalknowledge.However,theyfailedtoidentifyasmanylocationsastheGeo-Tagger(87%vs.93%)sincetheycouldnotidentifythelocationsofsomeorganisations,e.g.primaryschools,pubs,etc.Ontheotherhand,thesubjectsidentifiedstreets/localitiesin143newsoutof300(48%)andhenceperformedbetterthanCLIFF(15%vs48%).Thisisattributedtotheirlocalgeographicalknowledgeallowingthemtoidentifymanystreetsorlocalities.However,theperformanceofthesubjectswasstillinferiorcomparedtotheGeo-Taggerwhichidentifiedstreets/localitiesin233newsoutof300(78%).ThereasonforthisisthattheGeo-Tagger,intheabsenceofastreet/localityname,usedthenamesoftheorganisationstoidentifystreetsorlocalities.Eventhoughthesubjectshadlocalknowledge,theycouldnotresolvethestreet/localityinformationforsomeorganisations.Inthewordsofonesubject:
Table 1. CLIFF vs. geo-tagger vs. user response
CLIFF Geo-Tagger User SD
T1:Location 67% 93% 87% 14%
T2:Street/Locality 15% 78% 48% 32%
Figure 4. Result plot for T1 and T2
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
65
“Ididnotfindanylocationinthenews.IunderstoodthatthesystemidentifiedthelocationusingXPrimarySchoolmentionedinthenews.However,Ihavenoideawherethisschoolis”.Anothersubjectsaid:“Evenifthenewsdidnothavealocation,Iwaspresentedwithlocationnames,whichseemedrelevanttothenews,forexample,containedthelocationoftheorganisationsheriffcourtpresentedinthenews.”AnotherinterestingobservationisthecomparisonofT1andT2withrespecttoGeo-Tagger.Geo-Taggerwasablefindlocationsin93%ofnewswhereasitcouldfindstreets/localitiesinonly78%ofnews.Thereasonisthatthereweremanynewsinwhichtherewasnomentionofstreetorlocalities.InmanysuchscenariosGeo-Taggerwasabletoidentifystreets/localitiesusingthelocationsoftheidentifiedorganisationsinthenews.However,thereweresituationswheretheStanfordNERcouldnotfindanyorganisationwhichresultedwithnewsnotgeo-taggedwithstreetsorlocalities.
Anon-parametricKruskal-Wallistestisusedtoexaminethesignificanceoftheresultsobtainedfrom the three independent groups (UserResponses,Geo-Tagger andCLIFF), in relation to thenumberofnewsstorieshavinglocationinformationinthem(T1).Thetestresultsshowedsignificantdifferences χ 2
75 172 2 0 001= = <( ). , , .df p .AMann-Whitneytest(post-hoc)wasconductedtofollowupthefindingsbyapplyingaBonferronicorrection,toreportalltheeffectsata0.016levelofsignificance.Thiscorrectionisusedtoreducethechancesofobtainingfalse-positiveresults(typeIerrors),whenmultiplepair-wisetestsareperformedonasinglesetofdata(especially,whenanon-parametrictestisusedasapost-hoctest).WeperformedaBonferronicorrection,bydividingthecriticalpvalue α =( )0 05. bythenumberofcomparisonsbeingmade(i.e.3).Thepost-hoctest
resultsshowedsignificantdifferencesbetweenallpairs p <( )0 001. ,except,UserResponsesand
Geo-Tagger ( p =( )0 018. ). Hence, the statistical test also suggests that the Geo-Tagger issignificantlyeffective(itsabilitytofindlocationsinanewsstory)thanCLIFF.
Anon-parametricKruskal-Wallistestisusedtoexaminethestatisticalsignificanceoftheresultsobtainedfromthethreeindependentgroups,inrelationtothenumberofnewsstorieshavingstreetl o c a t i o n s i n t h e m ( T 2 ) . T h e t e s t r e s u l t s s h owe d s i g n i f i c a n t d i f fe r e n c e sχ 2
233 20 2 0 001= = <( ). , , .df p .AMann-Whitneytest(post-hoc)wasconductedtofollowup the findingsbyapplyingaBonferroni correction, to report all the effects at a0.016 levelofsignificance.Thepost-hoctestresultsshowedsignificantdifferencesbetweenallpairs p <( )0 001. .TheseresultssuggestthattheGeo-Taggerissignificantlyeffective(infindingstreets/localitiesinanewsstory)thanCLIFF.
Insummary,theGeo-TaggerhaseffectivelyfoundspatiallocationsbetterthanCLIFFevenintheabsenceofsuchinformationinanewsstoryusingourcontextualmodel(Section:GEO-TAGGINGMODELLING).
T3Next,weanalysetheresultswithrespecttoT3.TheagreementlevelofthesubjectsinrelationtoT3was95%.Thedifferenceinagreementlevelisattributedtothefactthatsomesubjectschosesubsetlocationsasnon-relevant,evenifthisaccuratelyrepresentedthenews.Forexample,locationsinanewsmayinclude:i)CityXandii)RoadA,CityX.Ourapproachpresentedsuchlocationsbecausewewanted to show thecity level scope for anews tohelp the subjects in identifyingirrelevantstreet/locality information.Moreover, in thecaseofsportsnews(e.g. footballmatchbetweenChelseaVsArsenal,bothinLondon,UK),somesubjectsconsideredthenameoftheteamsasrelevantlocations,buttherestdidnot.Furthermore,somesubjectschoseallirrelevantoptionsandothersdidnot.Thisvariationontheagreement,webelieve,isacommonphenomenonsincedifferentuserswillhavedifferentsubjectiveopinionsregardinghowalocationcanbeinferredevenintheabsenceofadirectmention.Inadditiontheiropinionwillbeinfluenceddependingontheirfamiliaritywithaparticularlocation.
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
66
TheGeo-Tagger identified1118 streets/localities in300news including repetitiveentries inmanylocations,meaningastreet/localitywasfoundinmorethanonenews.Outofthese,79streets/localitiesweretaggedasirrelevantbytheuserswhichisaround7%of1118,whichwasstatisticallyinsignificantaccordingtoMann-Whitneytest(p<0.001),i.e.demonstratedtheeffectivenessofthealgorithm,indicatinganaccuracyofaround93%(seeFigure5).
Foreachstorywitherroneousstreets/localitiesidentifiedbyasubjectinT3,theresultswerefurtherexploredtofindthenatureoferrorsbythetwoexperts.MostoftheerrorsareattributedtothewayGeocode.FarmAPIgeocodedalocationname.Forexample,whenAmerica/HollandoranyothercountrieswerepassedtoGeocode.FarmAPIwithafocusofacity,theAPIresolvedthenametoastreet(e.g.AmericaStreetorHollandStreet)ofthatcity.Webelievethiscanbecorrectedbyapplyinganexclusionpolicythatwillexcludethenameofacountryoutsidethechosenscopeunlessthenameissucceededbyareferenceofstreet(e.g.AmericaStreet)inanews.Inaddition,theAPIalsocouldnotgeocodeafeworganisationsproperlyandresolvedtheorganisationnameintoacompletelyanotherorganisationhavingasimilarnameinanothercity.Anexampleofsuchanorganisationisasupermarkethavingmultiplebranchesinacityorbranchesinmultiplecities.Inthewordsofanothersubject:“Sometimesthelocationwasnotcorrectfortheorganisation.BecauseIcouldseethenewsaboutcityXandcontainedthereferenceoforganisationAwhichwascompletelyresolvedtoalocationbelongingtocityZ”.Onewaytohandlethisistorestrictthescopeduringorganisationresolutiontothecitymainlyfocusedinthenews.Weplantorectifythisprobleminfuture.
Reuter RCV-1ThecomparativeresultispresentedinTable2.AsevidentfromTable2,Geo-TaggerperformedslightlybetterthanCLIFF.ThisimprovementisattributedtothefactthatGeo-Taggerexploited
Figure 5. Result plot for T3
Table 2. CLIFF vs. geo-tagger using RCV-1
Method Sample Size Accuracy
CLIFF 10,000 90%
Geo-Tagger 10,000 92%
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
67
organisations,inadditiontoaspatiallocationsandthishelpedGeo-Taggertogeo-tagnewsevenintheabsenceofdirectmentionsofaspatiallocations.TheaccuracyofGeo-Taggeralmostresonate with the results obtained using our data set. This demonstrates the effectivenessofourapproachoverdifferentdata setsofnewsstories.TheaccuracyofCLIFFusingourReuterdatasetisslightlylowerthanwhatwasreportedin(Ignazio,2014).Theexactreasonisdifficulttoestablishsincethesampledistributionin(Ignazio,2014)andusedinthispaperwillmost likelybedifferenteventhoughtheyoriginatedfromthesamedataset.Wecouldnotcomparetheeffectivenessofourapproachinfindingthestreet/localitylevelgranularityoverthis10,000datasetsincethenewsstoriesinRCV-1arenotgeo-taggedatthisgranularityandhence it isnot reported in thispaper.Furthermore, it is important to realise that,eventhoughasmallerdata-setwasusedforcarryingoutthiscomparison,itdoesnotinvalidatethecomparisonwhatsoeverasthesamedatasethasbeenusedforcarryingoutthecomparisonbetweenCLLIFFandourapproach.
dISCUSSIoN
Inthissection,weexploretheresearchquestionsmentionedinSection:INTRODUCTION,highlighttheadvantagesofourapproachanddiscussitslimitations.
Research QuestionsWithrespecttoRQ-1,wehavedevelopedamathematicalmodelofcontextforanewsstory(Section:GEO-TAGGINGMODELLING)consistingofinformationthatanswers5corequestionsofwho,what,where,whenandwhy.Inthisway,thecontextofanewsessentiallycharacterisesthenews,justlikethewayacontextofausercharacteriseshis/hersituation.Thisisasimpleyetpowerfulmodelasseveralof itscomponentssuchaspeople,organisationsand locationscanbeutilisedforgeo-taggingnewsstories.Moreover,someofitscomponentssuchaslocationandcategorycanbeeasilyexpandeddependingonthegranularityofourchoice.Wehavealsoformalisedamodelofspatialinformationandhaveshownhowacontextualinformationcanberelatedwithaspatialinformationbydevelopingamathematicalmodelofgeocodingandgeo-tagging.Ourmodel,inessence,providesablue-printfordevelopingalgorithmstobeutilisedforthegeo-taggingprocess.Wehavediscussedhowwehavedevelopedsuchalgorithmsutilisingourmodelthatencodethegeo-taggingprocessintheimplementationsection.
TheeffectivenessofourmodelbecomesapparentwhenweanswertheRQ-2.Wehaveevaluatedourgeo-taggingprocedureovertwodatasets.Thefirstdatasetconsistsof10,000newswhichhavebeenrandomlycollectedfromtheReuterRCV-1Corpusconsistingof800,000news.Theseconddatasetconsistingofaround11,000newsretrievedfromdifferentnewswebsitesforaroundaperiodofoneyearintheUK.Theresultsofourevaluation(Section:RESULT)clearlyshowthatourmodelofcontextcanbeexploitedtogeo-tagnewsstoriesevenintheabsenceofdirectmentionsoflocations(seeFigure1).Thisisachievedbygeo-taggingnewsstoriesnotonlywithlocations,butalsowithotheraspectsofthecontextmodel.
AsforRQ-3,theresultsoftheevaluationalsodemonstratetheapplicabilityofexploitingdifferentaspectsofthecontextforgeo-taggingnewsatthestreetlevel.Themainchallengeregardingthisistodetectandresolveambiguitythatmayoccuratthisgranularity(street/locality).Ourapproachsuggeststhatitisevenpossibletoresolvelocationsatthestreetlevelevenifthereisnomentionofastreet.Inaddition,ourapproachhasalsobeenabletoassociatestreetswithacitybothmentionedwithinanews(seeFigure5).
AdvantagesAnumberofadvantagesofourapproacharehighlightedbelow:
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
68
• Ourapproachenablesgeo-taggingnewsstoriesevenintheabsenceofdirectmentionsoflocationsbyutilisingthecontextualinformation,especiallyorganisations;
• Havingtheabilitytogeo-tagnewsatthestreet/localitylevelopensuptheopportunitytointegrateanynewscollectionswithothermulti-modaldatawhicharealreadyequippedwithsuchfine-grainedlocationinformation.Thewholedatasetcanthenbeutilisedtocreateauniquefine-grainedspatio-temporalretrievalsystem;
• AllowingauseroftheGeo-Taggertofixategeographicalscopeswiththehelpofanexternalinputfilewillenablehim/hertogeo-tagnewsoverthesamedatasetfordifferentscopesatdifferenttimes.Unlikeanyexistingapproachthatdeterminessuchscopesautomatically,thiswillenableausertogeo-tagnewsby“zoomingin”,or“zoomingout”,withinageographiclocationfordifferentscenarios.Forexample,ausercanrestrictthefocustoaparticularcitywithinacountrybyusingaparticularboundingboxcoordinatesatonetimeorcanrestrictthefocusoverthewholecountryinanothertimeusinganotherboundingboxcoordinates.
LimitationsThemainlimitationoftheapproachistheinaccuracyofsomeofthestreet/localitylocations.Asitturnsout,mostofsuchinaccuraciesareattributedtotheexternalgeocodingAPIthatisusedtoresolvelocationnamesintogeographiccoordinates.OnepossiblewaytorectifythisproblemistocombinetheresultsfromothergeocodingAPIssuchasGooglePlace(Google,2016)withtheresultofGeocode.FarmAPIandthenapplyapolicytoresolvethecorrectlocation.Weplantoincorporatethisintoourapplicationinfuture.
Anotherlimitationisthewaytheuserstudywascarriedwithasmallerdatasetandwithasmallnumberofsubjects.Bothareattributedtothefollowingreasons:
• Thereisnopubliclyavailabledatasetgeo-taggedwiththegranularityofstreet/locality;• Wehadtoascertainthatthesubjectshavegoodlocalknowledgewithinthegeographicalscope.
Thisruledoutthepossibilityforalarge-scalecrowd-sourcedevaluation.
CoNCLUSIoN
Inthispaper,wehavedevelopedamathematicalmodelofcontextandgeo-taggingwithrespecttonewsstoriesandhaveexploitedthatmodeltogeo-tagnewsstoriesevenintheabsenceofdirectmentionsoflocationsaswellasatthegranularityofstreet/localitylevel.Forthis,wehave incorporated our model with a geo-tagging algorithm and utilised off-the-shelf toolsandexistingGeocodingAPIs.Thedatasetgeneratedafterapplyingourapproachhasbeenevaluatedwith6users.Inaddition,wehaveevaluatedourapproachover10,000newsfromtheReuterdataset.TheresultsdemonstratetheeffectivenessofourapproachagainstexistingpubliclyavailableAPIs.
In future, we plan to extend the evaluation with a larger sample and subjects, once theapproachisimprovedandthentoconductanexperimentwithindependentstyledesign.Attheend,weaim to release thedatasetcontaining locationsat thegranularityof street/localitiesto the research community for their research. As we have found that different subjects havedifferentopinionsregardingafewspecificlocations(e.g.locationsregardinganorganisation),itwillbeinterestingtoincorporateaconfidencelevelwhiletheusersevaluatetheapproach.Then,resultscontainingaverylowconfidencelevelcanbespeciallytreatedorevenexcludedduringtheoverallevaluation.
Theultimategoalofourapproachistointegrateournewsdatacollectionwithanarrayofmulti-modaldatafromdifferentsocialmediasuchasFlickr,Twitter,YouTubeaswellasfromdifferent
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
69
wearablesensorssuchaslifeloggersandGPStrackersandtodevelopalocation-basedinformationfusionsystemwherelocations,amongotherfactors,willactasthegluetobindtogetheralldatasets.Inthisregard,theproposedapproachwillbeanessentialingredientofthesystem.However,webelievethattheapproachcanbeadaptedforgeo-tagginganynewsstoriesinanyotherscenarioswithlittleornofurthermodifications.
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
70
ReFeReNCeS
Abowd,G.D.,Dey,A.K.,Brown,P.J.,Davies,N.,Smith,M.,&Steggles,P.(1999,September).Towardsabetterunderstandingofcontextandcontext-awareness.Proceedings of theInternational Symposium on Handheld and Ubiquitous Computing(pp.304-307).Springer.doi:10.1007/3-540-48157-5_29
Amitay,E.,Har’El,N.,Sivan,R.,&Soffer,A.(2004,July).Web-a-where:geotaggingwebcontent.Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval(pp.273-280).ACM.
Andogah,G.,Bouma,G.,&Nerbonne,J.(2012).Everydocumenthasageographicalscope.Data & Knowledge Engineering,81,1–20.doi:10.1016/j.datak.2012.07.002
BericoTechnologies.(n.d.).CLAVIN.Retrievedfromhttps://clavin.bericotechnologies.com/
MIT.(n.d.).CLIFF.CenterforCivicMedia.Retrievedfromhttp://cliff.mediameter.org/
D’Ignazio,C.,Bhargava,R.,Zuckerman,E.,&Beck,L.(2014).Cliff-clavin:Determininggeographicfocusfornews.Proceedings of KDD ‘14.
Ding,J.,Gravano,L.,&Shivakumar,N.(1999).Computinggeographicalscopesofwebresources.
Errico,M.,April,J.,Asch,A.,Khalfani,L.,Smith,M.,&Ybarra,X.(1997).The evolution of the summary news lead.MediaHistoryMonographs-OnLineJournalofMediaHistory.
Garbin,E.,&Mani,I.(2005,October).Disambiguatingtoponymsinnews.Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing(pp.363-370).AssociationforComputationalLinguistics.doi:10.3115/1220575.1220621
Geocode.FarmAPI.(n.d.).Retrievedfromhttps://www.geocode.farm/
Goldberg,D.W.(2008).A geocoding best practices guide.Springfield,IL:NorthAmericanAssociationofCentralCancerRegistries.
GooglePlace,A.P.I.Retrieved25April,2016,fromhttps://developers.google.com/places/
Gross,T.,&Klemke,R.(2002).ContextModellingforInformationRetrieval-RequirementsandApproaches.ProceedingsofICWI(pp.247-254).
Hull,R.,Neaves,P.,&Bedford-Roberts,J.(1997,October).Towardssituatedcomputing.ProceedingsoftheFirstInternationalSymposiumonWearableComputers(pp.146-153).IEEE.doi:10.1109/ISWC.1997.629931
Leidner,J.L.(2006).Anevaluationdatasetforthetoponymresolutiontask.Computers, Environment and Urban Systems,30(4),400–417.doi:10.1016/j.compenvurbsys.2005.07.003
Lieberman,M.D.,Samet,H.,Sankaranarayanan,J.,&Sperling,J.(2007,November).STEWARD:architectureofaspatio-textualsearchengine.Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems(p.25).ACM.doi:10.1145/1341012.1341045
MetaCarta.(n.d.).GeoTag.Retrievedfromhttp://www.metacarta.com/products-platform-geotag.htm
OPENCALAIS.ThomsonReuters.Retrievedfromhttp://new.opencalais.com
Placespotter.Yahoo.Retrievedfromhttps://developer.yahoo.com/boss/geo/docs/key-concepts.html
ResearchDataAustraliaContentProvidersGuide.Whatisspatiallocation?Retrievedfromhttp://guides.ands.org.au/rda-cpg/spatiallocation
Ryan,N.,Pascoe,J.,&Morse,D.(1999).Enhancedrealityfieldwork:Thecontextawarearchaeologicalassistant.Bar International Series,750,269–274.
Sandhaus,E.(2004).ReutersCorpora(RCV1,RCV2,TRC2).Retrievedfromhttp://trec.nist.gov/data/reuters/reuters.html
International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017
71
Sandhaus,E.(2008).TheNewYorkTimesannotatedcorpusLDC2008T19.Retrievedfromhttps://catalog.ldc.upenn.edu/LDC2008T19
Schilit,B.N.,&Theimer,M.M.(1994).Disseminatingactivemapinformationtomobilehosts.IEEE Network,8(5),22–32.doi:10.1109/65.313011
StanfordNamedEntityRecognizer(NER),&theStanfordNaturalLanguageProcessingGroup.(n.d.).Retrievedfromhttp://nlp.stanford.edu/software/CRF-NER.shtml
Md Sadek Ferdous is a Research Fellow at the University of Southampton, UK. He received his PhD degree in Computing Science from the University of Glasgow, UK in 2015. His research interests include Information Retrieval, Security, Privacy, Identity Management, Trust Management and Blockchain technologies.
Soumyadeb Chowdhury completed his Masters and PhD in Computing Science from University of Glasgow. Employed as a Research Assistant thereafter working in the Integrated Multimedia City Data project in Urban Big Data Center (funded by EPSRC). Currently, lecturer of InfoComm Technology in Singapore Institute of Technology. Research interests include Information Retrieval, Information Security and Privacy, Human Computer Interaction, Pervasive Sensing Technologies and Visualisation Metaphors and Usable Security.
Please recommend this Publication to your librarianFor a convenient easy-to-use library recommendation form, please visit:http://www.igi-global.com/IJIRR
Volume 7 • Issue 4 • October-December 2017 • ISSN: 2155-6377 • eISSN: 2155-6385An official publication of the Information Resources Management Association
all inquiries regarding iJirr should be directed to the attention of:Zhongyu (Joan) Lu, Editor-in-Chief • IJIRR@igi-global.com
all manuscriPt submissions to iJirr should be sent through the online submission system:http://www.igi-global.com/authorseditors/titlesubmission/newproject.aspx
Advanced software development related information retrieval issues • Classification • Clustering approaches • Content and context awareness and environment awareness • Filtering system • Index techniques • Information mining • Information retrieval in cloud computing issues • Information retrieval in education • Information retrieval in healthcare • Information retrieval in science, engineering, and technologies • Information retrieval in social science, social behaviors • Information retrieval with business, commerce, etc. • Information retrieval with Internet of Things • Knowledge mining • Link analysis • Machine learning on documents • Message passing • Metadata and XML retrieval • Mobile computing related information retrieval issues • Multimedia retrieval • Performance measures • Query languages and optimization • Retrieval architecture • Retrieval evaluation • Retrieval languages and operations • Retrieval strategies • Retrieval systems • Retrieval theories • Retrieval with big data technologies • Scalability • Search algorithms • Search engine • Social media related information retrieval issues • Taxonomy theory and applications • Text mining • Text, document, and image retrieval • Web mining
Coverage and major topiCsThe topics of interest in this journal include, but are not limited to:
The mission of the International Journal of Information Retrieval Research (IJIRR) is to provide an outlet for researchers to present their research and obtain inspiration in the areas of information retrieval, computer science, and information science. Focusing on theories, methods, technologies, and tools, IJIRR is aimed towards information engineers, scientists, and related professionals. This journal exhibits expert experiences and state-of-the-art technologies in search and storage of texts, images, videos, and other data to stimulate innovation and exploration of improved approaches for conquering industry problems.
mission
Ideas FoR specIal Theme Issues may be submITTed To The edIToR(s)-In-chIeF
international Journal of information retrieval research
call for articles
Recommended