Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
D3.4EW-ShoppComponentsEvaluationAssessment
Grantn.732590-H2020-ICT-2016-2017/H2020-ICT-2016-1
Deliverablen: 3.4Date: 30June2019Status: Final
Version: 1.2Authors: David Čreslovnik (CE), Aljaž Košmerlj (JSI), Nikolay Nikolov (SINTEF),
MicheleCiavotta(UNIMIB),LorenzoSutton(ENG)Contributors: UNIMIB,SINTEF,JSI,ENG
Reviewers: MatejŽvan,FranciscoRodriguezDistribution: PU
EW-Shopp-Page2 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
HistoryofChanges
Version Date Description Revised by
0.5 05/06/2019 Contentconcept DavidCreslovnik(CE)
0.6 06/27/2019 MergingENG DavidCreslovnik(CE)
0.7 06/27/2019 MergingUNIMIB DavidCreslovnik(CE)
0.8 06/27/2019 MergingJSI DavidCreslovnik(CE)
0.9 06/27/2019 Conclusion Francisco Rodriguez (JOT),
Matej Zvan (BT), David
Creslovnik(CE)
1.0 06/28/2019 Takingreviewers’commentsinto
account
Allcontributors
1.1 06/29/2019 Mergingfinalcontributions DavidCreslovnik(CE)
1.2 06/30/2019 Finalcheckandsmallfixes(titles,
etc.)
MatteoPalmonari(UNIMIB)
EW-Shopp-Page3 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Tableofcontents
HistoryofChanges.................................................................................................................................2
Chapter1 Introduction......................................................................................................................7
1.1 Objectivesandscope...............................................................................................................8
1.2 Applicabledocumentsandreferences.....................................................................................8
1.2.1 Partners.............................................................................................................................8
1.2.2 Acronyms..........................................................................................................................9
1.3 Documentstructure...............................................................................................................11
Chapter2 Enrichment:DataGraftandASIA....................................................................................12
2.1 UsageinEW-ShoppBCs.........................................................................................................12
2.2 Enrichmentatscale................................................................................................................14
2.2.1 DistributionandParallelization.......................................................................................15
2.2.2 DataandServiceLocality................................................................................................16
2.2.3 ScalinguptheEnrichmentServices................................................................................16
2.2.4 HierarchicalCaching........................................................................................................17
2.3 Assessment............................................................................................................................17
2.3.1 Investigatingtheeffectofcachinginreconciliation.......................................................17
2.3.2 Investigatingtheeffectofcachinginenrichment...........................................................18
2.3.3 Investigatingsystemscalability.......................................................................................19
2.3.4 Investigatingthereconciliationaccuracy........................................................................20
2.4 Discussiononlimitations........................................................................................................21
2.4.1 DataLocality....................................................................................................................21
2.4.2 DistributedCaching.........................................................................................................22
2.4.3 EfficientAPIinteraction..................................................................................................22
2.4.4 Improvingthereconciliationmethod.............................................................................22
2.5 Conclusions............................................................................................................................23
Chapter3 Analytics:QMiner...........................................................................................................24
3.1 UsageinEW-ShoppBCs.........................................................................................................24
3.2 ModellingPerformance.........................................................................................................24
EW-Shopp-Page4 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
3.2.1 Ceneje.si..........................................................................................................................26
3.2.2 BigBang...........................................................................................................................27
3.3 TechnicalPerformance..........................................................................................................28
3.4 Conclusionsanddiscussiononlimitations.............................................................................31
Chapter4 Visualization:KNOWAGE................................................................................................33
4.1 GeneralandCommonfeatures..............................................................................................33
4.2 UsageinEW-ShoppBCs.........................................................................................................35
4.2.1 BusinessCase1...............................................................................................................35
4.2.2 BusinessCase3...............................................................................................................37
4.2.3 BusinessCase4...............................................................................................................39
4.3 Performanceassessment.......................................................................................................41
4.4 Conclusions,limitationsandimprovements..........................................................................43
Chapter5 Conclusions.....................................................................................................................44
EW-Shopp-Page5 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
ListoftablesTABLE1.SHORTREFERENCESFORPROJECTPARTNERS........................................................................................................9TABLE2.ABBREVIATIONSANDACRONYMSUSED...............................................................................................................9TABLE3EXPERIMENTRESULTSWITHDIFFERENTDATASETSIZESANDHARDWARESETUPS.........................................................20
ListoffiguresFIGURE1:ECOMMERCEMARKETSTRUCTURE....................................................................................................................7FIGURE2GENERALDATAFLOWMODEL..........................................................................................................................12FIGURE3RECONCILIATIONANDEXTENSIONATSCALEUSINGALOAD-BALANCEDVERSIONOFASIA............................................14FIGURE4ANOVERVIEWOFTHEPROPOSEDPROCESSINGENVIRONMENT..............................................................................16FIGURE5REQUESTEXECUTIONTIMEINMILLISECONDSFORTHESCENARIOWITHNODUPLICATES..............................................18FIGURE6REQUESTEXECUTIONTIMEINMILLISECONDSFORTHESCENARIOWITHFOURDUPLICATES...........................................19FIGURE7TOTALEXECUTIONTIME(INSECONDS)ANDLINEARREGRESSIONCURVE,FORDIFFERENTDATASETSIZESANDTWO
EXPERIMENTALSETUPS.......................................................................................................................................20FIGURE8.PREDICTINGTHENUMBEROFVISITSWITHWEATHERUNAWAREMODEL..................................................................27FIGURE9.PREDICTINGTHENUMBEROFVISITSWITHWEATHERAWAREMODEL......................................................................28FIGURE10.THEACTUALCPUTIMEREQUIREDTOLEARNTHEMODELDEPENDINGONTHENUMBEROFRECORDS..........................29FIGURE11.CONSUMPTIONOFMEMORYDURINGTHELEARNINGOFTHEMODELDEPENDINGONTHENUMBEROFRECORDS...........30FIGURE12.THEACTUALCPUTIMEREQUIREDTOLEARNTHEMODELDEPENDINGONTHENUMBEROFFEATURES.........................30FIGURE13.CONSUMPTIONOFTHEMEMORYDURINGTHELEARNINGOFTHEMODELDEPENDINGONTHENUMBEROFFEATURES.....31FIGURE14EXAMPLEOFSQLQUERYDATASET.NOTICETHEDYNAMICPARAMETERSSETWITHTHESYNTAX$P{IDPRODUCT}AND
$P{DATE}.......................................................................................................................................................34FIGURE15VISUALCREATIONOFADATASETASSOCIATION................................................................................................35FIGURE16TVBRANDS(PRODUCT)YEARLYCLICKRANKING.EACHROWISINTERACTIVE,ANDTHEUSERCANSELECTONETOACCESSA
MOREDETAILEDCOCKPIT.NOTE:SOMEDATAHAVEBEENBLURREDFORCONFIDENTIALITYREASONS..................................36FIGURE17DRILLDOWNCOCKPITFORASPECIFICBRANDSELECTEDINTHEPREVIOUSONE(ORTHROUGHTHEPROVIDEDFILTER).USER
CANSELECTTWOSPECIFICDATESFORSELLER/CLICKPERFORMANCECOMPARISON.NOTE:SOMEDATAHAVEBEENBLURREDFOR
CONFIDENTIALITYREASONS.................................................................................................................................36FIGURE18PRODUCTSELECTIONINTERFACE.NOTICETHEINCREMENTALSEARCH...................................................................37FIGURE19USERHASINTERACTIVELYSELECTEDSOMEDATESOFINTERESTFORMTHETABLE,ANDALLOTHERVISUALISATIONSARE
UPDATEDACCORDINGLY.SELECTIONSCANBEQUICKLYCLEAREDFROMTHESPECIFICWIDGETINTHETOP-RIGHTPART.NOTE:
SOMEDATAHAVEBEENBLURREDFORCONFIDENTIALITYREASONS..............................................................................37FIGURE20BC3BI-WEEKLYSTOREANALYSISWITHWEATHERCOCKPITEXAMPLES.ABOVE:VISUALISATIONWITHWIDGETSHOWING
VISITORSANDWEATHER.BELOW:UPDATEDCOMPACTVISUALIZATIONWITHVISITORSANDTEMPERATUREONTHESAMECHAR
ANDINCLUDINGSTOREFILTERPANE......................................................................................................................38FIGURE21.BC4WEEKLYKEYWORDCOCKPIT–FILTERINGBYDAYS......................................................................................39FIGURE22.BC4REGIONALANALYSISCOCKPIT(UPDATED)FILTERINGBYREGION....................................................................40FIGURE23.SCREENSHOTOFTHEREGIONALANALYSISCOCKPITWITHALLFILTERSAPPLIED(REGION,KEYWORDANDDATE).............40FIGURE24.DEMONSTRATIONOFTHEBUSINESSCASE3MAINCOCKPITEMBEDDEDINANEXTERNALHTMLPAGETHROUGHASIMPLE
IFRAME..........................................................................................................................................................42FIGURE25.DEMONSTRATIONOFRETRIEVINGTHEBUSINESSCASE1DOCUMENTSTHROUGHTHEKNOWAGEAPI(LEFT:THROUGH
DIRECTGETCALL,RIGHT:THROUGHPYTHON)........................................................................................................43
EW-Shopp-Page7 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Chapter1 Introduction
The value creation funnel in classical retail is interrupted, transformed and power redistributed.
Manufacturersandretailerslostalotoftheiraccesstotheendconsumer.Valueconstellationinthis
domain is characterized by the co-petition of several actors and stakeholders, such as
eMarketplaces,retailers,searchengines,comparisonshoppingengines,onlineadvertisingagencies,
technologyproviders (for retailersandmarketplaces),marketingagencies,marketingresearchand
productdatavendors.Extremelyfastecommerceissettingthepaceofdevelopmentandinfluences
theconsumerbehavior.
Figure1:eCommercemarketstructure
We need to accept the new market structure (Figure 1) where the online marketplaces, search
platforms, socialnetworksandothere-commerceplatformsown theuserattentionand influence
theirdecisionsoneverystepinthecross-channelpurchasefunnel.
Thedatacreatedinthisecosystemiskeytosuccessfulandprofitableexistenceforretailers,planning
ofmanufacturers,developmentofproducts...Butglobalplatformsthatdominatekeyglobalmarkets
tendtomaintaincoreinsightsfortheirownbusinessoptimizationswithverylimitedwillingnessto
share.Mostlynon-Europeantechplatformshaveaccesstotheuserdataandresourcestostructure
it and apply advanced knowledge in business analytics to it. In theproject EW-Shoppwe realized
that there isadistinctivegroupofmarketplayersacross thecompleteconsumer journey thatare
creatinggreatcustomervalueandconsequentlygatherultimatepurchasebehaviordata.
Manyofthecompaniesoperatinginthisdomainhavedevelopedhighlyspecializedcompetencesin
integrating and vertically analyzing data in their specific sector. However, this approach produces
limited insightsand theability toutilizepowerfulpurchasebehaviordata remains in thehandsof
thebig techcompanies thatestablishedplatforms formanagingcross-domaindata infrastructures
whicharecapableoflarge-scalecomputing.
EW-Shopp-Page8 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
TheEW-Shoppprojectovercomesthelimitationsofspecializeddomainknowledgebyincorporating
bigdataanalysis,bigdatastorageandbigdatavisualizationspecialistsintotheconsortium.Together
wecanbuildcompetitiveuserbehavior insightsanddemandknowledgethatcouldhelpEuropean
smalltolargecompaniestobemorecompetitiveandatleastmaintainthelocalmarketpositionby
having the opportunities to innovate their services. Furthermore, the ability to pool market
information gathered across such platforms and leverage them through sophisticated analytical
toolscanofferawiderinsightandenableapplicationoflessonslearnedfromonemarkettoothers.
1.1 Objectivesandscope
Thisdeliverableevaluates theperformanceof thedata services in theEW-Shopp toolkit (formerly
referredtoasplatform).Itpresentstheworkdonebyfourtechnicalpartners:ENG,JSI,SINTEFand
UNIMIBandhowitisusedinthealreadydevelopedpilots.Becausethepilotsarestillevolving,some
oftheseservicesmightundergovariouschanges.
Theoverallarchitectureof theservice toolkithasalreadybeenspecifiedanddescribed inanother
deliverables(seedeliverableD3.1,D3.2andD2.2).Herethefocusisonusabilityandperformanceof
the functionalities of individual services. Also, an assessment of the services with respect to the
usabilityintheviewofbusinesspartnersisreportedinthisdocument.
1.2 Applicabledocumentsandreferences
Thefollowingdocumentsareapplicabletothesubjectdiscussedinthisdeliverable:
1. EW-ShoppGrantAgreement[GA]
2. EW-ShoppDescriptionofAction[DoA]
3. EW-ShoppDeliverableD2.2–EW-ShoppPlatform[D2.2]
4. EW-ShoppDeliverableD2.3–EW-ShoppPlatformevaluationassessment[D2.3]
5. EW-Shopp Deliverable D3.1 – EW-Shopp components as a service: data visualization,
navigationandqualityassessment[D3.1]
6. EW-Shopp Deliverable D3.2 – EW-Shopp components as a service: transformation, linking
andanalytics[D3.2]
7. EW-Shopp Deliverable D3.3 – EW-Shopp components as a service: data visualization,
navigationandqualityassessment[D3.3]
8. EW-ShoppDeliverableD4.2–Pilotsdeployment[D4.2]
1.2.1 Partners
EW-Shopp-Page9 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Short references may be used to refer to project beneficiaries, referred to as partners in thedocument.ReferencesarelistedinTable1.
Table1.Shortreferencesforprojectpartners
No. Beneficiary(partner)nameasin[GA] Shortreference
1 UNIVERSITA’DEGLISTUDIDIMILANO-BICOCCA UNIMIB
2 CENEJEDRUZBAZATRGOVINOINPOSLOVNOSVETOVANJEDOO CE
3 BROWSETEL(UK)LIMITED BT
4 BIGBANG,TRGOVINAINSTORITVE,DOO BB
5 MEASURENCELIMITED ME
6 JOTINTERNETMEDIAESPAÑASL JOT
7 ENGINEERING–INGEGNERIAINFORMATICASPA ENG
8 STIFTELSENSINTEF SINTEF
9 INSTITUTJOZEFSTEFAN JSI
1.2.2 Acronyms
Lastly, there are various abbreviations and acronyms used in this deliverable. These arelistedinTable2.
Table2.AbbreviationsandacronymsusedAcronym Name
API ApplicationProgrammingInterface
BC BusinessCase
BI BusinessIntelligence
BDaaS BigData-as-a-Service
BDVA BigDataValueAssociation
CRM CustomerRelationshipManagement
CSE ComparisonShoppingEngine
DA DigitalAdvertisingagency
DaaS Data-as-a-Service(DaaS)
DM DisseminationManager
EW-Shopp-Page10 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
DSL DomainSpecificLanguage
ECMWF EuropeanCentreforMedium-RangeWeatherForecasts
FTP FileTransferProtocol
GLCI-RDF GoogleGeoTargetsinRDF
GRIB GRIdded Binary or General Regularly-distributed Information inBinaryform.Adataformatcommonlyusedinmeteorologytostorehistoricalandforecastweatherdata
HDT HeaderDictionaryTriples
IPR IntellectualPropertyRights
JAR JavaARchive
JSON JavaScriptObjectNotation
KPI KeyPerformanceIndicator
LOD LinkedOpenData
LOV ListOfValues
MARS TheECMWFMeteorologicalArchivalandRetrievalSystem
MP Marketplace
MR MarketingResearch
PDV ProductDataVendor
PPP Public-PrivatePartnership
QA QualityAssurance
R&D ResearchandDevelopment
RDF ResourceDescriptionFramework
RET Retailer
S&T Science&Technology
SLA ServiceLevelAgreement
SMB SmallBusinessOwner
SME Small-MediumEnterprise
SQL StructuredQueryLanguage
TSV Tab-Separate Values. A text format for storing data in a tabularstructure
EW-Shopp-Page11 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
TP TechnologyProvider
UI UserInterface
VDP VisualDataProfiling
VPN VirtualPrivateNetwork
WP WorkPackage
WYSIWYG WhatYouSeeIsWhatYouGet–auserinterfaceparadigmwhereuserscaneditandinteractgraphicallyandseetheresultsofediting/interactiondirectlyonthescreen
1.3 Documentstructure
The rest of the document is organized as follows. Chapter 2 describes SINTEF’s and UNIMIB’s
contribution toEW-Shoppplatformand itsperformanceassessmentandpotential limitations.The
weather and event data contribution and all the machine learning and advanced analytics are
presentedinChapter3byJSI.ThenwediscusstheevaluationofthevisualizationplatformofENGin
Chapter4andeventuallyasummaryisprovidedinChapter5
EW-Shopp-Page12 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Chapter2 Enrichment:DataGraftandASIA
2.1 UsageinEW-ShoppBCs
The Grafterizer 2.0, ASIA tools and processing infrastructure (i.e., the Processing component) are
builttosupporttasksthatareimplementedintheIngestionandEnrichmentphasesoftheGeneral
dataflowmodel(seeFigure2).
Figure2Generaldataflowmodel
EW-Shopp-Page13 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
DuringtheIngestionphase,theProcessingcomponentcanbeusedwithlarge-scaledatasetstore-
format, pre-process (e.g., perform unzipping of archives, splitting of large files for parallelization,
etc.)anddeliverthedatatotheservicesforEnrichment.Furthermore,theProcessingcomponentis
usedtoautomaticallydeployup-to-dateweather/eventdata intheEnrichmentdatabasesthatare
usedbyASIAtoaugmenttheoriginaldatasets (e.g.,downloadand importweathermeasurements
andforecastsonadailybasisthatarethenusedintheEnrichmentpipelines).Finally,theProcessing
componentisusedtoautomatetheprocessofdirectdatadeploymentinadatabase.
ASIAandGrafterizer2.0 implementfunctionalitiesthatsupportall tasksrelatedtotheEnrichment
phase.ASIAhasbeenintegratedwiththeGrafterizerUItoenabletheuseofitsfunctionalitiesinan
interactive and user-friendly way. The UI of Grafterizer provides features that enable users to
graphically perform cleaningup, filtering, anonymization, reconciliation, extension, transformation
(toknowledgegraphsorothergraphformats),minimizationandaggregationontabulardata.Inthe
EW-Shoppproject,theGrafterizerandASIAtoolsareusedtoimplementthedataflowsforBusiness
cases3and4.
InBC3thefocusisonautomatingtheprocessofdeliveringdatafortheMeasurencedashboardfor
displaying visitor data and providing predictions on expected numbers of visitors based on the
weatherforecastsandplannedeventsforphysicalshopvendors.Toenablethis, thebusinesscase
makesuseofrawbehavioraldata(collectedusingMeasurence’sproprietaryIoTplatform),weather
data (obtained fromECMWF) andevent data (e.g., promotions; providedby the endusers of the
dashboard).Therawbehavioralandeventsdataareupdatedweeklyandingestedusingaworkflow
deployedusing theProcessing component. Thereby,dataareuploadedonceperweekonablock
storage location (currently implemented using Minio1 and automatically uploaded to a MySQL
database that is used by the front end of the web dashboard in Scout+. Weather data are also
obtained weekly using a scheduled workflow based on the Processing component templates for
weatherdataretrievaltooldevelopedinEW-Shopp2.TheworkflowdownloadstheGRIBdatafrom
thepre-specifiedsetoflocationsandserializesitintoaCSVfilematchingthestructureoftheMySQL
tablethatstorestheweathervariables.TheCSVsarethenimportedtothedatabase.
InBC4,duetothelargevolumeofkeywordandkeywordmatchdata,themainfocusisscalability.
The data comprise hundreds of millions of records of unique keywordmatches (daily records of
impressionsandclickspercountry/region/city/datefortensofmillionsofkeywords)thatneedtobe
enrichedwithweatherdata inordertodetermine,e.g., thebestday for launchingacampaignfor
specificmarketingtargets.Therefore,theBC4makesuseoftheProcessingcomponent’scapability
of parallelizing Grafterizer/ASIA pipelines at scale. Keyword data are, therefore, uploaded to a
distributedfilesystem(currentlytheBCusesGlusterFS3)anddeliveredasasetofarchives(ZIPfiles).
EachfilecontainsTSVfilesforoneday'sworthofkeyworddataoftheextractedsetofkeywords.To
enable the processing in the Enrichment Database and the services of the EW-Shopp toolkit, the
data are decompressed, transformed from TSV to CSV (that is the default supported format in
1https://min.io/
2http://github.com/ew-shopp/weather_import/
3https://www.gluster.org/
EW-Shopp-Page14 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Grafterizer) and split into chunks. A set of Grafterizer/ASIA reconciliation and extension agents
running in parallel then augment the input data (rowby row)withweather data through a load-
balanced set ofASIA service instances (see Figure 3). Apart from the reconciliation andextension
steps, theagentsperformfilteringof invalidentries (e.g., removerowsthatdonotcontainavalid
AdWordorAdWordgroup),structuringofthedataintothelogicalgraphstructuredescribedinD2.3
andproducetheinputtothemulti-modeldatabaseusedtostorethedata(ArangoDB4).Thedataare
thenuploadedandstoredforfurtherusageintheArangoDBdatabase.
Figure3Reconciliationandextensionatscaleusingaload-balancedversionofASIA
Inthisdocument,weconsolidateandcomplete(month30)theevaluationcarriedoutinmonth24
and described in the Deliverable D2.3 [8], in which the functionalities of the toolkit have been
screened comparing themwith the demands of the different business cases. Here, as far as the
GrafterizerandASIAcomponentsareconcerned,evaluativestudiesonnon-functionalpropertiesare
reported.Inparticular,thefollowingfeatureshavebeenconsidered:
1. Thetransformationandenrichmentspeedcalculatedinsecondstoreconcileasinglerecord
andtransformedGBperhours.
2. Scalabilityofthesystem,statisticallyassessedthroughthetooloflinearregression
3. Theprecisionofthereconciliation,comparingitwithagroundtruthconstitutedbyacorpus
(silverstandard)obtainedbymatchingwithSILK5thetoponymsofGeoTarget,supportedby
AdWords,andGeoNames.
2.2 Enrichmentatscale
Between the two business cases, the one that posed a major challenge was certainly the BC4
becauseofthesizeofthedatasettobeprocessedandthenumberoftoponymstobereconciledand
4https://www.arangodb.com/
5http://silkframework.org/
EW-Shopp-Page15 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
extended.Thissectionpresentsthetechniquesandstrategiesemployedtoachieveasystemcapable
ofprovidingscalableenrichmentfunctionalities.
2.2.1 DistributionandParallelization
TheASIAecosystem ismadeupofvarious servicesanddatabasescapableof servinganumberof
concurrentinvocations.Eachtimeacallismade,ASIAreceivesalabel(inthecaseofreconciliation)
oraURI(inthecaseofextension)fromGrafterizerandreturnsoneormorevalues.Twosuccessive
callsareindependent,andforthisreason,wehavecreatedaplatform(theProcessingComponent)
inwhich the transformation and enrichment pipeline is performed in parallel on non-overlapping
segmentsofthedataset.
From a technological point of view, the Processing component is implemented as a private cloud
consistingofaclusterofbare-metalmachinesrunningtheDockerengine6andconnectedviaahigh-
bandwidth Ethernet fabric network and mounting a shared file system (i.e., GlusterFS). In this
environmentdataflowsarecompiledintoachainofDockercontainersthatareinturndeployedand
runthroughacontainerOrchestrationsystem(i.e.,Rancher7).Eachofthestepsconsistsofasetof
containers that work independently and in parallel and can be scaled up or down on demand if
required.Thecommunicationbetweentwoconsecutivestepsofthechain,thatis,thehandoverof
thepartialresults,occursthroughwritingandreadingfromthefilesystem.
The implementation of a container-based solution has several benefits; for instance, it allows to
decouple thedata flowdeployment fromtheparticular stakeholder'shardware infrastructurealso
working in heterogeneous distributed environments. Furthermore, it guarantees flexible
deployments, better resource utilization and seamless horizontal scalability. As for the choice of
GlusterFS,thisdistributedfilesystemhastheadvantageofbeingfastasitlacksacentralrepository
for metadata and is linearly scalable and therefore able to support massive amounts of data.
Moreover, being physically separated from themachines onwhich the calculation takes place, it
guaranteesreasonablystableperformanceovertheclusterreducingtheriskoftimeskewnessinthe
enrichment process. A graphical representation of the proposed Processing environment is
presentedinFigure4.
Duetoahighcomplexityofthislayout,wewillhavetoconsidersettingupaSystemMonitorwhich
willbeawareofanyfailuresinanyofthesteps.
6https://www.docker.com
7https://rancher.com
EW-Shopp-Page16 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Figure4AnoverviewoftheproposedProcessingEnvironment
2.2.2 DataandServiceLocality
Onevery first consideration tomakeabout thecauses thathinder theachievementofacceptable
scalabilityfortheenrichmentprocessregardstheuseofremoteservices.Asamatteroffact,inthe
firstreleasesofGrafterizerandASIAthesetwocomponentsweredeployedongeographicallydistant
environmentsbut theuseof servicesaccessibleover the Internetproved tobe incompatiblewith
datasetsfeaturingmorethanafewthousandrowsduetothenetworklatencyandthehighnumber
ofinvocations.Hence,whileitisacceptableinthedesignphase(performedonasmallerdataset)to
use remote multi-tenant services, it is imperative to manage the life-cycle of enrichment data
internally,whenlargedatasetsmustbeprocessed.
Tothisend,wehaveimplementedaseriesofstrategiesaimedatreducingthenumberofrequests
to external services asmuch as possible. In particular, due to its size theweather information is
downloadeddailyfromtheproviderandtreatedtoimproveaccessfromgeospatialsharedspaceof
identifiers (SSIs). Other, less changeable knowledge bases (KBs) are refreshed at different
frequencies. The local management of these KBs has the advantage of rightsizing the resources
allocated against the incomingworkload;moreover, the control over the (local) network enables
reducedand stable round-tripdelay times (RDT). Similar considerationshave led todeploying the
ASIA reconciliation services as close as possible to both the reference KBs and the agents
(containers)performingthereconciliationpipelinesteps(withintheProcessingcomponent).
2.2.3 ScalinguptheEnrichmentServices
In order to manage the workload caused by the simultaneous invocation (by the containers
executingthereconciliationpipeline)ofreconciliationandextensionfunctions,theASIAecosystem
hasbeen implementedtobeeasilyreplicabletoachievehorizontalperformancescalability.This is
EW-Shopp-Page17 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
achievableastheknowledgebasesareusedinread-onlymode;accordingly,theycanbeduplicated
without consistency issues. Load balancers take care of sharing the requests across the various
replicasofASIA.
2.2.4 HierarchicalCaching
Lastly,itshouldbenotedthatthesamerequestforreconciliationorextensionfunctionalitycanbe
made by the agents (running of the Processing component) multiple times because the columns
interested can have repeated values both in the same or in different data chunks. This (if not
mitigated)wouldgenerateaveryhighnumberof identicalrequests.Forthisreason,ahierarchical
caching system has been implemented in which the single agent directly manages the first level
whiletheotherlevelsaremanagedbythestackofASIAservicesandtheunderlyingdatabases.
2.3 Assessment
IntheD2.3deliverable,thefunctionalitiesoftheEW-Shoppserviceswereanalyzedandevaluated,
comparing themwith theneedsof the variousbusiness cases. To complete theevaluation in this
document, we examine two non-functional properties that are particularly relevant in the BC4
scenario,namelythescalabilityandaccuracyofreconciliation.
Totesttheflexibilityandscalabilityoftheproposedsolution,threeexperiments(describedbelow)
of increasing size involving real data sets have been carried out; all of them makes use of the
geospatialreconciliationandextensionservices(GeoNames)andareaimedatenrichingthedataset
withweatherinformation.Toassessaccuracy,weusedasilverstandardobtainedthroughapipeline
of matching GeoTarget (the source of BC4 toponyms) and GeoNames (the system of shared
identifiersthatallowsextension).The4experimentsaredescribedbelow.
2.3.1 Investigatingtheeffectofcachinginreconciliation
First, we designed a small-scale experiment reproducing the scenario where the data scientist
executes the enrichment pipeline on a commodity machine. The main objective is to assess the
performanceboostattributabletotheintroductionofdifferentcachinglevels.Westartedbytesting
the reconciliation performances with no caching strategy whatsoever: 200 thousand rows (21
columns)fromtheBC4datasetfeaturing2227differenttoponyms(fromGermanyandSpain)have
been extracted and a pipeline featuring only reconciliation executed. Themeasured average time
perrowwas12.927ms.
Thesametesthasbeenthenrepeatedenablingthefirstlevelofcache,whichisimplementedatthe
reconciliationservice level.Wereserved64MBofmemoryascache,witha time-to-liveof1hour.
The cache system improved the performances achieving on average 2.558ms per row (5x times
faster with respect to the baseline). At last, the second cache layer has been enabled, which is
implemented locally on each agent. The objective is to avoid the network latency, which is
substantial even in a local setup (via the loopback interface). The pipeline, in this case, ran ~770
EW-Shopp-Page18 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
times faster than the baseline (0.0168ms/row on average). Since the local cache improves the
overallperformancesignificantly,wegeneralizedthecachestructuretohandleextensionqueriesas
well(namely,weatherandevents).
2.3.2 Investigatingtheeffectofcachinginenrichment
Inorder to analyze thebehaviorof the cacheover time, a secondexperimenthasbeendesigned
extending the first one as following: amore complex pipeline is considered,which reconciles city
toponyms to GeoNames, extends reconciled entities with their first administrative level (i.e.,
regions), and fetchesweather informationabout regions (i.e., temperature fora specificdateand
thefollowingone)generatinganewdatasetwith25columns.
This pipeline has been employed first to enrich a dataset derived from the one of the first
experiments filteringout theduplicates in thereconciliation targetcolumn(i.e.,eachvalueoccurs
once atmost); thus, resulting in 2227unique cities (and rows). Theoutcomesof this experiment,
wherethecachedidnotsignificantlyimprovetheperformance(asitwasbuiltbutneverused),are
depicted inFigure58.Afterward,a syntheticdatasetwasbuiltwhereeach line fromtheprevious
datasetisrepeatedfourtimes,allowingtoexploitthelocalcache.AsreportedinFigure6,spikesare
still visibledue to cachebuilding,but the cache reuse speedsup theprocessprogressively (4xon
average),reducingtheexecutiontime(whichtendstobepurelycacheaccesstime)considerably.
Figure5Requestexecutiontimeinmillisecondsforthescenariowithnoduplicates
8Initialspikesareduetothesystemstartup(e.g.,databaseconnectorsinitialization)
EW-Shopp-Page19 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Figure6Requestexecutiontimeinmillisecondsforthescenariowithfourduplicates
2.3.3 Investigatingsystemscalability
Thefinalexperimentwasdevotedtoinvestigatingthesystemscalability.First,acommoditymachine
isused(thisexperimentlikethepreviousoneshavebeenperformedonamulti-tenantmachinewith
4CPUsIntelXeonSilver4114–2.20GHz,and125GBRAM)andASIAdeployedsingularly.Thesame
pipelineisusedtoenrichdatasetsofdifferentsizes:100MB,1GB,5GB,and10GB,dividedinto10
chunksofequalsizeandassignedto10agents.
Performanceresults(inblue),reportedinFigure7asdatasetsize/totalcompletiontime,showalineartrend,whichhighlightsthescalabilityoftheproposedsolution.Finally,theenrichmentofa100GBdataset(~500millionrows,21
columns)wasperformed;thepipelinewasrunontheBigDataEnvironmentdeployedonaprivatecloudinfrastructurefeaturingan8-nodeclusterofheterogeneoushosts.Fiveofthenodeshave4-coreCPUsand15.4GBRAMandthreenodeswith12-coreCPUs,64GBRAM,withsix3TBHDDsholdingaGlusterFSdistributedfilesystem(sharedacrossthe
wholecluster).Theenrichmentagentsweredeployedonthethreeservers.Thetransformationaccessedaload-balanced(usinground-robinloadbalancing)setof10ASIAservicesdeployedonthesamestack.Thefiveconfigurations
arealsodetailedin
Table3.
The linear trend with R2=0.998 (please notice Figure 7) uses a base-10 log scale for the axes) is
maintained also when the data point (in red) pertaining to the 100GB experiment is considered,
despitethedifferentcontextsinwhichtheexperimentshavebeencarriedout.Thisismainlydueto
similaraccessandreconciliationtimesbetweenthetwoconfigurationsused.
EW-Shopp-Page20 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Figure7Totalexecutiontime(inseconds)andlinearregressioncurve,fordifferentdatasetsizesandtwoexperimentalsetups
Table3Experimentresultswithdifferentdatasetsizesandhardwaresetups
Trial Datasetsize(GB) Executiontime(s) Reconc.rate(GB/h) Hardwaretype0 0.1 75.6 4.76 commodity1 1.0 120.7 29.83 commodity2 5.0 152.6 117.96 commodity3 10.0 262.2 137.30 commodity4 100.0 3026.7 118.94 bigdataenvironment
2.3.4 Investigatingthereconciliationaccuracy
Tocomplete theevaluationofGrafterizerandASIA in thecontextofBC4,wehavecarriedoutanexperimenttomeasuretheaccuracyofoursolutioninthematchingoperation.
Let's briefly describe the problem. JOT receives from AdWords information about the number ofimpressions achievedper campaign andper geographical area. The areas usedbyAdWords are asubsetofthoseofGeoTarget.Toenrichwithmeteorologicalinformation,ASIAusesGeoNamesasasharedidentifiersystem(SSI)ofreference,forthisreasonitisnecessarytobeabletoreconcilewithprecisionthetoponymsofAdWordstothoseofGeoNames.
NothavingagroundtruthwithwhichtocomparethematchesproposedbyASIA,wehadtobuildit.With a tool called SIlKwe created amatching pipeline that associated the toponymsofAdWordsrelatedtoSpainwiththeadministrativeareasofGeoNames.WelimitedourselvestoSpainbecausewe are convinced that this is a difficult problem given the large number of toponyms of Spanishorigin on the planet (identical or differing in an accented character). AdWords, limited to Spain,considers534places; theSILKpipeline is able to create499 linkswith theadministrativeareasofGeoNames. Of the 35 missing links, a manual analysis has revealed that 30 do not have a
EW-Shopp-Page21 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
correspondingadministrativearea(ofanykind)associatedinGeoNameswhiletheremaining5arenot identifiedbySIlkbecausetheydonotexceedthethresholdofsimilarity (forsyntaxerrors:eg.Las Rozas exists only as Las Rozas de Madrid in GeoNames). These links have been insertedmanually.Thesetoflinkswasthenmanuallyevaluatedbyrandomlyextractingapercentageoftheproposed links and assessing their correctness. As a result of this operation, we obtained acomparisondatasetwhichwewillrefertowiththeexpressionSilverStandard.
ASIAwas thencalled to reconcile the534Spanish toponymsofAdWords,obtainingaprecisionof0.7959andmanagingcorrectly425cases.TheerrorsofASIAaremainlydueto:
• accents:manySpanishcities(CoriadelRío,AlcalálaReal,Mérida,etc.)arewrittenwithanaccent, but unfortunately there are cities with the same name, written but without anaccent(especially inthePhilippines).ASIA, intheconfigurationofthisexperiment,hasnotbeengiveninformationonthestatethetoponymsbelongto.
• ASIAcreatesamatchevenwhen the rightadministrativeareadoesnotexist (e.g. lesTresCalesdoesnothaveanadministrativearea,andASIAreturnsLesIrois,cityofHaiti);errorsofthis type are due to the complexity of the scoring system used by ElasticSearch (whichunderliesASIA)thatisnotnormalizedandthereforeitisnottrivialtosetathresholdthatisreasonableforallcases.
• we rarely have cities for which matching does not lead to any results (e.g. Guernica inGeoNames appears only asGernika-Lumo, using SILKwe had the same problem andwassolvedbymanuallyenteringlinks)
• inveryfewcaseswehaveacodingerror(forthecityAvila,inAdWordsthereseemstobeaspecialcharacterthatmisleadsthereconciliationprocess)
• inafewcaseswehaveunresolvedambiguities(e.g.,Navalmoralde laMata ismatchedtoNavalmoral,bothinSpain,bothoflevel3,butindifferentregions)
Clearly,itispossibletoalleviatetheproblemofambiguitybycarryingoutreconciliationbyindicatingthestatetoASIA.Inthiscasetheprecisionrisesto0.8945.Webelievethatthisvalueofprecisionis,in relation to thescopeofBC4,agoodresultbecause the toponymsnot reconciled (or reconciledincorrectly)refertoperipherallocationsoftheterritory,affectverymarginallyonthecalculationoftotalimpressions.
2.4 Discussiononlimitations
In this section, we systematically discuss the current limitations and the aspects that could be
improvedtobringanotmarginalimprovementtotheperformanceoftheentireprocess.Inmustbe
noted,nonetheless,thattheperformanceachievedfitstherequirementsofthebusinesscases.
2.4.1 DataLocality
InthisinitialreleaseoftheProcessingcomponent,thedatalocalityprinciple,intendedasoneofthe
chief scalabilityenablers, isonlypartially implementedandexploited. It is limited to the life-cycle
managementof data of the knowledgebases employed for enrichment that are brought into the
environment. In thisway, the speedof the functionalities relying on them increases dramatically.
Similarly, the working data are stored in a distributed file system and accessible through the
network. This architectural choice, which enables uniform access times to data, has the
EW-Shopp-Page22 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
disadvantage of raising the average read/write times of a quantity equal to twice the network
latency(eachagentreadsadatachunkandwritesalargerone).Asfortheworkingdata,thereisthe
chance to improve the reading and writing performances, thus affecting the whole process
positively,bymoving thedataas closeaspossible to theagents thatmustprocess it. This canbe
donebydistributingthechunksamongthemachinesthatexecutetheagents'containers.
2.4.2 DistributedCaching
The hierarchical caching system has ample room for improvement, mainly because each ASIA
replicateddeploymenthasitsownlocalcache.Moreover,duetothepresenceoftheloadbalancer
runningaround-robindispatchingpolicy(thuscachingunaware),identicalrequestscanbeassigned
todifferentreplicasofASIAcausingpreventablecachemisses.Besides,thecacheusedattheagent
level is also private, ending up generating more requests than are strictly necessary. A possibly
bettersolutiontotheproblemofduplicatedrequeststotheenrichmentservicesentailstheuseofa
distributedcachesharedamongthevariousinstancesofASIAandamongtheagentsthatcarryout
thepipeline in parallel. Such a service (for example Ehcache9), once strategically deployedon the
machines that configure the cluster of the big data environment, would guarantee a rapid
synchronizationofthelocalcachescontentreducingthenumberofmisses.
2.4.3 EfficientAPIinteraction
The service APIs are in themselves a point that merits optimization, generating, we believe, a
significant improvement inexecutiontimes.Atpresent, infact, forboththedesignandprocessing
phases, reconciliation and extension are invoked for each single row of the working table. This
meansthatforeachlinetheagentrunningthepipelinemustwaitatimeequaltotheRTDforeach
line, forcing the system to wait a time equal to twice the network latency for each line. A
considerable improvement would be obtained by grouping the invocations to the service. The
processingtimesoftheinputdatasetcouldbefurtherimprovediflightnetworkprotocols(suchas
Websocket)wereusedtogetherwithAPIsthatbetterexploitmessageserialization(suchasGoogle
Protobuf10).
2.4.4 Improvingthereconciliationmethod
WehaveseenthatintheexperimentreportedaboveASIAhasobtainedaprecisionof0.8945.This
valuecanbeeasilyimprovedbyrebuildingtheindexusedbyASIAtodifferentiatebetweenaccented
letters.Inaddition,itispossibletofurtherimprovetheaccuracybyswitchingtoASIAmorecontext
informationbeyondthestate,suchastheindicationoftheregion.
9https://www.ehcache.org/
10https://developers.google.com/protocol-buffers/
EW-Shopp-Page23 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
2.5 Conclusions
TheobjectiveofthischapteristoassesstheGrafterizer,ASIAandProcessingcomponentswithinthe
scopeof business cases 3 and4. Inparticular, theBC4 represents a challenge for theenrichment
processbecauseofthelargesizeofthedatasetstobeprocessedandthenumberoftoponymstobe
reconciled.Insection2,4experimentshavebeenpresentedthatdemonstratethesuitabilityofthe
components of the toolkit both from thepoint of viewof performance (up to 100GBof enriched
data in less than an hour) and accuracy (precision=0.8945) for the toponyms of Spain , between
GeoTargetandGeoNames).
EW-Shopp-Page24 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Chapter3 Analytics:QMiner
3.1 UsageinEW-ShoppBCs
The analytics component takes in the transformed and linked data as input and then extracts
insightsandbuildsmodelsthatpowerthebusinessservicesofbusinesscasepartners.It isbuilton
theQMinerdata analytics platformwhichhasbeenused to implement a comprehensive learning
andpredictionpipelineasdescribedindeliverableD3.2[8].
Mostbusiness caseshavealready tested theanalytics component in thedevelopment versionsof
theirbusinessservices.TheexceptionisBC3whereMEhaveusedtheirownsolution.BC1partnershavealluseditforpredictionoftheiruse-cases.CEhaveuseditforpredictionofmarketreactionsto
weatherconditionsandpricingeventsandtheweatherpredictionservicehasevenbeentestedina
productionenvironmentontheirsite.BBisusingittopredictvisitationtotheirphysicalstoresbasedonweatherconditionsandaselectionofinternalandexternalevents(i.e.mostlyholidaysandmajor
marketingevents).IntheBTuse-caseitisusedtopredictexpectedvolumeofincomingcallsandthe
expected answer rate of outgoing calls in their call center based on weather conditions and
customer internalevents (e.g.marketing campaignsor serviceoutages). Finally, JOT isusing it forpredictionofexpectedvolumeofimpressionsofkeywordclustersbasedonweatherconditionsand
globaleventscoveredinthenewsdataobtainedfromEventRegistry.
Therearetwowaystoevaluatetheanalyticsservices.Thefirstistomeasuretheperformanceofthe
built models by estimating their accuracy using machine learning methodology. The other is to
measurethetechnicalperformanceoftheservicesbyobservinghowmuchresourcestheyconsume
while processing the data. The following two sections will discuss these two performances
respectively.
3.2 ModellingPerformance
Bymodellingperformance,wedenote themeasureof thequalityofmodelsbuiltby theanalytics
services. These estimate how well the models have captured the properties of the data and
consequentlytheaccuracyofthepredictionsthemodelsgive.Evaluationofmodelsisabroadtopic
withinthefieldofmachinelearningandstatisticsandgoeswellbeyondthescopeofthisdocument.
Theanalyticstaskswithintheprojectarealmostexclusivelyoftwotypes,namelyclassificationtasks
or regression tasks. Each typeof task canbe solvedwith a specific subset of analyticmodels and
hence has its own associatedmeasures for evaluating thesemodels. For our discussion themost
commonlyusedmeasureswillsuffice.
Inclassificationtaskthemodelispresentedwithaninputexampleandaskedtospecifyoneofthekpredefinedcategoriesthisexamplebelongsto.Whenevaluatingthemodel,wehaveacollectionof
inputexamples,previouslynotseenbythemodel,forwhichthetruecategoriesareknown.Wefeed
EW-Shopp-Page25 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
theseexamplesthroughthemodeltoobtainpredictedcategories,whicharelatercomparedtotrue
categories. To quantify the model’s performance, different measures are calculated both on the
levelofaspecificcategoryandontheoveralllevel.Whenevaluatingonaspecificcategoryc,oneoffourpossiblecasescanoccurforeachinputexamplee:
• truepositive(TP)–modelcorrectlypredictedthatthetruecategoryofeisc,
• truenegative(TN)–modelcorrectlypredictedthatthetruecategoryofeisnotc(sincewearelimitedonlyoncategoryc,theactualpredictedcategorydoesnotmatteraslongasitis
differentfromc),
• false positive (FP) – model falsely predicted that the true category of e is c (the correctcategorymightbeanyotherexceptc),
• falsenegative(FN)-modelfalselypredictedthatcorrectcategoryofeisnotc.
Each of themeasures used in our discussion requires the number of occurrences for each of the
abovecases.The followingmeasuresarecalculatedwhenestimating themodel’sperformanceon
categoryc:
• precision–measuresthefractionofexampleswhosecategoryiscorrectlypredictedtobecversusallexampleswhosecategoryis(falselyorcorrectly)predictedtobecandiscalculated
bytheformula!"#!"#"$% = !"!"!!",
• recall –measures the fraction of exampleswhose category is correctly predicted to be cversus all examples whose correct category is c but might or might not be correctly
predictedbythemodel,andiscalculatedbytheformula!"#$%% = !"!"!!",
• F-score–combines theprecisionand recall intoa singlenumberas theirharmonicmean,
andiscalculatedbytheformula ! = !∗!"#!"#"$%∗!"#$%% !"#$%&%'(!!"#$%% .
In the regression task, the model is presented with an input example and asked to predict a
numericalvalue.Thetaskissimilartoclassification,exceptthattheformatoftheoutputisdifferent.
Oneexamplewouldbepredictingtheexactnumberofcustomersvisitingaphysicalstoreonagiven
day.Whenevaluatingthemodelonacollectionof inputexamples,truevaluesarecomparedwith
valuespredictedbythemodel.Thedifferencebetweentrueandpredictedvalueisreferredtoasan
error.Thefollowingperformancemeasuresarecalculated:
• meanabsoluteerror(MAE)–isthemeanabsolutevalueofallerrors,andiscalculatedby
theformula!"# = !! !"#$! − !"#$%&'#$! ,wherethesumgoesoverallinputexamples
andnisthenumberofallinputexamples,
• root-mean-squared error (RMSE) – is the square root of mean of squared errors, and is
calculatedbytheformula!"#$ = !! (!"#$! − !"#$%&'#$!)!,wherethesumgoesover
allinputexamplesandnisthenumberofallinputexamples,
EW-Shopp-Page26 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
• meanabsolutepercentageerror(MAPE)–isthemeanofabsolutepercentageerrorsandis
calculatedby the formula!"#$ = !""%!!"#$!!!"#$%&'#$!
!"#$!,where the sumgoesoverall
inputexamples,n is thenumberofall inputexamplesand!"#$!!!"#$%&'#$!
!"#$! isanabsolute
percentageerrorforinputexamplee.
Allthesemeasuresindicatethequalityofthebuiltmodelandrepresenttheimmediatefeedbackfor
theuserbecausetheycanbecomputedontheinputdatasetusingcross-validation11,whichmeans
splittingthedatasetinsubsetsandsystematicallyusingpartsofthedataforlearningandtheothers
fortesting.ThisevaluationprocessisrunautomaticallyintheEW-Shoppanalyticstooleverytimea
model isbuilt.This servesbothasameasureofmodelqualityaswellasa sanitycheck incaseof
errorsinthelearningprocess.
Theproblemwiththelistedaccuracymeasuresisthattheyarehighlydatadependent.Thismeans
thatreportingtheperformanceoftheanalyticsservicesononedataset(oronebusinesscase)tells
uslittleabouttheexpectedperformanceonanotherdataset.Therearewaysofgaugingaccuracyof
machinelearningalgorithmsthatmaygivemoregeneralizableresults,suchasusingtheoreticaland
syntheticdata.Theseapproachesfalloutofthescopeofthisdocumentandthereisawidearrayof
machinelearningliteraturecoveringthistopic.Hereweonlyreportmodelperformancesfromtwo
businesscases–more to illustrate themeasures themselves than tomakean in-depthanalysisof
overall performance. More detailed discussions of business case results are reported in WP4
deliverables.
3.2.1 Ceneje.si
ThefirstexampleofevaluationwedescribecomesfromtheCeneje.sibusinesscase.CEarecreating
atoolthatwouldhelpstoresplanpricingpoliciesandtheyneedtoestimatetheimpactofdiscounts
onproductdemand.This isformulatedasamachinelearningtaskwhereamodelpredicts ifsome
marketsituation(whichmayormaynothaveadiscount)willraisecustomerdemand(measuredby
the relative change of the number of clicks on thewebsite) over some threshold. The resultswe
reportherearecomputedforthethresholdofa15%increase.
The datawe use is for air conditioning units from the year 2017. For each productwe aggregate
pricinganddemanddataonaweekly levelandcomputefeaturessuchastheminimal/maximal/
averagepriceon themarket, recentand long-termdemand,weatherconditionsandothers.From
thesewebuildamodelwhichseparates“positive”cases(demandhasrisenby15%ormore)from
the“negative”cases(demandhasnotrisenby15%ormore).Byrunninga10-foldcrossvalidation,
weobtainthefollowingresultsbyevaluatingonthecategoryof“positive”cases:precision=0.766,
recall = 0.614, F = 0.682. The interpretation of these numbers tells us that the model correctly
captures61.4%of all increases indemandandgives a falsepositive foronly 23.4%of its positive
predictions.Note how these numbers only give us the technical performance of themodel – the
business value of amodelwith this accuracy needs to be evaluated by domain experts from the
businessside.
11https://en.wikipedia.org/wiki/Cross-validation_(statistics)
EW-Shopp-Page27 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
3.2.2 BigBang
ThesecondexampleisanevaluationofamodelsolvingataskfromtheBigBang’sbusinesscase.Big
Bang is developing a tool for predicting the daily number of customers visiting one of their 18
physicalstores.Theyareinterestedinpredictinganexactnumberofvisits,whichcanbeformulated
asaregressiontask.
The store is located ina large shopping center in theSlovenia’s capital city Ljubljana. Thedataset
records thenumberof daily visits from the years of 2014 to 2018. Themodel is trainedon years
from2014to2017andaskedtopredictthenumberofvisitsforeachdayin2018.
Weconsiderthetrend,seasonalityonayearlyandweeklybasis,nationalholidays,majormarketing
events and weather information. In order to better understand the effect of weather, we have
trainedtwoseparatemodels,onewithouttheincludedinformationaboutweather(weatheraware
model)andonewithoutthisinformation(weatherunawaremodel).Bothmodelsareevaluated,and
theirperformanceiscompared.
Theimprovedperformanceoftheweatherawaremodelcanbenicelyseeninthesummermonths
when there are almost no other external factors such as holidays or large marketing events
influencing the visits. Below Figure 8 and Figure 9 show the predicted of number of visits by the
weatherunawaremodelandtheweatherawaremodelrespectively,forthepartofsummer2018.
Blackdotsindicatetheactualnumberofvisits,whilethebluelineindicatesthepredictednumberof
visits. In the figure displaying predictions with weather aware model violet lines indicate the
correctionofpredictedvaluecomparedtopredictionoftheweatherunawaremodel.
Figure8.Predictingthenumberofvisitswithweatherunawaremodel.
Whiletheweatherunawaremodelfailedtopredicttheincreasednumberofvisitsonspecificdays,
theweatherawaremodelperformedbetter.
EW-Shopp-Page28 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Figure9.Predictingthenumberofvisitswithweatherawaremodel.
We report three different measures of performance for the weather unaware and the weather
awaremodel.Fortheweatherunawaremodel,weobtainthefollowingresults:RMSE=261,MSE=
188andMAPE=25%,whilefortheweatherawaremodelweobtainthefollowingresults:RMSE=
235,MAE=153andMAPE=21%.
SincetheMAEmeasure is in thesameunitsas thevalue (numberofvisits) itcanbesaid that the
weatherunawaremodelonaveragemissesthenumberofvisitsby188whiletheweathermodelon
averagemissesthenumberofvisitsbyonly153.Sincethismeasureisexpressingamean,theactual
errormightbeevensmalleronsomedaysandlargeronsomeotherdays.
TheRMSEmeasureismoresensitivetolargeerrors,astheeffectofeacherrorisproportionaltoits
squared value. The RMSE of theweather awaremodel is also lower as the RMSE of theweather
unawaremodelwhich indicatestheweatherawaremodelmightperformbetteronaverage inthe
caseoflargeerrors.
The MAPE measure is probably the most widely used performance measure and it is also most
interpretableacrossdifferentstoresasitdoesnotrelyontheactualscaleofvalues,butitexpresses
theerrorinpercentages.ThelowestMAPEisalsoachievedbytheweatherawaremodelanditcan
besaidthatthemodelmissesthetruevalueby21%onaverage,whichisconsideredasquiteagood
forecast.Asinthepreviousexample,thesenumbersonlygiveusthetechnicalperformanceofthe
modelwhiletheactualbusinessvalueneedstobeevaluatedbydomainexpertsfromthebusiness
side.
3.3 TechnicalPerformance
Theperformanceoftheanalyticalservices inatechnicalsensepertainstotheamountofmachine
resources the services consumewhen processing some amount of data. This is of course of high
importance since if these needs exceed the capabilities we have at hand, we cannot solve the
problemandneed to obtainmore hardware. The twomain resources important for the analytics
EW-Shopp-Page29 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
services (and pretty much any computation-intensive task) are the amount of time needed to
performthecomputation,CPUtime,andtheamountofsystemmemoryconsumed,RAM.
Both resources of course depend on the amount of data we are processing and slightly on the
operating system running the analytical services and the tools used for measurements. The
evaluation is performed on the Ubuntu 18.04.1 LTS operating system using time12, and pidstat13,POSIXcommands.
TheCPUtime is representedas thesumof thesystemtimeanduser time,which representshow
much actual CPU time theprocess used.However, thismetric couldpotentially exceedwall clock
time if many CPU cores are used or be less than wall clock time if there are many blocking I/O
operations,whereCPU is not needed. TheRAM ismeasured as theResident Set Size, RSS, and it
showscurrentlyoccupyingspaceintheRAM.Itdoesnot includememorythat isswappedoutand
memoryfromshared libraries.TheVirtualMemorySize,VSZ, is theaddressspaceallocated inthe
process’smemorymap,butthere isnotnecessaryanyactualmemorybehind it, letalonephysical
memory.Although thesemetrics arenot thebest for very accurateevaluationof resourceusage,
theyareneverthelessgoodapproximationforourpurposes.
To show the effect of the dataset size and the number of features on the resource usage of the
model learning process, several tests were performed on the JOT dataset, consisting of
approximately3000clusteredkeywordsand10millionactivities.Forthepurposesofevaluation,we
fixedtheclusteredkeywordto“/Sport&Fitness/Sport/Fußball”andmeasuredtheresourceusage
duringthelearningofthemodelwhileseparatelychangingthenumberofrecordsandthenumber
offeaturesused.Inthefollowingfigures,eachdatapointrepresentstheaverageof10runs,andthe
upperandlowergreycapsrepresentmaximumandminimumvalue,respectively.
Figure10.TheactualCPUtimerequiredtolearnthemodeldependingonthenumberofrecords.
Forone specific region inGermanyandone specific clusteredkeyword in a year (inourexample,
2017) we have a maximum of 365 impression records (and possibly less due to missing data).
12https://linux.die.net/man/1/time13https://linux.die.net/man/1/pidstat
8.6310.53
13.8115.58
20.35
0.00
5.00
10.00
15.00
20.00
25.00
285 460 686 787 1076
CPU{me[s]
Numberofrecords
EW-Shopp-Page30 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Therefore, we have decided to gradually increase the number of records by adding records of
different regions. Because the impact ofweather featuresmight differ a lot between regions,we
decided to use only the event features, which are global and therefore region independent. As
expectedandseeninFigure10,theincreasingnumberofrecordsalmostlinearlyincreasestheCPU
timeneededtolearnthemodel.
Figure11.Consumptionofmemoryduringthelearningofthemodeldependingonthenumberofrecords.
Figure11depictstheconsumptionofmemorywhilelearningthemodelonanincreasingnumberof
records. It shows that the average virtual memory size is slowly increasing with the number of
records, whilemaximum RSS, occupancy of the actual RAM, is not increasing and seems steady.
However,RSSwouldeventuallyalsoincreasewiththeincreasingnumberofrecords.
Figure12.TheactualCPUtimerequiredtolearnthemodeldependingonthenumberoffeatures.
1287.631300.00
1318.81
1354.661362.83
1389.17 1389.16 1389.13 1389.20 1389.19
1220.00
1240.00
1260.00
1280.00
1300.00
1320.00
1340.00
1360.00
1380.00
1400.00
285 460 686 787 1076
Allocated[M
B]
Numberofrecords
AvgerageVZS MaximumRRS
8.63
9.26 9.23
9.44
9.59
7.80
8.00
8.20
8.40
8.60
8.80
9.00
9.20
9.40
9.60
9.80
10 35 60 85 103
CPU{me[s]
Numberoffeatures
EW-Shopp-Page31 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
In thecasewherewe tested theeffectofan increasingnumberof features,weused265 records
from one region. First 10 features present event’s features and every additional set are various
weatherfeatures.Thelastdatapointhas103features,whichisthewholefeaturesetwewereusing
totrainthemodels.
InFigure12,similarlyasinFigure10,theactualCPUtimerequiredtolearnthemodelincreaseswith
thenumberoffeatures.TheCPUtimerequiredincreasesfasterperoneaddedfeaturethanperone
addedrecord.However,thisdependsonthetypeofthefeatureandifallvaluesoftherecordsare
valid.Thisisfortunatebecauseweusuallyhaveafixedsetofinterestingfeaturesandincreaseonly
thenumberofrecordstolearnthemodel.
Figure13.Consumptionofthememoryduringthelearningofthemodeldependingonthenumberoffeatures.
As seen in Figure 13, we can see that while an increasing number of features does not seem toinfluence the consumption of memory directly, it is likely that the process pre-allocates morememory than it needs. A more significant number of features would surely affect the allocatedmemorytoagreaterextent.
3.4 Conclusionsanddiscussiononlimitations
This chapter covers the two main approaches to evaluating analytics services in the EW-Shopp
toolkit.Measuringthemodellingperformanceisdonebyusingmachinelearningmethodologyand
givestheuserasenseofthequalityofthemodelbuilt.Thesemetricsareveryusefulbutaredata
dependentanddonottransferbetweendifferenttasks.Thismeansthatthelimitationsofwhatlevel
ofperformancewecanachievedependonhowpredictive thedata is for the taskweare solving,
andthetestsneedtobeperformedoneachdatasetindependently.
Theothertypeofperformancewecanmeasureisthetechnicalperformanceofthecomputationof
theanalyticsprocesses.Weobservetheamountofhardwareresourcesthecomputationconsumes,
focusingonCPU timeandmemoryusage.We ran severalexperimentsby simulating realisticEW-
1289.11 1290.36 1294.43 1291.30 1292.78
1389.21 1389.15 1389.17 1389.23 1389.13
1220.00
1240.00
1260.00
1280.00
1300.00
1320.00
1340.00
1360.00
1380.00
1400.00
10 35 60 85 103
Allocated[M
B]
Numberoffeatures
AvgerageVSZ MaximumRRS
EW-Shopp-Page32 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Shoppanalyticstasks.Ourresultsshowthesetaskscanbeefficientlysolvedoncommodityhardware
withcomputationtimewellbelow1minuteandmemoryconsumptioninthelevelsof1.5GB.
EW-Shopp-Page33 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Chapter4 Visualization:KNOWAGE
4.1 GeneralandCommonfeatures
AsexplainedindetailindeliverableD3.3[8]datavisualisationandnavigationinEW-Shoppisoffered
mainly through the Knowage platform (www.knowage-suite.com). Knowage is an Open Source
business analytics and visualization suite developedby ENG. TheCommunity Edition (CE) is freely
downloadablebyanyoneandcross-platform–runningonLinuxandWindows.Amongthedifferent
Knowagetoolsareasetofservicesabletovisualize,navigateandexplore,inanintuitiveway,data
present in most widespread data storage formats and systems. Being completely open source
Knowageisextensivelydocumentedonline.
InthischapterwefocusontheevaluationofKnowageinthecontextofEW-Shoppatmonth30,also
followingontheevaluationcarriedoutatmonth24andpresentedinD2.3[8].Inordertofacilitate
theexposition, inthenextsectionweprovideausagedescription,updatedforthecurrentproject
status(i.e.month30),organisedbyEW-ShoppBusinessCase.
GiventhatKnowage isextensivelydocumentedexternallyand internally (seeaboveD3.3)herewe
only summarise some of the key Knowage features common to in all business cases for
completenessandinordertofacilitatetheexpositionbelow.
ThebasisfordatamanagementandvisualizationinKnowagearedatasourcesanddatasets.Adatasource is a data connection/acquisition from any of the supported systems (such as a relational
database,RestAPI,CSVfile,etc.).Knowageisabletoconnecttodata‘providers’suchasdatabases,
storesandAPIsinordertoextractandprocessdatawhichiseventuallyusedforvisualization.Data
setsprovidemorein-depthandgranulardata interactionmeans.Theyarealsothedata‘suppliers’
for actual visualization features (e.g. widgets). For instance, a data set can be an SQL query to a
MySQL database provided by a data source (Figure 14 shows an example). One very important
featureofmostdatasetsisthepossibilitytoparametrizecertainfeatures,forinstanceapartofan
SQLquery,whichcanbemodifiedinreal-timeduringvisualizationandinteraction.Knowageoffers
specific interfaces for data source and data setmanagement including usermanagement, access
managementandutilitiessuchaspreviewsandset-upofparameters.Thesehavebeendescribedin
moredetailinD3.3.
EW-Shopp-Page34 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Figure14ExampleofSQLqueryDataSet.Noticethedynamicparameterssetwiththesyntax$P{idProduct}and$P{date}.
In order tomanage, filter and present data, Knowage also offers a set ofbehavioralmodels andanalytical drivers.14 For instanceListsofValues (LOVs) canbedefined fromdifferentdata sources
(includingdatasets)whichcanthenbelinkedtodocumentsandessentiallydetermine(stillthrough
userinteraction)thewaydataispresented,filtered,etc.
Oncethedatasourcesanddatasetstobeusedinthebusinesscasehavebeendefined,theactual
navigation and visualization is implemented through the creation of Cockpits. Cockpits allowKnowage users to build interactive environments throughWYSIWYG and intuitive interfaces (e.g.
clicksanddraganddrop,real-timewidgetupdatebasedonuseractions,etc.)whichprovideaseries
of data visualization andnavigationwidgets including various types of charts, tables, cross-tables,
text,HTML,etc.
Withincockpitsuserscanalsocreateassociationsamongdatasets,essentiallylinkingdatasetfields
(i.e. columns). Once associated, fields updates are reflected in all linked ones.Associations are averypowerful featurewhichenables to link fields (i.e. columns)ofadatasetwith the field froma
different one. Once associated any change to one field of a dataset (e.g. user selection) will be
reflected in real-time in all associated datasets, allowing for quickly creating highly interactivecockpits.Figure15showsanexampleofvisualdataassociationforBC4:usersinteractivelyassociate
fields(i.e.columns)fromthedifferentdatasets.Cockpitscanbeaugmentedwithanalyticaldriverswhich provide (for instance) filtering features, and with cross navigation features which allow tocross-linkvariouscockpitsandletusers‘navigate’amongtheminteractively.
14 See also D3.3. and, for detailed documentation about Behavioural Models, see: https://knowage-suite.readthedocs.io/en/latest/functionalities-guide/behavioural-model/
EW-Shopp-Page35 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Figure15Visualcreationofadatasetassociation
4.2 UsageinEW-ShoppBCs
4.2.1 BusinessCase1
As extensively explained in other deliverables (namelyD4.2 [8]), Business Case 1 actually deploys
three different pilots. Pilot 1 concentrates on B2C scenario, i.e., web portal user engagement
therefore visualization and user interface are provided directly on the online storewebportal. In
particular,GraphicalWidgetsareintegratedintargetwebportalsmatchingUXfeaturesneededfor
thecontentbasedonpredictive informationset inexactlydefinedstagesofthepurchaseprocess.
Deployment(asdescribedinD4.2)happensonhttps://www.ceneje.siwebportal.RegardingPilot3,visualization was implemented as a set of additional custom reports in the existing COCOS
WorkforceManagementApplication.No External visualization tools are planed sinceCOCOS is an
establishedCloudContactCentreSuite.Inthefollowingweoverviewthevisualizationscenario(with
updatescomparedtoM24)forPilot2.
A‘back-office’visualizationscenario,whichsharesdomainanddatausedinPilot1iscarriedoutfor
Pilot2,andthevisualizationcockpitswereupdatedinthelatestmonthsthroughtheinteractionwith
involvedBusinessCasepartnersandinresponsetotheevaluationassessmentcarriedoutatM24in
D3.3 (in particular concerning Pilot 2 - Visualization). Therefore, in this Pilot Knowage aims to
provide ‘back-office’ (business) users a way of exploring historical pricing evolution of certain
productsorproductcategories, inthiscaseTVs.AnewcockpitwhichpresentsayearlyrankofTV
sellershasbeencreated:thiscockpitcanalsobeusedasaninterface‘entrypoint’wheretheuser
clicksonabrandandisthenshownthemoredetailedpriceanalysiscockpit:thisisaccomplishedby
thecross-navigation featureofferedbyKnowagewhich isextensivelydocumented: inbrief, cross-
navigationallowstoset-up interactivevisualisationwidgetsandcockpitswhichdynamically link to
otherdocuments.
EW-Shopp-Page36 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Figure16TVbrands(product)yearlyclickranking.Eachrowisinteractive,andtheusercanselectonetoaccessamoredetailedcockpit.Note:somedatahavebeenblurredforconfidentialityreasons.
Oncetheproductisselectedanewcockpitopenswitha‘drilldown’abouttheproductpresenting
overall price evolution (min,max and average), aswell as the possibility to select two dates and
comparearankoftopnumberofclicksperseller.LeveragingonKnowageinteractivityofcourseall
widgetsareclickableandinteractive(e.g.thechartcanbezoomedtospecificdateranges).
Figure17Drilldowncockpitforaspecificbrandselectedinthepreviousone(orthroughtheprovidedfilter).Usercanselecttwospecificdatesforseller/clickperformancecomparison.Note:somedatahavebeenblurredforconfidentiality
reasons.
UsersarealsoabletoselectthespecificTVbrand/modelthroughanintuitivefilterselectionwidget,
whichalsoprovidesincrementalsearch.Fromatechnicalpointofviewthishasbeenimplemented
throughaKnowageListofValues(LOV)addedtothecockpit.Productsearchparameterscanalsobe
savedinasocalled‘bookmark’sothattheusercanquicklyretrieveitagain.
EW-Shopp-Page37 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Figure18Productselectioninterface.Noticetheincrementalsearch
Discreetdates (days) canalsobe selectedby theuser (e.g. to see specificdayson thegraph), for
examplebyselectingmultipledays in theclicks ranking tableprovided in thebottom-rightpartof
thecockpit.
Figure19Userhasinteractivelyselectedsomedatesofinterestformthetable,andallothervisualisationsareupdatedaccordingly.Selectionscanbequicklyclearedfromthespecificwidgetinthetop-rightpart.Note:somedatahavebeen
blurredforconfidentialityreasons.
4.2.2 BusinessCase3
MEalreadyprovidesvisualisationintheirproduct,thereforeintheEW-ShoppBusinessCaseweaim
to provide added value through: a) incorporation of weather data; b) creation of a ‘live’ and
interactivecockpitwhichusesperiodicallyupdateddata.Themainneedistohaveacockpitforeach
store to visualise both daily visitor data (e.g. walk-by and visitors) captured by the Measurence
system and at the same time daily weather features, in particular temperature, cloudiness and
precipitation. Additionally, the end userswant to be able to also see theweather forecast (again
basedontheaboveparameters) for thenextweek, inordertopossiblyplanmarketing initiatives,
etc.Historicaldataarealsopresentedforreference.Inordertoimplementthescenario,wecreated
a (MariaDB) data source connecting to a centralized database which collects automatically both
EW-Shopp-Page38 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Measurence-specificandweatherdata.Thedatabasealsoprovidesdataforstores,meaningthatifa
storedataisupdated(orastoreadded),thiswillbedynamicallyreflectedinthecockpit.Inorderto
provideadynamicdate-rangeoftwoweekswecreatedasetofdatasetswhereweconceptualize
‘today’eitherastheactualcurrentdateandthetwoweeksaroundit(e.g.if‘today’isaTuesdaywe
will see information for the previous week starting on Monday up to next Sunday). A user can
actuallychangethe‘today’setbyconventionasavailableMondaysinthedata.
The cockpit presents twomain charts showing daily visitors and daily temperature with the two
weeks aggregated. Otherwidgetswere also developed showing in the upper part data regarding
visitorsandtemperatureforthepastweek(historical),andweatherforecast(includingallrelevant
parameters) for thenextweek.Thecockpit iscreated foreachstoreandthespecificstorecanbe
selected from a List of Values (LOV – see D3.3 for details). Figure 20 shows a screenshot for a
completestorecockpit.
Figure20BC3bi-weeklystoreanalysiswithweathercockpitexamples.Above:visualisationwithwidgetshowingvisitorsandweather.Below:updatedcompactvisualizationwithvisitorsandtemperatureonthesamecharandincludingstore
filterpane.
EW-Shopp-Page39 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
4.2.3 BusinessCase4
InBusinessCase4themainrequirementistobeabletopresentinaconciseandeffectivewaythe
‘performance’ of certain keywords in certain regions. To this end visualisation was set-up for
German regions and a set of relevant keywords in German. In this scenario target users are JOT
account managers involved in campaigns and interested in getting insights about keyword
performance in various regions covered by the campaign. Starting fromweekly data, the cockpit
presentsasnapshotofdailykeywordperformance(measuredasthesumof impressions-per-day).
Additionally,theusercanperformsome‘drilldown’intoregionsandalsofilterbyday,keywordetc.
To this end a prototype of a ‘weekly’ cockpit was created (and its related data sets). The main
visualisationchosenisaheatmapchartweretheXshowsdaysofweek,YshowskeywordsandtheZshowstheimpressionsperday(representedbyacolourscalefromyellow–lessimpressions–tored
–most impression). Additionally, a global daily performance chart (line) is alsodisplayed showing
overallperformanceforeachday.
A cross-table listing keywords and total impressions per keyword is also displayed in the same
dashboard. In this way a compact and quick, yet rather complete visual snapshot of weekly
performance is provided (the cross-table values are colour-coded and click-able): this makes the
default visualization rather effective. Additionally, a set of dynamicity and data navigation /
exploration features are embedded in the cockpit: 1) for the relevant data sets appropriate
associationshavebeencreated:inthiswaywhenevertheuserclicksonanydatapointofthewidgetalloftheothervisualelementsandwidgetsareupdatedaccordingly;2)aspecificcheckboxwidget
for filtering keywords (including multiselect) has been added. Similarly, users will most probably
wanttofilterbyday(s)todrilldownintotheperformanceofthedifferentkeywordsonadailybasis
and compare two or more different dates: to this end a multiselect checkbox provides this
functionality(seeanexampleintheFigure21screenshot).Ofcourse,allfilterscanbecombinedto
highlight different data features. Finally, all charts are interactive, providing zoom functionalities,
gettingdetailsaboutadata-pointvia‘hovering’andgeneratingafilterselectionviaclicking.
Figure21.BC4weeklykeywordcockpit–filteringbydays
EW-Shopp-Page40 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Inadditiontotheaboveweeklykeywordcockpitacockpitwhichfocusesonregionalperformanceis
provided, enabling users to focus on regional analyses of the week. The cockpit (see example
screenshotsinFigure22andFigure23),presentsaheatmapsimilartothepreviousonebutinthis
caseshowingthedateontheXaxis,regionsontheYandactivationsasZ(i.e.colourscaleyellowto
red).Across-tablelinkingregions,keywordsandactivationsisprovided(withcolourcoding)aswell
as a simple tree-map visualization where regions are displayed and area is linked to the sum of
activations per region. In this cockpit all widgets are also interactive and linked through data
associations: to this end some checkbox widget associated with the data allow filtering the
visualizationandeasedatanavigation(e.g.drillintocertainkeywords,regionsordays).
Figure22.BC4regionalanalysiscockpit(updated)filteringbyregion
Figure23.Screenshotoftheregionalanalysiscockpitwithallfiltersapplied(region,keywordanddate)
EW-Shopp-Page41 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
4.3 Performanceassessment
At the currentprogress statusof theproject, EW-Shopp is aiming toprovide amodular toolkit tosupportcertaincomplexandmultiformdataflowswhichincludemultipledataprovidersanddiverse
actors/stakeholders. In D2.3 we consolidated a common dataflow approach where Visualization
constitutes the last step in the ‘data-chain’and isdedicated to thevisualisationandnavigationof
data, as output from the preceding steps (Ingestion, Enrichment, Analysis). In the same D2.3 we
therefore assessed “the capability of the various tools to work in an integrated manner and to
supportallofthedataflowstages”.15Inlinewithsuchbusiness-anduser-centricapproach,herewe
aimtoassesshowKnowage,asthemainvisualizationtoolintheEW-Shopptoolkit,isabletoprovide
thecapabilitiestosupportbusinessandfunctionalrequirementsoftheBusinessCases.
• Accessandconsumemultipleanddifferentdatasources forvisualization.Knowageprovidesaseriesofdataconnectors‘outofthebox’tomostcommondatasourcesincludingfiles(e.g.CSV),
databases(e.g.MySQL/MariaDB,PostgreSQL),REST,ApacheHive,ApacheSparkandMongoDB.
IntheEW-ShoppBusinesscasesthemostuseddatasources/setsareMySQL,MariaDBandfiles.
Additionally,acustomisedRESTconnectorforArangoDBwasdevelopedinthecontextofinitial
PilotsofBusinessCase4.Customdatasourcescanalsobecreatedalthoughsuchfunctionality
wasnotusedwithinEW-Shopp.Overall Knowageoffersbroad capabilities to connect tomost
popular databases and is able to support data consumption, especially from the previous
Analyticsstageinthedataflow.
• Easily query data sources and extract data in customised ways. In the scope of EW-Shopp
Business cases we will consider queries to SQL databases: to this end Knowage offers the
possibility to use standard SQL queries as data sets, therefore easily portable from existing
applicationsandenvironments.Additionally,itprovidesqueryparametrizationwhichisastepto
makingdataqueriesmoreinteractiveandinteroperablewithinthewholeKnowageecosystem.
Forexample,inthefollowingquery:
SELECT `Date`FROM `measurence-db`.traffic_data WHERE StoreID = $P{storeID}
the syntax $P{storeID} provides the possibility for storeID to be a dynamic parameter to be
linked to other data sets and therefore offer interactivity to the user. Other parametrization
optionsexistdependingonthedataset,document(e.g.behaviourmodels),etc.
• Provideeffectivedatavisualisationtoolswhicharealsointeractiveandaccessibletoendusers.In some of the EW-Shopp business cases, and in other potential similar use cases,we should
assume that the finalusersat theendof thedataconsumption ‘chain’areuserswhoarenot
necessarilytechnicalordatascientists(forinstance‘businessusers’,‘accountmanagers’,etc.in
the variousPilots), and thereforewould like touse (orbetter ‘see’) data tomake strategicor
business decisions. Indeed, if we look at the some of the main final artefacts in Knowage
(documents),theseare–forinstance-highlyinteractivecockpitswhichuserscaninteractwith
15EW-Shopp,D2.3EW-ShoppPlatformevaluationassessment,Chapter1“IntroductionandMethodology”
EW-Shopp-Page42 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
throughpoint-and-clickoperationsviaanybrowser(includingmobileplatforms).Ofcourse,the
actualset-upandcreationofsuchartefactsdoesrequireat leastbasic technicaldataanalyticsand Knowage competences. Such approach is still in line with both the EW-Shopp potential
businessmodelandKnowage’s, i.e.offeringa freeOpenSource toolandofferingvalueadded
services (e.g. data set set-up, cockpit creation and customisation, integration, etc.) should a
potentialbusinesspartner require such support. Thevisual tools inKnowageoffer all basic as
wellassomeadvancedvisualisationandnavigationtoolsincludingseveraltypesofcharts,tables
andcrosstablesaswellasmoreweb-orientedwidgetsbasedondata-linkedHTMLorSVG.Aset
ofvisualfilteringandselectiontoolsarealsoprovidedlikeListOfValuesusedinBusinessCases1
and3toselectrespectivelyproductsorstores.
• Integrationandaccessibility capabilities.Asmentioned, theKnowage front-end,beingHTML5-
base,runsinanyrecentbrowser.Thismeansthatnotonlyitcanbeintegratedeasilyinanyweb-
basedenvironment (e.g.acompany intranet,dataportal,etc.),butalsoaccessed frommobile
devicessuchasphonesandtablets.Additionally,mostback-endandfront-endfeatures(suchas
data-sets and cockpits) can be exposed externally (given the adequate access and security
permissions) through a REST API for lower-level operations16 and through HTML for visual
artefacts(seeexamplescreenshotsinFigure24andFigure25respectively).Finally,theKnowage
server can be easily deployed either in a public or private cloud,making it relatively easy to
integratewithotherplatformsandservices.
Figure24.DemonstrationoftheBusinessCase3maincockpitembeddedinanexternalHTMLpagethroughasimpleiFrame.
16Seehttps://knowage.docs.apiary.io/fordetails
EW-Shopp-Page43 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Figure25.DemonstrationofretrievingtheBusinessCase1documentsthroughtheKnowageAPI(left:throughdirectGETcall,right:throughPython)
4.4 Conclusions,limitationsandimprovements
InthischapterweprovidedanoverviewofthecurrentstateoftheartoftheKnowagesuiteasmain
visualizationtoolfortheEW-Shopptoolkit,inparticularfocusingonitsreal-worldusetosupportthe
EW-Shopp business cases. Given the diversity of business cases and related requirements for
visualisation and data navigation, Knowage supports these requirements and can be positively
assessed. Possible limitations are related to the fact that in the context of EW-Shopp not all
Knowagefeatureshavebeenused,inparticularrelatedtonon-visualfeatures:actuallythisisinline
withtheEW-Shoppset-upwhereKnowagewaspickedespeciallyasthevisualizationtool.Also,the
definitionof the citeddataflowmethodologypresented inD2.3allowedus todefinemore clearly
thefunctional-logicalstepsandwhattoolsinthetoolkitcoverthem.Anotherpossiblelimitation(or
maybefeaturedependingontheperspective),isthatwiththedataflowsandrelateddatainvolved
in theEW-Shoppbusiness casesweopted tohavingasmuchas thedataprocessing (inparticular
enrichmentandanalytics),happening‘before’beingconsumedbyKnowage:thisisessentiallyinline
withthestepsseparationexplainedearlier,butofcourseKnowagecouldalsopotentiallycoversome
ofthesetasks.
Regardingimprovements,aswearemovingtothefinalstagesofEW-ShopptowardsMonth36,all
partnersareworkingtoimproveandfine-tuneeachPilot: inparticular,fromtheKnowagepointof
viewweaimtoimprove‘real-time’(i.e.liveupdates)functionalitiestoBusinessCase3withabetter
automatized workflow to ingest, enrich and analyze the data before visualizing it, and more
generally having a final round of feedback loop and testing of the cockpits with users (and final
users),topossiblyintegrateimprovementsandenhancements,especiallyfromtheUXpointofview.
Finally, while integration was successfully demonstrated, we would like to test (at least in one
BusinessCase),atighterintegration(e.g.directlywithintheuserpartnerenvironment),howeverwe
arestudyingifandhowthiscanbeachievedina‘smooth’way,i.e.withoutbeingtooobtrusivefor
thedailybusinessofthesepartners.
EW-Shopp-Page44 GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1
Chapter5 Conclusions
Theconsumer journeygeneratesvastamountsofdata that seems toohard tohandleat first.Yet
with help of Big Data specialists, reconciliation of this data, enriching it with some internal or
external data sources such as weather or geo data, performing advanced analytics on it and
eventuallypresentingitinavisuallyappealingmannerseemsadoabletask.
This document has presented the complete EW-Shopp toolkit and theway business partners use
certaintoolsinit.Noteverypartneruseseverysingletoolfromthetoolkit,buthehasthepossibility
toadapthisdatapipelinetofithisbusinessneeds.
ThedatascientistsfromSINTEFandUNIMIBhavebuiltaninfrastructurethatcanprocessgigabytes
of data. Processing means ingesting data into the infrastructure, cleansing it and eventually
enriching it. They tackled big data problem by designing the Process Infrastructure in away that
enables scalability and parallel processing. So, alongside the datawrangling and enrichment tools
(DataGraft,Grafterizer2.0andASIA),anadditionalsetoftoolswasutilizedtoenablescalabilityand
parallelprocessing–Dockercontainers,GlusterFSandRancherasthecontainerorchestrator.
The enriched data set is then ready to be fed the machine learning algorithms of the QMiner
component.AllthebusinesspartnersneededpredictionsbasedonhistoricdataandtheJSIprovided
analyticalmodelswhichmaketheneededcalculations.Importantly,thesecalculationsarefastand
thepredictiondataispropagatedtobusinesspartnersalmostinrealtime.Ofcourse,existingPilots
donotexploitthefullMLpotentialofQMiner;atthemomenttheyarelimitedtoclassificationand
regressionalgorithms.
Finally, all the descriptive and predictive analytics needs to be presented in a visually appealing
manner and the Knowage platforms handles this aspect. Enabling various types of charts and
supportingmanydata connectors enable thebusiness intelligence specialists todesign interactive
cockpitsthataresuitedfortheenduserwhocangainimportantinsightsfromthem.
Itmustbementionedthatnopersonaldataofanykindisusedandallthecommunicationbetween
differentmodules in the EW-Shopp toolkit is running through secure protocols such as HTTPS or
SFTP.
InthisdeliverableD3.4wereportedthemonth30evaluationoftheEW-Shopptools.Theapproach
adopted was to assess how each tool covers the Pilots of the Business Cases. For each tool the
dataflow was described and performance assessed. Where needed, an analysis of current
shortcomingsandrelatedcorrectivemeasureswereexplained
Overall we can conclude that evaluation for all components was successful with the natural
improvements required given the timeframe and the fact that a further release of the complete
toolset(formerlyreferredtoas‘Platform)isforeseeninsixmonths’time(i.e.atmonth36).Tothis
endEW-Shoppconsortiumisalreadyatworkto implementsuchcorrectivemeasuresand improve
thetoolsthroughvariousinternaliterationsandincludingalltherequiredactorsandactions.