Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
TheBillionObjectPla1orm(BOP):AReal-:me,BigData,
Spa:o-TemporalExplora:onPla1orm
HarvardABCD-GIS
BenjaminLewis,MerceCrosas,DavidSmiley,DevikaKakkar,ArielNunez
Outline• Introduc:on• Architecture• Harves:ng/Archiving• Sen:mentEnrichment• ApacheKaSa• SolrforGeo-enrichment• Solr&TimeSharding• BOPWeb-Service• ClientUI• Deployment/Opera:ons• DockerandKontena
BOPRequirementsSummary
• Mostrecent~billiongeo-tweets• Real:mesearch(<5seclatency)• Sub-secondqueries– Includingheatmaps!
• Onthecheap:~6commodityservers
Provideaproof-of-conceptpla1ormdesignedtolowerthebarrierforresearcherswhoneedtoaccessbigstreamingspa:o-temporaldatasets.
BOPasanExampleofaNewKindofDatasetAvailableinDataverse
StreamingData:HarvestandArchive
Ini:alfocusonGeo-tweets(couldbeanystreamingdataset)
• 1-2%oftweetshaveGPScoordinatesfromtheuser’sdevice,rangesfrom1to6millionperday
• TheCGAhasbeenharves:nggeo-tweetssince2012andhasaninformalarchiveofabout8billionobjects
• ResearcherRyanQiWangalsoharves:ngduringthisperiod.Histweetswereloadedfirst.CGAtweetswillbemergedlater.
• Collaborators:• HarvardDataverseTeam• BostonAreaResearchIni:a:ve
LogicalHigh-LevelArchitecture
KaSa(archive)
SolrHarves:ng Enrichment
DataflowsviaApacheKa)a HTTP
WebService
BrowserUI
Docker,Kontena,OpenStackHos:ng:MassOpenCloud
BOP
ApacheKaSa• KaSa:ascalablemessage/queuepla1orm• SeenewKaSaStreams&KaSaConnectAPIs• Noback-pressure;canbeachallenge• Non-obvioususe:– Forstorage;:mepar::oning
• Lotsofbenefitsyetseriouslimita:ons
Real-TimeHarves:ng
Streamtweetsusingpredefinedusersandcoordinatesextent
KaSaTopic
ConnecttoTwijer’sStreamingAPI
Ifthetweetis
Geotagged
Enrichment
Geo:QuerySolrviaspa:alpointquery;ajachrelatedmetadatatotweet
KaSaTopic Enrich KaSa
Topic
TwijerSen:mentClassifier
Geo:Solrwithregionalpolygons&metadata
Sen:mentAnalysis• Classifier:SupportVectorMachine(SVM)withLinearKernel• SourcecodeinPython• Usesscikit-learn,numpy,scipy,NLTK• Twoclassesofsen:ment:Posi:ve(1),Nega:ve(0)• TrainingCorpus:Sen:ment140,Polaritydatasetv2.0,Universityof
Michigan• Preprocessing:Lowercase,URLs,@user,#tags,trimming,repea:ng
characters,emo:cons• Stemming:Porterstemmer• Precision,Recall,F1score:0.82(82%)• Processingspeed:20ms/tweet(noemo:con),5ms/tweet(emo:con)
Sen:mentAnalysisPhase1:Training
Phase2:Predic5on
Loadtheclassifier
Foreachtweet
Parse Preprocess Stem Predict
Traintheclassifier
Saveaspickle
SolrforGeoEnrichment“ReverseGeocoding”
• Tweets(docs)canhaveageolat/lon• EnrichtweetwithCountry,State/Province,…– Gazejeerlookup(point-in-polygon)
DataSet Features Rawsize Index5me Indexsize
Admin2 46,311 824MB 510min 892MB
USStates 74,002 747MB 4.9min 840MB
MassachusejsCensusBlocks 154,621 152MB 5.9min 507MB
FastPoint-in-PolygonTricksIndex/Config• Op:mizeto1segment• RptWithGeometry
Spa:alField– precisionModel=
"floating_single"– autoIndex="true"
• <cachename="perSegSpatialFieldCache_WKT"…
Search• EmbedSolr(in-process)• UsedocValues,notstored
– fl=block:field(GEOID10)Querylikethis:• q={!fieldcache=false
f=WKT}Intersects(POINT($lon$lat))
Sub-Millisecond!
ApacheSolr• Search/analy:csserver,basedonLucene• Customadd-ons:– Timeshardedrou:ng(index+query)– LatLonPointSpa:alField–inSolr6.5
• Faster/leanersearch&sortforpointdata– HeatmapSpa:alField–inSolr6.6TBD
• Faster/leanerheatmapsatscale
Time“Sharding”Solrhasnobuilt-in:mebasedsharding.ASolrcustom“URP”wasdevelopedtoroutetweetstotherightby-monthshard.Itautocreatesanddeletesshards.ASolrcustom“SearchHandler”wasdevelopedtodecidewhichsubsetofshardstosearchbasedoncustomparameterssentbytheweb-service.Generallyusefulforothers.Needmoreworkforcontribu:ontoSolritself.
TheBOPWeb-Service• HTTP/RESTAPI– Keyowrdsearch– Face:ng
• Heatmaps– CSVexport
• WhynotSolrdirect?– DefineasupportedAPI– Easeofuseforclients– Security
Tech:• Swagger• Dropwizard• Kotlinlang(onJVM)
ClientUI• BrowsersideUIwithnoservercomponent• Itusesthefollowingtechnologies:– AngularJS– OpenLayers3– npm(dependencies,scriptminifica:on,development)
UIadaptstolaptop
UIadaptstotablets
UIadaptstophones
Temporalfiltering
Temporalface:ng(histogram)
Spa:alfiltering
Spa:alface:ng(heatmap)
Textface:ng(tagcloud)
Nearbytweets
Deployment/Opera:ons• MassOpenCloud“MOC”– OpenStackbasedcloud(mimicsAmazonEC2)
• CoreOS• Kontena&Docker• Admin/Opstools:– KaSaManager(Yahoo!)– Solr’sadminUI
Stats:• 12nodes(machines)
• 5toSolr• 3toKaSa• 3toenrichment,…
• 217GBRAM• 3500GBdisk• 17services(soywarepieces)
• 133containers
Docker• Easytofind/try/use
soyware– Noinstalla:on– Simplifiedconfigura:on(envvariables)
– Commonlogging– Isolated
• Idealfor:– Con:nuousInt.servers– Tryingnewsoyware– Produc:onadvantagestoo
• but“new”
DockerinProduc:on• Weuse“Kontena”• Commonlogging,machine/procstats,security– VPNtosecurenetwork;accesseverythingaslocal
• Nolongerneedtocareabout:– Ansible,Chef,Puppet,etc.– Securityatnetworkorproxy;notservicespecific
• Challenges:state&big-data