46
Towards Scalable Multimedia Analytics Björn Þór Jónsson data sys group Computer Science Department IT University of Copenhagen

Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Towards ScalableMultimedia Analytics

Björn Þór Jónssondatasys groupComputer Science DepartmentIT University of Copenhagen

Page 2: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Today’s Media Collections

• Massive and growing– Europeana > 50 million items– DeviantArt > 250 million items (160K/day)– Facebook > 1,000 billion items (200M/day)

• Variety of users and applications– Novices à enthusiasts à scholars à experts– Current systems aimed at helping experts

• Need for understanding and insights2

Page 3: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Media Tasks

3

SearchExploration

Page 4: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Media Tasks

4[Zahálka and Worring, 2014]

Page 5: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

MultimediaAnalytics

Multimedia Analytics

Multimedia Analysis

VisualAnalytics

[Zahálka and Worring, 2014] 5

Page 6: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

[Zahálka and Worring, 2014; Keim et al., 2010]

From Data to Insight

6

Page 7: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

The Two Gaps

7

Generic data and annotation

based on objective understanding

Predefined, fixedannotation based

on understanding of the collection

Specific contextand task-driven subjective understanding

Dynamically evolving and interaction-driven understanding of collections

Semantic Gap[Smeulders et al., 2000]

Pragmatic Gap[Zahálka and Worring, 2014]

Page 8: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Multimedia AnalyticsState of the Art

• Theory is developing• Early systems have appeared• No real-life applications (?)• Small collections only

8

Page 9: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

ScalableMultimediaAnalytics

ScalableMultimedia Analytics

Multimedia Analysis

VisualAnalytics

DataManagement

[Jónsson et al., 2016] 9

Page 10: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

The Three Gaps

10

Generic data and annotation

based on objective understanding

Predefined, fixedannotation based

on understanding of the collection

Pre-computed indices and bulk

processing of large datasets

Specific contextand task-driven subjective understanding

Dynamically evolving and interaction-driven understanding of collections

Serendipitousand highly interactive sessions on small data subsets

Semantic Gap[Smeulders et al., 2000]

Pragmatic Gap[Zahálka and Worring, 2014]

Scale Gap[Jónsson et al., 2016]

Page 11: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

VELO

CIT

YVOLUME

VAR

IETY

VISUALINTERACTION

[Jónsson et al., MMM 2016]

Ten Research Questions forScalable Multimedia Analytics

Page 12: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Service Layer

Big Data Framework: Lambda Architecture

12

Batch Layer

Storage Layer

Speed Layer

[Marz and Warren, 2015]

Page 13: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Service Layer

Big Data Framework: Lambda Architecture

13

Batch Layer

Storage Layer

Speed Layer

[Marz and Warren, 2015]

Page 14: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Outline

• Motivation:Scalable multimedia analytics

• Batch Layer:Spark and 43 billion high-dim features

• Service Layer:Blackthorn and 100 million images

• Conclusion:Importance and challenges of scale!

14

Page 15: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Gylfi Þór Guðmundsson, Laurent Amsaleg, Björn Þór Jónsson, Michael J. FranklinTowards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference (MMSys)Taipei, Taiwan, June, 2017

15

Page 16: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Spark Case Study:Motivation

• How can multimedia tasks harness the power of cloud-computing? – Multimedia collections are growing– Computing power is abundant

• ADCFs = Hadoop || Spark– Automatically Distributed

Computing Frameworks– Designed for high-throughput processing

16

Page 17: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Design Choices: ADCF = Spark

• Hadoop is not suitable (more later) • Resilient Distributed Datasets (RDDs)

– Transform one RDD to another via operators– Lazy execution– Master and Workers paradigm

• Supports deep pipelines• Supports worker’s memory sharing• Lazy execution allows for optimizations

17

Page 18: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Design Choices: Application Domain

• Content-Based Image Retrieval (CBIR)– Well known application– Two phases: Off-line & “On-line”

Search resultsQuery Image

CBIRSystem

18

Page 19: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Properties:• Clustering-based• Deep hierarchical index• Approximate k-NN search• Trades response time for

throughput by batching

Why?• Very simple• Prototypical of many

CBIR algorithms• Previous Hadoop

implementation facilitates comparison

Design Choices: DeCP Algorithm

19

Page 20: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

DeCP as a CBIR System

20

• Off-line– Build the index

hierarchy– Cluster the data

collection• On-line

– Approximate k-NN search

– Vote aggregation

Index isin RAM

Clusters reside on disk

Searching a single feature

IdentifyRetrieve

Scan

k-NN

Clustered collection

Page 21: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Design Choices: Feature Collection

• YLI feature corpus from YFCC100M– Various feature sets (visual, semantic, …)– 99.2M images and 0.8M videos– Largest dataset publicly available

• Use all 42.9 billion SIFT features!– Goal is to test at a very large scale– No feature aggregation or compression– Largest feature collection reported!

21

Page 22: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Research Questions

• What is the complexity of the Spark pipeline for typical multimedia-related tasks?

• How well does background processing scale as collection size and resources grow?

• How does batch size impact throughput of an online service?

22

Page 23: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Requirementsfor the ADCF

R1 ScalabilityAbility to scale out with additional computing powerR2 Computational flexibilityAbility to carefully balance system resources as neededR3 CapacityAbility to gracefully handle data that vastly exceeds main memory capacity

R4 UpdatesAbility to gracefully update the data structures for dynamic workloadsR5 Flexible pipelineAbility to easily implement variations of the indexing and/or retrieval processR6 SimplicityHow efficiently the programmer’s time is spent

23

Page 24: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

DeCP on Hadoop

24

• Prior work evaluated DeCP on Hadoop using 30 billion SIFTs on 100+ machines

• Conclusion = limited success– Scalability limited due to RAM per core– Two-step Map-Reduce pipeline is too rigid

• Ex: Single data-source only• Ex: Could not search multiple clusters

– R1, R2, R3 = partially; R4 = no; R5, R6 = no

Page 25: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

DeCP on Spark

• A very different ADCF from Hadoop• Several advantages

– Arbitrarily deep pipelines • Easily implement all features and functionality

– Broadcast variables• Solves the RAM per core limitation

– Multiple data sources• Ex: Allows join operations for maintenance (R4)

25

Page 26: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Spark Pipeline Symbols

26

• .map = one-to-one transformation• .flatmap = one-to-any transformation

• .groupByKey = Hadoop’s Shuffle • .reduceByKey = Hadoop’s Reduce

• .collectAsMap = collect to Master

.map

.flatmap

.groupByKey

.reduceByKey

.collectAsMap

Page 27: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Search Pipeline

27

Indexing

Search

Page 28: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Evaluation: Off-line Indexing

• Hardware: 51 AWS c3.8xl nodes– 800 real cores + 800 virtual cores– 2.8 TB of RAM and 30 TB of SSD space

• Indexing time as collection grows

28

Features(billions)

Indexing time (seconds)

Scaling(relative)

8.5 3,287 –17.2 5,030 1.5326.0 11,943 3.6334.5 14,192 4.3142.9 19,749 6.00

Page 29: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Evaluation: “On-line” Search

29

● Throughput with batching

Hadoop limit

Page 30: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Summary

30

R1

Scal

abili

ty

R2

Com

puta

tiona

lFl

exib

ility

R3

Cap

acity

R4

Upd

ates

R5

Flex

ible

Pi

pelin

es

R6

Sim

plic

ity

Spark Yes Yes YesPartialfull re-shuffle

Yes Yes

Hadoop PartialRAM

per corePartial Partial No No No

Page 31: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Outline

31

• Motivation:Scalable multimedia analytics

• Batch Layer:Spark and 43 billion high-dim features

• Service Layer:Blackthorn and 100 million images

• Conclusion:Importance and challenges of scale!

Page 32: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

32

Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel WorringBlackthorn: Large-Scale Interactive Multimodal LearningUnder revision at IEEE Transactions on Multimedia

Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel WorringInteractive Multimodal Learning on 100 Million ImagesProceedings of the ACM International Conference on Multimedia Retrieval (ICMR)New York, NY, USA, June, 2016

Page 33: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Service Layer

Framework:Lambda Architecture

33

Batch Layer

Storage Layer

Speed Layer

Page 34: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

BlackthornMotivation

• Do not impose a dictionary on the user• Let the user synthesize categories

of relevance from semantic annotations on the fly

• Let the user search and explore along those categories interactively

• Interactive semi-supervised learning

34

at scale!

Page 35: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Honza’sScalability Illustration

• “Yesterday”: 10-100K images

• YFCC: 100M images

35Image credit: http://demonocracy.info/infographics/usa/us_debt/us_debt.html

Page 36: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Scale > Size

36

• Single (high-end) workstation• 1000D features à 800GB

• Interactive response time!• Computing feature scores takes minutes!

Page 37: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Blackthorn Overview

37

Page 38: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Blackthorn Compression

38

Page 39: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Blackthorn Results:1.2M Collection

39

• Compression: 880GB à 5GB• Precision: 89-108% of uncompressed• Scoring time:

60-80x faster• Recall over time:

Blackthorn rocks!

Page 40: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Blackthorn Results: YFCC100M Collection

• Scoring time: ~1 second!

40

Page 41: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Blackthorn Future Work

• More (user) evaluation is needed• Other applications may (will) require

adaptations• Further scalability:

Combine eCP and Blackthorn

41

Page 42: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Outline

• Motivation:Scalable multimedia analytics

• Batch Layer:Spark and 43 billion high-dim features

• Service Layer:Blackthorn and 100 million images

• Conclusion:Importance and challenges of scale!

42

Page 43: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Why Scale?

43

• Current and future applications• Future of computing• Because we cannot yet!

“We choose to … in this decade and do the other things,not because they are easy, but because they are hard, …”

We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too.

JFK, September 12, 1962

Page 44: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Scalability Hurdles: Can Industry Help?Industry-Level Collections

– Data– Processing capacity

The Small-Minded Reviewer– “Are there users willing to explore 100M data sets interactively?”

Interactive Applications– Application knowledge– User study “victims”

44

Page 45: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

46

Page 46: Towards Scalable Multimedia Analytics · 2018. 12. 5. · Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference

Summary

• Motivation:Scalable multimedia analytics

• Batch Layer:Spark and 43 billion high-dim features

• Serving Layer:Blackthorn and 100 million images

• Conclusion:Importance and challenges of scale!

45