Towards Scalable Multimedia Analyticsamsterdamdatascience.nl/wp-content/.../06/ADSMeetup... ·...

Preview:

Citation preview

Towards ScalableMultimedia Analytics

Björn Þór Jónssondatasys groupComputer Science DepartmentIT University of Copenhagen

Today’s Media Collections

• Massive and growing– Europeana > 50 million items– DeviantArt > 250 million items (160K/day)– Facebook > 1,000 billion items (200M/day)

• Variety of users and applications– Novices à enthusiasts à scholars à experts– Current systems aimed at helping experts

• Need for understanding and insights2

Media Tasks

3

SearchExploration

Media Tasks

4[Zahálka and Worring, 2014]

MultimediaAnalytics

Multimedia Analytics

Multimedia Analysis

VisualAnalytics

[Zahálka and Worring, 2014] 5

[Zahálka and Worring, 2014; Keim et al., 2010]

From Data to Insight

6

The Two Gaps

7

Generic data and annotation

based on objective understanding

Predefined, fixedannotation based

on understanding of the collection

Specific contextand task-driven subjective understanding

Dynamically evolving and interaction-driven understanding of collections

Semantic Gap[Smeulders et al., 2000]

Pragmatic Gap[Zahálka and Worring, 2014]

Multimedia AnalyticsState of the Art

• Theory is developing• Early systems have appeared• No real-life applications (?)• Small collections only

8

ScalableMultimediaAnalytics

ScalableMultimedia Analytics

Multimedia Analysis

VisualAnalytics

DataManagement

[Jónsson et al., 2016] 9

The Three Gaps

10

Generic data and annotation

based on objective understanding

Predefined, fixedannotation based

on understanding of the collection

Pre-computed indices and bulk

processing of large datasets

Specific contextand task-driven subjective understanding

Dynamically evolving and interaction-driven understanding of collections

Serendipitousand highly interactive sessions on small data subsets

Semantic Gap[Smeulders et al., 2000]

Pragmatic Gap[Zahálka and Worring, 2014]

Scale Gap[Jónsson et al., 2016]

VELO

CIT

YVOLUME

VAR

IETY

VISUALINTERACTION

[Jónsson et al., MMM 2016]

Ten Research Questions forScalable Multimedia Analytics

Service Layer

Big Data Framework: Lambda Architecture

12

Batch Layer

Storage Layer

Speed Layer

[Marz and Warren, 2015]

Service Layer

Big Data Framework: Lambda Architecture

13

Batch Layer

Storage Layer

Speed Layer

[Marz and Warren, 2015]

Outline

• Motivation:Scalable multimedia analytics

• Batch Layer:Spark and 43 billion high-dim features

• Service Layer:Blackthorn and 100 million images

• Conclusion:Importance and challenges of scale!

14

Gylfi Þór Guðmundsson, Laurent Amsaleg, Björn Þór Jónsson, Michael J. FranklinTowards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference (MMSys)Taipei, Taiwan, June, 2017

15

Spark Case Study:Motivation

• How can multimedia tasks harness the power of cloud-computing? – Multimedia collections are growing– Computing power is abundant

• ADCFs = Hadoop || Spark– Automatically Distributed

Computing Frameworks– Designed for high-throughput processing

16

Design Choices: ADCF = Spark

• Hadoop is not suitable (more later) • Resilient Distributed Datasets (RDDs)

– Transform one RDD to another via operators– Lazy execution– Master and Workers paradigm

• Supports deep pipelines• Supports worker’s memory sharing• Lazy execution allows for optimizations

17

Design Choices: Application Domain

• Content-Based Image Retrieval (CBIR)– Well known application– Two phases: Off-line & “On-line”

Search resultsQuery Image

CBIRSystem

18

Properties:• Clustering-based• Deep hierarchical index• Approximate k-NN search• Trades response time for

throughput by batching

Why?• Very simple• Prototypical of many

CBIR algorithms• Previous Hadoop

implementation facilitates comparison

Design Choices: DeCP Algorithm

19

DeCP as a CBIR System

20

• Off-line– Build the index

hierarchy– Cluster the data

collection• On-line

– Approximate k-NN search

– Vote aggregation

Index isin RAM

Clusters reside on disk

Searching a single feature

IdentifyRetrieve

Scan

k-NN

Clustered collection

Design Choices: Feature Collection

• YLI feature corpus from YFCC100M– Various feature sets (visual, semantic, …)– 99.2M images and 0.8M videos– Largest dataset publicly available

• Use all 42.9 billion SIFT features!– Goal is to test at a very large scale– No feature aggregation or compression– Largest feature collection reported!

21

Research Questions

• What is the complexity of the Spark pipeline for typical multimedia-related tasks?

• How well does background processing scale as collection size and resources grow?

• How does batch size impact throughput of an online service?

22

Requirementsfor the ADCF

R1 ScalabilityAbility to scale out with additional computing powerR2 Computational flexibilityAbility to carefully balance system resources as neededR3 CapacityAbility to gracefully handle data that vastly exceeds main memory capacity

R4 UpdatesAbility to gracefully update the data structures for dynamic workloadsR5 Flexible pipelineAbility to easily implement variations of the indexing and/or retrieval processR6 SimplicityHow efficiently the programmer’s time is spent

23

DeCP on Hadoop

24

• Prior work evaluated DeCP on Hadoop using 30 billion SIFTs on 100+ machines

• Conclusion = limited success– Scalability limited due to RAM per core– Two-step Map-Reduce pipeline is too rigid

• Ex: Single data-source only• Ex: Could not search multiple clusters

– R1, R2, R3 = partially; R4 = no; R5, R6 = no

DeCP on Spark

• A very different ADCF from Hadoop• Several advantages

– Arbitrarily deep pipelines • Easily implement all features and functionality

– Broadcast variables• Solves the RAM per core limitation

– Multiple data sources• Ex: Allows join operations for maintenance (R4)

25

Spark Pipeline Symbols

26

• .map = one-to-one transformation• .flatmap = one-to-any transformation

• .groupByKey = Hadoop’s Shuffle • .reduceByKey = Hadoop’s Reduce

• .collectAsMap = collect to Master

.map

.flatmap

.groupByKey

.reduceByKey

.collectAsMap

Search Pipeline

27

Indexing

Search

Evaluation: Off-line Indexing

• Hardware: 51 AWS c3.8xl nodes– 800 real cores + 800 virtual cores– 2.8 TB of RAM and 30 TB of SSD space

• Indexing time as collection grows

28

Features(billions)

Indexing time (seconds)

Scaling(relative)

8.5 3,287 –17.2 5,030 1.5326.0 11,943 3.6334.5 14,192 4.3142.9 19,749 6.00

Evaluation: “On-line” Search

29

● Throughput with batching

Hadoop limit

Summary

30

R1

Scal

abili

ty

R2

Com

puta

tiona

lFl

exib

ility

R3

Cap

acity

R4

Upd

ates

R5

Flex

ible

Pi

pelin

es

R6

Sim

plic

ity

Spark Yes Yes YesPartialfull re-shuffle

Yes Yes

Hadoop PartialRAM

per corePartial Partial No No No

Outline

31

• Motivation:Scalable multimedia analytics

• Batch Layer:Spark and 43 billion high-dim features

• Service Layer:Blackthorn and 100 million images

• Conclusion:Importance and challenges of scale!

32

Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel WorringBlackthorn: Large-Scale Interactive Multimodal LearningUnder revision at IEEE Transactions on Multimedia

Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel WorringInteractive Multimodal Learning on 100 Million ImagesProceedings of the ACM International Conference on Multimedia Retrieval (ICMR)New York, NY, USA, June, 2016

Service Layer

Framework:Lambda Architecture

33

Batch Layer

Storage Layer

Speed Layer

BlackthornMotivation

• Do not impose a dictionary on the user• Let the user synthesize categories

of relevance from semantic annotations on the fly

• Let the user search and explore along those categories interactively

• Interactive semi-supervised learning

34

at scale!

Honza’sScalability Illustration

• “Yesterday”: 10-100K images

• YFCC: 100M images

35Image credit: http://demonocracy.info/infographics/usa/us_debt/us_debt.html

Scale > Size

36

• Single (high-end) workstation• 1000D features à 800GB

• Interactive response time!• Computing feature scores takes minutes!

Blackthorn Overview

37

Blackthorn Compression

38

Blackthorn Results:1.2M Collection

39

• Compression: 880GB à 5GB• Precision: 89-108% of uncompressed• Scoring time:

60-80x faster• Recall over time:

Blackthorn rocks!

Blackthorn Results: YFCC100M Collection

• Scoring time: ~1 second!

40

Blackthorn Future Work

• More (user) evaluation is needed• Other applications may (will) require

adaptations• Further scalability:

Combine eCP and Blackthorn

41

Outline

• Motivation:Scalable multimedia analytics

• Batch Layer:Spark and 43 billion high-dim features

• Service Layer:Blackthorn and 100 million images

• Conclusion:Importance and challenges of scale!

42

Why Scale?

43

• Current and future applications• Future of computing• Because we cannot yet!

“We choose to … in this decade and do the other things,not because they are easy, but because they are hard, …”

We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too.

JFK, September 12, 1962

Scalability Hurdles: Can Industry Help?Industry-Level Collections

– Data– Processing capacity

The Small-Minded Reviewer– “Are there users willing to explore 100M data sets interactively?”

Interactive Applications– Application knowledge– User study “victims”

44

46

Summary

• Motivation:Scalable multimedia analytics

• Batch Layer:Spark and 43 billion high-dim features

• Serving Layer:Blackthorn and 100 million images

• Conclusion:Importance and challenges of scale!

45

Recommended