Towards Scalable Multimedia Analyticsamsterdamdatascience.nl/wp-content/.../06/ADSMeetup... ·...

Towards ScalableMultimedia Analytics

Björn Þór Jónssondatasys groupComputer Science DepartmentIT University of Copenhagen

Today’s Media Collections

• Massive and growing– Europeana > 50 million items– DeviantArt > 250 million items (160K/day)– Facebook > 1,000 billion items (200M/day)

• Variety of users and applications– Novices à enthusiasts à scholars à experts– Current systems aimed at helping experts

• Need for understanding and insights2

Media Tasks

SearchExploration

Media Tasks

4[Zahálka and Worring, 2014]

MultimediaAnalytics

Multimedia Analytics

Multimedia Analysis

VisualAnalytics

[Zahálka and Worring, 2014] 5

[Zahálka and Worring, 2014; Keim et al., 2010]

From Data to Insight

The Two Gaps

Generic data and annotation

based on objective understanding

Predefined, fixedannotation based

on understanding of the collection

Specific contextand task-driven subjective understanding

Dynamically evolving and interaction-driven understanding of collections

Semantic Gap[Smeulders et al., 2000]

Pragmatic Gap[Zahálka and Worring, 2014]

Multimedia AnalyticsState of the Art

• Theory is developing• Early systems have appeared• No real-life applications (?)• Small collections only

ScalableMultimediaAnalytics

ScalableMultimedia Analytics

Multimedia Analysis

VisualAnalytics

DataManagement

[Jónsson et al., 2016] 9

The Three Gaps

Generic data and annotation

based on objective understanding

Predefined, fixedannotation based

on understanding of the collection

Pre-computed indices and bulk

processing of large datasets

Specific contextand task-driven subjective understanding

Dynamically evolving and interaction-driven understanding of collections

Serendipitousand highly interactive sessions on small data subsets

Semantic Gap[Smeulders et al., 2000]

Pragmatic Gap[Zahálka and Worring, 2014]

Scale Gap[Jónsson et al., 2016]

YVOLUME

VISUALINTERACTION

[Jónsson et al., MMM 2016]

Ten Research Questions forScalable Multimedia Analytics

Service Layer

Big Data Framework: Lambda Architecture

Batch Layer

Storage Layer

Speed Layer

[Marz and Warren, 2015]

Service Layer

Big Data Framework: Lambda Architecture

Batch Layer

Storage Layer

Speed Layer

[Marz and Warren, 2015]

Outline

• Motivation:Scalable multimedia analytics

• Batch Layer:Spark and 43 billion high-dim features

• Service Layer:Blackthorn and 100 million images

• Conclusion:Importance and challenges of scale!

Gylfi Þór Guðmundsson, Laurent Amsaleg, Björn Þór Jónsson, Michael J. FranklinTowards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference (MMSys)Taipei, Taiwan, June, 2017

Spark Case Study:Motivation

• How can multimedia tasks harness the power of cloud-computing? – Multimedia collections are growing– Computing power is abundant

• ADCFs = Hadoop || Spark– Automatically Distributed

Computing Frameworks– Designed for high-throughput processing

Design Choices: ADCF = Spark

• Hadoop is not suitable (more later) • Resilient Distributed Datasets (RDDs)

– Transform one RDD to another via operators– Lazy execution– Master and Workers paradigm

• Supports deep pipelines• Supports worker’s memory sharing• Lazy execution allows for optimizations

Design Choices: Application Domain

• Content-Based Image Retrieval (CBIR)– Well known application– Two phases: Off-line & “On-line”

Search resultsQuery Image

CBIRSystem

Properties:• Clustering-based• Deep hierarchical index• Approximate k-NN search• Trades response time for

throughput by batching

Why?• Very simple• Prototypical of many

CBIR algorithms• Previous Hadoop

implementation facilitates comparison

Design Choices: DeCP Algorithm

DeCP as a CBIR System

• Off-line– Build the index

hierarchy– Cluster the data

collection• On-line

– Approximate k-NN search

– Vote aggregation

Index isin RAM

Clusters reside on disk

Searching a single feature

IdentifyRetrieve

Clustered collection

Design Choices: Feature Collection

• YLI feature corpus from YFCC100M– Various feature sets (visual, semantic, …)– 99.2M images and 0.8M videos– Largest dataset publicly available

• Use all 42.9 billion SIFT features!– Goal is to test at a very large scale– No feature aggregation or compression– Largest feature collection reported!

Research Questions

• What is the complexity of the Spark pipeline for typical multimedia-related tasks?

• How well does background processing scale as collection size and resources grow?

• How does batch size impact throughput of an online service?

Requirementsfor the ADCF

R1 ScalabilityAbility to scale out with additional computing powerR2 Computational flexibilityAbility to carefully balance system resources as neededR3 CapacityAbility to gracefully handle data that vastly exceeds main memory capacity

R4 UpdatesAbility to gracefully update the data structures for dynamic workloadsR5 Flexible pipelineAbility to easily implement variations of the indexing and/or retrieval processR6 SimplicityHow efficiently the programmer’s time is spent

DeCP on Hadoop

• Prior work evaluated DeCP on Hadoop using 30 billion SIFTs on 100+ machines

• Conclusion = limited success– Scalability limited due to RAM per core– Two-step Map-Reduce pipeline is too rigid

• Ex: Single data-source only• Ex: Could not search multiple clusters

– R1, R2, R3 = partially; R4 = no; R5, R6 = no

DeCP on Spark

• A very different ADCF from Hadoop• Several advantages

– Arbitrarily deep pipelines • Easily implement all features and functionality

– Broadcast variables• Solves the RAM per core limitation

– Multiple data sources• Ex: Allows join operations for maintenance (R4)

Spark Pipeline Symbols

• .map = one-to-one transformation• .flatmap = one-to-any transformation

• .groupByKey = Hadoop’s Shuffle • .reduceByKey = Hadoop’s Reduce

• .collectAsMap = collect to Master

.flatmap

.groupByKey

.reduceByKey

.collectAsMap

Search Pipeline

Indexing

Search

Evaluation: Off-line Indexing

• Hardware: 51 AWS c3.8xl nodes– 800 real cores + 800 virtual cores– 2.8 TB of RAM and 30 TB of SSD space

• Indexing time as collection grows

Features(billions)

Indexing time (seconds)

Scaling(relative)

8.5 3,287 –17.2 5,030 1.5326.0 11,943 3.6334.5 14,192 4.3142.9 19,749 6.00

Evaluation: “On-line” Search

● Throughput with batching

Hadoop limit

Summary

Spark Yes Yes YesPartialfull re-shuffle

Yes Yes

Hadoop PartialRAM

per corePartial Partial No No No

Outline

Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel WorringBlackthorn: Large-Scale Interactive Multimodal LearningUnder revision at IEEE Transactions on Multimedia

Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel WorringInteractive Multimodal Learning on 100 Million ImagesProceedings of the ACM International Conference on Multimedia Retrieval (ICMR)New York, NY, USA, June, 2016

Service Layer

Framework:Lambda Architecture

Batch Layer

Storage Layer

Speed Layer

BlackthornMotivation

• Do not impose a dictionary on the user• Let the user synthesize categories

of relevance from semantic annotations on the fly

• Let the user search and explore along those categories interactively

• Interactive semi-supervised learning

at scale!

Honza’sScalability Illustration

• “Yesterday”: 10-100K images

• YFCC: 100M images

35Image credit: http://demonocracy.info/infographics/usa/us_debt/us_debt.html

Scale > Size

• Single (high-end) workstation• 1000D features à 800GB

• Interactive response time!• Computing feature scores takes minutes!

Blackthorn Overview

Blackthorn Compression

Blackthorn Results:1.2M Collection

• Compression: 880GB à 5GB• Precision: 89-108% of uncompressed• Scoring time:

60-80x faster• Recall over time:

Blackthorn rocks!

Blackthorn Results: YFCC100M Collection

• Scoring time: ~1 second!

Blackthorn Future Work

• More (user) evaluation is needed• Other applications may (will) require

adaptations• Further scalability:

Combine eCP and Blackthorn

Outline

Why Scale?

• Current and future applications• Future of computing• Because we cannot yet!

“We choose to … in this decade and do the other things,not because they are easy, but because they are hard, …”

We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too.

JFK, September 12, 1962

Scalability Hurdles: Can Industry Help?Industry-Level Collections

– Data– Processing capacity

The Small-Minded Reviewer– “Are there users willing to explore 100M data sets interactively?”

Interactive Applications– Application knowledge– User study “victims”

Summary

• Serving Layer:Blackthorn and 100 million images

Towards Scalable Multimedia Analyticsamsterdamdatascience.nl/wp-content/.../06/ADSMeetup... ·...

Documents

Big Data @ Yahoo - IT Pro Forum · SQL Layer Proxy Server. The Elephant Comes Into The Room. Why Move To Hadoop? Legacy systems were not performing well (< 1 TB / day) ... The Lambda

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

ETL on Hadoop –What is Requiredfiles.meetup.com/3168962/Syncsort%20ETL%20on... · PUBLIC SECTOR • ... Hadoop is replacing existing or conventional ETL processes – ETL layer

Adaptive Executive Layer with Pentaho Data Integration · With Pentaho 8.1, AEL Spark supports these Hadoop distributions:

( ( lambda (z) ( define x ( lambda (x) ( lambda (y z) (y x) ) ) )

Data-Intensive Applications on HPC Using Hadoop, Spark and ...phd.artsedighi.com/wp-content/uploads/2015/11/tut162s3-RADICAL.… · Interoperability Layer between Hadoop (Apache Big

ETL on Hadoop –What is Required - Meetup ETL on... · ETL Use Case using Hadoop Hadoop is replacing existing or conventional ETL processes – ETL layer and/or Data Warehouse can

AWS Lambda - Developer Guide · PDF fileAWS Lambda Developer Guide When should I Use Lambda? What Is AWS Lambda? AWS Lambda is a compute service that lets you run code without provisioning

Project Jigsaw: Under The Hoodjava.base java.logging java.sql java.corba myapp mylib Java Virtual Machine Hadoop layer Loader 16 hadoop guava@11 JavaScript layer Loader 23 guava@18

Near Real-Time Big Data Analytics Using Hadoop · Hadoop is a typical example of a batch layer tool. ... NoSQL and relational databases are much discussed in connection with Big Data

Introduction to lambda expression & lambda calculus

Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

How to expedite design and development of serverless · 2021. 9. 6. · Compute layer AWS Lambda Amazon API gateway AWS Step Functions Data layer Amazon DynamoDB Amazon S3 Amazon

Speed layer : Real time views in LAMBDA architecture

Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe

WALL - recticelinsulation.com 07 2019 Fina… · Low lambda value, less thickness The lower the lambda value (thermal conductivity) of the insulation, the thinner the layer needed

Hadoop data access layer v4.0

LAMBDA 10 - University of California, San Diegoneurophysics.ucsd.edu/Manuals/Lambda/Lambda 10 Optical... · 2014. 9. 19. · Lambda-10 to Controller Connecting Cable Controller to

Introduction to Big Data and the Lambda Architecturedownload.microsoft.com/...014E...BigDataOverviewAndLamdaArchite… · 22/4/2014 · Apache Hadoop SQL Server Analysis Service

Lambda Architecture - Computer Science · 2014-12-09 · Lambda Architecture (VI) • For the batch layer, we will make use of techniques that can process large sets of data using