big data systems-prp-berlin17-v3big-data-berlin.dima.tu-berlin.de/fileadmin/news/... · •Big Data...

Peter R. Pietzuch prp@doc.ic.ac.uk

Big Data Systems:Achieving Performance and Usability

Peter Pietzuch

Large-Scale Distributed Systems GroupDepartment of Computing, Imperial College London

http://lsds.doc.ic.ac.uk

Big Data Excellence, Berlin, Germany – March 2017

prp@imperial.ac.uk

Machine Learning For Web Data

– How to maximise user experience withrelevant content?

– How to analyse “click paths” to trace mostcommon user routes?

• Example: Online predictions for adverts to serve on search engines

Page Views

Visits

Unique Visitors

Uniquely Identified Visitors

Volume of Available Data

y E {−1,1}

predict

update

Peter Pietzuch - Imperial College London

• Solution: AdPredictor– Bayesian learning algorithm

ranks ads according to click probabilities

Data Mining of Social Networks

Twitter Cascade Detection

Intelligent Urban Transport

• Instrumentation of urban transport

– Induction loops to measure traffic flow

– Video surveillance of hot spots– Sensors in public transport

• Potential queries– How to detect traffic

congestion and road closures?

– How to explain the cause of congestion (public event, emergency)?

– How to react accordingly (egby adapting traffic light schedules)?

Big Data Applications Enabled by Systems

• Data centres allow processing at scale

– Commodity servers– Fast network fabrics

5Peter Pietzuch - Imperial College London

Task vs Data Parallelism

Big Data ...n machines in

data centre

Results

select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second]group by highway, segment, directionhaving avg < 40

Task parallelism:Multiple queries

Data parallelism:Single query

select distinct W.cidFrom Payments [range 300 seconds] as W,

Payments [partition-by 1 row] as Lwhere W.cid = L.cid and W.region != L.region

select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second]group by highway, segment, directionhaving avg < 40

Peter Pietzuch - Imperial College London 6

Distributed Dataflow Systems

• Idea:Execute multiple data-parallel tasks on cluster nodes

• Tasks organised as dataflow graph

• Almost all big data systems do this: Apache Hadoop, Apache Spark, Apache Storm, Apache Flink,Google Dataflow, Google TensorFlow, ...

parallelismdegree 3

parallelismdegree 2

Design Space of Big Data Systems

• Tension between performance and algorithmic expressiveness

Hard formachinelearningalgorithms

Hard forall algo-rithms

Result latency

10s 1s 100ms 10ms 1ms

1. High Performance with New Hardware

High-throughput processing

Facebook Insights: Aggregates 9 GB/s < 10 sec latencyFeedzai: 40K credit card transactions/s < 25 ms latencyGoogle Zeitgeist: 40K user queries/s (1 sec windows) < 1 ms latencyNovaSparks: 150M trade options/s < 1 ms latency

Low-latency results

Big Data System

2. Expressing Machine Learning Algorithms

…Share state

Aggregate

Iterate…

Pre-process

Parallelize…

Online machinelearning, data

mining

Datafiltering

Complexpattern

matchingData

queries

highwaysegmentdirectionspeed

T1(a, b, c)

T2(c, d, e)

T3(g, i, h)

Increasing complexity

Servers have many parallel CPU cores

Servers with GPUs common– GPU have even more specialised cores

New types of compute accelerators– Xeon Phi, Google's TPUs, FPGAs, ...

L2 Cache

DRAM DRAM

SMX1 ... SMXN

Command QueuePCIe Bus

1000s ofGPU cores

10s ofCPU cores

Must Exploit New Parallel Hardware

SQL-Based Big Data Queries

SQL provides well-defined declarative semantics for queries– Based on relational algebra (select, project, join, …)

• Example: Identify slow moving traffic on highway– Find highway segments with average speed below 40 km/h

select highway, segment, direction, AVG(speed) as avg

from Vehicles[range 5 sec slide 1 sec]group by highway, segment, directionhaving avg < 40

Input data

Output

Operators

Queries must operate over finite amounts of data at a time

123456

size: 4 secslide: 1 sec

Processing Big Data with Windows

windows

Naive strategy parallelises computation along window boundaries

123456

size: 4 secslide: 1 sec

Task T1

Task T2

Combine partial results

How to use GPUs with Big Data Queries?

E Window-based parallelism results in redundant computation

Parallel processing of non-overlapping window data?

123456 size: 4 secslide: 1 sec

Combine partial results

How to use GPUs with Big Data Queries?

E Slide-based parallelism limits degree of parallelism

Idea: Parallelise over data chunks that are best for hardware– Should be independent of query!

• Task contains one or more window fragments

10 9 8 7 6 5 4 3 2 115 14 13 12 11 10 9 8 7 615 14 13 12 11

T1T2T3

w1w2w3w4w5

size: 7 tuplesslide: 2 tuples

5 tuples/task

SABER’s Big Data Processing Model

Idea: Reassemble partial results to obtain overall result

Worker B: T2

w1w2w3

Worker A: T1w1

w1result

w2result

Result StageSlot 2 Slot 1

EmptyEmpty

Output result circular buffer

SABER's Big Data Processing Model

E Partial result assembly must also be done in parallel

SABER's Hybrid Execution Model

Idea: Permits task execution on all heterogeneous processors(eg CPUs, GPUs, ...)

– Scheduler assigns tasks to idle processors on-the-fly

T2 T1T3T4T5T6T7T8T9

QBQAQBQBQBQBQAQBQA

Task queue CPU worker

GPU worker

0 3 6 9 12

CPUGPU T1 T2

T2T1 T4 T5 T6

T4 T5 T6

ααop

Scheduling & execution stage

Dequeue tasks based on HLS

Dispatching stage

Dispatch fixed-size tasks

Merge & forward partial window results

Result stage

Java15K LOC

C & OpenCL4K LOC

SABER Big Data Engine

CM2 SG1 SG2 LRB3 LRB4Thro

SABER (CPU contrib.)

SABER (GPU contrib.)

Cluster Mgmt. Smart Grid LRB

aggravg group-byavg select

group-byavg group-bycnt

group-bycntgroup-byavg

select

Intel Xeon 2.6 GHz

NVIDIA Quadro K5200

16 cores

2,304 cores

What is SABER's Performance?

E SABER achieves aggregate performance of CPUs and GPUs

2. Expressing Machine Learning Algorithms

…Share state

Aggregate

Iterate…

Pre-process

Parallelize…

Online machinelearning, data

mining

Datafiltering

Complexpattern

matchingData

queries

highwaysegmentdirectionspeed

T1(a, b, c)

T2(c, d, e)

T3(g, i, h)

Increasing complexity

Democratise Big Data Analytics

E Need to enable more users to do big data analytics…Peter Pietzuch - Imperial College London 22

Programming Language Popularity

Programming Models For Big Data Apps?

• Declarative query languages (eg SQL) challenging for complex algorithms (eg iterative machine learning algorithms)

– Typically need to rely on user-defined functions (UDFs)

• Distributed dataflow frameworks favour functional programming models– MapReduce, SQL, PIG, DryadLINQ, Spark, …– Simplifies consistency and fault tolerance

• Domain experts tend to write imperative programs– Java, Matlab, C++, R, Python, Fortran, …

Executed as dataflow graph

Abstractions for Machine Learning

Rating: 3User AItem:

“iPhone”Rating: 5

User ARecommend:

“Apple Watch”

Customer eventson website

Up-to-date recommendations

• Online recommender system– Recommendations based on past user ratings– Eg based on collaborative filtering (cf Netflix, Amazon, …)

(eg Hadoop, Spark, Flink, …)

Online Collaborative Filtering in Java

Matrix userItem = new Matrix();Matrix coOcc = new Matrix();

void addRating(int user, int item, int rating) { userItem.setElement(user, item, rating);updateCoOccurrence(coOcc, userItem);

Vector getRec(int user) {Vector userRow = userItem.getRow(user);Vector userRec = coOcc.multiply(userRow); return userRec;

Item-A Item-BUser-A 4 5User-B 0 5

Item-A Item-BItem-A 1 1Item-B 1 2

User-Item matrix (UI)

Co-occurrence matrix (CO)

Update with new ratings

Multiply for recommendation

User-B 1 2 x

Collaborative Filtering in Spark (Java)// Build the recommendation model using ALSint rank = 10;int numIterations = 20;MatrixFactorizationModel model = ALS.train(JavaRDD.toRDD(ratings), rank, numIterations, 0.01);

// Evaluate the model on rating dataJavaRDD<Tuple2<Object, Object>> userProducts = ratings.map(new Function<Rating, Tuple2<Object, Object>>() {public Tuple2<Object, Object> call(Rating r) {return new Tuple2<Object, Object>(r.user(), r.product());

);JavaPairRDD<Tuple2<Integer, Integer>, Double> predictions = JavaPairRDD.fromJavaRDD(model.predict(JavaRDD.toRDD(userProducts)).toJavaRDD().map(new Function<Rating, Tuple2<Tuple2<Integer, Integer>, Double>>() {public Tuple2<Tuple2<Integer, Integer>, Double> call(Rating r){return new Tuple2<Tuple2<Integer, Integer>, Double>(new Tuple2<Integer, Integer>(r.user(), r.product()), r.rating());

));JavaRDD<Tuple2<Double, Double>> ratesAndPreds = JavaPairRDD.fromJavaRDD(ratings.map(

new Function<Rating, Tuple2<Tuple2<Integer, Integer>, Double>>() {public Tuple2<Tuple2<Integer, Integer>, Double> call(Rating r){return new Tuple2<Tuple2<Integer, Integer>, Double>(new Tuple2<Integer, Integer>(r.user(), r.product()), r.rating());

)).join(predictions).values();

Collaborative Filtering in Spark (Scala)

// Build the recommendation model using ALSval rank = 10val numIterations = 20val model = ALS.train(ratings, rank, numIterations, 0.01)

// Evaluate the model on rating dataval usersProducts = ratings.map {case Rating(user, product, rate) => (user, product)

}val predictions = model.predict(usersProducts).map {case Rating(user, product, rate) => ((user, product), rate)

}val ratesAndPreds = ratings.map {case Rating(user, product, rate) => ((user, product), rate)

}.join(predictions)

E All data is immutable, no fine-grained model updatesPeter Pietzuch - Imperial College London 28

State as First Class Citizen

User AItem 2

User B

Item 124 1

Tasks process data

State Elements (SEs) represent transient state

Dataflows represent

• State elements (SEs) are mutable in-memory data structures– Tasks have local access to SEs– SEs can be shared between tasks

Program.java Cluster

Annotated Java program(@Partitioned, @Partial, @Global, …)

Staticprogramanalysis

SEEP distributed dataflowframework

Translation &checkpoint-based fault

tolerance

Data-parallelStateful Dataflow

Graph (SDG)

Stateful Dataflow Graphs (SDG)Imperial,USENIXATC’14

Conclusions

• Big Data Systems have enabled the Big Data revolution– Driven by advances in data centre technology and parallel hardware

• 1. Performance means exploiting new parallel hardware (GPUs)– Must decouple query semantics from hardware parameters

• E SABER engine: Hybrid CPU/GPU execution model

• 2. Usability means providing right programing abstractions – Imperative programming models still going strong

• E SDG: Stateful Big Data processing for machine learning

Acknowledgements: LSDS Group

Peter Pietzuch<prp@doc.ic.ac.uk>

http://lsds.doc.ic.ac.ukThank you! Any Questions?Peter Pietzuch - Imperial College London

We're recruiting: PhDs, Post-Docs

big data systems-prp-berlin17-v3big-data-berlin.dima.tu-berlin.de/fileadmin/news/... · •Big Data...

Documents

Big Data Visualization: Turning Big Data into Big Insights · PDF fileWhite Paper Big Data Visualization: Turning Big Data Into Big Insights The Rise of Visualization-based Data Discovery

BIG DATA, SMART DATA AND BIG ANALYSIS

Big Data + Big Ideas = Big Impact

Big Data Visualization: Turning Big Data into Big Insights · Big Data Visualization: Turning Big Data Into Big Insights The Rise of Visualization-based Data Discovery Tools MARCH

E6893 Big Data Analytics Lecture 11: Linked Big Data ...cylin/course/bigdata/EECS6893-BigData... · E6893 Big Data Analytics – Lecture 11: Linked Big Data: ... E6893 Big Data Analytics

The BIG Future of BIG Data · 2017-03-08 · The BIG Future of BIG Data. Big Data Data Governance Data Warehousing Data Reporting Data Infrastructure. Become a PREDICTIVEEnterprise

Big Data and Hadoop - How Big is this Big Data?

Introduction to Big Data, Big Data Processing, and Big

Big Data Technology Big Data - aakritsubedi9.com.npaakritsubedi9.com.np/files/Big Data Technology.pdf · Big Data Technology Big Data 1"Big data" is a field that treats ways to analyze,

2.3 Methods for Big Data What is “Big Data”? Summarizing Big Data

Caterpillar Big Data Infrastructure Big Data, Data

Big Data Meets Big Data Analytics 105777

Big Data Visualization: Turning Big Data Into Big Insights – White

2016 Big Data For Beginners Understanding SMART Big Data, Data Mining & Data Analytics2016 Big Data for Beginners Understanding SMART Big Data, Data Mining & Data Analytics

Introduction to Big Data, Big Data Processing, and Big ...cis.csuohio.edu/~sschung/CIS660/Lecture1_IntroBigDataAnalyrics.pdf · What’s Big Data? From Wikipedia: • Big data is

· for executive: box big data ussuiu lla:ansnnns1ðxnu big data -big data big -wifñuiaÖ big data big data • hadoop big clouderâ manager hive impala big data 22 airuntju 2559

Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

Big Data, Big Commerce, Big Challenge

Big Data, Big Challenges, Big - Oracle

Big Data, künstliche Intelligenz und Data Analytics · Big Data, künstliche Intelligenz, Machine Learning, Data Analytics & Co. How big is big? Big Data in der Versicherung sind