75
Advanced topics in Apache Flink™ Maximilian Michels [email protected] @stadtlegende EIT ICT Summer School 2015 Ufuk Celebi [email protected] @iamuce

Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels [email protected] @stadtlegende EIT ICT

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Advanced topics in Apache Flink™

Maximilian Michels [email protected] @stadtlegende

EIT ICT Summer School 2015

Ufuk Celebi [email protected] @iamuce

Page 2: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Agenda §  Batch analytics

§  Iterative processing

§  Fault tolerance

§  Data types and keys

§  More transformations

§  Further API concepts

2

Page 3: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Batch analytics

3

Page 4: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Memory Management

4

Page 5: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Memory Management

5

Page 6: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Managed memory in Flink

6 Memory runs out

Hash  Join  Probe  side:  64GB  

See  also:  h6p://flink.apache.org/news/2015/03/13/peeking-­‐into-­‐Apache-­‐Flinks-­‐Engine-­‐Room.html  

Page 7: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Cost-based optimizer

7

DataSource orders.tbl

Filter

Map DataSource lineitem.tbl

Join Hybrid Hash

buildHT probe

broadcast forward

Combine

GroupRed sort

hash-part [0,1]

DataSource orders.tbl

Filter

Map DataSource lineitem.tbl

Join Hybrid Hash

buildHT probe

hash-part [0] hash-part [0]

GroupRed sort

forward

Page 8: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Table API

8

val  customers  =  env.readCsvFile(…).as('id,  'mktSegment)              .filter("mktSegment  =  AUTOMOBILE")    val  orders  =  env.readCsvFile(…)              .filter(  o  =>  dateFormat.parse(o.orderDate).before(date)  )              .as("orderId,  custId,  orderDate,  shipPrio")    val  items  =  orders              .join(customers).where("custId  =  id")              .join(lineitems).where("orderId  =  id")              .select("orderId,  orderDate,  shipPrio,                      extdPrice  *  (Literal(1.0f)  –  discount)  as  revenue")    val  result  =  items              .groupBy("orderId,  orderDate,    shipPrio")              .select('orderId,  revenue.sum,  orderDate,  shipPrio")  

Page 9: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Flink stack

9

Tabl

e

DataSet (Java/Scala) DataStream (Java/Scala)

Had

oop

M/R

Local Cluster Yarn Tez Embedded

Tabl

e Streaming dataflow runtime

Page 10: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Iterative processing

10

Page 11: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Non-native iterations

11

Step Step Step Step Step

Client

for  (int  i  =  0;  i  <  maxIterations;  i++)  {    //  Execute  MapReduce  job  

}  

Page 12: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Iterative processing in Flink Flink offers built-in iterations and delta iterations to execute ML and graph algorithms efficiently.

12

map

join sum

ID1

ID2

ID3

Page 13: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

FlinkML §  API for ML pipelines inspired by scikit-learn

§  Collection of packaged algorithms •  SVM, Multiple Linear Regression, Optimization, ALS, ...

13

val  trainingData:  DataSet[LabeledVector]  =  ...  val  testingData:  DataSet[Vector]  =  ...    val  scaler  =  StandardScaler()  val  polyFeatures  =  PolynomialFeatures().setDegree(3)  val  mlr  =  MultipleLinearRegression()    val  pipeline  =  scaler.chainTransformer(polyFeatures).chainPredictor(mlr)    pipeline.fit(trainingData)    val  predictions:  DataSet[LabeledVector]  =  pipeline.predict(testingData)  

Page 14: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Gelly §  Graph API: various graph processing paradigms

§  Packaged algorithms •  PageRank, SSSP, Label Propagation, Community

Detection, Connected Components

14

ExecutionEnvironment  env  =  ExecutionEnvironment.getExecutionEnvironment();    Graph<Long,  Long,  NullValue>  graph  =  ...    DataSet<Vertex<Long,  Long>>  verticesWithCommunity  =  graph.run(  

       new  LabelPropagation<Long>(30)).getVertices();    verticesWithCommunity.print();    env.execute();  

Page 15: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Example: Matrix Factorization

15

Factorizing  a  matrix  with  28  billion  raNngs  for  recommendaNons  

More  at:  h6p://data-­‐arNsans.com/compuNng-­‐recommendaNons-­‐with-­‐flink.html  

Page 16: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Fault tolerance

16

Page 17: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Why be fault tolerant? §  Failures are rare but they occur §  The larger the cluster, the more likely

failures

§  Types of failures •  Hardware •  Software

17

Page 18: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Recovery strategies Batch §  Simple strategy: restart the job/tasks §  Resume from partially crated intermediate

results or re-read and process the entire input again

Streaming §  Simple strategy: restart job/tasks §  We loose the state of the operators §  Goal: find the correct offset to resume from

18

Page 19: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Streaming fault tolerance §  Ensure that operators see all events •  “At least once” •  Solved by replaying a stream from a

checkpoint, e.g., from a past Kafka offset

§  Ensure that operators do not perform duplicate updates to their state •  “Exactly once” •  Several solutions

19

Page 20: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Exactly once approaches §  Discretized streams (Spark Streaming)

•  Treat streaming as a series of small atomic computations •  “Fast track” to fault tolerance, but restricts computational

and programming model (e.g., cannot mutate state across “mini-batches”, window functions correlated with mini-batch size)

§  MillWheel (Google Cloud Dataflow) •  State update and derived events committed as atomic

transaction to a high-throughput transactional store •  Requires a very high-throughput transactional store J

§  Chandy-Lamport distributed snapshots (Flink)

20

Page 21: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

21

JobManager Initiate Checkpoint

Replay will start from here

Page 22: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

22

JobManager Barriers “push” prior events (assumes in-order delivery in individual channels)

Operator checkpointing starting

Operator checkpointing finished

Operator checkpointing in progress

Page 23: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

23

JobManager Operator checkpointing takes snapshot of state after ack’d data have updated the state. Checkpoints currently one-off and synchronous, WiP for incremental and asynchronous

State backup

Pluggable mechanism. Currently either JobManager (for small state) or file system (HDFS/Tachyon). WiP for in-memory grids

Page 24: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

24

JobManager

State snapshots at sinks signal successful end of this checkpoint

At failure, recover last checkpointed state and restart sources from last barrier. This guarantees at least once

State backup

Page 25: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Best of all worlds for streaming §  Low latency •  Thanks to pipelined engine

§  Exactly-once guarantees •  Variation of Chandy-Lamport

§  High throughput •  Controllable checkpointing overhead

§  Separates app logic from recovery •  Checkpointing interval is just a config parameter

25

Page 26: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Fault Tolerance Demo

26

Page 27: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Type System and Keys What kind of data can Flink handle?

27

Page 28: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Apache Flink’s Type System

§  Flink aims to support all data types •  Ease of programming •  Seamless integration with existing code

§  Programs are analyzed before execution •  Used data types are identified •  Serializer & comparator are configured

28

Page 29: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Apache Flink’s Type System

§  Data types are either •  Atomic types (like Java Primitives) •  Composite types (like Flink Tuples)

§  Composite types nest other types

§  Not all data types can be used as keys! •  Flink groups, joins & sorts DataSets on keys •  Key types must be comparable

29

Page 30: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Atomic Types

Flink  Type   Java  Type   Can  be  used  as  key?  

BasicType   Java  PrimiNves    (Integer,  String,  …)  

Yes  

ArrayType   Arrays  of  Java  primiNves  or  objects  

No  

WritableType   Implements  Hadoop’s  Writable  interface  

Yes,  if  implements  WritableComparable  

GenericType   Any  other  type   Yes,  if  implements  Comparable  

30

Page 31: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Composite Types §  Are composed of fields with other types •  Fields types can be atomic or composite

§  Fields can be addressed as keys •  Field type must be a key type!

§  A composite type can be a key type •  All field types must be key types!

31

Page 32: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

PojoType §  Any Java class that •  Has an empty default constructor •  Has publicly accessible fields

(Public or getter/setter)

public class Person { ! public int id; ! public String name; ! public Person() {}; ! public Person(int id, String name) {…}; !} !!

DataSet<Person> p = ! env.fromElements(new Person(1, ”Bob”)); !

32

Page 33: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

PojoType §  Define keys by field name DataSet<Person> p = … !// group on “name” field !d.groupBy(“name”).groupReduce(…); !

33

Page 34: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Scala CaseClasses §  Scala case classes are natively supported case class Person(id: Int, name: String) !d: DataSet[Person] = !" env.fromElements(new Person(1, “Bob”) !

!

§  Define keys by field name // use field “name” as key !d.groupBy(“name”).groupReduce(…) !!

34

Page 35: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Composite & nested keys DataSet<Tuple3<String, Person, Double>> d = … !

§  Composite keys are supported // group on both long fields !d.groupBy(0, 1).reduceGroup(…); !

§  Nested fields can be used as types // group on nested “name” field !d.groupBy(“f1.name”).reduceGroup(…); !

§  Full types can be used as key using “*” wildcard // group on complete nested Pojo field !d.groupBy(“f1.*”).reduceGroup(…); !!

•  “*” wildcard can also be used for atomic types !

35

Page 36: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Join & CoGroup Keys §  Key types must match for binary operations!

DataSet<Tuple2<Long, String>> d1 = … !DataSet<Tuple2<Long, Long>> d2 = … !!// works !d1.join(d2).where(0).equalTo(1).with(…); !// works !d1.join(d2).where(“f0”).equalTo(0).with(…); !// does not work! !d1.join(d2).where(1).equalTo(0).with(…); !

36

Page 37: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

KeySelectors §  Keys can be computed using KeySelectors public class SumKeySelector implements ! KeySelector<Tuple2<Long, Long>, Long> { !! public Long getKey(Tuple2<Long, Long> t) { ! return t.f0 + t.f1; ! }} !!DataSet<Tuple2<Long,Long>> d = … !d.groupBy(new SumKeySelector()).reduceGroup(…); !

37

Page 38: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Advanced Sources and Sinks Getting data in and out

38

Page 39: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Supported File Systems §  Flink build-in File Systems: •  LocalFileSystem (file://) •  Hadoop Distributed File System (hdfs://) •  Amazon S3 (s3://) •  MapR FS (maprfs://)

§  Support for all Hadoop File Systems •  NFS, Tachyon, FTP, har (Hadoop Archive), …

39

Page 40: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Input/Output Formats §  FileInputFormat

(recursive directory scans supported) •  DelimitedInputFormat

•  TextInputFormat (Reads text files linewise) •  CsvInputFormat (Reads field delimited files)

•  BinaryInputFormat •  AvroInputFormat (Reads Avro POJOs)

§  JDBCInputFormat (Reads result of SQL query) §  HadoopInputFormat

(Wraps any Hadoop InputFormat)

40

Page 41: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Hadoop Input/OutputFormats

§  Support for all Hadoop I/OFormats

§  Read from and write to •  MongoDB •  Apache Parquet •  Apache ORC •  Apache Kafka (for batch) •  Compressed file formats (.gz, .zip, ...) •  and more…

41

Page 42: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Using InputFormats ExecutionEnvironment env = … !!

// read text file linewise!env.readTextFile(…); !!

// read CSV file !env.readCsvFile(…); !!

// read file with Hadoop FileInputFormat!env.readHadoopFile(…); !!

// use regular Hadoop InputFormat!env.createHadoopInput(…); !!

// use regular Flink InputFormat!env.createInput(…); !! 42

Page 43: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Transformations & Functions

43

Page 44: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Transformations §  DataSet Basics presented: •  Map, FlatMap, GroupBy, GroupReduce, Join

§  Reduce §  CoGroup §  Combine §  GroupSort §  AllReduce & AllGroupReduce §  Union

§  see documentation for more transformations

44

Page 45: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

GroupReduce (Hadoop-style)

§  GroupReduceFunction gives iterator over elements of group •  Elements are streamed (possibly from disk),

not materialized in memory •  Group size can exceed available JVM heap

§  Input type and output type may be different

45

Page 46: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Reduce (FP-style) •  Reduce like in functional programming

•  Less generic compared to GroupReduce •  Function must be commutative and associative •  Input type == Output type

•  System can apply more optimizations •  Always combinable •  May use a hash strategy for execution (future)

46

Page 47: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Reduce (FP-style) DataSet<Tuple2<Long,Long>> sum = data ! .groupBy(0) ! .reduce(new SumReducer()); !!public static class SumReducer implements ! ReduceFunction<Tuple2<Long,Long>> { !! @Override ! public Tuple2<Long,Long> reduce( ! Tuple2<Long,Long> v1, ! Tuple2<Long,Long> v2) { !" "v1.f1 += v2.f1; !

return v1; !"} !

} !47

Page 48: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

CoGroup §  Binary operation (two inputs) •  Groups both inputs on a key •  Processes groups with matching keys of both

inputs

§  Similar to GroupReduce on two inputs

48

Page 49: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

CoGroup DataSet<Tuple2<Long,String>> d1 = …; !DataSet<Long> d2 = …; !DataSet<String> d3 = ! d1.coGroup(d2).where(0).equalTo(1).with(new CoGrouper()); !!

public static class CoGrouper implements !CoGroupFunction<Tuple2<Long,String>,Long,String>{ !!

@Override ! public void coGroup(Iterable<Tuple2<Long,String> vs1, ! Iterable<Long> vs2, Collector<String> out) { ! if(!vs2.iterator.hasNext()) { ! for(Tuple2<Long,String> v1 : vs1) { !

" out.collect(v1.f1); !" } !

" } !} } ! 49

Page 50: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Combiner §  Local pre-aggregation of data •  Before data is sent to GroupReduce or CoGroup •  (functional) Reduce injects combiner automatically •  Similar to Hadoop Combiner

§  Optional for semantics, crucial for performance! •  Reduces data before it is sent over the network

50

Page 51: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Combiner WordCount Example

51

Map

Map

Map

Combine

Combine

Combine

Reduce

Reduce

A B B A C A

C A C B A B

D A A A B C

(A, 1) (B, 1) (B, 1)

(A, 1) (C, 1) (A, 1)

(C, 1) (A, 1) (C, 1) (B, 1) (A, 1) (B, 1)

(D, 1) (A, 1) (A, 1)

(A, 1) (B, 1) (C, 1)

(A, 3) (C, 1)

(A, 2) (C, 2)

(A, 3) (C, 1)

(B, 2)

(B, 2)

(B, 1) (D, 1)

(A, 8)

(C, 4)

(B, 5)

(D, 1)

Page 52: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Use a combiner §  Implement RichGroupReduceFunction<I,O> !•  Override combine(Iterable<I> in, Collector<O>); !•  Same interface as reduce() method •  Annotate your GroupReduceFunction with @Combinable !•  Combiner will be automatically injected into Flink

program

§  Implement a GroupCombineFunction!•  Same interface as GroupReduceFunction!•  DataSet.combineGroup()!

52

Page 53: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

GroupSort §  Sort groups before they are handed to

GroupReduce or CoGroup functions •  More (resource-)efficient user code •  Easier user code implementation •  Comes (almost) for free •  Aka secondary sort (Hadoop)

53

DataSet<Tuple3<Long,Long,Long> data = …; !!data.groupBy(0) ! .sortGroup(1, Order.ASCENDING) ! .groupReduce(new MyReducer()); !

!

Page 54: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

AllReduce / AllGroupReduce

§  Reduce / GroupReduce without GroupBy •  Operates on a single group -> Full DataSet •  Full DataSet is sent to one machine •  Will automatically run with parallelism of 1

§  Careful with large DataSets! •  Make sure you have a Combiner

54

Page 55: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Union §  Union two data sets •  Binary operation, same data type required •  No duplicate elimination (SQL UNION ALL) •  Very cheap operation

55

DataSet<Tuple2<String, Long> d1 = …; !DataSet<Tuple2<String, Long> d2 = …; !!DataSet<Tuple2<String, Long> d3 = !"d1.union(d2); !

Page 56: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

RichFunctions §  Function interfaces have only one method •  Single abstract method (SAM) •  Support for Java8 Lambda functions

§  There is a “Rich” variant for each function. •  RichFlatMapFunction, … •  Additional methods

•  open(Configuration c) !•  close() !•  getRuntimeContext() !

56

Page 57: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

RichFunctions & RuntimeContext

§  RuntimeContext has useful methods: •  getIndexOfThisSubtask () !•  getNumberOfParallelSubtasks() !•  getExecutionConfig() !

§  Gives access to: •  Accumulators •  DistributedCache

flink.apache.org 57

Page 58: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Further API Concepts

58

Page 59: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Broadcast Variables

59

map

map

map

Example: Tag words with IDs in text corpus

Dictionary

Text data set

broadcast (small) dictionary to all

mappers

Page 60: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Broadcast variables §  register any DataSet as a broadcast

variable §  available on all parallel instances

60

// 1. The DataSet to be broadcasted !DataSet<Integer> toBroadcast = env.fromElements(1, 2, 3); !!// 2. Broadcast the DataSet !map().withBroadcastSet(toBroadcast, "broadcastSetName"); !!// 3. inside user defined function!getRuntimeContext().getBroadcastVariable("broadcastSetName"); !

Page 61: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Accumulators §  Lightweight tool to compute stats on data •  Useful to verify your assumptions about your data •  Similar to Counters (Hadoop MapReduce)

§  Build in accumulators •  Int and Long counters •  Histogramm

§  Easily customizable

61

Page 62: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

JobManager counter =

4 + 18 + 22 =

44

Accumulators

62

map long counter++

Example:    Count  total  number  of  words  in  text  corpus  

Send local counts from parallel instances to JobManager

Text data set

map long counter++

map long counter++

Page 63: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Using Accumulators §  Use accumulators to verify your

assumptions about the data

63

class Tokenizer extends ! RichFlatMapFunction<String, String>> { ! ! @Override ! public void flatMap(String val, ! Collector<String> out) { ! "getRuntimeContext() !

" .getLongCounter("elementCount").add(1L); ! // do more stuff. ! } !} !

Page 64: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Get Accumulator Results

§  Accumulators are available via JobExecutionResult •  returned by env.execute()

§  Accumulators are displayed •  by CLI client •  in the JobManager web frontend

64

JobExecutionResult result = env.execute("WordCount"); !long ec = result.getAccumulatorResult("elementCount"); !

Page 65: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Closing

65

Page 66: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

I ♥ , do you?

66

§  Get involved and start a discussion on

Flink‘s mailing list

§  { user, dev }@flink.apache.org

§  Subscribe to [email protected] §  Follow flink.apache.org/blog and

@ApacheFlink on Twitter

Page 67: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

67

flink-forward.org

Call for papers deadline: August 14, 2015

October 12-13, 2015

Discount code: FlinkEITSummerSchool25  

Page 68: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Thank you for listening!

68

Page 69: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Flink compared to other projects

69

Page 70: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Batch & Streaming projects Batch only Streaming only Hybrid

70

Page 71: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Batch comparison

71

API   low-­‐level   high-­‐level   high-­‐level  

Data  Transfer   batch   batch   pipelined  &  batch  

Memory  Management   disk-­‐based   JVM-­‐managed   AcNve  managed  

IteraAons   file  system  cached  

in-­‐memory    cached     streamed  

Fault  tolerance   task  level   task  level   job  level  

Good  at   massive  scale  out     data  exploraNon   heavy  backend  &  iteraNve  jobs  

Libraries   many  external   built-­‐in  &  external   evolving  built-­‐in  &  external  

Page 72: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Streaming comparison

72

Streaming   “true”   mini  batches   “true”  

API   low-­‐level   high-­‐level   high-­‐level  

Fault  tolerance   tuple-­‐level  ACKs   RDD-­‐based  (lineage)   coarse  checkpoinNng  

State   not  built-­‐in   external   internal  

Exactly  once   at  least  once   exactly  once   exactly  once  

Windowing   not  built-­‐in   restricted   flexible  

Latency   low   medium   low  

Throughput   medium   high   high  

Page 73: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Stream platform architecture

73

-  Gather and backup streams -  Offer streams for consumption -  Provide stream recovery

-  Analyze and correlate streams -  Create derived streams and state -  Provide these to downstream systems

Server logs

Trxn logs

Sensor logs

Downstream systems

Page 74: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

What is a stream processor?

1.  Pipelining 2.  Stream replay

3.  Operator state 4.  Backup and restore

5.  High-level APIs 6.  Integration with batch

7.  High availability 8.  Scale-in and scale-out

74

Basics

State

App development

Large deployments

See http://data-artisans.com/stream-processing-with-flink.html

Page 75: Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Engine comparison

75

Paradigm

Optimization

Execution

API

Optimization in all APIs

Optimization of SQL queries none none

DAG

Transformations on k/v pair collections

Iterative transformations on collections

RDD Cyclic dataflows

MapReduce on k/v pairs

k/v pair Readers/Writers

Batch sorting

Batch sorting and partitioning

Batch with memory pinning

Stream with out-of-core algorithms

MapReduce