14
Sergio Fernández Redlink GmbH December 7, 2016 - DataCamp Salzburg (incubating) Introduction to

Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Embed Size (px)

Citation preview

Page 3: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Apache Beam is a unified and agnostic

(batch+stream) programming model designed to provide

efficient and portable data processing pipelines

Page 4: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Some bits of history...

Page 6: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

BeamProgrammingModel:abstract stack

SDK

DSL

Beam Pipeline Construction

Runner

Beam Fn Runners

Execution

Page 7: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

BeamProgrammingModel:concrete stack

Java SDK

scio

Beam Pipeline Construction

Flink Runner

Beam Fn Runners

Execution 1

Python SDK x SDK

Apex Runner

Dataflow Runner

Spark Runner

Direct Runner

Execution N

Page 8: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Beam Capability Matrix

https://beam.incubator.apache.org/documentation/runners/capability-matrix/

Page 9: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Beam Model API in a nutshell

● Pipeline: a data processing job as a directed graph of steps

● PCollection: a parallel collection of timestamped elements that are in windows

● IO: produce/consume PCollections from/to outside the pipeline

● Transforms, for instance:○ ParDo: flatmap over elements of a PCollection○ (Co)GroupByKey: shuffle & group {{K: V}} → {K: [V]}○ Side inputs: global view of a PCollection used for broadcast / joins

https://beam.apache.org/documentation/programming-guide/

Page 10: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Options options = PipelineOptionsFactory.fromArgs(args)

.withValidation().as(Options.class);

Pipeline pipeline = Pipeline.create(options);

pipeline.apply("ReadLines", TextIO.Read.from(options.getInput()))

.apply(new CountWords())

.apply(MapElements.via(new FormatAsTextFn()))

.apply("WriteCounts", TextIO.Write.to(options.getOutput()));

pipeline.run();

Writing a basic Beam Pipeline

Page 11: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Run your Pipeline: Direct Runner

mvn compile exec:java \

-Dexec.mainClass=io.redlink.datacamp.beam.WordCount \

-Dexec.args="--inputFile=../input.txt \

--output=target/direct/counts" \

-Pdirect-runner

http://beam.incubator.apache.org/get-started/quickstart/

Page 12: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Run your Pipeline: Spark

mvn compile exec:java \

-Dexec.mainClass=io.redlink.datacamp.beam.WordCount \

-Dexec.args="--runner=SparkRunner \

--inputFile=input.txt --output=target/spark/counts" \

-Pspark-runner

http://beam.incubator.apache.org/get-started/quickstart/#runner-spark

Page 13: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Run your Pipeline: Flink

mvn package exec:java \

-Dexec.mainClass=io.redlink.datacamp.beam.WordCount \

-Dexec.args="--runner=FlinkRunner \

--inputFile=input.txt \

--output=target/flink/counts" \

-Pflink-runner

http://beam.incubator.apache.org/get-started/quickstart/#runner-flink

Page 14: Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Vielen Dank

Sergio FernándezSoftware Engineerhttps://www.wikier.org/

Redlink GmbHhttp://redlink.co

Work partially funded by SSIX, a European Union’s Horizon 2020 project (grant agreement no. 645425)