Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Preview:

Citation preview

Apache Beam is a unified and agnostic

(batch+stream) programming model designed to provide

efficient and portable data processing pipelines

Some bits of history...

BeamProgrammingModel:abstract stack

SDK

DSL

Beam Pipeline Construction

Runner

Beam Fn Runners

Execution

BeamProgrammingModel:concrete stack

Java SDK

scio

Beam Pipeline Construction

Flink Runner

Beam Fn Runners

Execution 1

Python SDK x SDK

Apex Runner

Dataflow Runner

Spark Runner

Direct Runner

Execution N

Beam Capability Matrix

https://beam.incubator.apache.org/documentation/runners/capability-matrix/

Beam Model API in a nutshell

● Pipeline: a data processing job as a directed graph of steps

● PCollection: a parallel collection of timestamped elements that are in windows

● IO: produce/consume PCollections from/to outside the pipeline

● Transforms, for instance:○ ParDo: flatmap over elements of a PCollection○ (Co)GroupByKey: shuffle & group {{K: V}} → {K: [V]}○ Side inputs: global view of a PCollection used for broadcast / joins

https://beam.apache.org/documentation/programming-guide/

Options options = PipelineOptionsFactory.fromArgs(args)

.withValidation().as(Options.class);

Pipeline pipeline = Pipeline.create(options);

pipeline.apply("ReadLines", TextIO.Read.from(options.getInput()))

.apply(new CountWords())

.apply(MapElements.via(new FormatAsTextFn()))

.apply("WriteCounts", TextIO.Write.to(options.getOutput()));

pipeline.run();

Writing a basic Beam Pipeline

Run your Pipeline: Direct Runner

mvn compile exec:java \

-Dexec.mainClass=io.redlink.datacamp.beam.WordCount \

-Dexec.args="--inputFile=../input.txt \

--output=target/direct/counts" \

-Pdirect-runner

http://beam.incubator.apache.org/get-started/quickstart/

Run your Pipeline: Spark

mvn compile exec:java \

-Dexec.mainClass=io.redlink.datacamp.beam.WordCount \

-Dexec.args="--runner=SparkRunner \

--inputFile=input.txt --output=target/spark/counts" \

-Pspark-runner

http://beam.incubator.apache.org/get-started/quickstart/#runner-spark

Run your Pipeline: Flink

mvn package exec:java \

-Dexec.mainClass=io.redlink.datacamp.beam.WordCount \

-Dexec.args="--runner=FlinkRunner \

--inputFile=input.txt \

--output=target/flink/counts" \

-Pflink-runner

http://beam.incubator.apache.org/get-started/quickstart/#runner-flink

Vielen Dank

Sergio FernándezSoftware Engineerhttps://www.wikier.org/

Redlink GmbHhttp://redlink.co

Work partially funded by SSIX, a European Union’s Horizon 2020 project (grant agreement no. 645425)

Recommended