86
Simplifying Big Data with Apache Crunch Micah Whitacre @mkwhit

Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Simplifying Big Data with Apache Crunch

Micah Whitacre@mkwhit

Page 2: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation
Page 3: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation
Page 4: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation
Page 5: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation
Page 6: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation
Page 7: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation
Page 8: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation
Page 9: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation
Page 10: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation
Page 11: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Semantic Chart Search

Cloud Based EMR

Medical Alerting System

Population Health Management

Page 12: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Problem moves from scaling architecture...

Page 13: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Problem moves from not only scaling architecture...

To how to scale the knowledge

Page 14: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation
Page 15: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Battling the 3 V’s

Page 16: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Battling the 3 V’s

Daily, weekly, monthly uploads

Page 17: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Battling the 3 V’s

Daily, weekly, monthly uploads

60+ different data formats

Page 18: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Battling the 3 V’s

Daily, weekly, monthly uploads

60+ different data formats

Constant streams for near real time

Page 19: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Battling the 3 V’s

Daily, weekly, monthly uploads

2+ TB of streaming data daily

60+ different data formats

Constant streams for near real time

Page 20: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Normalize Data

Apply Algorithms

Load Data for

Displays

HBaseSolrVertica

AvroCSVVerticaHBase

Population Health

Page 21: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 22: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Mapper

Reducer

Page 23: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Struggle to fit into single MapReduce job

Page 24: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Struggle to fit into single MapReduce job

Integration done through persistence

Page 25: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Struggle to fit into single MapReduce job

Integration done through persistence

Custom impls of common patterns

Page 26: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Struggle to fit into single MapReduce job

Integration done through persistence

Custom impls of common patterns

Evolving Requirements

Page 27: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

AnonymizeData Avro

Prep for Bulk Load HBase

Page 28: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Easy integration between teams

Focus on processing steps

Shallow learning curve

Ability to tune for performance

Page 29: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Apache CrunchCompose processing into pipelines

Open Source FlumeJava impl

Utilizes POJOs (hides serialization)

Transformation through fns (not job)

Page 30: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 31: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Processing Pipeline

Page 32: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PipelineProgrammatic description of DAG

Supports lazy execution

MapReduce, Spark, Memory

Implementations indicate runtime

Page 33: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Pipeline pipeline = MemPipeline.getIntance();

Pipeline pipeline = new MRPipeline(Driver.class, conf);

Pipeline pipeline = new SparkPipeline(sparkContext, “app”);

Page 34: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

SourceReads various inputs

At least one required per pipeline

Custom implementations

Creates initial collections for processing

Page 35: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

SourceSequence Files

Avro ParquetHBaseJDBCHFilesTextCSV

StringsAvroRecords

ResultsPOJOs

ProtobufsThrift

Writables

Page 36: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

pipeline.read(From.textFile(path));

Page 37: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

pipeline.read(new TextFileSource(path,ptype));

Page 38: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PType<String> ptype = …;pipeline.read(new TextFileSource(path,ptype));

Page 39: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PTypeHides serialization

Exposes data in native Java forms

Avro, Thrift, and Protocol Buffers

Supports composing complex types

Page 40: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Multiple Serialization Types

Serialization Type = PTypeFamily

Can’t mix families in single type

Avro & Writable available

Can easily convert between families

Page 41: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PType<Integer> intTypes = Writables.ints();

PType<String> stringType = Avros.strings();

PType<Person> personType = Avros.records(Person.class);

Page 42: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PType<Pair<String, Person>> pairType = Avros.pairs(stringType, personType);

Page 43: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PTableType<String, Person> tableType = Avros.tableOf(stringType,personType);

Page 44: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PType<String> ptype = …;PCollection<String> strings = pipeline.read(

new TextFileSource(path, ptype));

Page 45: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PCollectionImmutable

Not created only read or transformed

Unsorted

Represents potential data

Page 46: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 47: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

PCollection<String> PCollection<RefData>

Page 48: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

DoFnSimple API to implement

Location for custom logic

Transforms PCollection between forms

Processes one element at a time

Page 49: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

For each item emits 0:M items

FilterFn - returns boolean

MapFn - emits 1:1

Page 50: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

DoFn API

class ExampleDoFn extends DoFn<String, RefData>{ ...} Type of Data In Type of Data Out

Page 51: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

public void process(String s, Emitter<RefData> emitter) { RefData data = …; emitter.emit(data);}

Type of Data In Type of Data Out

Page 52: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PCollection<String> refStringsPCollection<RefData> refs = refStrings.parallelDo(fn, Avros.records(RefData.class));

Page 53: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PCollection<String> dataStrs...PCollection<RefData> refs = dataStrs.parallelDo(diffFn, Avros.records(Data.class));

Page 54: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 55: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Hmm now I need to join...

We need a PTable

But they don’t have a common key?

Page 56: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PTable<K, V>Immutable & Unsorted

Variation PCollection<Pair<K, V>>

Multimap of Keys and Values

Joins, Cogroups, Group By Key

Page 57: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

class ExampleDoFn extends DoFn<String, RefData>{ ...}

Page 58: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

class ExampleDoFn extends DoFn<String, Pair<String, RefData>>{

...}

Page 59: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PCollection<String> refStringsPTable<String, RefData> refs = refStrings.parallelDo(fn, Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));

Page 60: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PTable<String, RefData> refs…;PTable<String, Data> data…;

Page 61: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

data.join(refs);

(inner join)

Page 62: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PTable<String, Pair<Data, RefData>> joinedData = data.join(refs);

Page 63: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Joinsright, left, inner, outer

Mapside, BloomFilter, Sharded

Eliminates custom impls

Page 64: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 65: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 66: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

FilterFn API

class MyFilterFn extends FilterFn<...>{ ...} Type of Data In

Page 67: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

public boolean accept(... value){ return value > 3;}

Page 68: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PCollection<Model> values = …;PCollection<Model> filtered = values.filter(new MyFilterFn());

Page 69: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 70: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PTable<String,Model> models = …;

Keyed By PersonId

Page 71: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PTable<String,Model> models = …;PGroupedTable<String, Model> groupedModels = models.groupByKey();

Page 72: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PGroupedTable<K, V>Immutable & Sorted

PCollection<Pair<K, Iterable<V>>>

Page 73: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 74: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 75: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PCollection<Person> persons = …;

Page 76: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PCollection<Person> persons = …;pipeline.write(persons, To.avroFile(path));

Page 77: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

PCollection<Person> persons = …;pipeline.write(persons, new AvroFileTarget(path));

Page 78: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

TargetPersists PCollection

At least one required per pipeline

Custom implementations

Page 79: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

TargetSequence Files

Avro ParquetHBaseJDBCHFilesTextCSV

StringsAvroRecords

ResultsPOJOs

ProtobufsThrift

Writables

Page 80: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 81: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Pipeline pipeline = …; ...pipeline.write(...);PipelineResult result = pipeline.done();

Execution

Page 82: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Map Reduce Reduce

Page 83: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Tuning

GroupingOptions/ParallelDoOptions

Scale factors

Tweak pipeline for performance

Page 84: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Focus on the transformations

Smaller learning curve

Functionality first

Less fragility

Page 85: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Extend pipeline for new features

Iterate with confidence

Integration through PCollections