86
Simplifying Big Data with Apache Crunch Micah Whitacre @mkwhit

Simplifying Big Data with Apache Crunch

Embed Size (px)

Citation preview

Page 1: Simplifying Big Data with Apache Crunch

Simplifying Big Data with Apache Crunch

Micah Whitacre@mkwhit

Page 2: Simplifying Big Data with Apache Crunch
Page 3: Simplifying Big Data with Apache Crunch
Page 4: Simplifying Big Data with Apache Crunch
Page 5: Simplifying Big Data with Apache Crunch
Page 6: Simplifying Big Data with Apache Crunch
Page 7: Simplifying Big Data with Apache Crunch
Page 8: Simplifying Big Data with Apache Crunch
Page 9: Simplifying Big Data with Apache Crunch
Page 10: Simplifying Big Data with Apache Crunch
Page 11: Simplifying Big Data with Apache Crunch

Semantic Chart Search

Cloud Based EMR

Medical Alerting System

Population Health Management

Page 12: Simplifying Big Data with Apache Crunch

Problem moves from scaling architecture...

Page 13: Simplifying Big Data with Apache Crunch

Problem moves from not only scaling architecture...

To how to scale the knowledge

Page 14: Simplifying Big Data with Apache Crunch
Page 15: Simplifying Big Data with Apache Crunch

Battling the 3 V’s

Page 16: Simplifying Big Data with Apache Crunch

Battling the 3 V’s

Daily, weekly, monthly uploads

Page 17: Simplifying Big Data with Apache Crunch

Battling the 3 V’s

Daily, weekly, monthly uploads

60+ different data formats

Page 18: Simplifying Big Data with Apache Crunch

Battling the 3 V’s

Daily, weekly, monthly uploads

60+ different data formats

Constant streams for near real time

Page 19: Simplifying Big Data with Apache Crunch

Battling the 3 V’s

Daily, weekly, monthly uploads

2+ TB of streaming data daily

60+ different data formats

Constant streams for near real time

Page 20: Simplifying Big Data with Apache Crunch

Normalize Data

Apply Algorithms

Load Data for

Displays

HBaseSolrVertica

AvroCSVVerticaHBase

Population Health

Page 21: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 22: Simplifying Big Data with Apache Crunch

Mapper

Reducer

Page 23: Simplifying Big Data with Apache Crunch

Struggle to fit into single MapReduce job

Page 24: Simplifying Big Data with Apache Crunch

Struggle to fit into single MapReduce job

Integration done through persistence

Page 25: Simplifying Big Data with Apache Crunch

Struggle to fit into single MapReduce job

Integration done through persistence

Custom impls of common patterns

Page 26: Simplifying Big Data with Apache Crunch

Struggle to fit into single MapReduce job

Integration done through persistence

Custom impls of common patterns

Evolving Requirements

Page 27: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

AnonymizeData Avro

Prep for Bulk Load HBase

Page 28: Simplifying Big Data with Apache Crunch

Easy integration between teams

Focus on processing steps

Shallow learning curve

Ability to tune for performance

Page 29: Simplifying Big Data with Apache Crunch

Apache CrunchCompose processing into pipelines

Open Source FlumeJava impl

Utilizes POJOs (hides serialization)

Transformation through fns (not job)

Page 30: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 31: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Processing Pipeline

Page 32: Simplifying Big Data with Apache Crunch

PipelineProgrammatic description of DAG

Supports lazy execution

MapReduce, Spark, Memory

Implementations indicate runtime

Page 33: Simplifying Big Data with Apache Crunch

Pipeline pipeline = MemPipeline.getIntance();

Pipeline pipeline = new MRPipeline(Driver.class, conf);

Pipeline pipeline = new SparkPipeline(sparkContext, “app”);

Page 34: Simplifying Big Data with Apache Crunch

SourceReads various inputs

At least one required per pipeline

Custom implementations

Creates initial collections for processing

Page 35: Simplifying Big Data with Apache Crunch

SourceSequence Files

Avro ParquetHBaseJDBCHFilesTextCSV

StringsAvroRecords

ResultsPOJOs

ProtobufsThrift

Writables

Page 36: Simplifying Big Data with Apache Crunch

pipeline.read(From.textFile(path));

Page 37: Simplifying Big Data with Apache Crunch

pipeline.read(new TextFileSource(path,ptype));

Page 38: Simplifying Big Data with Apache Crunch

PType<String> ptype = …;pipeline.read(new TextFileSource(path,ptype));

Page 39: Simplifying Big Data with Apache Crunch

PTypeHides serialization

Exposes data in native Java forms

Avro, Thrift, and Protocol Buffers

Supports composing complex types

Page 40: Simplifying Big Data with Apache Crunch

Multiple Serialization Types

Serialization Type = PTypeFamily

Can’t mix families in single type

Avro & Writable available

Can easily convert between families

Page 41: Simplifying Big Data with Apache Crunch

PType<Integer> intTypes = Writables.ints();

PType<String> stringType = Avros.strings();

PType<Person> personType = Avros.records(Person.class);

Page 42: Simplifying Big Data with Apache Crunch

PType<Pair<String, Person>> pairType = Avros.pairs(stringType, personType);

Page 43: Simplifying Big Data with Apache Crunch

PTableType<String, Person> tableType = Avros.tableOf(stringType,personType);

Page 44: Simplifying Big Data with Apache Crunch

PType<String> ptype = …;PCollection<String> strings = pipeline.read(

new TextFileSource(path, ptype));

Page 45: Simplifying Big Data with Apache Crunch

PCollectionImmutable

Not created only read or transformed

Unsorted

Represents potential data

Page 46: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 47: Simplifying Big Data with Apache Crunch

Process Reference

Data

PCollection<String> PCollection<RefData>

Page 48: Simplifying Big Data with Apache Crunch

DoFnSimple API to implement

Location for custom logic

Transforms PCollection between forms

Processes one element at a time

Page 49: Simplifying Big Data with Apache Crunch

For each item emits 0:M items

FilterFn - returns boolean

MapFn - emits 1:1

Page 50: Simplifying Big Data with Apache Crunch

DoFn API

class ExampleDoFn extends DoFn<String, RefData>{ ...} Type of Data In Type of Data Out

Page 51: Simplifying Big Data with Apache Crunch

public void process(String s, Emitter<RefData> emitter) { RefData data = …; emitter.emit(data);}

Type of Data In Type of Data Out

Page 52: Simplifying Big Data with Apache Crunch

PCollection<String> refStringsPCollection<RefData> refs = refStrings.parallelDo(fn, Avros.records(RefData.class));

Page 53: Simplifying Big Data with Apache Crunch

PCollection<String> dataStrs...PCollection<RefData> refs = dataStrs.parallelDo(diffFn, Avros.records(Data.class));

Page 54: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 55: Simplifying Big Data with Apache Crunch

Hmm now I need to join...

We need a PTable

But they don’t have a common key?

Page 56: Simplifying Big Data with Apache Crunch

PTable<K, V>Immutable & Unsorted

Variation PCollection<Pair<K, V>>

Multimap of Keys and Values

Joins, Cogroups, Group By Key

Page 57: Simplifying Big Data with Apache Crunch

class ExampleDoFn extends DoFn<String, RefData>{ ...}

Page 58: Simplifying Big Data with Apache Crunch

class ExampleDoFn extends DoFn<String, Pair<String, RefData>>{

...}

Page 59: Simplifying Big Data with Apache Crunch

PCollection<String> refStringsPTable<String, RefData> refs = refStrings.parallelDo(fn, Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));

Page 60: Simplifying Big Data with Apache Crunch

PTable<String, RefData> refs…;PTable<String, Data> data…;

Page 61: Simplifying Big Data with Apache Crunch

data.join(refs);

(inner join)

Page 62: Simplifying Big Data with Apache Crunch

PTable<String, Pair<Data, RefData>> joinedData = data.join(refs);

Page 63: Simplifying Big Data with Apache Crunch

Joinsright, left, inner, outer

Mapside, BloomFilter, Sharded

Eliminates custom impls

Page 64: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 65: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 66: Simplifying Big Data with Apache Crunch

FilterFn API

class MyFilterFn extends FilterFn<...>{ ...} Type of Data In

Page 67: Simplifying Big Data with Apache Crunch

public boolean accept(... value){ return value > 3;}

Page 68: Simplifying Big Data with Apache Crunch

PCollection<Model> values = …;PCollection<Model> filtered = values.filter(new MyFilterFn());

Page 69: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 70: Simplifying Big Data with Apache Crunch

PTable<String,Model> models = …;

Keyed By PersonId

Page 71: Simplifying Big Data with Apache Crunch

PTable<String,Model> models = …;PGroupedTable<String, Model> groupedModels = models.groupByKey();

Page 72: Simplifying Big Data with Apache Crunch

PGroupedTable<K, V>Immutable & Sorted

PCollection<Pair<K, Iterable<V>>>

Page 73: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 74: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 75: Simplifying Big Data with Apache Crunch

PCollection<Person> persons = …;

Page 76: Simplifying Big Data with Apache Crunch

PCollection<Person> persons = …;pipeline.write(persons, To.avroFile(path));

Page 77: Simplifying Big Data with Apache Crunch

PCollection<Person> persons = …;pipeline.write(persons, new AvroFileTarget(path));

Page 78: Simplifying Big Data with Apache Crunch

TargetPersists PCollection

At least one required per pipeline

Custom implementations

Page 79: Simplifying Big Data with Apache Crunch

TargetSequence Files

Avro ParquetHBaseJDBCHFilesTextCSV

StringsAvroRecords

ResultsPOJOs

ProtobufsThrift

Writables

Page 80: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 81: Simplifying Big Data with Apache Crunch

Pipeline pipeline = …; ...pipeline.write(...);PipelineResult result = pipeline.done();

Execution

Page 82: Simplifying Big Data with Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Map Reduce Reduce

Page 83: Simplifying Big Data with Apache Crunch

Tuning

GroupingOptions/ParallelDoOptions

Scale factors

Tweak pipeline for performance

Page 84: Simplifying Big Data with Apache Crunch

Focus on the transformations

Smaller learning curve

Functionality first

Less fragility

Page 85: Simplifying Big Data with Apache Crunch

Extend pipeline for new features

Iterate with confidence

Integration through PCollections