Simplifying Big Data with Apache Crunch

Micah Whitacre@mkwhit

Semantic Chart Search

Cloud Based EMR

Medical Alerting System

Population Health Management

Problem moves from scaling architecture...

Problem moves from not only scaling architecture...

To how to scale the knowledge

Battling the 3 V’s

Daily, weekly, monthly uploads

60+ different data formats

Constant streams for near real time

2+ TB of streaming data daily

Constant streams for near real time

Normalize Data

Apply Algorithms

Load Data for

Displays

HBaseSolrVertica

AvroCSVVerticaHBase

Population Health

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Mapper

Reducer

Struggle to fit into single MapReduce job

Integration done through persistence

Custom impls of common patterns

Evolving Requirements

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

AnonymizeData Avro

Prep for Bulk Load HBase

Easy integration between teams

Focus on processing steps

Shallow learning curve

Ability to tune for performance

Apache CrunchCompose processing into pipelines

Open Source FlumeJava impl

Utilizes POJOs (hides serialization)

Transformation through fns (not job)

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

Processing Pipeline

PipelineProgrammatic description of DAG

Supports lazy execution

MapReduce, Spark, Memory

Implementations indicate runtime

Pipeline pipeline = MemPipeline.getIntance();

Pipeline pipeline = new MRPipeline(Driver.class, conf);

Pipeline pipeline = new SparkPipeline(sparkContext, “app”);

SourceReads various inputs

At least one required per pipeline

Custom implementations

Creates initial collections for processing

SourceSequence Files

Avro ParquetHBaseJDBCHFilesTextCSV

StringsAvroRecords

ResultsPOJOs

ProtobufsThrift

Writables

pipeline.read(From.textFile(path));

pipeline.read(new TextFileSource(path,ptype));

PType<String> ptype = …;pipeline.read(new TextFileSource(path,ptype));

PTypeHides serialization

Exposes data in native Java forms

Avro, Thrift, and Protocol Buffers

Supports composing complex types

Multiple Serialization Types

Serialization Type = PTypeFamily

Can’t mix families in single type

Avro & Writable available

Can easily convert between families

PType<Integer> intTypes = Writables.ints();

PType<String> stringType = Avros.strings();

PType<Person> personType = Avros.records(Person.class);

PType<Pair<String, Person>> pairType = Avros.pairs(stringType, personType);

PTableType<String, Person> tableType = Avros.tableOf(stringType,personType);

PType<String> ptype = …;PCollection<String> strings = pipeline.read(

new TextFileSource(path, ptype));

PCollectionImmutable

Not created only read or transformed

Unsorted

Represents potential data

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

Process Reference

PCollection<String> PCollection<RefData>

DoFnSimple API to implement

Location for custom logic

Transforms PCollection between forms

Processes one element at a time

For each item emits 0:M items

FilterFn - returns boolean

MapFn - emits 1:1

DoFn API

class ExampleDoFn extends DoFn<String, RefData>{ ...} Type of Data In Type of Data Out

public void process(String s, Emitter<RefData> emitter) { RefData data = …; emitter.emit(data);}

Type of Data In Type of Data Out

PCollection<String> refStringsPCollection<RefData> refs = refStrings.parallelDo(fn, Avros.records(RefData.class));

PCollection<String> dataStrs...PCollection<RefData> refs = dataStrs.parallelDo(diffFn, Avros.records(Data.class));

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

Hmm now I need to join...

We need a PTable

But they don’t have a common key?

PTable<K, V>Immutable & Unsorted

Variation PCollection<Pair<K, V>>

Multimap of Keys and Values

Joins, Cogroups, Group By Key

class ExampleDoFn extends DoFn<String, RefData>{ ...}

class ExampleDoFn extends DoFn<String, Pair<String, RefData>>{

PCollection<String> refStringsPTable<String, RefData> refs = refStrings.parallelDo(fn, Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));

PTable<String, RefData> refs…;PTable<String, Data> data…;

data.join(refs);

(inner join)

PTable<String, Pair<Data, RefData>> joinedData = data.join(refs);

Joinsright, left, inner, outer

Mapside, BloomFilter, Sharded

Eliminates custom impls

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

FilterFn API

class MyFilterFn extends FilterFn<...>{ ...} Type of Data In

public boolean accept(... value){ return value > 3;}

PCollection<Model> values = …;PCollection<Model> filtered = values.filter(new MyFilterFn());

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

PTable<String,Model> models = …;

Keyed By PersonId

PTable<String,Model> models = …;PGroupedTable<String, Model> groupedModels = models.groupByKey();

PGroupedTable<K, V>Immutable & Sorted

PCollection<Pair<K, Iterable<V>>>

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

PCollection<Person> persons = …;

PCollection<Person> persons = …;pipeline.write(persons, To.avroFile(path));

PCollection<Person> persons = …;pipeline.write(persons, new AvroFileTarget(path));

TargetPersists PCollection

At least one required per pipeline

Custom implementations

TargetSequence Files

Avro ParquetHBaseJDBCHFilesTextCSV

StringsAvroRecords

ResultsPOJOs

ProtobufsThrift

Writables

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

Pipeline pipeline = …; ...pipeline.write(...);PipelineResult result = pipeline.done();

Execution

Process Reference

Process Raw

Person Data

Process Raw Data

using Reference

Map Reduce Reduce

Tuning

GroupingOptions/ParallelDoOptions

Scale factors

Tweak pipeline for performance

Focus on the transformations

Smaller learning curve

Functionality first

Less fragility

Extend pipeline for new features

Iterate with confidence

Integration through PCollections

Linkshttp://crunch.apache.org/

http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading

http://dl.acm.org/citation.cfm?id=1806596.1806638

Simplifying Big Data with Apache Crunch

Documents

Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation

Wing Crunch!

Cafe Crunch

Loeb's Crunch

Crunch, crunch: Africa's locust outbreak is far from over

Simplifying Data Engineering to Accelerate Innovationilovejoes.weebly.com/.../simplifying-data-engineering-databricks_2.pdf · Apache Spark™ is an open source data processing engine

RAGING SWAN PRESS 100% CRUNCH LICHES - The Eye Party/Ragin… · R AGING S WAN PRESS: GM’S R ESOURCES 100% Crunch: Liches $5.99 100% Crunch: Skeletons $5.99 100% Crunch: Zombies

Leveraging Ambari to Build Comprehensive Management · PDF fileFor Your Hadoop Applications by ... Technical Architect at Pivotal, BigData, Hadoop, SpringXD, Apache Committer, Crunch

Simplifying Radical Expressions Simplifying Radicals Radicals with variables

INR CRUNCH

Crunch Time

Introducing Apache Pivot · 2010-09-16 · Pivot Compared to Swing • Pivot advantages: • Provides XML markup language for simplifying user interface construction • Built-in

Hortonworks Technical Preview for Apache Falcon · Apache Falcon provides a framework for simplifying the development of data management applications in Apache Hadoop. Falcon enables

“Crunch. Squeak.” the Apache moccasins moaned with great wear as the Apache warriors slowly moved forward toward the Spanish settlement. The Indians hid

Simplifying Expressions 08/09/12lntaylor ©. Table of Contents Learning Objectives Simplifying Fractions Simplifying Polynomials Simplifying Rational Expressions

Crunch Bang

Resilient Messaging - Crunch Toolscrunchtools.com/files/2014/05/Resilient-Messaging-AMQ... · 2014. 5. 20. · @scottcranton 20+ years in Middleware software coding and sales Apache

Tools for Simplifying Tools for Simplifying SOA SOA

Credit-Crunch Explained

Algebra: Simplifying Algebraic Expressions, Expanding ...mathsmalakiss.com/Simplifying expressions_expanding brackets and... · mathsmalakiss.com 1 Algebra: Simplifying Algebraic