Writing Hadoop Jobs in Scala using Scalding

Preview:

DESCRIPTION

Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processing

Citation preview

Writing Hadoop Jobs in Scala using Scalding @tonicebrian

How much storage can100$ dollars buy you?

1980

1 photo

How much storage can100$ dollars buy you?

1980

1 photo

1990

5 songs

How much storage can100$ dollars buy you?

2000

7 movies

1980

1 photo

1990

5 songs

How much storage can100$ dollars buy you?

2000

7 movies

1980

1 photo

1990

5 songs

600 movies

170.000 songs

5 million photos

2010

How much storage can100$ dollars buy you?

From single drives…

From single drives… to clusters…

Data Science

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

and data

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

and datainsights

Map Reduce

Distributed File System+=

Hadoop

Storage

Map Reduce

Distributed File System+=

Hadoop

StorageProgram ModelMap

ReduceDistributed File System+=

Hadoop

Word Count

Hello cruel world

Say hello! Hello!

Raw

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

say 1

hello 2

Raw Map

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

say 1

hello 2

Raw Map Reduce

Word Count

Hello cruel world

Say hello! Hello!

hello 3

cruel 1

world 1

say 1

Raw Map Reduce Result

4 Main Characteristics of Scala

JVM

4 Main Characteristics of Scala

JVM Statically Typed

4 Main Characteristics of Scala

JVM Statically Typed

Object Oriented

4 Main Characteristics of Scala

JVM Statically Typed

Object Oriented

Functional Programming

4 Main Characteristics of Scala

def map[B](f: (A) B): ⇒ List[B] Builds a new collection by applying a function to all elements of this list.

def reduce[A1 >: A](op: (A1, A1) A1): A1 ⇒Reduces the elements of this list using the specified associative binary operator.

Recap

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

Recap

• Map/Reduce

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

• Functional Language that runs on the JVM

Scala

Recap

• Map/Reduce

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

• Functional Language that runs on the JVM

Scala

• Open Source Implementation of MR in the JVM

Hadoop

Recap

So in what language is Hadoop implemented?

The Result?

package org.myorg;import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

The Result?

High level approaches

SQL DataTransformations

High level approaches

input_lines = LOAD ‘myfile.txt' AS (line:chararray);words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;filtered_words = FILTER words BY word MATCHES '\\w+';word_groups = GROUP filtered_words BY word;word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;ordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

-- myscript.pigREGISTER myudfs.jar;A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);DUMP B;

package myudfs;import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}

Java

Pig

User defined functions (UDF)

package impatient;import java.util.Properties;import cascading.flow.Flow;import cascading.flow.FlowDef;import cascading.flow.hadoop.HadoopFlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexFilter;import cascading.operation.regex.RegexSplitGenerator;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.property.AppProps;import cascading.scheme.Scheme;import cascading.scheme.hadoop.TextDelimited;import cascading.tap.Tap;import cascading.tap.hadoop.Hfs;import cascading.tuple.Fields;  public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ];  Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );  // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );  // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );  // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );  // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

WordCount in Cascading

• Data Flow Programming Model• User Defined Functions

Good parts

• Data Flow Programming Model• User Defined Functions

Good parts

• Still Java• Objects for Flows

Bad

package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) )  // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }}

Red

GreenRefactor

TDD Cycle

Red

GreenRefactor

Unit Testing

Acceptance Testing

Continuous Deployment

…Lean Startup

Broader view

Big Data Big Speed

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

Is Scalding of any help here?

Is Scalding of any help here?

0 Size of code

Is Scalding of any help here?

1 Types

0 Size of code

Is Scalding of any help here?

1 Types

2 Unit Testing

0 Size of code

Is Scalding of any help here?

1 Types

2 Unit Testing

3 Local execution

0 Size of code

1Types

Unit Testing

Acceptance Testing

Continuous Deployment

Lean Startup

An extra cycle

Compilation Phase

Unit Testing

Acceptance Testing

Continuous Deployment

Lean Startup

An extra cycle

Static type-checking makes

you a better programmer™

(Int,Int,Int,Int)

Fail-fast with type errors

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

Fail-fast with type errors

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

Fail-fast with type errors

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

Fail-fast with type errors

val w = Meters(5)val x = Miles(5)val y = Celsius(5)val z = Fahrenheit(5)

w + x + y + z => type error

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

2Unit Testing

How do you test a distributed algorithm without a distributed

platform?

Source

Tap

Source

Tap

Source

Tap

// Scaldingimport com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish }}

3Local Execution

HDFS

Local

HDFS

Local

> run-main com.twitter.scalding.Tool MyJob --local

> run-main com.twitter.scalding.Tool MyJob --hdfs

SBT as a REPL

More Scalding goodness

More Scalding goodness

Algebird

More Scalding goodness

Algebird

Matrix library

Be functionalQuestions?

Recommended