Writing Hadoop Jobs in Scala using Scalding

Writing Hadoop Jobs in Scala using Scalding @tonicebrian

How much storage can100$ dollars buy you?

1 photo

5 songs

7 movies

1 photo

5 songs

7 movies

1 photo

5 songs

600 movies

170.000 songs

5 million photos

From single drives…

From single drives… to clusters…

Data Science

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

Alfréd Rényi

data scientist

and data

Alfréd Rényi

data scientist

and datainsights

Map Reduce

Distributed File System+=

Hadoop

Storage

Map Reduce

Distributed File System+=

Hadoop

StorageProgram ModelMap

ReduceDistributed File System+=

Hadoop

Word Count

Hello cruel world

Say hello! Hello!

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

hello 2

Raw Map

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

hello 2

Raw Map Reduce

Word Count

Hello cruel world

Say hello! Hello!

hello 3

cruel 1

world 1

Raw Map Reduce Result

4 Main Characteristics of Scala

JVM Statically Typed

Object Oriented

Functional Programming

def map[B](f: (A) B): ⇒ List[B] Builds a new collection by applying a function to all elements of this list.

def reduce[A1 >: A](op: (A1, A1) A1): A1 ⇒Reduces the elements of this list using the specified associative binary operator.

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

• Map/Reduce

Map/Reduce

• Functional Language that runs on the JVM

• Map/Reduce

Map/Reduce

• Functional Language that runs on the JVM

• Open Source Implementation of MR in the JVM

Hadoop

So in what language is Hadoop implemented?

The Result?

package org.myorg;import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

The Result?

High level approaches

SQL DataTransformations

High level approaches

input_lines = LOAD ‘myfile.txt' AS (line:chararray);words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;filtered_words = FILTER words BY word MATCHES '\\w+';word_groups = GROUP filtered_words BY word;word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;ordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

-- myscript.pigREGISTER myudfs.jar;A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);DUMP B;

package myudfs;import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}

User defined functions (UDF)

package impatient;import java.util.Properties;import cascading.flow.Flow;import cascading.flow.FlowDef;import cascading.flow.hadoop.HadoopFlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexFilter;import cascading.operation.regex.RegexSplitGenerator;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.property.AppProps;import cascading.scheme.Scheme;import cascading.scheme.hadoop.TextDelimited;import cascading.tap.Tap;import cascading.tap.hadoop.Hfs;import cascading.tuple.Fields; public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\$\$,.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

WordCount in Cascading

• Data Flow Programming Model• User Defined Functions

Good parts

• Data Flow Programming Model• User Defined Functions

Good parts

• Still Java• Objects for Flows

package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }}

GreenRefactor

TDD Cycle

GreenRefactor

Unit Testing

Acceptance Testing

Continuous Deployment

…Lean Startup

Broader view

Big Data Big Speed

A typical day working with Hadoop

Is Scalding of any help here?

0 Size of code

1 Types

0 Size of code

1 Types

2 Unit Testing

0 Size of code

1 Types

2 Unit Testing

3 Local execution

0 Size of code

1Types

Unit Testing

Acceptance Testing

Lean Startup

An extra cycle

Compilation Phase

Unit Testing

Acceptance Testing

Lean Startup

An extra cycle

Static type-checking makes

you a better programmer™

(Int,Int,Int,Int)

Fail-fast with type errors

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

(Int,Int,Int,Int)

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

(Int,Int,Int,Int)

val w = Meters(5)val x = Miles(5)val y = Celsius(5)val z = Fahrenheit(5)

w + x + y + z => type error

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

2Unit Testing

How do you test a distributed algorithm without a distributed

platform?

Source

// Scaldingimport com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish }}

3Local Execution

> run-main com.twitter.scalding.Tool MyJob --local

> run-main com.twitter.scalding.Tool MyJob --hdfs

SBT as a REPL

More Scalding goodness

Algebird

More Scalding goodness

Algebird

Matrix library

Be functionalQuestions?

Writing Hadoop Jobs in Scala using Scalding

Technology

Hadoop , Hadoop , Hadoop !!!

CRT020: Spark Certification Guide (Scala) · 2020-01-24 · How you should prepare for CRT020 Spark Scala/Python (Databricks) Certification Exam? ... Why Cloudera CCA175 Hadoop and

@Scaldingblogs.ischool.berkeley.edu/.../2012/11/twitter... · Scalding @Twitter • Revenue quality team (ads targeting, market insight, click-prediction, trafﬁc-quality) uses scalding

Using Scalding for Data Driven Product Development at LinkedIn

Scalding by Adform Research, Alex Gryzlov

Scalding - Big Data Programming with Scala

Scalding Presentation

DRIVING INNOVATION THROUGH DATA HADOOP IN …dw.connect.sys-con.com/session/2647/Supreet Oberoi.pdf · DRIVING INNOVATION THROUGH DATA HADOOP IN ENTERPRISE ... • Scalding is great

1 Big Data Hadoop€¦ · · 2017-09-01Data Sampling and Debugging ... 2 Apache Spark & Scala 1 Introduction to Spark Limitations of MapReduce in Hadoop Objectives ... Cassandra

Scalding Big (Ad)ta

Other Distributed Frameworks Shannon Quinn. Distinction 1.General Compute Engines – Hadoop 2.User-facing APIs – Cascading – Scalding

Dr. Sabin Buraga - profs.info.uaic.robusaco/teach/courses/soa/presentations/... · eBay Java, Node.js (JavaScript) Oracle DB ... Java, Scala, Rails (Ruby) MySQL, Cassandra, Hadoop,

Scala Days 2018 You Are a - Lightbend...scala/scala-parallel-collections scala/scala-collection-compat scala/scala-java8-compat scala/scala-swing scala/scala-async scala/scala-continuations

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

SCALA: ECONOMIC EMPOWERMENT OF LOW … EN/SCALA-Call... · SCALA: ECONOMIC EMPOWERMENT OF LOW-INCOME POPULATIONS CALL FOR THE DEVELOPMENT OF THE SCALA KNOWLEDGE OBSERVATORY: SCALA

Big Data online training | Hadoop Spark Scala Online Training

Influence of milk pasteurization and scalding temperature

Hive: SQL for Hadoop - GitHub Pages · PDF fileHive: SQL for Hadoop ... Big Data, Scala, ... For production, you need to set up a MySQL or PostgreSQL database for Hive’s metadata

Scala Native Documentation - Scala Native — Scala Native

Productivity Frameworks in Big Data Image Processing ... · Productivity frameworks in big data image processing computations - creating photographic mosaics with Hadoop and Scalding