28
Page 1 © Hortonworks Inc. 2014 Scalding YARN Webinar Series September 18, 2014 Ajay Singh, Director - Hortonworks Jonathan Coveney, Senior Software Engineer - Twitter

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Embed Size (px)

DESCRIPTION

This webinar focuses on introducing Scalding for developers and writing applications for Hadoop and YARN using Scalding. Guest speaker Jonathan Coveney from Twitter provides an overview, use cases, limitations, and core concepts.

Citation preview

Page 1: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 1 © Hortonworks Inc. 2014

Scalding YARN Webinar Series

September 18, 2014

Ajay Singh, Director - Hortonworks Jonathan Coveney, Senior Software Engineer - Twitter

Page 2: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 2 © Hortonworks Inc. 2014

Agenda

Introduction: Ajay Singh, Hortonworks Modern Data Architecture and how Cascading and Scalding fit in

Scalding: Jonathan Coveney, Twitter

Why Scalding?

Core Concepts and Limitations

Scalding at Twitter

Resources

Page 3: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 3 © Hortonworks Inc. 2014

Speakers

Ajay Singh is Hortonworks Director of Technical Channels and leads the strategic alliances with partners from a technology standpoint such as driving alignment on roadmaps, product certifications and demos. Ajay is dedicated to building, scaling and delivering exceptional go-to-market solutions with partners.

Jonathan Coveney currently works at Twitter, where he has spent a lot of time maintaining and updating Scalding; in the past, he has worked extensively on Apache Pig. He is deeply interested in functional programming, as well as developing usable, scalable API's for data processing at scale.

Page 4: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 4 © Hortonworks Inc. 2014

A Modern Data Architecture

APPLICAT

IONS  

DATA

   SYSTEM  

REPOSITORIES  

SOURC

ES  

Exis4ng  Sources    (CRM,  ERP,  Clickstream,  Logs)  

RDBMS   EDW   MPP  

Emerging  Sources    (Sensor,  Sen4ment,  Geo,  Unstructured)  

OPERATIONAL  TOOLS  

MANAGE  &  MONITOR  

DEV  &  DATA  TOOLS  

BUILD  &  TEST  

Business    Analy4cs   Custom  Applica4ons   Packaged  

Applica4ons  

Gov

erna

nce

&

Inte

grat

ion

ENTERPRISE HADOOP

Secu

rity

Ope

ratio

ns

Data Access

Data Management

Page 5: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 5 © Hortonworks Inc. 2014

HDP 2.1: Enterprise Hadoop

HDP 2.1 Hortonworks Data Platform

   

Provision,  Manage  &  Monitor  

 Ambari  

Zookeeper  

Scheduling    

Oozie  

Data  Workflow,  Lifecycle  &  Governance  

 Falcon  Sqoop  Flume  NFS  

WebHDFS   YARN  :  Data  Opera4ng  System  

DATA    MANAGEMENT  

SECURITY  DATA    ACCESS  GOVERNANCE  &  INTEGRATION  

Authen4ca4on  Authoriza4on  Accoun4ng  

Data  Protec4on    

Storage:  HDFS  Resources:  YARN  Access:  Hive,  …    Pipeline:  Falcon  Cluster:  Knox  

OPERATIONS  

Script    Pig      

Search    

Solr      

SQL    

Hive/Tez,  HCatalog  

   

NoSQL    

HBase  Accumulo  

   

Stream      

Storm        

Others    

In-­‐Memory  AnalyNcs,    ISV  engines  

1   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°  

°  

N  

HDFS    (Hadoop  Distributed  File  System)  

Batch    

Map  Reduce  

   

Deployment  Choice  Linux Windows On-Premise Cloud

Cascading

Page 6: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 6 © Hortonworks Inc. 2014

Cascading SDK

HDP Integrates and delivers Cascading SDK •  Collection of tools, documentation, libraries,

tutorials and example projects •  Key Benefits

•  Simplified Development •  Multi Language Support •  Reuse existing skills and tools •  Native YARN Integration

Hortonworks delivers Enterprise support •  Backed by Concurrent

Hortonworks and Concurrent Advance Enterprise Data Application Development on Hadoop

Page 7: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 7 © Hortonworks Inc. 2014

HDP Integration of Cascading SDK •  Write once and deploy on your fabric of

choice

•  Integration with data processing layer allows Cascading to take advantage of advances in interactive applications

•  Sep 17th - Cascading 3.0 WIP Now Supports Apache Tez –  http://www.cascading.org/2014/09/17/

cascading-3-0-wip-now-supports-apache-tez/

Efficient  Cluster  Resource    Management  &  Shared  Services  

(YARN)  

Batch  Data  Processing  MapReduce  

Interac4ve  Data  Processing  TEZ  

Java  Cascading  

Scala  Scalding  

SQL  Lingual  

ML  Pa6ern  

Java  Cascading  

Scala  Scalding  

SQL  Lingual  

ML  Pa6ern  

Enable both existing and new application to provide value to the organization

PRESENTATION  &  APPLICATION  

CURRENT WIP

Page 8: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 8 © Hortonworks Inc. 2014

Cascading.org Scalding Resources

Scalding Resources on Cascading.org •  Videos and Tutorials

•  Mailing List

•  Newsletter

Cascading 3.0 WIP With Tez Support

•  https://github.com/cwensel/cascading/tree/wip-3.0/cascading-hadoop2-tez

Scalding Training Debuts This Fall

•  In-person, 1-day class with labs

•  Email: [email protected]

Page 9: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 9 © Hortonworks Inc. 2014

Jonathan Coveney Twitter

@jco

Page 10: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 10 © Hortonworks Inc. 2014

Why Scalding?

Writing raw map reduce is difficult! ●  Scalding is

o  Less verbose o  Less error prone (type checking!) o  Easier to evolve o  Performant enough

Page 11: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 11 © Hortonworks Inc. 2014

●  Really good for certain things o  Excellent for quick, ad-hoc work o  Easy to understand o  Can leverage existing knowledge (ie SQL)

●  Not always the best for maintainability o  Composition isn’t great o  Testing is difficult o  Type safety is lacking

But what about Hive and Pig?

Page 12: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 12 © Hortonworks Inc. 2014

So… Cascading?

●  Still pretty verbose! ●  But you can use normal java tools

o  Maven o  JUnit o  IDEs

●  Handles the low level details for you ●  A good target for higher level languages

Page 13: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 13 © Hortonworks Inc. 2014

Scalding

●  Concise, expressive syntax ●  Testable ●  Abstractable ●  Composable Because it’s in a full-featured, functional language!

Page 14: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 14 © Hortonworks Inc. 2014

But Scala is scary!

●  Scalding doesn’t force you to use more complicated features

●  Can just write less-verbose Java if desired ●  Functional programming is an important paradigm -- but

especially for big data Learning new things is good for your brain :)

Page 15: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 15 © Hortonworks Inc. 2014

Example Scalding job

class Webinar(arg: Args) extends Job(args) { import TDsl._

TextLine(args(“input”)) .flatMap { _.split(“\s+”) } .map { w => (w, 1L) } .group .sum .write(TypedTsv[(String, Long)](args(“output”)))

} “Hadoop is a system for counting words” -Oscar Boykin, @posco

Page 16: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 16 © Hortonworks Inc. 2014

Core concepts

●  Source o  How to read or write data

●  TypedPipe[T] o  A distributed list of T o  Kind of like a Seq[T] in Scala’s collections library

●  Grouped[K, T] o  A grouping on K o  Represents transition to reduce phase

Page 17: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 17 © Hortonworks Inc. 2014

Word Co-Occurrence

TextLine(args("input")) .flatMap { line => val words = line.split("\s+") for (w1 <- words; w2 <- words if (w1 != w2)) yield (w1, Map(w2 -> 1L)) }.group[String, Map[String, Long]] .sum .flatMap { case (word, wordMap) => wordMap.map { case (otherWord, count) => (word, otherWord, count) }}.write(TypedTsv[(String, String, Long)](args("output")))

Page 18: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 18 © Hortonworks Inc. 2014

Scalding leverages a lot of Scala idioms, as well as concepts from functional programming ●  map

o  a 1 to 1 mapping for every piece of data ●  flatMap

o  a 1 to 0 or more mapping for every piece of data

Important concepts

Page 19: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 19 © Hortonworks Inc. 2014

Important concepts (continued)

●  Typeclasses o  The separation of computation from data types o  Think Java’s Comparator (but way more powerful) o  These are what power .sum

Page 20: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 20 © Hortonworks Inc. 2014

Scalding’s limitations are MapReduce’s limitations ●  Bad at iterative jobs ●  Lots of checkpointing, serialization, sorting However... ●  Cascading on Tez could help!

o  in progress as part of Cascading 3.0 ●  So could Cascading on Spark!

Limitations

Page 21: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 21 © Hortonworks Inc. 2014

The cutting edge

●  REPL support ●  Executor[T]

o  Decoupling TypedPipes from specifics of the execution engine

o  Makes Iterative algorithms much easier to express ●  Macros

o  Allowing easier use of case classes o  Closure analysis?

Page 22: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 22 © Hortonworks Inc. 2014

Scalding at Twitter

●  Thousands of users o  Engineers AND data scientists

●  Many thousands of jobs every day o  ETL o  Recommendations o  Email o  Time series analysis

When you use Twitter, you’re using features powered by Scalding!

Page 23: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 23 © Hortonworks Inc. 2014

Useful practices

●  A standardized “Job” subclass with company specific information o  Want the common case to be as simple as possible o  Especially should configure serialization for users

●  Separate data from functions on data o  At Twitter, this means Thrift for data, and various Scala

functions operating and that data o  Decouples the specification of some data from the derived

data people want based on it

Page 24: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 24 © Hortonworks Inc. 2014

Q&A

Page 25: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 25 © Hortonworks Inc. 2014

Contribute! ●  Scalding ●  Algebird

o  Math inspired aggregators (.sum uses it)

●  Bijection o  Conversion and serialization made fun

●  Summingbird o  Abstraction for batch and online map/reduce (see resources for more)

Page 26: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 26 © Hortonworks Inc. 2014

More resources

Scalding/Algebird •  Oscar Boykin: Algebra for Scalable Analytics •  Avi Bryant: Add ALL the Things •  Oscar Boykin, Argyris Zimny: Scalding: Powerful & Concise MapReduce

You might also be interested in… •  Summingbird! Streaming real-time and batch analytics, unified and made

beautiful •  Oscar Boykin: Introduction to Summingbird •  Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin:

Summingbird, A Framework for Integrating Batch and Online MapReduce Computations

Page 27: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 27 © Hortonworks Inc. 2014

Next Webinar – Oct 2 - Spark

Writing applications to Hadoop and YARN using Spark •  October 2nd at 9am Pacific Time

•  Register

Find all webinars

•  Hortonworks.com/webinars

Find past recorded webinars

•  Hortonworks.com/webinars/#library

Page 28: YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Page 28 © Hortonworks Inc. 2014

Thank you!