51
Workshop: From Zero to _ Budapest DW Forum 2014

From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Embed Size (px)

DESCRIPTION

Hadoop is everywhere these days, but it can seem like a complex, intimidating ecosystem to those who have yet to jump in. In this hands-on workshop, Alex Dean, co-founder of Snowplow Analytics, will take you "from zero to Hadoop", showing you how to run a variety of simple (but powerful) Hadoop jobs on Elastic MapReduce, Amazon's hosted Hadoop service. Alex will start with a no-nonsense overview of what Hadoop is, explaining its strengths and weaknesses and why it's such a powerful platform for data warehouse practitioners. Then Alex will help get you setup with EMR and Amazon S3, before leading you through a very simple job in Pig, a simple language for writing Hadoop jobs. After this we will move onto writing a more advanced job in Scalding, Twitter's Scala API for writing Hadoop jobs. For our final job, we will consolidate everything we have learnt by building a more sophisticated job in Scalding.

Citation preview

Page 1: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Workshop: From Zero

to _Budapest DW Forum 2014

Page 2: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Agenda today

1. Some setup before we start

2. (Back to the) introduction

3. Our workshop today

4. Part 2: a simple Scalding job on EMR

Page 3: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Some setup before we start

Page 4: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

There is a lot to copy and paste – so let’s all join a Google Hangout chat

http://bit.ly/1xgSQId• If I forget to paste some content into the chat room, just shout

out and remind me

Page 5: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

First, let’s all download and setup Virtualbox and Vagrant

http://docs.vagrantup.com/v2/installation/index.html

https://www.virtualbox.org/wiki/Downloads

Page 6: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Now let’s setup our development environment$ vagrant plugin install vagrant-vbguest

If you have git already installed:

$ git clone --recursive https://github.com/snowplow/dev-environment.git

If not:

$ wget https://github.com/snowplow/dev-environment/archive/temp.zip

$ unzip temp.zip

$ wget https://github.com/snowplow/ansible-playbooks/archive/temp.zip

$ unzip temp.zip

Page 7: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Now let’s setup our development environment

$ cd dev-environment

$ vagrant up

$ vagrant ssh

Page 8: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Final step for now, let’s install some software

$ ansible-playbook /vagrant/ansible-playbooks/aws-tools.yml --inventory-file=/home/vagrant/ansible_hosts --connection=local

$ ansible-playbook /vagrant/ansible-playbooks/scala-sbt.yml --inventory-file=/home/vagrant/ansible_hosts --connection=local

Page 9: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

(Back to the) introduction

Page 10: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Snowplow is an open-source web and event analytics platform, built on Hadoop

• Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008

• We released Snowplow as a skunkworks prototype at start of 2012:

github.com/snowplow/snowplow

• We built Snowplow on top of Hadoop from the very start

Page 11: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

We wanted to take a fresh approach to web analytics

• Your own web event data -> in your own data warehouse• Your own event data model• Slice / dice and mine the data in highly bespoke ways to answer your

specific business questions• Plug in the broadest possible set of analysis tools to drive value from your

data

Data warehouseData pipeline

Analyse your data in any analysis tool

Page 12: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner

These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis

Amazon EMRAmazon S3CloudFront

Amazon Redshift

Page 13: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Our Snowplow event processing flow runs on Hadoop, specifically Amazon’s Elastic MapReduce hosted Hadoop service

Website / webapp

Snowplow Hadoop data pipeline

CloudFront-based event

collectorScalding-

based enrichment on Hadoop

JavaScript event tracker

Amazon Redshift /

PostgreSQL

Amazon S3

or

Clojure-based event

collector

Page 14: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Why did we pick Hadoop?

Scalability

Easy to reprocess

data

Highly testable

We have customers processing 350m Snowplow events a day in Hadoop – runs in <2 hours

If business rules change, we can fire up a large cluster and re-process all historical raw Snowplow events

We write unit and integration tests for our jobs and run them locally, giving us confidence that our jobs will run correctly at scale on Hadoop

Page 15: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

And why Amazon’s Elastic MapReduce (EMR)?

No need to run our own

cluster

Elastic

Interop with other AWS

services

Running your own Hadoop cluster is a huge pain – not for the fainthearted. By contrast, EMR just works (most of the time !)

Snowplow runs as a nightly (sometimes more frequent) batch job. We spin up the EMR cluster to run the job, and shut it down straight after

EMR works really well with Amazon S3 as a file store. We are big fans of Amazon Redshift (hosted columnar database) too

Page 16: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Our workshop today

Page 17: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Hadoop is complicated…

Page 18: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

… for our workshop today, we will stick to using Elastic MapReduce and try to avoid any unnecessary complexity

Page 19: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

… and we will learn by doing!

• Lots of books and articles about Hadoop and the theory of MapReduce

• We will learn by doing – no theory unless it’s required to directly explain the jobs we are creating

• Our priority is to get you up-and-running on Elastic MapReduce, and confident enough to write your own Hadoop jobs

Page 20: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Part 1: a simple Pig Latin job on EMR

Page 21: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

What is Pig (Latin)?

• Pig is a high-level platform for creating MapReduce jobs which can run on Hadoop

• The language you write Pig jobs in is called Pig Latin

• For quick-and-dirty scripts, Pig just works

Hadoop DFS

Hadoop MapReduce

Crunch Hive Pig

Java

Cascading

Page 22: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s all come up with a unique name for ourselves

• Lowercase letters, no spaces or hyphens or anything

• E.g. I will be alexsnowplow – please come up with a unique name for yourself!

• It will be visible to other participants so choose something you don’t mind being public

• In the rest of this workshop, wherever you see YOURNAME, replace it with your unique name

Page 23: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s restart our Vagrant and do some setup

$ mkdir zero2hadoop

$ aws configure

// And type in:

AWS Access Key ID [None]: AKIAILD6DCBTFI642JPQ

AWS Secret Access Key [None]: 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

Default region name [None]: eu-west-1

Default output format [None]:

Page 24: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s create some buckets in Amazon S3 – this is where our data and our apps will live

$ aws s3 mb s3://zero2hadoop-in-YOURNAME

$ aws s3 mb s3://zero2hadoop-out-YOURNAME

$ aws s3 mb s3://zero2hadoop-jobs-YOURNAME

// Check those worked

$ aws s3 ls

Page 25: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s get some source data uploaded

$ mkdir -p ~/zero2hadoop/part1/in

$ cd ~/zero2hadoop/part1/in

$ wget https://raw.githubusercontent.com/snowplow/scalding-example-project/master/data/hello.txt

$ cat hello.txt

Hello world

Goodbye world

$ aws s3 cp hello.txt s3://zero2hadoop-in-YOURNAME/part1/hello.txt

Page 26: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s get our EMR command-line tools installed (1/2)

$ /vagrant/emr-cli/elastic-mapreduce

$ rvm install ruby-1.8.7-head

$ rvm use 1.8.7

$ alias emr=/vagrant/emr-cli/elastic-mapreduce

Page 27: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s get our EMR command-line tools installed (2/2)

Add this file:

{

"access_id": "AKIAI55OSYYRLYWLXH7A",

"private_key": "SHRXNIBRdfWuLPbCt57ZVjf+NMKUjm9WTknDHPTP",

"region": "eu-west-1"

}

to: /vagrant/emr-cli/credentials.json

(sudo sntp -s 24.56.178.140)

Page 28: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s get our EMR command-line tools installed (2/2)

// This should work fine now:

$ emr --list

<no output>

Page 29: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s do some local file work

$ mkdir -p ~/zero2hadoop/part1/pig

$ cd ~/zero2hadoop/part1/pig

$ wget https://gist.githubusercontent.com/alexanderdean/d8371cebdf00064591ae/raw/cb3030a6c48b85d101e296ccf27331384df3288d/wordcount.pig

// The original

https://gist.github.com/alexanderdean/d8371cebdf00064591ae

Page 30: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Now upload to S3

$ aws s3 cp wordcount.pig s3://zero2hadoop-jobs-YOURNAME/part1/

$ aws s3 ls --recursive s3://zero2hadoop-jobs-YOURNAME/part1/

2014-06-06 09:10:31 674 part1/wordcount.pig

Page 31: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

And now we run our Pig script

$ emr --create --name "part1 YOURNAME" \

--set-visible-to-all-users true \

--pig-script s3n://zero2hadoop-jobs-YOURNAME/part1/wordcount.pig \

--ami-version 2.0 \

--args "-p,INPUT=s3n://zero2hadoop-in-YOURNAME/part1, \

-p,OUTPUT=s3n://zero2hadoop-out-YOURNAME/part1"

Page 32: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s check out the jobs running in Elastic MapReduce – first at the console

$ $ emr --list

j-1HR90SWPP40M4 STARTING part1 YOURNAME

PENDING Setup Pig

PENDING Run Pig Script

Page 33: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

and also in the UI

Page 34: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Okay let’s check the output of our job! (1/2)

$ aws s3 ls --recursive s3://zero2hadoop-out-YOURNAME/part1

2014-06-06 09:57:53 0 part1/_SUCCESS

2014-06-06 09:57:50 26 part1/part-r-00000

Page 35: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Okay let’s check the output of our job!

$ mkdir -p ~/zero2hadoop/part1/out

$ cd ~/zero2hadoop/part1/out

$ aws s3 cp --recursive s3://zero2hadoop-out-YOURNAME/part1 .

$ ls

part-r-00000 _SUCCESS

$ cat part-r-00000

2 world

1 Hello

1 Goodbye

Page 36: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Part 2: a simple Scalding job on EMR

Page 37: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

What is Scalding?

• Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop:

Hadoop DFS

Hadoop MapReduce

Cascading Pig …

Java

Scalding Cascalog PyCascading cascading. jruby

Page 38: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Cascading has a “plumbing” abstraction over vanilla MapReduce which should be quite comfortable to DW practitioners

Page 39: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Scalding improves further on Cascading by reducing boilerplate and making more complex pipelines easier to express

• Scalding written in Scala – reduces a lot of boilerplate versus vanilla Cascading. Easier to look at a job in its entirety and see what it does

• Scalding created and supported by Twitter, who use it throughout their organization

• We believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing

Page 40: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Strongly typed data pipelines – why?

• Catch errors as soon as possible – and report them in a strongly typed way too

• Define the inputs and outputs of each of your data processing steps in an unambiguous way

• Forces you to formerly address the data types flowing through your system

• Lets you write code like this:

Page 41: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Okay let’s get started!

• Head to https://github.com/snowplow/scalding-example-project

Page 42: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s get this code down locally and build it

$ mkdir -p ~/zero2hadoop/part2

$ cd ~/zero2hadoop/part2

$ git clone git://github.com/snowplow/scalding-example-project.git

$ cd scalding-example-project

$ sbt assembly

Page 43: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Here is our MapReduce code

Page 44: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Good, tests are passing, now let’s upload this to S3 so it’s available to our EMR job

$ aws s3 cp target/scala-2.10/scalding-example-project-0.0.5.jar s3://zero2hadoop-jobs-YOURNAME/part2/

// If that doesn’t work:

$ aws cp s3://snowplow-hosted-assets/third-party/scalding-example-project-0.0.5.jar s3://zero2hadoop-jobs-YOURNAME/part2/

$ aws s3 ls s3://zero2hadoop-jobs-YOURNAME/part2/

Page 45: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

And now we run it!

$ emr --create --name ”part2 YOURNAME" \

--set-visible-to-all-users true \

--jar s3n://zero2hadoop-jobs-YOURNAME/part2/scalding-example-project-0.0.5.jar \

--arg com.snowplowanalytics.hadoop.scalding.WordCountJob \

--arg --hdfs \

--arg --input --arg s3n://zero2hadoop-in-YOURNAME/part1/hello.txt \

--arg --output --arg s3n://zero2hadoop-out-YOURNAME/part2

Page 46: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s check out the jobs running in Elastic MapReduce – first at the console

$ emr --list

j-1M62IGREPL7I STARTING scalding-example-project

PENDING Example Jar Step

Page 47: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

and also in the UI

Page 48: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Okay let’s check the output of our job!

$ aws s3 ls --recursive s3://zero2hadoop-out-YOURNAME/part2

$ mkdir -p ~/zero2hadoop/part2/out

$ cd ~/zero2hadoop/part2/out

$ aws s3 cp --recursive s3://zero2hadoop-out-YOURNAME/part2 .

$ ls

$ cat part-00000

goodbye 1

hello 1

world 2

Page 49: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Part 3: a more complex Scalding job on EMR

Page 50: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Let’s explore another tutorial together

https://github.com/sharethrough/scalding-emr-tutorial

Page 51: From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Questions?

http://snowplowanalytics.comhttps://github.com/snowplow/snowplow

@snowplowdata

To talk offline – @alexcrdean on Twitter or [email protected]