38
1 Exploring Hadoop Microsoft Azure Implementation Jacob Saunders - Chicago SQL BI User Group, 7/24/2012

Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

  • Upload
    lekiet

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

1

Exploring Hadoop Microsoft Azure Implementation Jacob Saunders - Chicago SQL BI User Group, 7/24/2012

Page 2: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

2

Agenda

What is Hadoop?

HDFS

MapReduce

What’s Next?

Resources

Demo

Page 3: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

3

What is Hadoop?

Page 4: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

4

What is Hadoop?

A system for processing mind-boggingly large

amounts of data efficiently -Jacob Homan

Sr. Software Engineer at LinkedIn

Page 5: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

5

Hadoop in Plain English

Hadoop is based on Large Scale Distributive Storage

Not a Relational Database

Hadoop was designed to process large amounts of complex data

efficiently

Hadoop is very good at organizing Unstructured Data

Hadoop is Open Source

The Apache Software foundation (http://hadoop.apache.org/)

provides support for a community of Open Source Projects (i.e. pig,

hive, etc..)

Page 6: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

6

A Brief History of Hadoop

Hadoop has its origins in Google’s need to store and perform

analysis on the entire contents of the Internet

Hadoop is based on Google’s GFS and MapReduce papers

The concept of problem partitioning is as old as LISP

Yahoo was also involved early on with Hadoop

Yahoo is the main contributor to Hadoop today

Nutch (another Open Source Search Engine that was part of the

Apache Project) Doug Cutting of Nutch created the prototype

based of a Google White Paper.

Yahoo hired Doug Cutting to take that prototype and expand on it

Page 7: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

7

Why is it called Hadoop?

Hadoop was named by Doug Cutting,

after his son’s toy elephant!

Page 8: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

8

Who is using Hadoop today?

Adobe

CCC

Sears

Orbitz Worldwide

The New York Times

eHarmony

Last.fm

Fox Interactive Media

Rackspace

LinkedIn

Facebook

IBM

Twitter

Microsoft

To name a few ….

Page 9: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

9

Main Components of Hadoop

HDFS (Hadoop Distributed File System) -

Storage

Open Source Implementation of MapReduce –

Math/Computation

Page 10: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

10

The Hadoop Ecosystem

Page 11: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

11

HDFS

Page 12: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

12

HDFS

HDFS is a distributed File System that runs on large

clusters of servers (also referred to as nodes)

The file system strongly resembles the UNIX file system

in its structure and utilizes UNIX tools (i.e. ls, cat, mkdir,

etc..)

Page 13: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

13

HDFS Example screen print of File system

Page 14: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

14

HDFS Hadoop can read very big files (Petabytes and Exabytes). What the heck is an

Exabyte?

An Exabyte is approximately 1,000 Petabytes.

Another way to look at it is that an Exabyte is approximately one quintillion

bytes or one billion Gigabytes.

There is not much to compare an Exabyte to. It has been said that 5 Exabytes

would be equal to all of the words ever spoken by mankind.

HDFS reads a file and splits it up into 128 MB blocks ( the size is

configurable)

HDFS makes (3) copies of each block (the # of copies is configurable. The

more copies you have, the safer the file will be)

Page 15: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

15

How HDFS Stores Data

Distributed Environment

The replicas of the data are

Placed on nodes that comprise

The cluster

A datanode is a server on the

cluster

Page 16: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

16

Hadoop Nodes

Namenode

(or headnode)

The Namenode does

NOT store blocks of

Data.

It keeps track of where

The blocks are mapped

On the other nodes (servers)

Page 17: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

17

MapReduce

Page 18: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

18

The Map function takes a large dataset and partitions

it into several smaller intermediate datasets that can

be processed in parallel by different nodes in a cluster.

The Reduce function then takes the separate results of

each computation and aggregates them to form the

final output.

MapReduce can be leveraged to perform operations

such as sorting and statistical analysis on large

datasets, which may be mapped into smaller partitions

and processed in parallel.

MapReduce

Page 19: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

19

MapReduce

Data is treated as Keys & Values

Examples: <Key, value>

<Byte Offset, some text>

<Userid, User Profile> (used by social networks)

<timestamp, Access log entry> (used for log anaysis)

<users, List of User’s friends> (used by social networks)

19

Page 20: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

20

Steps to write a Map Reduce

Write a mapper that takes a key and a value, emits zero

or more keys and values

Write a reducer that takes all the values of one key and

emits zero or more new keys and values

Page 21: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

21

Map Reduce Task Tracker

Page 22: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

22

Calculation

runs on the

node where

the data

block

resides

Map Reduce Task Tracker

Page 23: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

23

The map reducer is run

on many nodes

in parallel because

Hadoop is a

Distributed Environment

The diagram to the right

illustrates this

Map Reduce Task Tracker

Page 24: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

24

Map Reduce Task Tracker Output

Page 25: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

25

Map Reduce Task Tracker Output

Page 26: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

26

The Reducer will

Process results from all

nodes.

The occurrence of

Each key (a, above)

Is calculated and the number

of occurrences the Value

(1.5.3, etc…)

Map Reduce Task Tracker Output

Page 27: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

27

Map Reduce doesn’t always “Map” New languages/projects that can be used instead of Map Reduce

Benefit: the projects below do not require you to think in <Key, Value>

paradigm

PIG – a data flow language and execution environment for exploring very large

datasets. Pig runs on HDFS and Map Reducer Clusters

Hive- a distributed data warehouse. Hive manages data stored in HDFS and provides

a query language based on SQL (and which is translated by the runtime engine to

Map Reduce jobs) for querying the data

HBase- A distributed, column-oriented database. HBase used HDFS for its underlying

storage and supports batch style computations using Map Reduce and point

queries(random reads)

ZooKeeper- A distributed, highly available coordination service. ZooKeeper provides

primitives such as distributed locks that can be used for building applications

Sqoop-A tool for efficiently moving data between relational databases and HDFS.

Page 28: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

28

What’s Next?

Page 29: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

29

What’s next?

Google’s Percolator for incremental indexing By dealing only with new, modified, or deleted documents and using

secondary indices to efficiently catalog and query the resulting output,

Percolator can dramatically increase performance of rapidly changing

datasets.

Dremel for ad hoc analytics MapReduce (and thereby Hadoop) is purpose-built for organized data

processing (jobs), not ad hoc exploration. Google invented Dremel (now

exposed as the BigQuery product) as a purpose-built tool to allow analysts

to scan over petabytes of data in seconds to answer ad hoc queries and

power compelling visualizations.

Page 30: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

30

What’s next?

Pregel for analyzing graph data. Google MapReduce was purpose-built for crawling and analyzing the

world’s largest graph data structure – the internet. However, MapReduce is

not good at analyzing networks of people, telecom equipment, documents

and other graph structures. Therefore, Google built Pregel, a large bulk

synchronous processing application for petabyte -scale graph processing on

distributed commodity machines.

LINQ-to-Hadoop Microsoft has reforcerd its DryadLinq efforts towards Hadoop and will be

offering the ability to query Hadoop using this friendly and modern

language

Page 31: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

31

Can I Get It?

Page 32: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

32

Yes, but…

Microsoft Azure Hadoop is open to the public. You must

respond to the survey on the Microsoft Connect Site

https://connect.microsoft.com/SQLServer/Survey/Survey.

aspx?SurveyID=13697

If you just want to play around with Hadoop, Cloudera

has a fully configured VM they can download and run.

https://ccp.cloudera.com/display/SUPPORT/Cloudera's+

Hadoop+Demo+VM

Page 33: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

33

Resources

Page 35: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

35

Resources Used for this Hadoop Research

Hadoop The Definitive Guide ORIELLY|Yahoo! Press by Tom White

Apache Hadoop – Petabytes and Terawatts discussion lead by Jakob Homan,

Software Engineer at Linked In (UTUBE link follows) :

http://www.youtube.com/watch?v=SS27F-hYWfU&feature=endscreen

What is Hadoop: Other Big Terms like Map Reduce interview with Mike Olsen of

Cloudera (UTUBE link follows): http://www.youtube.com/watch?v=S9xnYBVqLws

http://bigdatauniversity.com/courses/

TERADAT.ASTER (The Differences Between Aster and Hadoop

http://www.asterdata.com/blog/2008/09/06/differences-between-aster-and-hadoop/

Why the days are numbered for Hadoop as we know it

http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/

Page 36: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

36

Additional Hadoop Resources

Hadoop on Azure

https://www.hadooponazure.com/

Magenic: Why I am Excited about SQL Server 2012 (Part 2)

http://magenic.com/Blog/WhyIAmExcitedaboutSQLServer2012Part2.aspx

Apache Hadoop Home Page

http://hadoop.apache.org/

Running Apache Pig (Pig Latin) at Apache Hadoop on Windows Azure

http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/10/running-apache-pig-pig-latin-at-

apache-hadoop-on-windows-azure.aspx

Practical Problem Solving with Apache Hadoop Pig

http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

Fluent Queries on the Interactive JavaScript Console

http://social.technet.microsoft.com/wiki/contents/articles/7183.fluent-queries-on-the-interactive-

javascript-console.aspx

Hive vs. Pig

http://www.larsgeorge.com/2009/10/hive-vs-pig.html

Hive DDL Reference

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Page 37: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

37

Demo

Page 38: Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

38

Agenda

Methods of accessing Azure Hadoop

Loading files into HDFS

Pig from the Interactive Console

Hive

Excel Hive Add-In

Sqoop