Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)

1

Exploring Hadoop Microsoft Azure Implementation Jacob Saunders - Chicago SQL BI User Group, 7/24/2012

2

Agenda

What is Hadoop?

HDFS

MapReduce

What’s Next?

Resources

Demo

3

What is Hadoop?

4

What is Hadoop?

A system for processing mind-boggingly large

amounts of data efficiently -Jacob Homan

Sr. Software Engineer at LinkedIn

5

Hadoop in Plain English

Hadoop is based on Large Scale Distributive Storage

Not a Relational Database

Hadoop was designed to process large amounts of complex data

efficiently

Hadoop is very good at organizing Unstructured Data

Hadoop is Open Source

The Apache Software foundation (http://hadoop.apache.org/)

provides support for a community of Open Source Projects (i.e. pig,

hive, etc..)

6

A Brief History of Hadoop

Hadoop has its origins in Google’s need to store and perform

analysis on the entire contents of the Internet

Hadoop is based on Google’s GFS and MapReduce papers

The concept of problem partitioning is as old as LISP

Yahoo was also involved early on with Hadoop

Yahoo is the main contributor to Hadoop today

Nutch (another Open Source Search Engine that was part of the

Apache Project) Doug Cutting of Nutch created the prototype

based of a Google White Paper.

Yahoo hired Doug Cutting to take that prototype and expand on it

7

Why is it called Hadoop?

Hadoop was named by Doug Cutting,

after his son’s toy elephant!

8

Who is using Hadoop today?

Adobe

CCC

Sears

Orbitz Worldwide

The New York Times

eHarmony

Last.fm

Fox Interactive Media

Rackspace

LinkedIn

Facebook

IBM

Twitter

Microsoft

To name a few ….

9

Main Components of Hadoop

HDFS (Hadoop Distributed File System) -

Storage

Open Source Implementation of MapReduce –

Math/Computation

10

The Hadoop Ecosystem

11

HDFS

12

HDFS

HDFS is a distributed File System that runs on large

clusters of servers (also referred to as nodes)

The file system strongly resembles the UNIX file system

in its structure and utilizes UNIX tools (i.e. ls, cat, mkdir,

etc..)

13

HDFS Example screen print of File system

14

HDFS Hadoop can read very big files (Petabytes and Exabytes). What the heck is an

Exabyte?

An Exabyte is approximately 1,000 Petabytes.

Another way to look at it is that an Exabyte is approximately one quintillion

bytes or one billion Gigabytes.

There is not much to compare an Exabyte to. It has been said that 5 Exabytes

would be equal to all of the words ever spoken by mankind.

HDFS reads a file and splits it up into 128 MB blocks ( the size is

configurable)

HDFS makes (3) copies of each block (the # of copies is configurable. The

more copies you have, the safer the file will be)

15

How HDFS Stores Data

Distributed Environment

The replicas of the data are

Placed on nodes that comprise

The cluster

A datanode is a server on the

cluster

16

Hadoop Nodes

Namenode

(or headnode)

The Namenode does

NOT store blocks of

Data.

It keeps track of where

The blocks are mapped

On the other nodes (servers)

17

MapReduce

18

The Map function takes a large dataset and partitions

it into several smaller intermediate datasets that can

be processed in parallel by different nodes in a cluster.

The Reduce function then takes the separate results of

each computation and aggregates them to form the

final output.

MapReduce can be leveraged to perform operations

such as sorting and statistical analysis on large

datasets, which may be mapped into smaller partitions

and processed in parallel.

MapReduce

19

MapReduce

Data is treated as Keys & Values

Examples: <Key, value>

<Byte Offset, some text>

<Userid, User Profile> (used by social networks)

<timestamp, Access log entry> (used for log anaysis)

<users, List of User’s friends> (used by social networks)

19

20

Steps to write a Map Reduce

Write a mapper that takes a key and a value, emits zero

or more keys and values

Write a reducer that takes all the values of one key and

emits zero or more new keys and values

21

Map Reduce Task Tracker

22

Calculation

runs on the

node where

the data

block

resides


23

The map reducer is run

on many nodes

in parallel because

Hadoop is a

Distributed Environment

The diagram to the right

illustrates this


24

Map Reduce Task Tracker Output

25


26

The Reducer will

Process results from all

nodes.

The occurrence of

Each key (a, above)

Is calculated and the number

of occurrences the Value

(1.5.3, etc…)


27

Map Reduce doesn’t always “Map” New languages/projects that can be used instead of Map Reduce

Benefit: the projects below do not require you to think in <Key, Value>

paradigm

PIG – a data flow language and execution environment for exploring very large

datasets. Pig runs on HDFS and Map Reducer Clusters

Hive- a distributed data warehouse. Hive manages data stored in HDFS and provides

a query language based on SQL (and which is translated by the runtime engine to

Map Reduce jobs) for querying the data

HBase- A distributed, column-oriented database. HBase used HDFS for its underlying

storage and supports batch style computations using Map Reduce and point

queries(random reads)

ZooKeeper- A distributed, highly available coordination service. ZooKeeper provides

primitives such as distributed locks that can be used for building applications

Sqoop-A tool for efficiently moving data between relational databases and HDFS.

28

What’s Next?

29

What’s next?

Google’s Percolator for incremental indexing By dealing only with new, modified, or deleted documents and using

secondary indices to efficiently catalog and query the resulting output,

Percolator can dramatically increase performance of rapidly changing

datasets.

Dremel for ad hoc analytics MapReduce (and thereby Hadoop) is purpose-built for organized data

processing (jobs), not ad hoc exploration. Google invented Dremel (now

exposed as the BigQuery product) as a purpose-built tool to allow analysts

to scan over petabytes of data in seconds to answer ad hoc queries and

power compelling visualizations.

30

What’s next?

Pregel for analyzing graph data. Google MapReduce was purpose-built for crawling and analyzing the

world’s largest graph data structure – the internet. However, MapReduce is

not good at analyzing networks of people, telecom equipment, documents

and other graph structures. Therefore, Google built Pregel, a large bulk

synchronous processing application for petabyte -scale graph processing on

distributed commodity machines.

LINQ-to-Hadoop Microsoft has reforcerd its DryadLinq efforts towards Hadoop and will be

offering the ability to query Hadoop using this friendly and modern

language

31

Can I Get It?

32

Yes, but…

Microsoft Azure Hadoop is open to the public. You must

respond to the survey on the Microsoft Connect Site

https://connect.microsoft.com/SQLServer/Survey/Survey.

aspx?SurveyID=13697

If you just want to play around with Hadoop, Cloudera

has a fully configured VM they can download and run.

https://ccp.cloudera.com/display/SUPPORT/Cloudera's+

Hadoop+Demo+VM

https://connect.microsoft.com/SQLServer/Survey/Survey.aspx?SurveyID=13697




https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM




33

Resources

34

Chicago Area Hadoop User Group (CHUG)

http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/

Chicago Big Data

http://www.meetup.com/Chicago-Big-Data/

34

Local User Groups


















35

Resources Used for this Hadoop Research

Hadoop The Definitive Guide ORIELLY|Yahoo! Press by Tom White

Apache Hadoop – Petabytes and Terawatts discussion lead by Jakob Homan,

Software Engineer at Linked In (UTUBE link follows) :

http://www.youtube.com/watch?v=SS27F-hYWfU&feature=endscreen

What is Hadoop: Other Big Terms like Map Reduce interview with Mike Olsen of

Cloudera (UTUBE link follows): http://www.youtube.com/watch?v=S9xnYBVqLws

http://bigdatauniversity.com/courses/

TERADAT.ASTER (The Differences Between Aster and Hadoop

http://www.asterdata.com/blog/2008/09/06/differences-between-aster-and-hadoop/

Why the days are numbered for Hadoop as we know it

http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/





http://www.youtube.com/watch?v=S9xnYBVqLws

http://www.youtube.com/watch?v=S9xnYBVqLws




































36

Additional Hadoop Resources

Hadoop on Azure

https://www.hadooponazure.com/

Magenic: Why I am Excited about SQL Server 2012 (Part 2)

http://magenic.com/Blog/WhyIAmExcitedaboutSQLServer2012Part2.aspx

Apache Hadoop Home Page

http://hadoop.apache.org/

Running Apache Pig (Pig Latin) at Apache Hadoop on Windows Azure

http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/10/running-apache-pig-pig-latin-at-

apache-hadoop-on-windows-azure.aspx

Practical Problem Solving with Apache Hadoop Pig

http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

Fluent Queries on the Interactive JavaScript Console

http://social.technet.microsoft.com/wiki/contents/articles/7183.fluent-queries-on-the-interactive-

javascript-console.aspx

Hive vs. Pig

http://www.larsgeorge.com/2009/10/hive-vs-pig.html

Hive DDL Reference

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL


https://www.hadooponazure.com/








http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/10/running-apache-pig-pig-latin-at-apache-hadoop-on-windows-azure.aspx




































http://social.technet.microsoft.com/wiki/contents/articles/7183.fluent-queries-on-the-interactive-javascript-console.aspx




















https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

37

Demo

38

Agenda

Methods of accessing Azure Hadoop

Loading files into HDFS

Pig from the Interactive Console

Hive

Excel Hive Add-In

Sqoop

Documents

Exploring Hadoop - Chicago SQL PASS BIchicagobi.pass.org/Portals/147/presentations/201207_Hadoop_Slides.pdf · Hadoop is based on Large Scale Distributive Storage ... Apache Project)