Introduction to Hadoop

ObjectPartnersInc.

Click to edit Master subtitle style

Introduction to Hadoop

Presented by: Joel Crabb

Demo by: Nick Adelman

ObjectPartnersInc. Agenda

Ø TerminologyØ Why does Hadoop Exist?Ø HDFS and HbaseØ ExamplesØ Getting StartedØ Demo

ObjectPartnersInc. Terminology

Ø Hadoop– Core set of technologies hosted by Apache Foundation for

storing and searching data sets in the Tera and Petabyte range

Ø HDFS – Hadoop File System used as the basis for all Hadoop

technologiesØ Hbase

– Distributed Map based database which uses HDFS as its underlying data store

Ø Map Reduce– A framework for programming distributed parallel

processing algorithms

ObjectPartnersInc. Terminology

Ø Distributed Computing– A computing paradigm that parallelizes computations over

multiple compute nodes in order to decrease overall processing time

Ø NOSQL– Programming paradigm which does not use a relational

database as the backend data storeØ Big Data

– Generic term used when working with large data setsØ Name Node

– Server that knows location of all files in cluster

ObjectPartnersInc. Enterprise Architecture 101

Data DataHDFS

Hbase

RDBMSM

ap R

educ

e

Hbase

RDBMS

HDFS

ObjectPartnersInc. The New System Constraint

Ø Hard disk seek time is the new constraint when working with a Petabyte data set– Spread the seek time among multiple servers– Isolate the data to a single read per disk – Faster to read too much data sequentially on disk and

discard the excess

Ø Working under this paradigm requires New Tools

ObjectPartnersInc. New Tools: Why does Hadoop exist?

Ø In the early 2000s Google had problems:

Ø Problem 1: Store Tera and Petabytes of data: – Inexpensive, Reliable, Accessible

Ø Answer: distributed file system

Ø Problem 2: Distributed Computing is HardØ Answer: make distributed computing easier

Ø Problem 3: Datasets too large for RDBMSØ Answer: make a new way to store application data

ObjectPartnersInc. Google’s Solution: Tool 1

Ø Google File System (GFS)– A file system specifically built to manage large files and

support distributed computingØ Inexpensive:

– Store files distributed across a cluster of cheap serversØ Reliable:

– Plan for server failure: if you have 1000 servers, one will fail every day

– Always maintain three copies of each file (configurable)Ø Accessible:

– File Chunk size is 64MB = Less file handles to manage– Master table keeps track of locations of each file copy

Problem 1: Store Tera and Petabytes of data


Ø Map Reduce – abstracts away the hard parts of distributed computing

Ø Programmers no longer need to manage:– Where is the data?– What piece of data am I working on?– How do I move data and result sets?– How do I combine results?

Ø Leverages the GFS– Send processing to the data– Multiple file copies means higher chance to use more

nodes for each process

Problem 2: Distributed Computing is Hard

ObjectPartnersInc. Tool 2: Map Reduce

Ø Distributed parallel processing frameworkØ Map - done N times on N servers

– Perform an operation (search) on a chunk (GBs) of dataØ Search 100 GB

– Process Map on 25 servers with 4GB of memory – 100 GB processed in-parallel in-memory– Create Maps storing results (key-value pair)

Ø Reduce– Take Maps from N nodes– Merge (reduce) maps to a single sorted map (result set)

Problem 2: Distributed Computing is Hard


Ø Bigtable: new paradigm in storing large data sets– “a sparse, distributed, persistent multi-dimensional sorted

map”*

Ø Sparse: Few entries in map are populatedØ Distributed: Data spread across multiple logical

machines in multiple copiesØ Multi-dimensional: Maps within maps organize and

store dataØ Sorted: Sorted by lexiographic keys

– Lexiographic = alphabetically including numbers

Problem 3: Data sets too large for RDBMS

*Bigtable: A Distributed Storage System for Structured Data

ObjectPartnersInc. Google’s Architecture

Map Reduce Direct Access Map Reduce

Bigtable

GFS

ObjectPartnersInc. Hadoop – If Something Works…

GFS HDFS

Bigtable Hbase

Map Reduce

Map Reduce

Ø Hadoop was started to recreate these technologies in the Open Source community

ObjectPartnersInc. A Little More on HDFS

Ø Plan for Failure– In a thousand node cluster, machines will fail often– HDFS is built to detect failure and redistribute files

Ø Fast Data Access– Generally a batch processing system

Ø Large Files – typically GB to TB filesØ Simple Coherency

– Once file is closed, it cannot be updated or appendedØ Cloud Ready

– Setup on Amazon EC2 / S3

Summarized from: http://hadoop.apache.org/common/docs/current/hdfs_design.html

ObjectPartnersInc. A Little More on Hbase

Ø Multi-dimensional MapØ Map<byte[ ]

– Map<byte[ ] • Map<byte[ ]

– Map<Long, byte[]>>>>

Ø First Map: Row Key to Column FamilyØ Second Map: Column Family to Column LabelØ Third Map: Column Label to TimestampØ Fourth Map: Timestamp to Value

A Column Family is a grouping of columns of the same data type.

ObjectPartnersInc. Hbase Storage Model

ObjectPartnersInc. Hbase Access

Ø REST interface– http://wiki.apache.org/hadoop/Hbase/Stargate

Ø Groovy– http://wiki.apache.org/hadoop/Hbase/Groovy

Ø Scala– http://wiki.apache.org/hadoop/Hbase/Scala

ObjectPartnersInc. Industry Examples

* Information from http://wiki.apache.org/hadoop/PoweredBy

Ø Web/File Search (Yahoo!)Ø Yahoo! Is the main sponsor and contributor to HadoopØ Has over 25,000 servers running Hadoop

Ø Log aggregation (Amazon, Facebook, Baidu)Ø RDBMS replacement (Google Analytics)Ø Image store (Google Earth)Ø Email store (Gmail)Ø Natural Language Search (Microsoft)Ø Many more…

ObjectPartnersInc. Use Case #1: Yahoo! Search

Ø Problem circa 2006

Ø Yahoo! search is seen as inferior to Google’sØ Google is better at:

– Storing Tera and Petabytes of unstructured data – Searching the data set efficiently– Applying custom analytics to data set– Presenting a more relevant result set

ObjectPartnersInc. Use Case #1: Yahoo! Search

Ø Solution – Emulate Google with Hadoop’s HDFS, Pig and Map Reduce– HDFS

• Stores Petabytes of web page data distributed over a cluster of compute nodes (1000s)

• Runs on commodity hardware• Average server – 2X4 core, 4 – 32 GB RAM *

– Pig (Hadoop Sub-project)• Analytics processing platform

– Map Reduce• Build indexes from raw web data

* http://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/Hbase/Stargate

http://wiki.apache.org/hadoop/Hbase/Groovy

http://wiki.apache.org/hadoop/Hbase/Scala

ObjectPartnersInc.

Use Case #2: RDBMS Replacement

Ø Google Analytics circa 2006Ø Problem

– Store Terabytes of analytics data about website usage– GBs of data added per hour– Data added in small increments– Access and display data in < 3 seconds per request

ObjectPartnersInc. Use Case #2: RDBMS Replacement

Ø Solution – Bigtable, Map Reduce on GFSØ Bigtable sits over GFS inputs small bits of dataØ In 2006, GA cluster supported ~220 TB*Ø Raw Click Table (200 TB)

– Rows keyed by WebsiteName + Session Time– All website data stored consecutively on disk

Ø Summary Table (20 TB)– Map Reduce of Raw Click Table for customer web views

*Bigtable: A Distributed Storage System for Structured Data

Pattern: Collect data in one Bigtable instanceMap Reduce to a View Bigtable instance

ObjectPartnersInc. Can You Use Hadoop?

Ø IF…– You have a large amount of data (Terabytes+)– You can split your data collection data store

from your online or analytics data store – You can order your data lexiographically– You can run analytics as batches– You cannot afford a large enough RDBMS– You need dynamic column additions– You need near linear performance as data set

grows

ObjectPartnersInc. Other Hadoop Technologies

Ø Hive – SQL like query language to use Hadoop like a data warehouse

Ø Pig – parallel data analysis frameworkØ Zookeeper – Distributed application coordination

frameworkØ Chukwa – Data collection system for distributed

computingØ Avro – data serialization framework

ObjectPartnersInc. New Skills for IT

Ø Learning to restructure dataØ Learning to write Map Reduce programsØ Learning to maintain a Hadoop clusterØ Forgetting RDBMS/SQL dominated design

principals

It takes a new style of creativity to both structure datain Hadoop and write useful Map Reduce programs.

ObjectPartnersInc. Getting Started

Ø You can install a test system on a single Unix boxØ For a full system a minimum of 3 servers

– 10 to 20 servers is a small clusterØ Expect to spend a day to a week getting a multi-

node cluster configured.Ø A book like Pro Hadoop, by Jason Venner may

save you time but is based on the 0.19 Hadoop release (currently at 0.20)

ObjectPartnersInc. Optional Quickstart

Ø Cloudera has a preconfigured single node Hadoop instance available for download at: http://www.cloudera.com/hadoop-training-virtual-machine

Ø Yahoo! Has a Hadoop distribution as well at: http://developer.yahoo.com/hadoop/distribution/

ObjectPartnersInc. Alternatives to Hbase

Ø Project Voldemort– http://project-voldemort.com/– Used by Linked In

Ø Hypertable– http://www.hypertable.org/– Used by BaiDu (Search leader of China)

Ø Cassandra– http://cassandra.apache.org/– Apache sponsored distributed database– Used by Facebook

ObjectPartnersInc. Helpful Information

Ø http://hadoop.apache.orgØ http://hbase.apache.orgØ http://wiki.apache.org/hadoop/HadoopPresentationsØ http://labs.google.com/papers/bigtable.htmlØ http://labs.google.com/papers/gfs.htmlØ http://labs.google.com/papers/mapreduce.htmlØ Twitter: @hbaseØ Two articles on Map Reduce in the 01/2010

Communications of the ACM

ObjectPartnersInc. DEMO

http://www.cloudera.com/hadoop-training-virtual-machine

http://developer.yahoo.com/hadoop/distribution/

Documents

Introduction to Hadoop