Upload
joelcrabb
View
5.049
Download
1
Embed Size (px)
DESCRIPTION
An introduction to Hadoop presentation geared towards educating potential clients on Hadoop\'s capabilities.
Citation preview
ObjectPartnersInc.
Click to edit Master subtitle style
Introduction to Hadoop
Presented by: Joel Crabb
Demo by: Nick Adelman
ObjectPartnersInc. Agenda
Ø TerminologyØ Why does Hadoop Exist?Ø HDFS and HbaseØ ExamplesØ Getting StartedØ Demo
ObjectPartnersInc. Terminology
Ø Hadoop– Core set of technologies hosted by Apache Foundation for
storing and searching data sets in the Tera and Petabyte range
Ø HDFS – Hadoop File System used as the basis for all Hadoop
technologiesØ Hbase
– Distributed Map based database which uses HDFS as its underlying data store
Ø Map Reduce– A framework for programming distributed parallel
processing algorithms
ObjectPartnersInc. Terminology
Ø Distributed Computing– A computing paradigm that parallelizes computations over
multiple compute nodes in order to decrease overall processing time
Ø NOSQL– Programming paradigm which does not use a relational
database as the backend data storeØ Big Data
– Generic term used when working with large data setsØ Name Node
– Server that knows location of all files in cluster
ObjectPartnersInc. Enterprise Architecture 101
Data DataHDFS
Hbase
RDBMSM
ap R
educ
e
Hbase
RDBMS
HDFS
ObjectPartnersInc. The New System Constraint
Ø Hard disk seek time is the new constraint when working with a Petabyte data set– Spread the seek time among multiple servers– Isolate the data to a single read per disk – Faster to read too much data sequentially on disk and
discard the excess
Ø Working under this paradigm requires New Tools
ObjectPartnersInc. New Tools: Why does Hadoop exist?
Ø In the early 2000s Google had problems:
Ø Problem 1: Store Tera and Petabytes of data: – Inexpensive, Reliable, Accessible
Ø Answer: distributed file system
Ø Problem 2: Distributed Computing is HardØ Answer: make distributed computing easier
Ø Problem 3: Datasets too large for RDBMSØ Answer: make a new way to store application data
ObjectPartnersInc. Google’s Solution: Tool 1
Ø Google File System (GFS)– A file system specifically built to manage large files and
support distributed computingØ Inexpensive:
– Store files distributed across a cluster of cheap serversØ Reliable:
– Plan for server failure: if you have 1000 servers, one will fail every day
– Always maintain three copies of each file (configurable)Ø Accessible:
– File Chunk size is 64MB = Less file handles to manage– Master table keeps track of locations of each file copy
Problem 1: Store Tera and Petabytes of data
ObjectPartnersInc. Google’s Solution: Tool 2
Ø Map Reduce – abstracts away the hard parts of distributed computing
Ø Programmers no longer need to manage:– Where is the data?– What piece of data am I working on?– How do I move data and result sets?– How do I combine results?
Ø Leverages the GFS– Send processing to the data– Multiple file copies means higher chance to use more
nodes for each process
Problem 2: Distributed Computing is Hard
ObjectPartnersInc. Tool 2: Map Reduce
Ø Distributed parallel processing frameworkØ Map - done N times on N servers
– Perform an operation (search) on a chunk (GBs) of dataØ Search 100 GB
– Process Map on 25 servers with 4GB of memory – 100 GB processed in-parallel in-memory– Create Maps storing results (key-value pair)
Ø Reduce– Take Maps from N nodes– Merge (reduce) maps to a single sorted map (result set)
Problem 2: Distributed Computing is Hard
ObjectPartnersInc. Google’s Solution: Tool 3
Ø Bigtable: new paradigm in storing large data sets– “a sparse, distributed, persistent multi-dimensional sorted
map”*
Ø Sparse: Few entries in map are populatedØ Distributed: Data spread across multiple logical
machines in multiple copiesØ Multi-dimensional: Maps within maps organize and
store dataØ Sorted: Sorted by lexiographic keys
– Lexiographic = alphabetically including numbers
Problem 3: Data sets too large for RDBMS
*Bigtable: A Distributed Storage System for Structured Data
ObjectPartnersInc. Google’s Architecture
Map Reduce Direct Access Map Reduce
Bigtable
GFS
ObjectPartnersInc. Hadoop – If Something Works…
GFS HDFS
Bigtable Hbase
Map Reduce
Map Reduce
Ø Hadoop was started to recreate these technologies in the Open Source community
ObjectPartnersInc. A Little More on HDFS
Ø Plan for Failure– In a thousand node cluster, machines will fail often– HDFS is built to detect failure and redistribute files
Ø Fast Data Access– Generally a batch processing system
Ø Large Files – typically GB to TB filesØ Simple Coherency
– Once file is closed, it cannot be updated or appendedØ Cloud Ready
– Setup on Amazon EC2 / S3
Summarized from: http://hadoop.apache.org/common/docs/current/hdfs_design.html
ObjectPartnersInc. A Little More on Hbase
Ø Multi-dimensional MapØ Map<byte[ ]
– Map<byte[ ] • Map<byte[ ]
– Map<Long, byte[]>>>>
Ø First Map: Row Key to Column FamilyØ Second Map: Column Family to Column LabelØ Third Map: Column Label to TimestampØ Fourth Map: Timestamp to Value
A Column Family is a grouping of columns of the same data type.
ObjectPartnersInc. Hbase Storage Model
ObjectPartnersInc. Hbase Access
Ø REST interface– http://wiki.apache.org/hadoop/Hbase/Stargate
Ø Groovy– http://wiki.apache.org/hadoop/Hbase/Groovy
Ø Scala– http://wiki.apache.org/hadoop/Hbase/Scala
ObjectPartnersInc. Industry Examples
* Information from http://wiki.apache.org/hadoop/PoweredBy
Ø Web/File Search (Yahoo!)Ø Yahoo! Is the main sponsor and contributor to HadoopØ Has over 25,000 servers running Hadoop
Ø Log aggregation (Amazon, Facebook, Baidu)Ø RDBMS replacement (Google Analytics)Ø Image store (Google Earth)Ø Email store (Gmail)Ø Natural Language Search (Microsoft)Ø Many more…
ObjectPartnersInc. Use Case #1: Yahoo! Search
Ø Problem circa 2006
Ø Yahoo! search is seen as inferior to Google’sØ Google is better at:
– Storing Tera and Petabytes of unstructured data – Searching the data set efficiently– Applying custom analytics to data set– Presenting a more relevant result set
ObjectPartnersInc. Use Case #1: Yahoo! Search
Ø Solution – Emulate Google with Hadoop’s HDFS, Pig and Map Reduce– HDFS
• Stores Petabytes of web page data distributed over a cluster of compute nodes (1000s)
• Runs on commodity hardware• Average server – 2X4 core, 4 – 32 GB RAM *
– Pig (Hadoop Sub-project)• Analytics processing platform
– Map Reduce• Build indexes from raw web data
* http://wiki.apache.org/hadoop/PoweredBy
ObjectPartnersInc.
Use Case #2: RDBMS Replacement
Ø Google Analytics circa 2006Ø Problem
– Store Terabytes of analytics data about website usage– GBs of data added per hour– Data added in small increments– Access and display data in < 3 seconds per request
ObjectPartnersInc. Use Case #2: RDBMS Replacement
Ø Solution – Bigtable, Map Reduce on GFSØ Bigtable sits over GFS inputs small bits of dataØ In 2006, GA cluster supported ~220 TB*Ø Raw Click Table (200 TB)
– Rows keyed by WebsiteName + Session Time– All website data stored consecutively on disk
Ø Summary Table (20 TB)– Map Reduce of Raw Click Table for customer web views
*Bigtable: A Distributed Storage System for Structured Data
Pattern: Collect data in one Bigtable instanceMap Reduce to a View Bigtable instance
ObjectPartnersInc. Can You Use Hadoop?
Ø IF…– You have a large amount of data (Terabytes+)– You can split your data collection data store
from your online or analytics data store – You can order your data lexiographically– You can run analytics as batches– You cannot afford a large enough RDBMS– You need dynamic column additions– You need near linear performance as data set
grows
ObjectPartnersInc. Other Hadoop Technologies
Ø Hive – SQL like query language to use Hadoop like a data warehouse
Ø Pig – parallel data analysis frameworkØ Zookeeper – Distributed application coordination
frameworkØ Chukwa – Data collection system for distributed
computingØ Avro – data serialization framework
ObjectPartnersInc. New Skills for IT
Ø Learning to restructure dataØ Learning to write Map Reduce programsØ Learning to maintain a Hadoop clusterØ Forgetting RDBMS/SQL dominated design
principals
It takes a new style of creativity to both structure datain Hadoop and write useful Map Reduce programs.
ObjectPartnersInc. Getting Started
Ø You can install a test system on a single Unix boxØ For a full system a minimum of 3 servers
– 10 to 20 servers is a small clusterØ Expect to spend a day to a week getting a multi-
node cluster configured.Ø A book like Pro Hadoop, by Jason Venner may
save you time but is based on the 0.19 Hadoop release (currently at 0.20)
ObjectPartnersInc. Optional Quickstart
Ø Cloudera has a preconfigured single node Hadoop instance available for download at: http://www.cloudera.com/hadoop-training-virtual-machine
Ø Yahoo! Has a Hadoop distribution as well at: http://developer.yahoo.com/hadoop/distribution/
ObjectPartnersInc. Alternatives to Hbase
Ø Project Voldemort– http://project-voldemort.com/– Used by Linked In
Ø Hypertable– http://www.hypertable.org/– Used by BaiDu (Search leader of China)
Ø Cassandra– http://cassandra.apache.org/– Apache sponsored distributed database– Used by Facebook
ObjectPartnersInc. Helpful Information
Ø http://hadoop.apache.orgØ http://hbase.apache.orgØ http://wiki.apache.org/hadoop/HadoopPresentationsØ http://labs.google.com/papers/bigtable.htmlØ http://labs.google.com/papers/gfs.htmlØ http://labs.google.com/papers/mapreduce.htmlØ Twitter: @hbaseØ Two articles on Map Reduce in the 01/2010
Communications of the ACM