Upload
lekiet
View
213
Download
0
Embed Size (px)
Citation preview
1
Exploring Hadoop Microsoft Azure Implementation Jacob Saunders - Chicago SQL BI User Group, 7/24/2012
2
Agenda
What is Hadoop?
HDFS
MapReduce
What’s Next?
Resources
Demo
3
What is Hadoop?
4
What is Hadoop?
A system for processing mind-boggingly large
amounts of data efficiently -Jacob Homan
Sr. Software Engineer at LinkedIn
5
Hadoop in Plain English
Hadoop is based on Large Scale Distributive Storage
Not a Relational Database
Hadoop was designed to process large amounts of complex data
efficiently
Hadoop is very good at organizing Unstructured Data
Hadoop is Open Source
The Apache Software foundation (http://hadoop.apache.org/)
provides support for a community of Open Source Projects (i.e. pig,
hive, etc..)
6
A Brief History of Hadoop
Hadoop has its origins in Google’s need to store and perform
analysis on the entire contents of the Internet
Hadoop is based on Google’s GFS and MapReduce papers
The concept of problem partitioning is as old as LISP
Yahoo was also involved early on with Hadoop
Yahoo is the main contributor to Hadoop today
Nutch (another Open Source Search Engine that was part of the
Apache Project) Doug Cutting of Nutch created the prototype
based of a Google White Paper.
Yahoo hired Doug Cutting to take that prototype and expand on it
7
Why is it called Hadoop?
Hadoop was named by Doug Cutting,
after his son’s toy elephant!
8
Who is using Hadoop today?
Adobe
CCC
Sears
Orbitz Worldwide
The New York Times
eHarmony
Last.fm
Fox Interactive Media
Rackspace
IBM
Microsoft
To name a few ….
9
Main Components of Hadoop
HDFS (Hadoop Distributed File System) -
Storage
Open Source Implementation of MapReduce –
Math/Computation
10
The Hadoop Ecosystem
11
HDFS
12
HDFS
HDFS is a distributed File System that runs on large
clusters of servers (also referred to as nodes)
The file system strongly resembles the UNIX file system
in its structure and utilizes UNIX tools (i.e. ls, cat, mkdir,
etc..)
13
HDFS Example screen print of File system
14
HDFS Hadoop can read very big files (Petabytes and Exabytes). What the heck is an
Exabyte?
An Exabyte is approximately 1,000 Petabytes.
Another way to look at it is that an Exabyte is approximately one quintillion
bytes or one billion Gigabytes.
There is not much to compare an Exabyte to. It has been said that 5 Exabytes
would be equal to all of the words ever spoken by mankind.
HDFS reads a file and splits it up into 128 MB blocks ( the size is
configurable)
HDFS makes (3) copies of each block (the # of copies is configurable. The
more copies you have, the safer the file will be)
15
How HDFS Stores Data
Distributed Environment
The replicas of the data are
Placed on nodes that comprise
The cluster
A datanode is a server on the
cluster
16
Hadoop Nodes
Namenode
(or headnode)
The Namenode does
NOT store blocks of
Data.
It keeps track of where
The blocks are mapped
On the other nodes (servers)
17
MapReduce
18
The Map function takes a large dataset and partitions
it into several smaller intermediate datasets that can
be processed in parallel by different nodes in a cluster.
The Reduce function then takes the separate results of
each computation and aggregates them to form the
final output.
MapReduce can be leveraged to perform operations
such as sorting and statistical analysis on large
datasets, which may be mapped into smaller partitions
and processed in parallel.
MapReduce
19
MapReduce
Data is treated as Keys & Values
Examples: <Key, value>
<Byte Offset, some text>
<Userid, User Profile> (used by social networks)
<timestamp, Access log entry> (used for log anaysis)
<users, List of User’s friends> (used by social networks)
19
20
Steps to write a Map Reduce
Write a mapper that takes a key and a value, emits zero
or more keys and values
Write a reducer that takes all the values of one key and
emits zero or more new keys and values
21
Map Reduce Task Tracker
22
Calculation
runs on the
node where
the data
block
resides
Map Reduce Task Tracker
23
The map reducer is run
on many nodes
in parallel because
Hadoop is a
Distributed Environment
The diagram to the right
illustrates this
Map Reduce Task Tracker
24
Map Reduce Task Tracker Output
25
Map Reduce Task Tracker Output
26
The Reducer will
Process results from all
nodes.
The occurrence of
Each key (a, above)
Is calculated and the number
of occurrences the Value
(1.5.3, etc…)
Map Reduce Task Tracker Output
27
Map Reduce doesn’t always “Map” New languages/projects that can be used instead of Map Reduce
Benefit: the projects below do not require you to think in <Key, Value>
paradigm
PIG – a data flow language and execution environment for exploring very large
datasets. Pig runs on HDFS and Map Reducer Clusters
Hive- a distributed data warehouse. Hive manages data stored in HDFS and provides
a query language based on SQL (and which is translated by the runtime engine to
Map Reduce jobs) for querying the data
HBase- A distributed, column-oriented database. HBase used HDFS for its underlying
storage and supports batch style computations using Map Reduce and point
queries(random reads)
ZooKeeper- A distributed, highly available coordination service. ZooKeeper provides
primitives such as distributed locks that can be used for building applications
Sqoop-A tool for efficiently moving data between relational databases and HDFS.
28
What’s Next?
29
What’s next?
Google’s Percolator for incremental indexing By dealing only with new, modified, or deleted documents and using
secondary indices to efficiently catalog and query the resulting output,
Percolator can dramatically increase performance of rapidly changing
datasets.
Dremel for ad hoc analytics MapReduce (and thereby Hadoop) is purpose-built for organized data
processing (jobs), not ad hoc exploration. Google invented Dremel (now
exposed as the BigQuery product) as a purpose-built tool to allow analysts
to scan over petabytes of data in seconds to answer ad hoc queries and
power compelling visualizations.
30
What’s next?
Pregel for analyzing graph data. Google MapReduce was purpose-built for crawling and analyzing the
world’s largest graph data structure – the internet. However, MapReduce is
not good at analyzing networks of people, telecom equipment, documents
and other graph structures. Therefore, Google built Pregel, a large bulk
synchronous processing application for petabyte -scale graph processing on
distributed commodity machines.
LINQ-to-Hadoop Microsoft has reforcerd its DryadLinq efforts towards Hadoop and will be
offering the ability to query Hadoop using this friendly and modern
language
31
Can I Get It?
32
Yes, but…
Microsoft Azure Hadoop is open to the public. You must
respond to the survey on the Microsoft Connect Site
https://connect.microsoft.com/SQLServer/Survey/Survey.
aspx?SurveyID=13697
If you just want to play around with Hadoop, Cloudera
has a fully configured VM they can download and run.
https://ccp.cloudera.com/display/SUPPORT/Cloudera's+
Hadoop+Demo+VM
33
Resources
34
Chicago Area Hadoop User Group (CHUG)
http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/
Chicago Big Data
http://www.meetup.com/Chicago-Big-Data/
34
Local User Groups
35
Resources Used for this Hadoop Research
Hadoop The Definitive Guide ORIELLY|Yahoo! Press by Tom White
Apache Hadoop – Petabytes and Terawatts discussion lead by Jakob Homan,
Software Engineer at Linked In (UTUBE link follows) :
http://www.youtube.com/watch?v=SS27F-hYWfU&feature=endscreen
What is Hadoop: Other Big Terms like Map Reduce interview with Mike Olsen of
Cloudera (UTUBE link follows): http://www.youtube.com/watch?v=S9xnYBVqLws
http://bigdatauniversity.com/courses/
TERADAT.ASTER (The Differences Between Aster and Hadoop
http://www.asterdata.com/blog/2008/09/06/differences-between-aster-and-hadoop/
Why the days are numbered for Hadoop as we know it
http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/
36
Additional Hadoop Resources
Hadoop on Azure
https://www.hadooponazure.com/
Magenic: Why I am Excited about SQL Server 2012 (Part 2)
http://magenic.com/Blog/WhyIAmExcitedaboutSQLServer2012Part2.aspx
Apache Hadoop Home Page
http://hadoop.apache.org/
Running Apache Pig (Pig Latin) at Apache Hadoop on Windows Azure
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/10/running-apache-pig-pig-latin-at-
apache-hadoop-on-windows-azure.aspx
Practical Problem Solving with Apache Hadoop Pig
http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
Fluent Queries on the Interactive JavaScript Console
http://social.technet.microsoft.com/wiki/contents/articles/7183.fluent-queries-on-the-interactive-
javascript-console.aspx
Hive vs. Pig
http://www.larsgeorge.com/2009/10/hive-vs-pig.html
Hive DDL Reference
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
37
Demo
38
Agenda
Methods of accessing Azure Hadoop
Loading files into HDFS
Pig from the Interactive Console
Hive
Excel Hive Add-In
Sqoop