View
208
Download
0
Category
Tags:
Preview:
DESCRIPTION
Inttroduction to Hadoop and its ecosystem at BreizhJug
Citation preview
INTRODUCTION TO HADOOP
Rennes – 2014-11-06David Morin - @davAtBzh
BreizhJug
Me
Solutions Engineer at
@davAtBzhDavid Morin
3
What is Hadoop ?
4
An elephant – This one ?
5
No, this one !
6
The father
7
Let's go !
8
Let's go !
9
Timeline
10
How did the story begin ?
=> Deal with high volume of data
11
Big Data – Big Server ?
12
Big Data – Big Server ?
13
Big Data – Big Problems ?
14
Big Data – Big Problems ?
15
Split is the key
16
How to find data ?
17
Define a master
18
Try again
19
Not so bad
20
Hadoop fundamentals
● Distributed FileSystem for high volume of data● Use of common servers (limit costs)● Scalable / fault tolerance
21
HDFS
HDFS
22
Hadoop Distributed FileSystem
23
Hadoop fundamentals
● Distributed FileSystem for high volume of data● Use of common servers (limit costs)● Scalable / fault tolerance ??
24
Hadoop Distributed FileSystem
25
MapReduce
HDFS MapReduce
26
Mapreduce
27
Mapreduce : word count
Map Reduce
28
Data Locality Optimization
29
Mapreduce in action
30
Hadoop v1 : drawbacks
– One Namenode : SPOF – One Jobtracker : SPOF and un-scalable (nodes limitation)– MapReduce only : open this platform to non MR
applications– MapReduce v1 : do not fit well with iterative algorithms
used by Machine Learning
31
Hadoop v2
Improvements :– HDFS v2 : Secondary namenode– YARN (Yet Another Resource Negociator)
● JobTracker => Resource Manager + Applications Master (more than one)
● Can be used by non MapReduce applications– MapReduce v2 : uses Yarn
32
Hadoop v2
33
YARN
34
YARN
35
YARN
36
YARN
37
YARN
38
YARN
39
What about monitoring ?
● Command line : hadoop job, yarn● IHM to monitor cluster status● IHM to check status of running jobs● Access to logs files about nodes activity from the IHM
40
What about monitoring ?
41
What can we do with Hadoop ?
(Me) 2 projects in Credit Mutuel Arkea :– LAB : Anti-money laundering– Operational reporting for a B2B customer
42
LAB : Context
● Tracfin : supervised by the Economic and Financial department in France
43
LAB : Context
● Difficulties to provide accurate alerts : complexity to maintain the system and develop new features
44
LAB : Context
● Batch Cobol (z/OS) : started at 19h00 until 9h00 the day after
45
LAB : Migration to Hadoop
● Pig : Pig dataflow model fits well for this kind of process (lot of data manipulation)
46
LAB : Migration to Hadoop
● Lot of data in input : +1 for Pig
47
LAB : Migration to Hadoop
● A lot of jobs tasks can be parallelized : +1 for Hadoop
48
LAB : Migration to Hadoop
● Time spent for data manipulation reduced by more than 50 %
49
LAB : Migration to Hadoop
● Previous Job was a batch : MapReduce Ok
50
Context : – Provide a large variety of reporting to a B2B partner
Why Hadoop :– New project– Huge amount of different data sources as input : Pig Help
me !– Batch is ok
Operational Reporting
51
52
Pig – Why a new langage ?
● With Pig write MR Jobs becomes easy● Dataflow model : data is the key !● Langage : PigLatin● No limit : Used Defined Functions
http://pig.apache.org/docs/r0.13.0/https://github.com/linkedin/datafuhttps://github.com/twitter/elephant-birdhttps://cwiki.apache.org/confluence/display/PIG/PiggyBank
53
● Pig-Wordcount-- Load file on HDFS
lines = LOAD '/user/XXX/file.txt' AS (line:chararray);
-- Iterate on each line-- We use TOKENISE to split by word and FLATTEN to obtain a tuple
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- Group by word
grouped = GROUP words BY word;
-- Count number of occurences for each group (word)
wordcount = FOREACH grouped GENERATE group, COUNT(words);
-- Display results on sysout
DUMP wordcount;
Pig “Hello world”
54
=> 130 lines of code !
Import …
public class WordCount2 {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
static enum CountersEnum { INPUT_WORDS }
private final static IntWritable one = new IntWritable(1); private Text word = new Text();
private boolean caseSensitive; private Set<String> patternsToSkip = new HashSet<String>();
private Configuration conf; private BufferedReader fis;
...
Pig vs MapReduce
55
● SQL like : HQL● Metastore : data abstraction and data discovery● UDFs
Hive
56
● Hive-Wordcount
-- Create table with structure (DDL)
CREATE TABLE docs (line STRING);
-- Load data..
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
-- Create table for results-- Select data from previous table, split lines and group by word-- And Count records per group CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) wGROUP BY wordORDER BY word;
Hive “Hello world”
57
Zookeeper
Purpose : Coordinate relations between the different actors. Provide a global configuration we have pushed.
58
Zookeeper● Distributed coordination service
59
● Dynamic configuration● Distributed locking
Zookeeper
60
● Messaging System with a specific design● Topic / Point to Point in the same time● Suitable for high volume of data
Kafka
https://kafka.apache.org/
61
Hadoop : Batch but not only..
62
Tez
● Interactive processing uppon Hive and Pig
63
HBase
● Online database (realtime querying) ● NoSQL : columm oriented database● Based on Google BigTable● Storage on HDFS
64
Storm
● Streaming mode● Plug well with Apache Kafka● Allow data manipulation during input
http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos
http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign
65
Cascading
● Application development platform on Hadoop● APIs in Java : standard API, data processing, data
integration, scheduler API
66
Scalding
● Scala API for Cascading
67
Phoenix
● Relational DB Layer over Hbase● HBase access delivered as a JDBC client● Perf : on the order of milliseconds for small
queries, or seconds for tens of millions of rows
68
Spark
● Big data analytics in-memory / disk● Complements Hadoop● Fast and more flexible
https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
69
??
Recommended