Hadoop breizhjug

INTRODUCTION TO HADOOP

Rennes – 2014-11-06David Morin - @davAtBzh

BreizhJug

Solutions Engineer at

@davAtBzhDavid Morin

What is Hadoop ?

An elephant – This one ?

No, this one !

The father

Let's go !

Timeline

How did the story begin ?

=> Deal with high volume of data

Big Data – Big Server ?

Big Data – Big Problems ?

Split is the key

How to find data ?

Define a master

Try again

Not so bad

Hadoop fundamentals

● Distributed FileSystem for high volume of data● Use of common servers (limit costs)● Scalable / fault tolerance

Hadoop Distributed FileSystem

Hadoop fundamentals

● Distributed FileSystem for high volume of data● Use of common servers (limit costs)● Scalable / fault tolerance ??

Hadoop Distributed FileSystem

MapReduce

HDFS MapReduce

Mapreduce

Mapreduce : word count

Map Reduce

Data Locality Optimization

Mapreduce in action

Hadoop v1 : drawbacks

– One Namenode : SPOF – One Jobtracker : SPOF and un-scalable (nodes limitation)– MapReduce only : open this platform to non MR

applications– MapReduce v1 : do not fit well with iterative algorithms

used by Machine Learning

Hadoop v2

Improvements :– HDFS v2 : Secondary namenode– YARN (Yet Another Resource Negociator)

● JobTracker => Resource Manager + Applications Master (more than one)

● Can be used by non MapReduce applications– MapReduce v2 : uses Yarn

Hadoop v2

What about monitoring ?

● Command line : hadoop job, yarn● IHM to monitor cluster status● IHM to check status of running jobs● Access to logs files about nodes activity from the IHM

What about monitoring ?

What can we do with Hadoop ?

(Me) 2 projects in Credit Mutuel Arkea :– LAB : Anti-money laundering– Operational reporting for a B2B customer

LAB : Context

● Tracfin : supervised by the Economic and Financial department in France

LAB : Context

● Difficulties to provide accurate alerts : complexity to maintain the system and develop new features

LAB : Context

● Batch Cobol (z/OS) : started at 19h00 until 9h00 the day after

LAB : Migration to Hadoop

● Pig : Pig dataflow model fits well for this kind of process (lot of data manipulation)

● Lot of data in input : +1 for Pig

● A lot of jobs tasks can be parallelized : +1 for Hadoop

● Time spent for data manipulation reduced by more than 50 %

● Previous Job was a batch : MapReduce Ok

Context : – Provide a large variety of reporting to a B2B partner

Why Hadoop :– New project– Huge amount of different data sources as input : Pig Help

me !– Batch is ok

Operational Reporting

Pig – Why a new langage ?

● With Pig write MR Jobs becomes easy● Dataflow model : data is the key !● Langage : PigLatin● No limit : Used Defined Functions

http://pig.apache.org/docs/r0.13.0/https://github.com/linkedin/datafuhttps://github.com/twitter/elephant-birdhttps://cwiki.apache.org/confluence/display/PIG/PiggyBank

● Pig-Wordcount-- Load file on HDFS

lines = LOAD '/user/XXX/file.txt' AS (line:chararray);

-- Iterate on each line-- We use TOKENISE to split by word and FLATTEN to obtain a tuple

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- Group by word

grouped = GROUP words BY word;

-- Count number of occurences for each group (word)

wordcount = FOREACH grouped GENERATE group, COUNT(words);

-- Display results on sysout

DUMP wordcount;

Pig “Hello world”

=> 130 lines of code !

Import …

public class WordCount2 {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

static enum CountersEnum { INPUT_WORDS }

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

private boolean caseSensitive; private Set<String> patternsToSkip = new HashSet<String>();

private Configuration conf; private BufferedReader fis;

Pig vs MapReduce

● SQL like : HQL● Metastore : data abstraction and data discovery● UDFs

● Hive-Wordcount

-- Create table with structure (DDL)

CREATE TABLE docs (line STRING);

-- Load data..

LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;

-- Create table for results-- Select data from previous table, split lines and group by word-- And Count records per group CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) wGROUP BY wordORDER BY word;

Hive “Hello world”

Zookeeper

Purpose : Coordinate relations between the different actors. Provide a global configuration we have pushed.

Zookeeper● Distributed coordination service

● Dynamic configuration● Distributed locking

Zookeeper

● Messaging System with a specific design● Topic / Point to Point in the same time● Suitable for high volume of data

https://kafka.apache.org/

Hadoop : Batch but not only..

● Interactive processing uppon Hive and Pig

● Online database (realtime querying) ● NoSQL : columm oriented database● Based on Google BigTable● Storage on HDFS

● Streaming mode● Plug well with Apache Kafka● Allow data manipulation during input

http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos

http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign

Cascading

● Application development platform on Hadoop● APIs in Java : standard API, data processing, data

integration, scheduler API

Scalding

● Scala API for Cascading

Phoenix

● Relational DB Layer over Hbase● HBase access delivered as a JDBC client● Perf : on the order of milliseconds for small

queries, or seconds for tens of millions of rows

● Big data analytics in-memory / disk● Complements Hadoop● Fast and more flexible

https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark

http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Hadoop breizhjug

Technology

Hadoop 3 (2017 hadoop taiwan workshop)

Hadoop Online Tutorials - indiatrainings.in · Menu Search Hadoop Online Tutorials Author REPLY #1825 Hadoop Eco System › Forums › Hadoop Discussion Forum › 250 Hadoop Interview

Docker based Hadoop provisioning - Hadoop Summit 2014

Hadoop Hand-on Lab: Installing Hadoop 2

Обзор Hadoop-дистрибутивов. Тюнинг «узких мест» Hadoop

Configuración para Hadoop Configuración de WPS para Hadoop · Configuración para Hadoop Versión 4.2 Introducción ¿Qué es Hadoop? Hadoop esun marco de trabajo del software de

Hadoop Conf 2014 - Hadoop BigQuery Connector

Hadoop Crash Course Hadoop Summit SJ

Intro hadoop ecosystem components, hadoop ecosystem tools

Snapshotting in Hadoop Distributed File System for Hadoop ...€¦ · Snapshotting in Hadoop Distributed File System for Hadoop Open Platform as Service ... 2.2 Hadoop Open Platform

Advanced Hadoop Tuning and Optimization - Hadoop Consulting

New Map Reduce & Hadoop · 2018. 1. 24. · Hadoop Map Reduce Hadoop 2 TEZ Execution Engine DevelopmentSummary Hadoop File System Shell Overview Invoke via: hadoop fs

Apache Hadoop and Hive. Outline Architecture of Hadoop Distributed File System Hadoop usage at Facebook Ideas for Hadoop related research

Hadoop Summit 2010 Benchmarking And Optimizing Hadoop

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University

MapReduce Programming with Apache Hadoop - DSTdst.lbl.gov/ACSDownloads/kjackson/downloads/Hadoop-HDFS8-12pm.… · MapReduce Programming with Apache Hadoop Viraj Bhat ... (hadoop,

Hue: The Hadoop UI - Hadoop Singapore

HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Securing Hadoop: Security Recommendations for Hadoop