Upload
frans-van-noort
View
309
Download
0
Embed Size (px)
Citation preview
Big Data
A brief introduction into Big Data&
Hadoop
01/01/16 F. v. Noort
Big Data – A brief introduction 2
Big Data – A definition
• Big data usually includes data sets (both structured and unstructured) with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.
• Doug Laney (2001) 3V’s: “data growth challenges and opportunities defined as being three-dimensional, i.e. increasing Volume (amount of data), Velocity (speed of data in and out), and Variety (range of data types and sources)”
01/01/16 F. v. Noort
Big Data – A brief introduction 3
Big Data – A definition
• Gartner (2012): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."
01/01/16 F. v. Noort
Big Data – A brief introduction 4
Big Data - CharacterizationThe original 3V’s have been expanded by the following more complete set of characteristics:• Volume: the quantity of generated & stored data• Velocity: the speed at which is generated & processed• Variety: the type and nature of the data• Variability: Inconsistency of the data set can hamper processes to
handle and manage it• Veracity: The quality of captured data can vary greatly, affecting
accurate analysis.• Complexity: Managing data coming from multiple sources can be
very challenging. Data must be linked, connected, and correlated so users can query and process it effectively.
01/01/16 F. v. Noort
Big Data – A brief introduction 5
Difference Big Data versus BI
• Business Intelligence uses descriptive statistics with high information density data to measure things, detect trends, etc.
• Big data (analytics) uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors
01/01/16 F. v. Noort
Big Data – A brief introduction 6
Architecture: Client Server
01/01/16 F. v. Noort
Server
Client ClientClientClient Client
Client ClientClientClient Client
Client’s can always overwhelm the system!
Big Data – A brief introduction 7
Architecture: Storage Area Network
01/01/16 F. v. Noort
Central Contact Point
ServerServerServerServer
Client Client Client Client Client Client
Client’s can always overwhelm the system!
Big Data – A brief introduction 8
Architecture: Google File System (GFS)
• Instead of having a giant file storage appliance sitting in the back end, use industry standard hardware on a large scale
• Drive high performance through the shear number of components
• Reliability through redundancy & replication
• Computation work is done there where the data is
01/01/16 F. v. Noort
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Big Data – A brief introduction 9
Hadoop• Based on work from Google File System + MapReduce• Doug Cutting & Mike Cafarella created there own version:
Hadoop (named after Doug’s son toy elephant)• Current distributions based on Open Source and Vendor
Work– Apache Hadoop– Cloudera - CH4 w/ Impala– Hortonworks– MapR– AWS– Windows Azure HDInsight
01/01/16 F. v. Noort
Big Data – A brief introduction 10
Why use Hadoop?
• Scalability: Scales to Petabytes or more• Fault tolerant• Faster: Parallel data processing• Better: Suited for particular types of BigData
problems• Open source• Low cost: can be deployed on commodity
hardware
01/01/16 F. v. Noort
Big Data – A brief introduction 11
Hadoop Core ArchitectureHadoop core comprises of a• Distributed File System
HDFS: Hadoop Distributed File System (based on GFS) File Sharing & Data Protection Across Physical Servers
• Processing paradigm MapReduce Distributed Computing Across Physical Servers
01/01/16 F. v. Noort
MapReduce
HDFS
Big Data – A brief introduction 12
HDFS (1/2)Hadoop Distributed File System• Written in JAVA• On top of native filing system• Designed to handle very large files with streaming data
access patterns• Uses blocks to store a file or parts of a file/ splitting of
large files into blocks• Build on X86-standards
Lot’s of flexibility: reference architectures for many type of servers
01/01/16 F. v. Noort
Hadoop Distributed File System
X86 X86 X86 X86
Big Data – A brief introduction 13
HDFS (2/2)
• HDFS File Blocks– 64Mb (default), 128 Mb (recommended)– 1HDFS block is supported by multiple operations system
(OS) blocks• Blocks are replicated (default 3x) to multiple nodes• Allows for node failure without data loss• Two key services
– Master NameNode– Many DataNodes
• Checkpoint Node (Secondary NameNode)
01/01/16 F. v. Noort
Big Data – A brief introduction 14
MapReduce“take your task which is data oriented, chunk it up and distribute it on the network such that every piece of work is done within the network by the machine that has the piece of data that needs to be worked on”
MapReduce• Processing paradigm that pairs with HDFS• Distributed computation algorithm that pushes the compute down to
each of the X86 servers• Fault tolerant• Parallelized (scalable) processing• Combination of a Map- and a Reduce procedure:
– Map procedure: performs filtering and sorting of the data– Reduce procedure: performs summary operations
01/01/16 F. v. Noort
Big Data – A brief introduction 15
Other Hadoop tools/frameworks
• Data Access:– Hive, Pig, Mahout
• Tools– Sqoop, Flume
01/01/16 F. v. Noort
Big Data – A brief introduction 16
Hadoop Architecture
Main nodes of Hadoop
• Hadoop Distribute Files System (HDFS) nodes– NameNode– DataNode
• MapReduce nodes– JobTracker– TaskTracker
01/01/16 F. v. Noort
Big Data – A brief introduction 17
HDFS - NameNode
• Single master service for HDFS• Single point of failure (HDFS 1.x)• Stores file to block to location mappings in the
namespace (manages the file system namespace and metadata)
• Don’t use inexpensive commodity hardware for this node
• Large memory requirements, keeps the entire file system metadata in memory
01/01/16 F. v. Noort
Big Data – A brief introduction 18
HDFS - DataNode
• Many per Hadoop cluster• Stores blocks on local disk• Manages blocks with data and serves them to clients• Checksums on blocks => fault tolerant data store
system• Clients connect to DataNode for I/O• Sends frequent heartbeats (pings “hey I’m alive” for
about every 2 seconds) to NameNode• Sends block reports to NameNode
01/01/16 F. v. Noort
Big Data – A brief introduction 19
HDFS Write operation
01/01/16 F. v. Noort
HDFS ClientFile 1
Block 1 Block 2 Block 3
Rack 2
DataNode 7
DataNode 8
DataNode 9
DataNode 12
DataNode 10
DataNode 11
Rack 3
DataNode 13
DataNode 14
DataNode 15
DataNode 18
DataNode 16
DataNode 17
Rack 1
DataNode 1
DataNode 2
DataNode 3
DataNode 6
DataNode 4
DataNode 5
NameNode
Clientdivides filein blocks
Client contacts name node to write data
NameNode says write it to these nodes(DN1, DN7, DN15)
Block 1
Block 1
Block 1
Block 2
Block 2Block 2
Block 3
Block 3
Block 3
• DataNodes replicate data blocks, orchestrated by the NameNode
• Default 3 replica’s• Rack-aware system!
Big Data – A brief introduction 20
HDFS Read operation
01/01/16 F. v. Noort
HDFS Client
Rack 2
DataNode 7
DataNode 8
DataNode 9
DataNode 12
DataNode 10
DataNode 11
Rack 3
DataNode 13
DataNode 14
DataNode 15
DataNode 18
DataNode 16
DataNode 17
Rack 1
DataNode 1
DataNode 2
DataNode 3
DataNode 6
DataNode 4
DataNode 5
NameNode
Clientdivides filein blocks
Client contacts name node to read data
NameNode says find it on these nodes
Block 1
Block 1
Block 1
Block 2
Block 2Block 2
Block 3
Block 3
Block 3
Big Data – A brief introduction 21
HDFS 2.0 Features
• NameNode High-Availability– Two redundant NameNodes in active/passive
configuration– Manual or automated failover
• NameNode Federation– Multiple independent NameNodes using the same
collection of DataNodes
01/01/16 F. v. Noort
Big Data – A brief introduction 22
Hadoop MapReduce
• Moves the code to the data• JobTracker– Master service to monitor jobs
• TaskTracker– Multiple services to run tasks– Same physical machine as a DataNode
• A job contains many tasks• A task contains one or more task attempts
01/01/16 F. v. Noort
Big Data – A brief introduction 23
MapReduce JobTracker
• One per Hadoop cluster• Receives job requests submitted by Client• Schedules jobs in FIFO order• Schedules & monitors MapReduce jobs on
task trackers• Issues task attempts to TaskTrackers• Single point of failure for MapReduce
01/01/16 F. v. Noort
Big Data – A brief introduction 24
MapReduce TaskTracker
• Runs on same node as DataNode service• Many per Hadoop cluster• Sends heartbeats and task reports to
JobTracker• Executes MapReduce operations• Configurable number of map and reduce slots• Runs map and reduce task attempts
01/01/16 F. v. Noort
Big Data – A brief introduction 25
HDFS Architecture: Master & Slave
01/01/16 F. v. Noort
HDFS Client
Secondary NameNodeNameNodeJobTracker
Note• Hadoop 1.0 has only 1
NameNode• Hadoop 2.0 has active & passive
NameNode
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
MapReduce Distributed Data Processing
Big Data – A brief introduction 26
How MapReduce works (1/3)
• MapReduce is a combination of a Map- and a Reduce procedure:– Map procedure: performs filtering and sorting of
the data– Reduce procedure: performs summary operations
01/01/16 F. v. Noort
Big Data – A brief introduction 27
How MapReduce works (2/3)
01/01/16 F. v. Noort
CustId, ZipCode, Amount
Data
Nod
e 1
Data
Nod
e 2
Data
Nod
e 3
4 6654FD €757 1534CD €60
2 5734CD €301 1184AN €15
5 5734CD €650 6654FD €22
5 5734CD €156 4484AN €10
3 1534CD €958 4484AN €55
6 4484AN €259 1184AN €15
Mapper 1
Mapper 2
2 Map JobsScenario: Get sum sales grouped by ZipCode
6654FD €75
1534CD €60
5734CD €30
1184AN €15
5734CD €65
6654FD €22
5734CD €15
4484AN €10
1534CD €95
4484AN €55
4484AN €25
1184AN €15
Map Phase
Big Data – A brief introduction 28
How MapReduce works (3/3)
01/01/16 F. v. Noort
6654FD €75
1534CD €60
5734CD €30
1184AN €15
5734CD €65
6654FD €22
5734CD €15
4484AN €10
1534CD €95
4484AN €55
4484AN €25
1184AN €15
5734CD €655734CD €305734CD €15
4484AN €104484AN €251534CD €601534CD €954484AN €55
6654FD €756654FD €22
1184AN €151184AN €15
5734CD €655734CD €305734CD €15
4484AN €104484AN €25
1534CD €601534CD €95
4484AN €55
6654FD €756654FD €22
1184AN €151184AN €15
5734CD €110
1534CD €155
4484AN €90
1184AN €30
6654FD €97
Reducer
Reducer
Reducer
Scenario: Get sum sales grouped by ZipCode
Shuffl
e Ph
ase
Sort
Sum
Big Data – A brief introduction 29
The Hadoop Ecosystem
• Data Access:– Hive– Pig– Mahout
• Tools– Sqoop– Flume
01/01/16 F. v. Noort
Big Data – A brief introduction 30
Hive
• Declarative language• Allows users to write
write SQL-like queries (no ANSI SQL)
• Analytics area• Structures data• Data in Tables• Tables will remain
01/01/16 F. v. Noort
MapReduce
HDFS
Hive
Big Data – A brief introduction 31
PIG
• Procedural language (PigLatin)
• Generates one or more MapReduce jobs
• Efficiency in computing• Structured/unstructured
data• Data in Variables• May not retain values
01/01/16 F. v. Noort
MapReduce
HDFS
Hive PIG
Big Data – A brief introduction 32
Mahout
• Library for scalable machine learning (written in Java)
• Classification, Clustering, Pattern Mining, etc ..
01/01/16 F. v. Noort
MapReduce
HDFS
Hive PIG
Mahout
Big Data – A brief introduction 33
Sqoop
• To transfer data to and from a relational database
• Compression of data is a feature
01/01/16 F. v. Noort
MapReduce
HDFS
Hive PIG
Mahout
Sqoop
Big Data – A brief introduction 34
Flume
• An application that allows to move streaming data to a Hadoop cluster
• A massively distributable framework for event based data
01/01/16 F. v. Noort
MapReduce
HDFS
Hive PIG
Mahout
Sqoop
Flume