BIG DATA: HADOOP AND · PDF fileBIG DATA: HADOOP AND BEYOND Daryl Heinz [email protected]. ... •DARPA •Xdata •NSF •CISE Expedition Grant •Amazon, Google, SAP •18

BIG DATA: HADOOP AND BEYOND

Daryl Heinz

[email protected]

AGENDA

• The Apache Software Foundation

• BigData and the role of Hadoop

• Overview of Hadoop

• The Hadoop Distributed File System (HDFS)

• Yet Another Resource Negotiator (YARN)

• Application types:

• Data at Rest (Batch)

• Data at Motion (Streaming)

• A Brief look at some “Hadoop EcoSystem” projects

• The Berkeley Data Analytic Stack

BIG DATA• The 3 V’s and the issue of mutability

• What do you do with your current data infrastructure?• start using it IN CONCERT WITH big data frameworks

• BigData can be any or all of these (and more):• Clickstream

• Geographic

• Sensor/Machine

• Sentiment

• Server Logs

• Text

• Big Data is poly-structured

• OPEN• The ASF provides support for the Apache Community of open-source software projects, which

provide software products for the public good

• INNOVATION• The ASF projects are defined by collaborative consensus based processes, an open,

pragmatic software license and a desire to create high quality software that leads the way in its field.

• COMMUNITY• We consider ourselves not simply a group of projects sharing a server, but rather a community

of developers and users.

• APACHE PROJECTS

• The all-volunteer ASF develops, stewards, and incubates more than 350 Open Source projects and initiatives that cover a wide range of technologies

• [NOTE] This is where the “professional open-source”, “hybrid” and “proprietary” vendors step in with their “distributions”

• http://www.apache.org/

ASF PROJECTS

• http://www.apache.org/

http://www.apache.org/

APACHE HADOOP

• The Hadoop project includes these modules:

• Hadoop Common

• The common utilities that support the other Hadoop modules.

• Hadoop Distributed File System (HDFS™)

• A distributed file system that provides HA of polystructured data

• Hadoop YARN

• A framework for job scheduling and cluster resource management.

• Hadoop MapReduce

• A YARN-based application type for batch parallel processing of large data sets.

• http://www.slideshare.net/hortonworks/apache-hadoop-yarn-enabling-nex

APACHE HADOOP

HDFS OVERVIEW

• The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on “commodity” or low-cost hardware

• Services are fault-tolerant

• Data is replicated

• Increased data capacity is provided by “horizontal” as versus “vertical” scaling

• A “virtual” file system “that looks like *nix” is provided to the user.

• HDFS is not POSIX

HDFS OVERVIEWASSUMPTIONS AND GOALS

• Hardware Failure is the norm rather than the exception. • An HDFS instance may consist of thousands of server machines,

each storing part of the file system’s data

• Fault detection and automatic recovery from faults is a core architectural element of HDFS

• The query application (or yarn application types) specify the data type – translation to data type is part of the query (if necessary)

• Storage format, data type and query data types are decoupled• The three “V”s and the question of immutability are the users resposibility

• HDFS is designed for batch processing rather than interactive analysis (or streaming analysis) by users.

HDFS OVERVIEWASSUMPTIONS AND GOALS (2)

• The emphasis is on high volume of data accessed rather than low latency of data access

• Moving Computation is Cheaper than Moving Data

• A computation is efficient when executed where the data resides

• HDFS provides interfaces for applications (YARN) so they can be moved to where the data is located

• Portability Across Heterogeneous Hardware and Software Platforms

• The HDFS services are portable from one platform to another

• Portability facilitates adoption of HDFS as a viable virtual file system for a large range of applications, including non-ASF projects

HDFS ARCHITECTURE

• https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

YARN OVERVIEW

• Yet Another Resource Negotiator

• YARN is the computation framework of Hadoop whereas HDFS is the storageframework of Hadoop

• YARN and its services can support multiple application types – both batch (data at rest) and streaming (data in motion) oriented

• Two important points:

• The Resource Manager can be configured for HA

• YARN provides an API to bring legacy and new applications under the YARN resource management and application HA

• Refer to http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html#Writing_a_simple_Client

YARN COMPONENTS

• http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

The ResourceManager communicates the

NodeManager(s) (NM) for status of tasks associated

with the application type task running on its server.

The ResourceManager arbitrates resources among all

the applications in the system.

The ApplicationMaster is an application-type specific

library responsible for negotiating resources from the

ResourceManager for its application type and

working with the NodeManager(s) to execute and

monitor the applications, or tasks.

HADOOP ECOSYSTEM

• Many Apache projects, considered part of the “Hadoop Ecosystem” may be used without any reference to Hadoop. The following is only a sampling:

• Avro™: A data serialization system• Cassandra™: A scalable multi-master database with no single points of failure.• Chukwa™: A data collection system for managing large distributed systems. • HBase™: A scalable, distributed NoSQL, key-value database that supports structured data storage for

large tables• Hive™: A SQL-presenting framework that provides abstraction over the M/R paradigm • Kafka™ : A high-throughput distributed messaging system• Mahout™: A Scalable machine learning and data mining library• Mesos: Abstracts CPU, memory, storage, and other compute resources away from machines (physical

or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively• Nifi: Successor to Flume. Supports scalable directed graphs of data routing, transformation, and

system mediation logic. • Phoenix: A JDBC “skin” around HBase• Pig™: A high-level data-flow language and execution framework for parallel computation• Samza: a distributed stream processing framework that uses Apache Kafka for messaging,

and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management

• Spark: a fast and general engine for large-scale memory or disk-resident data processing• ZooKeeper™: A high-performance coordination service for distributed applications.

http://kafka.apache.org/

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

PIG

• Found in various locations on the web

HIVE

• Defining a table:hive> CREATE TABLE mytable (name chararray, age int)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

• ROW FORMAT is a Hive-unique command that indicate that each row is comma delimited text

• HiveQL statements are terminated with a semicolon ';'

• Other table operations:• SHOW TABLES

• CREATE TABLE

• ALTER TABLE

• DROP TABLE

• Courtesy Hortonworks

WHAT REALLY HAPPENS WITH HIVE

• Courtesy Hortonworks

NIFI

• NiFi automates system-to-system dataflow

• dataflow: the automated and managed flow of information between systems

• dataflow patterns: Gregor Hohpe. Enterprise Integration Patterns

• http://www.enterpriseintegrationpatterns.com/

• Apache NiFi provides directed graphs of

• data routing

• Transformation

• system mediation logic

• Documentation: http://nifi.apache.org/docs.html

2/3/2016

20

http://www.enterpriseintegrationpatterns.com/

http://nifi.apache.org/docs.html

NIFI OBJECTIVES (2)

• Web-based UI

• Configurable• Flow can be modified at runtime

• Data Provenance• Track dataflow from beginning to

end

• Designed for extension• Build your own processors• Enables rapid development and

effective testing

• Secure• SSL, SSH, HTTPS, encrypted content,

etc...• Pluggable role-based

authentication/authorization

2/3/2016

21

PHOENIX

• A relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data

• Compiles a SQL query into a series of HBase scans

• The running of the scans is orchestrated to produce regular JDBC result sets

2/3/2016

22

PHOENIX ON TOP OF HBASE

• The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema

• Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

2/3/2016

23

PERFORMANCE AND MEETING YOUR SLA

• Its in the configuration: over 2400 java properties for Hadoop in 4 files

• About 20 more for Hive

• About 50 more for Pig

• About 15 more for Spark

• 400 more for HBase

• …

• Each framework is reliant on shell environment variables…

• Forgot about the disk I/O, network issues, serialization…

CONFIGURATION PROPERTIES

• Properties are pervasive throughout the components of the Hadoop Ecosystem

• All components are “shipped” with “default” configuration settings that must be reviewed for applicability to each use case and cluster environment

• “Administrators” must review the properties and decide the following:

• What properties are the organizations “default”

• What properties must be marked with the <final> attribute

• What properties are intended to be superseded by either the core-site.xml file or by job-specific parameters

PERFORMANCE “TUNING” THE HADOOP CLUSTER (2)

• There is NOT a “standard” configuration for optimal performance that can be set at installation time

• Google on “Hadoop Performance Tuning” for many URLs to reference on this topic.

• Hadoop vendors have UIs to collect and facilitate preliminary performance analysis

• Apache Ambari is strongly suggested to start the journey

• Other tools may be chosen and used

• Criteria is usually based on specific function and user familiarity

HDFS-DEFAULT.XMLHDFS-SITE.XML

• Some properties of the 1200 of immediate interest for the NameNode are:

• Some properties of immediate interest for the DataNodes are:

• Images from http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-common/ClusterSetup.html

HADOOP ENVIRONMENT SCRIPTS

• In addition to ensuring the configuration properties in the .xml files are set appropriately, the environment in which the Hadoop deamons execute must be set.

• Setting values for environment variables that influence the Hadoop daemons are set in the following files:• Example hadoop-env.sh

• https://github.com/hanborq/hadoop/blob/master/example-confs/conf.secure/hadoop-env.sh

• Example yarn_env.sh• https://apache.googlesource.com/hadoop-

common/+/2942a5bfbafd67655b0859d339a4e95a0b6d5044/hadoop-yarn-project/hadoop-yarn/conf/yarn-env.sh

• Properties specified in either of these files (with caveats) can be superseded as job parameters (this can be command line, properties objects or vendor-specific management consoles.

AMBARI CONSOLE

• http://ambari.apache.org/1.2.0/installing-hadoop-using-ambari/content/

WHEN TO [NOT] USE HADOOP

• When to use Hadoop:

• Your data sets are really big

• You celebrate data diversity

• You have mad programming skills

• You are building an enterprise data hub for the future

• You find yourself throwing away perfectly good data

• When not to use Hadoop:

• You need answers in a hurry

• Your queries are complex are require extensive optimization

• You require random interactive access to data

• You want to store sensitive data

• You want to replace your data warehouse

• (http://www.facebook.com/pages/Datanami/124760547631010)

2/3/2016

32

SLOW FRAMEWORK, COMPLEX DATA

• http://news360.com/article/246140284

33

2/3/2016

BERKELEY AMPLAB

• 18 Commerial Sponsors:• Cisco

• Cloudera

• Ericsson

• FaceBook

• GE

• HortonWorks

• Intel

• Microsoft

• Oracle

• Splunk

• VmWare

• Yahoo and more…

• Began January 2011

• 8 Faculty, 40 students, 3 SW engineers

• Funding from • DARPA

• Xdata

• NSF

• CISE Expedition Grant

• Amazon, Google, SAP

• 18 commercial sponsors

35

2/3/2016

APPROACH TO BDAS GOALS

• Support the combination of batch, streaming and interactive computations with relative ease

• A single execution model supports all computation models

• Support interactive and streaming computations via use of memory and parallelism• Memory transfer rates far exceed any disk configuration capability

• RAM/SSD hybrid memories are beginning to appear

• Support the development of algoritms beyond simple MR or current ML algorithms such as recommendation engines and K-means clustering

• Provides Python and Scala shells

• Provides abstractions for graph based and ML algorithms

• Be compatible with existing Hadoop/HDFS and its ecosystem• Interoperates with existing storage and input formats (HDFS, Hive, Flume, etc)

• Supports existing execution models (Hive, GraphLab, etc)

37

2/3/2016

THE BDAS PROJECTS

• https://amplab.cs.berkeley.edu/software/

MESOS• Apache Mesos abstracts CPU, memory, storage, and other compute resources away

from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively

• Runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API’s for resource management and scheduling across entire datacenter and cloud environments

• Compatible with current Hadoop-related ASF projects

• Currently supporting 3500+ servers at Twitter • – scalability to 10000s

• https://amplab.cs.berkeley.edu/projects/mesos-dynamic-resource-sharing-for-clusters/

41

2/3/2016

SPARK• Spark is a general engine allowing the combination of multiple types of

computations (e.g., SQL queries, text processing and machine learning) that with Hadoop have required learning different engines

• Hadoop is disk oriented, Spark is memory oriented

• [Core] Spark is now an ASF Project

42

2/3/2016

SPARK (2)

• Spark as an engine is the basis of Spark SQL, Streaming Spark, MLlib, GraphX

• Spark offers simple APIs in Python, Java, Scala and SQL, and built-in libraries

• Spark can run in Hadoop clusters and access any Hadoop data source

2/3/2016

43

SPARK CORE

2/3/2016

44

SPARK CORE

• Spark Core provides the basic capabilities of Spark:

• Task scheduling

• Memory management

• Fault recovery

• Storage system usage

• Spark Core also provides the API that defines Resilient Distributed Datasets(RDDs)

• RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel

• Spark Core provides many APIs for building and manipulating these collections

2/3/2016

45

SPARK SQL

2/3/2016

46

SPARK SQL

• Spark SQL provides a SQL interface to Spark that represents database tables as Spark RDDs and translates SQL queries into Spark operations

• Spark SQL allows developers to intermix SQL queries with the programmatic data manipulations supported by RDDs in Python, Javaand Scala, all within a single application

• Spark SQL was added to Spark in version 1.0

2/3/2016

47

SHARK

• Shark is a project out of UC Berkeley that pre-dates Spark SQL and is being ported to work on top of Spark SQL

• Shark provides additional functionality so that Spark can act as drop-in replacement for Apache Hive

• This includes a HiveQL shell, as well as a JDBC server that makes it easy to connect external graphing and data exploration tools

2/3/2016

48

SPARK STREAMING

2/3/2016

50

SPARK STREAMING

• Spark Streaming provides an API for manipulating data streams that closely resembles the Spark Core’s RDD API

• Programmers can easily learn the project with familiarity with RDDs and move between applications that manipulate data stored in memory, on disk, or arriving in real-time

• Examples of data streams:

• log files generated by production web servers

• queues of messages containing status updates posted by users of a web service

• Spark Streaming provides the same degree of fault tolerance, throughput, and scalability that the Spark Core provides

2/3/2016

51

MLLIB

2/3/2016

52

MLLIB

• Spark comes with a library containing common machine learning (ML) functionality called Mllib

• MLlib provides multiple types of machine learning algorithms, including• binary classification

• Regression

• Clustering

• collaborative filtering

• Model evaluation

• Data import

• MLLIb provides lower level ML primitives including a generic gradient descent optimization algorithm

• All of these methods are designed to scale out across a cluster.

2/3/2016

53

GRAPHX

2/3/2016

54

GRAPHX

• GraphX is a library added in Spark 0.9 that provides an API for manipulating graphs

• (e.g., a social network’s friend graph) and performing graph-parallel computations.

• GraphX extends the Spark RDD API, allowing creation of a directed graph with arbitrary properties attached to each vertex and edge.

• GraphX also provides set of operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of common graph algorithms (e.g., PageRank)

2/3/2016

55

SPARKR• R package that provides a light-weight frontend to use Apache Spark

• Exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster

• As of April 2015, SparkR has been officially merged into Apache Spark.

https://github.com/apache/spark/pull/5096

TACHYON• High-throughput, fault-tolerant in-memory storage

• Compatible with HDFS

• Supports Spark and Hadoop

• Succinct (requires Tachyon):

• Queries on Compressed RDDs

57

2/3/2016

BLINKDB• A massively parallel, approximate query engine for running interactive SQL queries on large

volumes of data

• Allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars

• BlinkDB has been demonstrated live at VLDB 2012 on a 100 node Amazon EC2 cluster answering a range of queries on 17 TBs of data in less than 2 seconds (over 200x faster than Hive), within an error of 2-10%.

• Deployed at Facebook

58

2/3/2016

http://www.vldb2012.org/

CLUSTER MANAGEMENTRESOURCE NEGOTIATION

2/3/2016

59

BDAS/HADOOP COMPATIBILITY

• Supports existing interfaces

62

2/3/2016

BDAS/HADOOP COMPATIBILITY (2)

• Uses existing interfaces

63

2/3/2016

SUMMARY

• The Apache Software Foundation

• BigData and the role of Hadoop

• Overview of Hadoop• The Hadoop Distributed File System (HDFS)

• Yet Another Resource Negotiator (YARN)

• Application types:• Data at Rest (Batch)

• Data at Motion (Streaming)

• A Brief look at some “Hadoop EcoSystem” projects

• The Berkeley Data Analytic Stack

• Conclusion: They are just a bunch of processes. Some designed to exploit disk, some designed to exploit memory. They can co-exist on the same servers.

THANK YOU!Q & A

Documents

BIG DATA: HADOOP AND · PDF fileBIG DATA: HADOOP AND BEYOND Daryl Heinz [email protected]. ... •DARPA •Xdata •NSF •CISE Expedition Grant •Amazon, Google, SAP •18