Hadoop - Architectural road map for Hadoop Ecosystem

Architectural Road Map for Hadoop Ecosystem

Implementation

Sudhir Nallagangu

Content Index• Industry Definitions, History and Capabilities

• Hadoop EcoSystem Architecture/Components

• Road Map for Implementation

• Architecture Decision points

• Use cases and Data Sciences

• Questions?

Industry Definitions, History and CapabilitiesDefinition - BigData is a broad generic term for large data sets that traditional data processing(RDBMS) techniques and infrastructure are not adequate. Two core challenges faced: • Storage - To store PetaBytes of structured, semistructured and unstructured data that

grows by TeraBytes daily. • Processing - To process, analyze, visualize in reasonable amount of time.

History - Google faced with task of saving, indexing and searching billions of web pages turned to distributed storage and processing innovations. They published their techniques as white papers - Google File System(GFS 2003) for storage challenge and MapReduce(2004) for processing challenge.

Inspired by above, a team of open source development team created Hadoop core (HDFS and MapReduce) in 2006. This provided techniques to manage BigData by using commodity hardware (no supercomputers) in a regular enterprise datacenter setting. • HDFS - A distributed file storage system to solve ever growing storage needs by

horizontal scalable infrastructure. • MapReduce - A distributed “divide and conquer” processing technique that leverages

“data locality”. Data locality execution means execution/processing at the place where data resides(efficient over transferring data over network).

• The core system and services designed to operate , survive and recover in hardware and software failure scenarios as regular occurrence.

Industry Definitions, History and CapabilitiesCapabilities

Use CasesData Ware House modernization - Identified as the early adoption use case. Integrate big data and data warehouse capabilities to increase operational efficiency. Optimize your data warehouse to enable new types of analysis. Use big data technologies to set up a staging area or landing zone for your new data before determining what data should be moved to the data warehouse.

Big Data Exploration - Addresses the challenge that every large organization faces: information is stored in many different systems and silos. Enables you to explore and mine big data to find, visualize, and understand all your data to improve decision making by creating a unified view of information across all data sources you gain enhanced value and new insights. Make important decisions such as pricing. Helps arrive at Master Data Management.

Customer Behavior analysis, Retention, Segmentation - With access to consumer behavior data, companies can learn what prompts user to stick around , learn about customer characteristics to improve marketing efforts. Data Science led algorithms help building recommendation systems.

Cyber Security Intelligence - Data Science led Intrusion, fraud, anomaly detection systems - Lower risk, detect fraud and monitor cyber security in real time. Augment and enhance cyber security and intelligence analysis platforms with big data technologies to process and analyze new types.

Industry specific - Every industry have invested in BigData to help specific challenges in those industries. Healthcare for instances uses patient data to improve patient outcomes. agriculture to boost crop yields. One interesting case involves UN using cell phone data to track malaria spread in African subcontinent.

Sources of Data• Traditional data sources - RDBMS and transactional data, feed from current Data ware houses. • User and machine-generated content - eMail, help desk calls, social media, web and software logs. • Internet of Things - cameras, information-sensing mobile devices, aerial sensory technologies, and

genomics.

Hadoop EcoSystem ComponentsEarly Hadoop supported core services of Storage (HDFS) and Processing (MapReduce). • All Interfaces to HDFS & MR is through low level Java API. There were no higher level abstractions. • There was no robust Security model built around. • Early adopters were heavy tech corporations like Yahoo, Facebook, linkedIn …

Hadoop EcoSystem evolved to meet traditional enterprise needs (still evolving). They include but not limited to: • Higher level abstractions (Pig, Hive, Cascade) • Data Ingestion tools (Distcp, Sqoop, Flume) • Random real time read/write access for large datasets(HBase) • Distribution coordination services(Zookeeper) • workflow Engines (Oozie, Falcon) • Security (Apache Ranger, Knox) • Improved Resource management (YARN. This is now part of Hadoop core). • In-Memory processing (Spark) • Machine learning (Mahout) • Provisioning and Monitoring services

Architecture and Design Goal• As Hadoop ecosystem matured rather rapidly, enterprises have the task of effectively integrating these

several tools into complete solutions. • A rich ecosystem of tools, APIs, and development options provide choice and flexibility, but can make it

challenging to determine the best choices to implement a solution. • Enterprises need to understand the role of each tool. sub-component, architects need to ask right

questions, pick right tools, make right decisions for the implementation

Hadoop EcoSystem Components..

(Picture Courtesy HortonWorks HDP)

Hadoop EcoSystem CoreHDFS • Scalable, fault-tolerant, distributed storage system across many servers.HDFS cluster is

comprised of a NameNode(Master) that manages the cluster metadata and DataNodes(Worker) that store the actual data.

• Key features • Provides all standard file system commands available on POSIX systems • Rack Awareness - allows consideration of a node’s physical location, when allocating

storage and scheduling tasks • Replication - provides default replication of 3 in case of a particular datanode failure. It

detects errors and automatically creates a separate copy if required • Minimal data motion. MapReduce moves compute processes to the data on HDFS and

not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network traffic and so there improving overall latency/performance.

• Utilities diagnose the health of the files system and can rebalance the data on different nodes

Map Reduce

• The core of Divide-and-conquer distributed processing Engine. Earlier versions carried resource management but now those are moved to YARN.

• Includes JobTracker(Master) and TaskTracker(worker) components to run batch version of jobs.

YARN • Foundation of new generation of Hadoop core operating system. Provides resource management and a central operating platform. Addresses limitations of MapReduce 1.0 Jobtracker service in areas of scalability and support for non MapReduce programs

• Key features • Multi-tenancy - Allows multiple access engines (either open-source or proprietary) to

use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set.

• Cluster utilization - YARN’s dynamic allocation of cluster resources improves utilization over more static MapReduce rules used in early versions of Hadoop

• YARN’s original purpose was to split up major responsibilities of the JobTracker/TaskTracker into separate entities: a global ResourceManager, a per-application ApplicationMaster, a per-node slave NodeManager, a per-application Container running on a NodeManager.

HDFS

MapReduce

Cluster layout

Hadoop EcoSystem Abstraction services •One criticism of MapReduce is that the default development cycle in Java is very long. Writing the mappers and reducers, compiling and packaging the code, submitting the job(s), and retrieving the results is time consuming. Also those who are not Java programmers cannot really make good use of Hadoop/HDFS.

•PIG and Hive comes to rescue for Analysts and DBA’s. CASCADING for java programmers. Under the covers all the three apply abstractions into a series of MapReduce programs.

PIG • Originated at Yahoo. Pig scriptable language “Latin” can process petabytes of data with just few statements. Recommended use case(s) are ETL data pipelines, research on raw data and iterative processing.

• Key features • Available functions LOAD, FILTER, GROUP BY, FOREACH, MAX , ORDER, LIMIT, UNION ,

CROSS , SPLIT, CLUBE, ROLLUP. • Ability to create User Defined functions in Java and other languages that can be called with

in Pig scripts.

Hive/HQL

• Originated at Facebook, Hive/HQL provides SQL like abstraction called HQL for data analysts with SQL background.

• Key features: • Hive under the covers provides a SQL/schema like structure to HDFS data by using a Metastore. Metastore which visualizes “hdfs data” as sql tables is supported by MySQL. MySQL does not store data but just the structure to data stored in HDFS.

• Provides majority of ANSI SQL alike statements for processing including Explain, Analyze statements.

• Provides Partitioning, Indexing, Bucketing features to manage performance.

Cascade •PIG is great for early adopter use cases, ad hoc queries, and less complex applications. Cascading is great for Enterprise data workflows and is designed for “scale, complexity and stability” over PIG.

• With Cascading, you can package your entire MapReduce application, including its orchestration and testing, within a single JAR file

• Processing Model is based on “data pipes”, filters/operations, data taps(sources and sinks).

Hadoop EcoSystem Data Ingestion tools

Sqoop • Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.

• Mostly Batch driven, simple use case will be an organization that runs a nightly sqoop import to load the day's data from a production DB into a HDFS for Hive/HSQL analysis.

• Provides Import, Export commands and runs the process in parallel with data coming in and out of HDFS

• Parallelizes data transfer for fast performance and optimal system utilization • Copies data quickly from external systems to Hadoop • Makes data analysis more efficient • Mitigates excessive loads to external systems.

Flume • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into HDFS (not out of HDFS)

• Mostly Real time, even driven and a common use case is collecting log data from one system- a bank of web servers log files (aggregating it in HDFS for later analysis).

• Stream data from multiple sources into Hadoop for analysis • Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at

which data can be written to the destination • Guarantee data delivery • Scale horizontally to handle additional data volume

Distcp DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting

Hadoop EcoSystem Real time Read/Write Storage HBase HDFS is a distributed file system that is well suited for the storage of large files. HDFS is a general

purpose file system and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. Key features: Modelled after Google BigTable, cells are versioned, column oriented (vs RDBMS row orientation) and rows are sorted(for lookups), map oriented and distributed across servers. Depends on “ZooKeeper” for distribution services of a cluster. There is no one single master server which provides that additional flexibility.

Hadoop EcoSystem Workflow tools Oozie • Ozzie is a workflow scheduling engine specialized in running multi-stage jobs in Hadoop eco-

system. It has ability to monitor, track, recover from errors, maintain dependency of jobs. • Workflows are expressed as XML and no development language are required. • 3 Types of jobs:

• Oozie workflow jobs - Jobs running on demand. They are Directed Acyclical Graphs (DAGs), Oozie Coordinator jobs - Workflow Jobs running periodically on a regular basis.Triggered by time and data availability.

• Oozie Bundle provides a way to package multiple coordinator and workflow jobs. • Deployment Model: An Oozie application comprises one file defining the logic of the application

plus other files such as scripts, configuration, and JAR files. A Workflow application consists of a workflow.xml file and may have configuration files, Pig scripts, Hive scripts, JAR files, etc.

Falcon • Simplifies “data management(data jobs)” for Hadoop and used mainly as part of “Data Ingestion”. Falcon is a feed processing and feed management system aimed at making it easier for end consumers to onboard their data onto Hadoop clusters.

• Provides the key services data processing applications need so Sophisticated DLM can easily be added to Hadoop applications.

• Services include Data processing (using PIG scripts for example), retry logic , late arrival data handling, Replication, Retention, Records Audit, Governance, metrics.

• Provides integration with metastore/catalog(HCatalog). • Faster development and higher quality for ETL, reporting and other data processing apps on

Hadoop.

Hadoop EcoSystem Real time Read/Write Storage.. Accumulo Originally developed by NSA(National Security Agency), this provides “cell level security access” to

a BigTable modeled database.Due to its origins in the intelligence community, Accumulo provides extremely fast access to data in massive tables, while also controlling access to its billions of rows and millions of columns down to the individual cell. Key features: • Group columns within a single file • Automatic tablet splitting and rebalancing • Merge tablets and clone tables • Uses Zookeeper just like HBase

Hadoop EcoSystem Workflow tools….

Hadoop EcoSystem Analytic processing tools Storm • Storm provides ability to process of large amounts of “real time” data(vs batch version of Hadoop).

• Typical use case would to be say process twitter feeds, online web log errors that needs immediate actions.

• Features: • Input stream of data is managed by “spout” (like a water spout) which passes data to “bolt”

which transforms (processes) data or pass it another bolt. Ability to scale up by increasing number of “bolts” and “spouts”

• Storm have 2 kinds of nodes - master (which runs daemon Nimbus) and worker nodes(that runs supervisor) . All clusters are managed by ZooKeeper and they can fail/restart. Also fast computation/data transfers are done by “ZeroMQ” which is faster than TCP/IP.

• Provides Java, Scala and Python language interfaces

Spark • An In-memory compute for Machine learning and Data science projects. Typical MapReduce processing involves data transferring between “hard disks” but Spark does all data processing/saving in memory(to great extent) thus providing 10 fold performance improvement over Storm.

• Use cases - speeding up batch analysis jobs, iterative machine learning jobs, If you are interested in, for example, executing a Hadoop MapReduce job much faster, Spark is a great option (although memory requirements must be considered)

• Spark provides RDD (resilient Distributed Data) mechanism for in-Memory data storage and processing primitives which get applied on whole data

• Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these frameworks seamlessly in the same application.

• Spark can run on Hadoop 2's YARN cluster manager, and can read existing Hadoop data.SparkSQL can be built and configured to read and write data stored in Hive

• Provides Java, Scala and Python language interfaces

Mahout Suite of machine learning libraries designed to be scalable and robust. Provides Java Interface. Supports four main Data Science Use cases:

• Collaborative filtering – mines user behavior and makes product recommendations (e.g. Amazon recommendations)

• Clustering – takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups

• Classification – learns from existing categorizations and then assigns unclassified items to the best category

• Frequent item set mining – analyzes items in a group (e.g. items in a shopping cart or terms in a query session)

Hadoop EcoSystem Security Architects and designers of early Hadoop were initially focused on HDFS and MapReduce at a massive scale with no “security model” . Reason was most of the workflow(data ingestion, storage , processing, analytics) was internal to a enterprise. In recent times as Hadoop and alike got mainstream with interfaces extending outside the enterprise, Robust Security models are now added to Hadoop architecture

HDP Security/Ranger

• Hadoop Inbuilt Advanced Security and Authorization • Centralized Security Administration - HDP provides a console for managing security policies

access controls from one place. • Authentication - Kerberos-Based authentication which can generate delegation tickets for time

bound authentication (like disney tickets which are valid for certain days) as Kerberos servers will not be able to scale to tens of thousands of jobs/data requests. Kerberos have the additional benefit of never sending password over wire. It can integrate to corporate LDAP if required.

• Authorization and Audit - Prevents rogue jobs being executed by insiders. Fine-grained authorization via file permissions in HDFS, resource-level access control for YARN and MapReduce and coarser-grained access control at a service level.

Knox • Apache Knox Gateway (“Knox”) provides perimeter security for Hadoop ecoSystem. • Provide security to all of Hadoop’s REST & HTTP services • Support for REST APIs for Apache Ambari, Apache Falcon and Apache Ranger.

Hadoop EcoSystem Operational Tools

Ambari Cluster Administration, Provisioning and Monitoring tool

Zookeeper Distribution Services tool used by HBase, Accumulo

Road Map for Implementation

Stage 1: Development Tools• Most of Hadoop system sub-components can be independently installed and configured. But it is highly

recommended to go with Sandbox versions such as HortonWorks or Cloudier or any vendor. • Linux alike machines for developers as Hadoop EcoSystem is designed around Linux and usually the only

supported ones in production. • Sandbox development environments are also provided as Virtual Machine packages that can go on windows

machines for development. Port forwarding and sharing between Host and Guest development machines can meet most of development needs.

• Programming IDE’s such as for Java (Eclipse and IntelliJ) can continue with current setup. • If it is a large team use Vagrant and develop a VM with all software components required.

Stage 3: Pre-Prod components setup and testing• Execute a POC demo of all components. This need not be a use case but more a technology demonstration. • Complete one life cycle of development, deployment and testing of POC application with all components. • Install IDS/IPS products in line of “Data Ingestion” path for data coming from external sources(Twitter feed etc) so

fictitious data cannot be injected into processing.

Stage 2: Cluster Setup - Hardware and Network•Choose optimal hardware as per vendor recommendations for critical Namenode and ResourceManager/Jobtracker

as well as rest of worker nodes. Install and Configure a multi-node cluster nodes (24 or 32) as a proof of concept •Isolate Hadoop ecosystem in separate subnets away from web and other traditional infrastructure. Separate “Master

nodes, Worker nodes, client nodes and management nodes” where Master nodes and Worker nodes takes input only from management nodes.

•Keep Client nodes on edge nodes(akin to DMZ) with end user access limited to only those. Users should never get access to any Hadoop nodes except the edge/gateway nodes.

•Work with Network architect to finalize on topology design out of - Traditional two Switch rooted Architecture (using expensive Chassis switches) OR BigData preferred Spine and leaf model (using inexpensive scale out Fixed switches).

•Define Rack-awareness (Hadoop config file topology.script) for all servers so Namenode can make right decisions.

Road Map for Implementation

Stage 4: Early adoption use case• Identify Enterprise DataWarehouse or BI team long runnings jobs and queries. • Formulate a use case or copy an existing use case to mimic above functionality. • Design and code Sqoop and PIG scripts to load data to HDFS • Design HDFS and Hive Schema before data load • Replicate existing functionality using Hadoop ecosystem components. • Adopt architectural standards for data format, compression (defined in next section)

Stage 5: Mature use case like MDM • Provide data exploration and analysis to identify, locate and load internal and external final party data files into

landing zone • Clean the data as per business requirements • Prepare - Design and Develop Big Data structure and tables to load data files • Analyze - provide productivity tools to data access and analysis. • Implement Enterprise data transformation tooks like Cascade if required

Stage 6: Data Science driven Analytic use case.• Work with business to identify Data science driven use case • Setup BigData ecosystem and environment • Engineer with Data Science team to design prediction model.

Architectural and Design Decisions

Data Format, Compression and Storage systemFileFormat • SequenceFiles - Recommended for smaller files (<64MB default HDFS block size) • Avro - Recommended for use cases requiring ever evolving schema changes since they have self describing format • RC and ORCFile formats - Columnar formats and efficient for most of Analytic cases where only few columns of data is

read.

Compression - Very critical the approach used supports splittable format required for MapReduce • Snappy - Efficient compression but does not support splits • LZO - Slow performance but supports splits • Bzip2 - Split support but less efficient. Use only if storage is a serious concern

Schema Design - HDFS or HBase with letter providing random read/write support

Standard directory structure and standard organization of data

Data Movement• Network - Ensure appropriate bandwidth requirements are met • Sqoop for data ingestion

• Use single hop from source to HDFS where possible • Use Database Specific Connectors Whenever Available • Goldilocks Method of Sqoop Performance Tuning • Loading Many Tables in Parallel with Fair Scheduler Throttling

• Ask following questions and design to minimize network traffic • Timeliness of data ingestion • Incremental updates • Data transformation - Does data need to be transformed during flight • Source system structure, layout and proximity

Data Processing• MapReduce low level Java API - Use only if high flexibility and control of MapReduce is required • Spark, Storm are clear choices for most enterprise level of processing • Use Abstractions like Pig, HQL for early adoptions or for data analysts usage. Leverage modern performance

improvements like “Tez” which performs better than MapReduce for Pig and HQL.

Architectural and Design Decisions

PatternsUse Patterns for common occurring problems • Removing duplicate keys • Windowing analysis - helps in arriving peak and lows of a equity • Updating time series data -One common way of storing this information is as versions in HBase. HBase has a way to store

every change you make to a record as versions. Versions are defined at a column level and ordered by modification timestamp so you can go to any point in time to see what the record looked like.

Graph ProcessingGiraph - Recommended for mature cases of Graph processing but only for expert graph programmers Spark GraphX - Recommended as it is available under one stack of Spark which is set to replace MapReduce.

WorkFlows• While most Data pipeline and processing can be done manually with scripts, for an enterprise leverage Oozie and Falcon.

They provide robust mechanisms for processing.

Programming languages• Hadoop is developed in Java and is the core language choice for all interfaces like Spark, Storm, MapReduce

(Abstractions like Pig, HSQL are excluded). • However, as Hadoop is getting matured, Python is becoming language of choice. If development teams are already

familiar with Java then continue with same.

Unit Testing• All language and abstractions to Hadoop offer Unit testing (including Pig) and should be used just like any enterprise

application.

Questions ?

Data & Analytics

Hadoop - Architectural road map for Hadoop Ecosystem