A tour of the zoo – Hadoop Ecosystem - · PDF fileReporting . Confidential ... Workflow design and implementation using tools like Oozie, Azkaban etc. ... A tour of the zoo – Hadoop

Embed Size (px)

Citation preview

  • A Tour of the Zoo the Hadoop Ecosystem

    Prafulla Wani

    Technical Architect - Big Data

    Syntel

  • Confidential 2012 Syntel, Inc.

    Agenda

    Welcome to the Zoo!

    Evolution Timeline

    Traditional BI/DW Architecture

    Where Hadoop Fits In

    2

  • Confidential 2012 Syntel, Inc.

    3

    Welcome to the Zoo!

    3

    Jaql

    Giraph Shark

    Zookeeper Pig

    Hama

    Hadoop

    I am sure you wont find a Shark in any other zoo

    http://zookeeper.apache.org/https://cwiki.apache.org/confluence/display/Hive

  • Confidential 2012 Syntel, Inc.

    What is Hadoop?

    Hadoop is an open-source project overseen by the Apache Software

    Foundation

    Hadoop is an ecosystem, not a single product

    Originally based on papers published by Google in 2003 and 2004

    Some of the projects in the ecosystem have been inspired based on

    whitepapers published by Google

    4

    Google calls it: Hadoop equivalent

    GFS HDFS

    MapReduce Hadoop MapReduce

    Sawzall Hive, Pig

    BigTable HBase

    Chubby ZooKeeper

    Pregel Giraph

  • Confidential 2012 Syntel, Inc.

    Evolution Timeline

    Started by Doug Cutting at Yahoo! in early 2006, and named after

    his kids toy elephant

    Hadoop committers work at several different organizations

    Including Facebook, Yahoo!, LinkedIn, Twitter, Cloudera, Hortonworks

    5

    Jaql Giraph

    2006 2007 2008 2009 2010 2011

    http://zookeeper.apache.org/https://cwiki.apache.org/confluence/display/Hive

  • Confidential 2012 Syntel, Inc.

    Traditional Data Strategy - BI/DW Architecture

    6

    ETL Tools DW / Marts BI Analytics

    Commercial

    Informatica Teradata Microstrategy SAS

    Oracle Data Integrator Oracle OBIEE TIBCO Spotfire

    IBM Datastage DB2, Netezza Cognos SPSS

    Microsoft SSIS SQL server Microsoft SSRS

    Open source Talend mySQL Pentaho , Jaspersoft R, RapidMiner

    Data Warehouse

    Data Marts

    ETL

    Process

    ERP

    CRM

    Database

    Files

    Analytics

    OLAP Analysis/BI

    Ad Hoc

    Reporting

  • Confidential 2012 Syntel, Inc.

    How Hadoop fits in?

    7

    Hadoop can complement the existing DW environment as

    well replace some of the components in a traditional data

    architecture.

    Data Warehouse

    Data Marts

    ETL

    Process

    ERP

    CRM

    Database

    Files

    Analytics

    OLAP Analysis/BI

    Ad Hoc

    Reporting

  • Confidential 2012 Syntel, Inc.

    Data Storage

    Hadoop Distributed File System (HDFS)

    Its a file system, not a DBMS

    Allows storage of both structured and unstructured data

    Provides distributed, redundant storage for massive amounts of data on

    cheap, unreliable computers

    Hadoop 2.0 release (still beta) added some important features

    HDFS Federation

    High Availability

    HBase

    Distributed, versioned, column-oriented store on top of HDFS

    Provides an option of low-latency (OLTP) reads/writes along with

    support for batch-processing model of map-reduce

    Goal - To store tables with billion rows and million columns

    8

  • Confidential 2012 Syntel, Inc.

    Data Processing (ETL / Analytics)

    Extract / Load

    Source / Target is RDBMS - Sqoop

    Log collection and aggregation - Flume, Scribe, Chukwa

    Stream processing - S4, Storm (supports Transformation also)

    Transformation

    Map-reduce programming in Java or any other language or high level query

    languages like Pig, Hive etc.

    Workflow design and implementation using tools like Oozie, Azkaban etc.

    Iterative algorithms or in-memory cluster processing using Spark, Shark etc.

    Analytics

    Mahout - Scalable machine learning library with most of the algorithms implemented

    on top Apache Hadoop using map/reduce paradigm

    RHadoop Provides R packages to access data in HDFS & HBase and also to write

    map-reduce jobs in R

    9

  • Confidential 2012 Syntel, Inc.

    Common Industry Use Cases

    10

    Use cases Solution Comments

    Cold Data Storage HDFS More cost-effective option compared to most appliances in the market

    Huge transactional

    volume HBase

    StumbleUpon created openTSDB to capture their infrastructure metrics

    data

    Batch processing MapReduce

    /Hive /Pig

    Log aggregation Flume, Scribe,

    Chukwa web-log collection on HDFS in near real-time

    Real-time message/

    stream processing Storm, S4 Used by twitter for real-time tweet processing

    Iterative algorithms / In-

    memory processing Spark / Shark Predictive analytics, Log Mining

    Machine Learning/

    Analytics

    Mahout,

    RHadoop

    Graph data

    storage/processing Giraph Championed at Yahoo!

  • Confidential 2012 Syntel, Inc.

    11

    Proposed Big Data Roadmap

    Kickoff - Assessment Study:

    Understand the business processes

    Understand organizational goals & current investments

    Understand the challenges and pain-points of current setup

    Proof of Concept:

    Proof of Concept can be performed to demonstrate applicability of Hadoop to enhance DW

    Big Data integration Initial steps

    Move cold/warm data to Hive/HBase to reduce expenses on storage infrastructure

    Bring new data sources like web-logs, which was not possible with traditional storage solutions

    Big Data integration Next steps

    Throw data open to business users for analysis and they will appreciate the power of new infrastructure

    Big Data integration Next steps

    Identify the opportunities in ETL & Analytics space

    Move Hot data to Hadoop

    Perform real-time data integration using Storm/Spark

    Big Data integration Next steps

    Implement advanced solutions

    1

    2 3

    4

    5

    6

    HDFS, Hbase

    Hive, Pig,

    MapReduce

    Mahout, RHadoop

    Hadoop Technology Stack

  • Thank You