Grid Infrastructure and Databases BrownBag

Embed Size (px)

Citation preview

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    1/32

    Hadoop in a nutshell,

    Grid Infrastructure, Databasesand More .

    ~Ashok Kondala

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    2/32

    What is Hadoop? Hadoop is derived from Google File System (GFS) and

    Googles Map-Reduce papers.

    Yahoo! is the originator and major contributor for Hadoopand uses extensively!

    Hadoop was created by Doug Cutting and Michael J.Cafarella.

    Hadoop developed into set of Java-based Apache opensource projects.

    Hadoop is capable of processing of larger dataset,

    structured, unstructured and semi-structured, using widedistributed cluster of servers.

    This is the solution for the Big Data to deal withcomplexities of high 3Vs of data.

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    3/32

    Why Hadoop?

    A typical Analytics Architecture

    DataLogs

    Webservers

    Devices, like mobile

    Misc sources

    Raw andunstructured

    Data

    ETL Processing

    Structured & Aggregated Data

    BI & Reporting Data

    Network Devices

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    4/32

    Hadoop Architecture

    Hadoop has two main systems:

    Map-Reduce: This is the processing part; which playsdual rolemanaging/scheduling jobs and forprogramming abstracts for computations andproviding results.

    Hadoop Distributed File System (HDFS): This is thedata part of Hadoop with high-bandwidth clusteredstorage.

    MapReduce

    HDFS

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    5/32

    Hadoop Arch contd

    Hadoop subscribes to a master-slave architecture

    MasterNameNode and JobTracker Slave- DataNode and TaskTracker

    Map-reduce server on a typical server is called a

    TaskTracker

    HDFS server on a typical server is called DataNode.

    Machine

    TaskTracker

    DataNode

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    6/32

    Cluster of Machines/Nodes

    JobTracker keeps tracks of the jobs being run.

    TaskTracker

    DataNode

    TaskTracker

    DataNode

    TaskTracker

    DataNode

    TaskTracker

    DataNode

    JobTracker

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    7/32

    Cluster of Machines/Nodes

    NameNode keeps tracks information on the

    data location, acts as coordinator for all the

    DataNodes.

    TaskTracker

    DataNode

    TaskTracker

    DataNode

    TaskTracker

    DataNode

    TaskTracker

    DataNode

    NameNode

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    8/32

    Hadoop Characteristics

    Few key attributes of Hadoop

    Hadoop is Scalable, Reliable, Cost-effective and fault-tolerence.

    Hadoop can store and process petabytes of data.

    Hadoop does extremely powerful computations, provides dataredundancy and reliability.

    Hadoop is more for distributed and batch centric applications.

    Designed to scale up from single machine to several thousandsof machines with high degree of fault tolerance.

    Hadoop runs on commodity hardware.

    Hadoop commands similar to Linux/Unix commands

    hadoop fsls, hadoop fsmkdir, hadoop copyFromlocaletc.

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    9/32

    Hadoop Sub-Projects

    PIGPIG programs run as map-reduce jobs, high-level language-more like

    compilers. Developed at Y!.

    Constitutes Pig Latin programs and pig runtime.

    Example: set of transformation which converts into mapreduce tasksregister /homes/akondala/videoanalytics/jar/videoanalytics.jar

    register /grid/0/gs/pig/current/libexec/released/sds.jar

    A = load '/data/SDS/data/ULT_apps_b_daily/video_cdn/20090905/51329/view/part-00000' using com.yahoo.yst.sds.ULT.ULTLoader() as (simpleFields, mapFields,mapListFields);

    B = FOREACH A GENERATE mapFields#'page_params' as page_params;

    C = FOREACH B GENERATE videoanalytics.PARSE_ATLAS_DATA(page_params#'dt', page_params#'d') as atlas_data;

    dump C;

    grunt> D = FOREACH C GENERATE atlas_data.vid, atlas_data.b, atlas_data.cat;

    grunt> E = limit D 20;

    grunt> dump E;

    -------------------------------------------------------------------------------------------------------------------------------

    -- DUMP C: OUTPUT-- order "vid", "sid", "cat", "cdn", "ca", "ca_t", "cp", "cp_t", "us", "us_t", "b"

    ((3890802772,3890802772_700.mp4,flickr,,,,792600246,s,,,))

    ((3658435432,3658435432_700.mp4,flickr,,,,792600246,s,,,))

    ((3715267808,3715267808_700.mp4,flickr,,,,792600246,s,,,))

    ((3858990041,3858990041_700.mp4,flickr,,,,792600246,s,,,))

    --DUMP E OUTPUT

    (3792043310,,flickr)

    (3821820098,,flickr)

    (3763110035,,flickr)

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    10/32

    PIG and HIVE

    MAP-REDUCE

    HDFS

    PIG:High-level lang. constitutes of pig latin

    programs and pig runtime

    Developed @Y!

    Set of transformation which converts to

    map-reduce tasks, like load, dump, store,

    filter, group

    HIVE:Its a Query Language, SQL like interface.

    Developed by Facebook.

    Use SQL like, SELECT, FROM, GORUPBY,WHERE,

    which are transformed into map-reduce tasks.

    This can be run from Hive-Shell-command line, or

    from JDBC/ODBC appln/drivers, Thrift-Client

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    11/32

    HBASE Apart from batch processing, the requirements would to process the data real-

    time- HBASE fill that requirements.

    MAP-REDUCE

    HDFS

    PIG

    HIVE

    HBASE ZooKeeper

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    12/32

    HBASE

    Is a columnar database fills the requirement to

    real-time data needs.

    Doesnt support SQL nor a RDBMS datastore.

    Set of tables stored in HDFS.

    Doesnt use map-reduce and use the similarmaster-slave architecture.

    HBase can be accessed by PIG, Hive andMapReduce.

    Hbase keep some of its metadata in ZooKeeper(used for coordination purposes for several servers)

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    13/32

    HADOOPOverall

    MAP-REDUCE

    HDFS

    PIG

    HIVE

    HBASE ZooKeeperHCAT

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    14/32

    Grid Architecture

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    15/32

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    16/32

    Grid Databases

    Grid Databases are in Oracle(11.2.0.2/11.2.0.3) and MySQL (5.1/5.5)

    Five critical Databases on Grid:

    Grid Data Management RunTime (GDM) (formerly

    known as Data AcQuisition-DAQ)

    Grid Data Management Console (GDMConsole)

    Oozie

    HCATHive Metadata Store

    Support Shop (MySQL)

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    17/32

    Grid Data Management

    GDM Physical ArchitectureMulti-Colo

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    18/32

    GDM Components

    GDM consists of three facetsAcquisition,Replication andRetention, and optionallyArchival.

    Each of the colo will have all these facet serversinstalled.

    GDM Console is the centralized console server.

    Data from all the colos is aggregated andreplicated into the GDMConsole database.

    Console provides a GUI with centralized view offacets of all the colos for their workflows andconfiguring the datasets and datasources.

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    19/32

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    20/32

    GDM Logical Architecture

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    21/32

    GDM Database Layout

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    22/32

    Oozie

    Oozie is Workflow Engine for Hadoopmore likea WF manager and scheduler.

    Set of Coordinator jobs executes on the basis ofTimeand Dataavailability.

    Oozie supports several types of Hadoop jobs,such as Java map-reduce, Streaming map-reduce,Pig, Hive, Sqoop and more as well as systemspecific jobs, such as Java programs and shell

    scripts. Oozie is a scalable, reliable and extensible

    system.

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    23/32

    Oozie Data/Workflow

    Oozie uses database for storing Workflow definitions and

    currently running workflow instances, including instancestates and variables.

    Oozie schema is local to each Grid Cluster with about

    seven Oracle tables holding the BundleActions and jobs,

    Coordinate Actions and Jobs, and Workflow Actions andJobs.

    Oozie runs workflow jobs with multiple actions .

    Each workflow job creates at least three events in the

    database tablesCREATED, STARTED andSUCCEEDED/KILLED/FAILED.

    Oozie workflow definitions are written in XML.

    Oozie use-case: http://twiki.corp.yahoo.com/view/CCDI/OozieWorkflow

    http://twiki.corp.yahoo.com/view/CCDI/OozieWorkflowhttp://twiki.corp.yahoo.com/view/CCDI/OozieWorkflowhttp://twiki.corp.yahoo.com/view/CCDI/OozieWorkflow
  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    24/32

    Oozie coordinator

    Coordinator automates workflow execution

    Trigger workflow execution based on

    time (like cron job)

    input data availability

    HDFSWorkflow

    Coordinator

    Check Data Availability

    Check Time

    start

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    25/32

    ActionExecutors

    Oozie Architecture

    Web-ServicesJSON/REST API

    Security

    WS CallbackWS API

    DAG Engine

    Oracle DB

    Commands

    Command

    Queue start rerunsubmitCommand

    Executor

    Thread Pool

    Recovery

    Daemon Thread

    m/r fspig

    Instrumentation W

    Fstore

    WFlib

    sub-wf

    resume killsuspend

    info

    start

    action

    end

    action

    check

    action

    callback

    signal

    job

    notification

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    26/32

    HCAT/Hive Hadoop Catalog (HCAT)The HCAT Oracle databases

    stores all the Hive Metadata.

    Hive always has a MetaDatastore that stores the tablesinformation that Hive can process.

    This metadata store is pulled out of Hive and made aHCATalog, making it available for other applications.

    Real time SQL abstraction layer. Hive Metadata stores all the information about the

    tables, their partitions, the schemas, the columns andtheir types, the table locations etc.

    This information can be queried or modified using athrift interface and as a result it can be called fromclients in different programming languages.

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    27/32

    HCAT/Hive

    The reason for storing this information inRDBMS is to serve the clients faster, in almost-real time.

    Metadata-store gives Hive information aboutlocation, data_types, content in HDFS.

    For the Object-Relational mapping, Metadatastore uses DataNucleus, instead of storing this

    information on HDFS due the the latencyissues. And this ORM is compatible with mostof the RDBMS plugins.

    New Grid Flow

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    28/32

    New Grid Flow

    HCAT Integrated with GDM and Oozie

    GDM

    DataData

    OozieWorkflow 2

    Workflow1

    HCatalog

    CMS

    1: add data

    2: add partition(s)

    3: publish partition(s)

    4: send notification

    5: start workflow6: add data

    7: publish metadata

    Off-GridData

    0: load data

    28

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    29/32

    SupportShop SupportShop is Grid Portal deployed on MySQL in BF1 and GQ1 colo

    SupportShop is one-stop self help solution for all the Grid Yahoo!

    UsersWeb User Portal (WUP). WUP should provide users with following content:

    A convenient access to various internal & external educational &informational resources on Yahoo Grid Technologies.

    A dashboard providing real-time status, critical alerts andannouncements about various Yahoo Grids & currently running userjobs

    Various historical user facing reports on Grid utilization, Workloads,Data access etc.

    Various Grid forms for user to request resources on the grid, reportproblems and a mechanism to followup.

    Various technology enabled services for users for more productiveusage of the grids (See detail requirements for tools supported in thefirst release)

    Customizable portal interface to only view the information of interest

    yo/supportshop

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    30/32

    SupportShop Physical Layout

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    31/32

    SupportShop Architecture

  • 8/13/2019 Grid Infrastructure and Databases BrownBag

    32/32

    References

    Grid DBA Twiki:http://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructure

    Grid Web Portal:http://twiki.corp.yahoo.com/view/Grid/WebHome

    Grid Database Inventory: https://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkE#gid

    =0

    Grid Clusters/version: yo/gridversions

    SupportShop: yo/supportshop

    http://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructurehttp://twiki.corp.yahoo.com/view/Grid/WebHomehttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttp://twiki.corp.yahoo.com/view/Grid/WebHomehttp://twiki.corp.yahoo.com/view/Grid/WebHomehttp://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructurehttp://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructure