Grid Infrastructure and Databases BrownBag

8/13/2019 Grid Infrastructure and Databases BrownBag

1/32

Hadoop in a nutshell,

Grid Infrastructure, Databasesand More .

~Ashok Kondala


2/32

What is Hadoop? Hadoop is derived from Google File System (GFS) and

Googles Map-Reduce papers.

Yahoo! is the originator and major contributor for Hadoopand uses extensively!

Hadoop was created by Doug Cutting and Michael J.Cafarella.

Hadoop developed into set of Java-based Apache opensource projects.

Hadoop is capable of processing of larger dataset,

structured, unstructured and semi-structured, using widedistributed cluster of servers.

This is the solution for the Big Data to deal withcomplexities of high 3Vs of data.


3/32

Why Hadoop?

A typical Analytics Architecture

DataLogs

Webservers

Devices, like mobile

Misc sources

Raw andunstructured

Data

ETL Processing

Structured & Aggregated Data

BI & Reporting Data

Network Devices


4/32

Hadoop Architecture

Hadoop has two main systems:

Map-Reduce: This is the processing part; which playsdual rolemanaging/scheduling jobs and forprogramming abstracts for computations andproviding results.

Hadoop Distributed File System (HDFS): This is thedata part of Hadoop with high-bandwidth clusteredstorage.

MapReduce

HDFS


5/32

Hadoop Arch contd

Hadoop subscribes to a master-slave architecture

MasterNameNode and JobTracker Slave- DataNode and TaskTracker

Map-reduce server on a typical server is called a

TaskTracker

HDFS server on a typical server is called DataNode.

Machine

TaskTracker

DataNode


6/32

Cluster of Machines/Nodes

JobTracker keeps tracks of the jobs being run.

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

JobTracker


7/32

Cluster of Machines/Nodes

NameNode keeps tracks information on the

data location, acts as coordinator for all the

DataNodes.

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

NameNode


8/32

Hadoop Characteristics

Few key attributes of Hadoop

Hadoop is Scalable, Reliable, Cost-effective and fault-tolerence.

Hadoop can store and process petabytes of data.

Hadoop does extremely powerful computations, provides dataredundancy and reliability.

Hadoop is more for distributed and batch centric applications.

Designed to scale up from single machine to several thousandsof machines with high degree of fault tolerance.

Hadoop runs on commodity hardware.

Hadoop commands similar to Linux/Unix commands

hadoop fsls, hadoop fsmkdir, hadoop copyFromlocaletc.


9/32

Hadoop Sub-Projects

PIGPIG programs run as map-reduce jobs, high-level language-more like

compilers. Developed at Y!.

Constitutes Pig Latin programs and pig runtime.

Example: set of transformation which converts into mapreduce tasksregister /homes/akondala/videoanalytics/jar/videoanalytics.jar

register /grid/0/gs/pig/current/libexec/released/sds.jar

A = load '/data/SDS/data/ULT_apps_b_daily/video_cdn/20090905/51329/view/part-00000' using com.yahoo.yst.sds.ULT.ULTLoader() as (simpleFields, mapFields,mapListFields);

B = FOREACH A GENERATE mapFields#'page_params' as page_params;

C = FOREACH B GENERATE videoanalytics.PARSE_ATLAS_DATA(page_params#'dt', page_params#'d') as atlas_data;

dump C;

grunt> D = FOREACH C GENERATE atlas_data.vid, atlas_data.b, atlas_data.cat;

grunt> E = limit D 20;

grunt> dump E;

-------------------------------------------------------------------------------------------------------------------------------

-- DUMP C: OUTPUT-- order "vid", "sid", "cat", "cdn", "ca", "ca_t", "cp", "cp_t", "us", "us_t", "b"

((3890802772,3890802772_700.mp4,flickr,,,,792600246,s,,,))

((3658435432,3658435432_700.mp4,flickr,,,,792600246,s,,,))

((3715267808,3715267808_700.mp4,flickr,,,,792600246,s,,,))

((3858990041,3858990041_700.mp4,flickr,,,,792600246,s,,,))

--DUMP E OUTPUT

(3792043310,,flickr)

(3821820098,,flickr)

(3763110035,,flickr)


10/32

PIG and HIVE

MAP-REDUCE

HDFS

PIG:High-level lang. constitutes of pig latin

programs and pig runtime

Developed @Y!

Set of transformation which converts to

map-reduce tasks, like load, dump, store,

filter, group

HIVE:Its a Query Language, SQL like interface.

Developed by Facebook.

Use SQL like, SELECT, FROM, GORUPBY,WHERE,

which are transformed into map-reduce tasks.

This can be run from Hive-Shell-command line, or

from JDBC/ODBC appln/drivers, Thrift-Client


11/32

HBASE Apart from batch processing, the requirements would to process the data real-

time- HBASE fill that requirements.

MAP-REDUCE

HDFS

PIG

HIVE

HBASE ZooKeeper


12/32

HBASE

Is a columnar database fills the requirement to

real-time data needs.

Doesnt support SQL nor a RDBMS datastore.

Set of tables stored in HDFS.

Doesnt use map-reduce and use the similarmaster-slave architecture.

HBase can be accessed by PIG, Hive andMapReduce.

Hbase keep some of its metadata in ZooKeeper(used for coordination purposes for several servers)


13/32

HADOOPOverall

MAP-REDUCE

HDFS

PIG

HIVE

HBASE ZooKeeperHCAT


14/32

Grid Architecture


15/32


16/32

Grid Databases

Grid Databases are in Oracle(11.2.0.2/11.2.0.3) and MySQL (5.1/5.5)

Five critical Databases on Grid:

Grid Data Management RunTime (GDM) (formerly

known as Data AcQuisition-DAQ)

Grid Data Management Console (GDMConsole)

Oozie

HCATHive Metadata Store

Support Shop (MySQL)


17/32

Grid Data Management

GDM Physical ArchitectureMulti-Colo


18/32

GDM Components

GDM consists of three facetsAcquisition,Replication andRetention, and optionallyArchival.

Each of the colo will have all these facet serversinstalled.

GDM Console is the centralized console server.

Data from all the colos is aggregated andreplicated into the GDMConsole database.

Console provides a GUI with centralized view offacets of all the colos for their workflows andconfiguring the datasets and datasources.


19/32


20/32

GDM Logical Architecture


21/32

GDM Database Layout


22/32

Oozie

Oozie is Workflow Engine for Hadoopmore likea WF manager and scheduler.

Set of Coordinator jobs executes on the basis ofTimeand Dataavailability.

Oozie supports several types of Hadoop jobs,such as Java map-reduce, Streaming map-reduce,Pig, Hive, Sqoop and more as well as systemspecific jobs, such as Java programs and shell

scripts. Oozie is a scalable, reliable and extensible

system.


23/32

Oozie Data/Workflow

Oozie uses database for storing Workflow definitions and

currently running workflow instances, including instancestates and variables.

Oozie schema is local to each Grid Cluster with about

seven Oracle tables holding the BundleActions and jobs,

Coordinate Actions and Jobs, and Workflow Actions andJobs.

Oozie runs workflow jobs with multiple actions .

Each workflow job creates at least three events in the

database tablesCREATED, STARTED andSUCCEEDED/KILLED/FAILED.

Oozie workflow definitions are written in XML.

Oozie use-case: http://twiki.corp.yahoo.com/view/CCDI/OozieWorkflow
http://twiki.corp.yahoo.com/view/CCDI/OozieWorkflowhttp://twiki.corp.yahoo.com/view/CCDI/OozieWorkflowhttp://twiki.corp.yahoo.com/view/CCDI/OozieWorkflow


24/32

Oozie coordinator

Coordinator automates workflow execution

Trigger workflow execution based on

time (like cron job)

input data availability

HDFSWorkflow

Coordinator

Check Data Availability

Check Time

start


25/32

ActionExecutors

Oozie Architecture

Web-ServicesJSON/REST API

Security

WS CallbackWS API

DAG Engine

Oracle DB

Commands

Command

Queue start rerunsubmitCommand

Executor

Thread Pool

Recovery

Daemon Thread

m/r fspig

Instrumentation W

Fstore

WFlib

sub-wf

resume killsuspend

info

start

action

end

action

check

action

callback

signal

job

notification


26/32

HCAT/Hive Hadoop Catalog (HCAT)The HCAT Oracle databases

stores all the Hive Metadata.

Hive always has a MetaDatastore that stores the tablesinformation that Hive can process.

This metadata store is pulled out of Hive and made aHCATalog, making it available for other applications.

Real time SQL abstraction layer. Hive Metadata stores all the information about the

tables, their partitions, the schemas, the columns andtheir types, the table locations etc.

This information can be queried or modified using athrift interface and as a result it can be called fromclients in different programming languages.


27/32

HCAT/Hive

The reason for storing this information inRDBMS is to serve the clients faster, in almost-real time.

Metadata-store gives Hive information aboutlocation, data_types, content in HDFS.

For the Object-Relational mapping, Metadatastore uses DataNucleus, instead of storing this

information on HDFS due the the latencyissues. And this ORM is compatible with mostof the RDBMS plugins.

New Grid Flow


28/32

New Grid Flow

HCAT Integrated with GDM and Oozie

GDM

DataData

OozieWorkflow 2

Workflow1

HCatalog

CMS

1: add data

2: add partition(s)

3: publish partition(s)

4: send notification

5: start workflow6: add data

7: publish metadata

Off-GridData

0: load data

28


29/32

SupportShop SupportShop is Grid Portal deployed on MySQL in BF1 and GQ1 colo

SupportShop is one-stop self help solution for all the Grid Yahoo!

UsersWeb User Portal (WUP). WUP should provide users with following content:

A convenient access to various internal & external educational &informational resources on Yahoo Grid Technologies.

A dashboard providing real-time status, critical alerts andannouncements about various Yahoo Grids & currently running userjobs

Various historical user facing reports on Grid utilization, Workloads,Data access etc.

Various Grid forms for user to request resources on the grid, reportproblems and a mechanism to followup.

Various technology enabled services for users for more productiveusage of the grids (See detail requirements for tools supported in thefirst release)

Customizable portal interface to only view the information of interest

yo/supportshop


30/32

SupportShop Physical Layout


31/32

SupportShop Architecture


32/32

References

Grid DBA Twiki:http://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructure

Grid Web Portal:http://twiki.corp.yahoo.com/view/Grid/WebHome

Grid Database Inventory: https://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkE#gid

=0

Grid Clusters/version: yo/gridversions

SupportShop: yo/supportshop
http://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructurehttp://twiki.corp.yahoo.com/view/Grid/WebHomehttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttp://twiki.corp.yahoo.com/view/Grid/WebHomehttp://twiki.corp.yahoo.com/view/Grid/WebHomehttp://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructurehttp://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructure

Documents

Grid Infrastructure and Databases BrownBag