Upload
saeed-meethal
View
225
Download
0
Embed Size (px)
Citation preview
8/13/2019 Grid Infrastructure and Databases BrownBag
1/32
Hadoop in a nutshell,
Grid Infrastructure, Databasesand More .
~Ashok Kondala
8/13/2019 Grid Infrastructure and Databases BrownBag
2/32
What is Hadoop? Hadoop is derived from Google File System (GFS) and
Googles Map-Reduce papers.
Yahoo! is the originator and major contributor for Hadoopand uses extensively!
Hadoop was created by Doug Cutting and Michael J.Cafarella.
Hadoop developed into set of Java-based Apache opensource projects.
Hadoop is capable of processing of larger dataset,
structured, unstructured and semi-structured, using widedistributed cluster of servers.
This is the solution for the Big Data to deal withcomplexities of high 3Vs of data.
8/13/2019 Grid Infrastructure and Databases BrownBag
3/32
Why Hadoop?
A typical Analytics Architecture
DataLogs
Webservers
Devices, like mobile
Misc sources
Raw andunstructured
Data
ETL Processing
Structured & Aggregated Data
BI & Reporting Data
Network Devices
8/13/2019 Grid Infrastructure and Databases BrownBag
4/32
Hadoop Architecture
Hadoop has two main systems:
Map-Reduce: This is the processing part; which playsdual rolemanaging/scheduling jobs and forprogramming abstracts for computations andproviding results.
Hadoop Distributed File System (HDFS): This is thedata part of Hadoop with high-bandwidth clusteredstorage.
MapReduce
HDFS
8/13/2019 Grid Infrastructure and Databases BrownBag
5/32
Hadoop Arch contd
Hadoop subscribes to a master-slave architecture
MasterNameNode and JobTracker Slave- DataNode and TaskTracker
Map-reduce server on a typical server is called a
TaskTracker
HDFS server on a typical server is called DataNode.
Machine
TaskTracker
DataNode
8/13/2019 Grid Infrastructure and Databases BrownBag
6/32
Cluster of Machines/Nodes
JobTracker keeps tracks of the jobs being run.
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
JobTracker
8/13/2019 Grid Infrastructure and Databases BrownBag
7/32
Cluster of Machines/Nodes
NameNode keeps tracks information on the
data location, acts as coordinator for all the
DataNodes.
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
NameNode
8/13/2019 Grid Infrastructure and Databases BrownBag
8/32
Hadoop Characteristics
Few key attributes of Hadoop
Hadoop is Scalable, Reliable, Cost-effective and fault-tolerence.
Hadoop can store and process petabytes of data.
Hadoop does extremely powerful computations, provides dataredundancy and reliability.
Hadoop is more for distributed and batch centric applications.
Designed to scale up from single machine to several thousandsof machines with high degree of fault tolerance.
Hadoop runs on commodity hardware.
Hadoop commands similar to Linux/Unix commands
hadoop fsls, hadoop fsmkdir, hadoop copyFromlocaletc.
8/13/2019 Grid Infrastructure and Databases BrownBag
9/32
Hadoop Sub-Projects
PIGPIG programs run as map-reduce jobs, high-level language-more like
compilers. Developed at Y!.
Constitutes Pig Latin programs and pig runtime.
Example: set of transformation which converts into mapreduce tasksregister /homes/akondala/videoanalytics/jar/videoanalytics.jar
register /grid/0/gs/pig/current/libexec/released/sds.jar
A = load '/data/SDS/data/ULT_apps_b_daily/video_cdn/20090905/51329/view/part-00000' using com.yahoo.yst.sds.ULT.ULTLoader() as (simpleFields, mapFields,mapListFields);
B = FOREACH A GENERATE mapFields#'page_params' as page_params;
C = FOREACH B GENERATE videoanalytics.PARSE_ATLAS_DATA(page_params#'dt', page_params#'d') as atlas_data;
dump C;
grunt> D = FOREACH C GENERATE atlas_data.vid, atlas_data.b, atlas_data.cat;
grunt> E = limit D 20;
grunt> dump E;
-------------------------------------------------------------------------------------------------------------------------------
-- DUMP C: OUTPUT-- order "vid", "sid", "cat", "cdn", "ca", "ca_t", "cp", "cp_t", "us", "us_t", "b"
((3890802772,3890802772_700.mp4,flickr,,,,792600246,s,,,))
((3658435432,3658435432_700.mp4,flickr,,,,792600246,s,,,))
((3715267808,3715267808_700.mp4,flickr,,,,792600246,s,,,))
((3858990041,3858990041_700.mp4,flickr,,,,792600246,s,,,))
--DUMP E OUTPUT
(3792043310,,flickr)
(3821820098,,flickr)
(3763110035,,flickr)
8/13/2019 Grid Infrastructure and Databases BrownBag
10/32
PIG and HIVE
MAP-REDUCE
HDFS
PIG:High-level lang. constitutes of pig latin
programs and pig runtime
Developed @Y!
Set of transformation which converts to
map-reduce tasks, like load, dump, store,
filter, group
HIVE:Its a Query Language, SQL like interface.
Developed by Facebook.
Use SQL like, SELECT, FROM, GORUPBY,WHERE,
which are transformed into map-reduce tasks.
This can be run from Hive-Shell-command line, or
from JDBC/ODBC appln/drivers, Thrift-Client
8/13/2019 Grid Infrastructure and Databases BrownBag
11/32
HBASE Apart from batch processing, the requirements would to process the data real-
time- HBASE fill that requirements.
MAP-REDUCE
HDFS
PIG
HIVE
HBASE ZooKeeper
8/13/2019 Grid Infrastructure and Databases BrownBag
12/32
HBASE
Is a columnar database fills the requirement to
real-time data needs.
Doesnt support SQL nor a RDBMS datastore.
Set of tables stored in HDFS.
Doesnt use map-reduce and use the similarmaster-slave architecture.
HBase can be accessed by PIG, Hive andMapReduce.
Hbase keep some of its metadata in ZooKeeper(used for coordination purposes for several servers)
8/13/2019 Grid Infrastructure and Databases BrownBag
13/32
HADOOPOverall
MAP-REDUCE
HDFS
PIG
HIVE
HBASE ZooKeeperHCAT
8/13/2019 Grid Infrastructure and Databases BrownBag
14/32
Grid Architecture
8/13/2019 Grid Infrastructure and Databases BrownBag
15/32
8/13/2019 Grid Infrastructure and Databases BrownBag
16/32
Grid Databases
Grid Databases are in Oracle(11.2.0.2/11.2.0.3) and MySQL (5.1/5.5)
Five critical Databases on Grid:
Grid Data Management RunTime (GDM) (formerly
known as Data AcQuisition-DAQ)
Grid Data Management Console (GDMConsole)
Oozie
HCATHive Metadata Store
Support Shop (MySQL)
8/13/2019 Grid Infrastructure and Databases BrownBag
17/32
Grid Data Management
GDM Physical ArchitectureMulti-Colo
8/13/2019 Grid Infrastructure and Databases BrownBag
18/32
GDM Components
GDM consists of three facetsAcquisition,Replication andRetention, and optionallyArchival.
Each of the colo will have all these facet serversinstalled.
GDM Console is the centralized console server.
Data from all the colos is aggregated andreplicated into the GDMConsole database.
Console provides a GUI with centralized view offacets of all the colos for their workflows andconfiguring the datasets and datasources.
8/13/2019 Grid Infrastructure and Databases BrownBag
19/32
8/13/2019 Grid Infrastructure and Databases BrownBag
20/32
GDM Logical Architecture
8/13/2019 Grid Infrastructure and Databases BrownBag
21/32
GDM Database Layout
8/13/2019 Grid Infrastructure and Databases BrownBag
22/32
Oozie
Oozie is Workflow Engine for Hadoopmore likea WF manager and scheduler.
Set of Coordinator jobs executes on the basis ofTimeand Dataavailability.
Oozie supports several types of Hadoop jobs,such as Java map-reduce, Streaming map-reduce,Pig, Hive, Sqoop and more as well as systemspecific jobs, such as Java programs and shell
scripts. Oozie is a scalable, reliable and extensible
system.
8/13/2019 Grid Infrastructure and Databases BrownBag
23/32
Oozie Data/Workflow
Oozie uses database for storing Workflow definitions and
currently running workflow instances, including instancestates and variables.
Oozie schema is local to each Grid Cluster with about
seven Oracle tables holding the BundleActions and jobs,
Coordinate Actions and Jobs, and Workflow Actions andJobs.
Oozie runs workflow jobs with multiple actions .
Each workflow job creates at least three events in the
database tablesCREATED, STARTED andSUCCEEDED/KILLED/FAILED.
Oozie workflow definitions are written in XML.
Oozie use-case: http://twiki.corp.yahoo.com/view/CCDI/OozieWorkflow
http://twiki.corp.yahoo.com/view/CCDI/OozieWorkflowhttp://twiki.corp.yahoo.com/view/CCDI/OozieWorkflowhttp://twiki.corp.yahoo.com/view/CCDI/OozieWorkflow8/13/2019 Grid Infrastructure and Databases BrownBag
24/32
Oozie coordinator
Coordinator automates workflow execution
Trigger workflow execution based on
time (like cron job)
input data availability
HDFSWorkflow
Coordinator
Check Data Availability
Check Time
start
8/13/2019 Grid Infrastructure and Databases BrownBag
25/32
ActionExecutors
Oozie Architecture
Web-ServicesJSON/REST API
Security
WS CallbackWS API
DAG Engine
Oracle DB
Commands
Command
Queue start rerunsubmitCommand
Executor
Thread Pool
Recovery
Daemon Thread
m/r fspig
Instrumentation W
Fstore
WFlib
sub-wf
resume killsuspend
info
start
action
end
action
check
action
callback
signal
job
notification
8/13/2019 Grid Infrastructure and Databases BrownBag
26/32
HCAT/Hive Hadoop Catalog (HCAT)The HCAT Oracle databases
stores all the Hive Metadata.
Hive always has a MetaDatastore that stores the tablesinformation that Hive can process.
This metadata store is pulled out of Hive and made aHCATalog, making it available for other applications.
Real time SQL abstraction layer. Hive Metadata stores all the information about the
tables, their partitions, the schemas, the columns andtheir types, the table locations etc.
This information can be queried or modified using athrift interface and as a result it can be called fromclients in different programming languages.
8/13/2019 Grid Infrastructure and Databases BrownBag
27/32
HCAT/Hive
The reason for storing this information inRDBMS is to serve the clients faster, in almost-real time.
Metadata-store gives Hive information aboutlocation, data_types, content in HDFS.
For the Object-Relational mapping, Metadatastore uses DataNucleus, instead of storing this
information on HDFS due the the latencyissues. And this ORM is compatible with mostof the RDBMS plugins.
New Grid Flow
8/13/2019 Grid Infrastructure and Databases BrownBag
28/32
New Grid Flow
HCAT Integrated with GDM and Oozie
GDM
DataData
OozieWorkflow 2
Workflow1
HCatalog
CMS
1: add data
2: add partition(s)
3: publish partition(s)
4: send notification
5: start workflow6: add data
7: publish metadata
Off-GridData
0: load data
28
8/13/2019 Grid Infrastructure and Databases BrownBag
29/32
SupportShop SupportShop is Grid Portal deployed on MySQL in BF1 and GQ1 colo
SupportShop is one-stop self help solution for all the Grid Yahoo!
UsersWeb User Portal (WUP). WUP should provide users with following content:
A convenient access to various internal & external educational &informational resources on Yahoo Grid Technologies.
A dashboard providing real-time status, critical alerts andannouncements about various Yahoo Grids & currently running userjobs
Various historical user facing reports on Grid utilization, Workloads,Data access etc.
Various Grid forms for user to request resources on the grid, reportproblems and a mechanism to followup.
Various technology enabled services for users for more productiveusage of the grids (See detail requirements for tools supported in thefirst release)
Customizable portal interface to only view the information of interest
yo/supportshop
8/13/2019 Grid Infrastructure and Databases BrownBag
30/32
SupportShop Physical Layout
8/13/2019 Grid Infrastructure and Databases BrownBag
31/32
SupportShop Architecture
8/13/2019 Grid Infrastructure and Databases BrownBag
32/32
References
Grid DBA Twiki:http://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructure
Grid Web Portal:http://twiki.corp.yahoo.com/view/Grid/WebHome
Grid Database Inventory: https://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkE#gid
=0
Grid Clusters/version: yo/gridversions
SupportShop: yo/supportshop
http://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructurehttp://twiki.corp.yahoo.com/view/Grid/WebHomehttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttps://docs.google.com/a/yahoo-inc.com/spreadsheet/ccc?key=0ApG6hF0D_DhvdFItdEhDcEQ3cjBvWUJGVm5NUm9WYkEhttp://twiki.corp.yahoo.com/view/Grid/WebHomehttp://twiki.corp.yahoo.com/view/Grid/WebHomehttp://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructurehttp://twiki.corp.yahoo.com/view/Grid/GridProjDBinfrastructure