Download ppt - Hadoop Training Institutes in Hyderabad@Kellytechnologies

PresentedBy

Kelly Technologies

www.kellytechno.com

AgendaWhat is distributed computingWhy distributed computingCommon ArchitectureBest PracticeCase study

CondorHadoop – HDFS and map reduce

www.kellytechno.com

What is Distributed Computing/System?Distributed computing

A field of computing science that studies distributed system.

The use of distributed systems to solve computational problems.

Distributed systemWikipedia

There are several autonomous computational entities, each of which has its own local memory.

The entities communicate with each other by message passing.

Operating System Concept The processors communicate with one

another through various communication lines, such as high-speed buses or telephone lines.

Each processor has its own local memory. www.kellytechno.com

What is Distributed Computing/System?Distributed program

A computing program that runs in a distributed system

Distributed programmingThe process of writing distributed program

www.kellytechno.com

What is Distributed Computing/System?Common properties

Fault tolerance When one or some nodes fails, the whole system

can still work fine except performance. Need to check the status of each node

Each node play partial role Each computer has only a limited, incomplete view

of the system. Each computer may know only one part of the input.

Resource sharing Each user can share the computing power and

storage resource in the system with other usersLoad Sharing

Dispatching several tasks to each nodes can help share loading to the whole system.

Easy to expand We expect to use few time when adding nodes.

Hope to spend no time if possible.

www.kellytechno.com

Why Distributed Computing?The nature of applicationPerformance

Computing intensive The task could consume a lot of time on computing. For

example, π Data intensive

The task that deals with a lot mount or large size of files. For example, Facebook, LHC(Large Hadron Collider).

RobustnessNo SPOF (Single Point Of Failure)Other nodes can execute the same task executed on

failed node.www.kellytechno.com

Common ArchitecturesCommunicate and coordinate works among

concurrent processesProcesses communicate by sending/receiving

messagesSynchronous/Asynchronous

www.kellytechno.com

Common ArchitecturesMaster/Slave architecture

Master/slave is a model of communication where one device or process has unidirectional control over one or more other devices Database replication

Source database can be treated as a master and the destination database can treated as a slave.

Client-server web browsers and web

serverswww.kellytechno.com

Common ArchitecturesData-centric architecture

Using a standard, general-purpose relational database management system customized in-memory or file-based data structures and access method

Using dynamic, table-driven logic in logic embodied in previously compiled programs

Stored procedures logic running in middle-tier application servers

Shared databases as the basis for communicating between parallel processes direct inter-process communication via message passing function

www.kellytechno.com

Best PracticeData Intensive or Computing Intensive

Data size and the amount of data The attribute of data you consume Computing intensive

We can move data to the nodes where we can execute jobs

Data Intensive We can separate/replicate data to difference nodes,

then we can execute our tasks on these nodes Reduce data replication when executing tasks

Master nodes need to know data location No data loss when incidents happen

SAN (Storage Area Network)Data replication on different nodes

SynchronizationWhen splitting tasks to different nodes, how can

we make sure these tasks are synchronized?

www.kellytechno.com

Best PracticeRobustness

Still safe when one or partial nodes failNeed to recover when failed nodes are online.

No further or few action is needed Condor – restart daemon

Failure detection When any nodes fails, master nodes can detect this

situation. Eg: Heartbeat detection

App/Users don’t need to know if any partial failure happens. Restart tasks on other nodes for users

www.kellytechno.com

Best PracticeNetwork issue

Bandwidth Need to think of bandwidth when copying files from one node

to other nodes if we would like to execute the task on the nodes if no data in these nodes.

ScalabilityEasy to expand

Hadoop – configuration modification and start daemon

OptimizationWhat can we do if the performance of some nodes is

not good? Monitoring the performance of each node

According to any information exchange like heartbeat or log Resume the same task on another nodes

www.kellytechno.com

Best PracticeApp/User

shouldn’t know how to communicate between nodes

User mobility – user can access the system from some point or anywhere Grid – UI (User interface) Condor – submit machine

www.kellytechno.com

Case study - CondorCondor

Computing intensive jobsQueuing policy

Match task and computing nodesResource Classification

Each resource can advertise its attributes and master can classify according to this

www.kellytechno.com

Case study - Condor

www.kellytechno.com

Case study - CondorRole

Central Manger The collector of information, and the negotiator

between resources and resource requestsExecution machine

Responsible for executing condor tasksSubmit machine

Responsible for submitting condor tasksCheckpoint servers

Responsible for storing all checkpoint files for the tasks

www.kellytechno.com

Case study - CondorRobustness

One execution machine fails We can execute the same task on other nodes.

Recovery Only need to restart the daemon when the failed

nodes are online

www.kellytechno.com

Case study - CondorResource sharing

Each condor user can share computing power with other condor users.

SynchronizationUsers need to take care by themselves

Users can execute MPI job in a condor pool but need to think of the issues of synchronization and Deadlock.

Failure detectionCentral manager can know when nodes fails

Based on update notification sent by nodesScalability

Only execute few commands when new nodes are online.

www.kellytechno.com

Case study - HadoopHDFS

Namenode: manages the file system namespace and regulates access

to files by clients. determines the mapping of blocks to DataNodes.

Data Node : manage storage attached to the nodes that they run on save CRC codes send heartbeat to namenode. Each data is split as a chunk and each chuck is stored on

some data nodes.Secondary Namenode

responsible for merging fsImage and EditLog

www.kellytechno.com

Case study - Hadoop

www.kellytechno.com

Case study - HadoopMap-reduce Framework

JobTracker Responsible for dispatch job to each tasktracker Job management like removing and scheduling.

TaskTracker Responsible for executing job. Usually tasktracker

launch another JVM to execute the job.

www.kellytechno.com

Case study - Hadoop

From Hadoop - The Definitive Guide From Hadoop - The Definitive Guide www.kellytechno.com

Case study - HadoopData replication

Data are replicated to different nodes Reduce the possibility of data loss Data locality. Job will be sent to the node where data are.

RobustnessOne datanode fails

We can get data from other nodes.One tasktracker failed

We can start the same task on different nodeRecovery

Only need to restart the daemon when the failed nodes are online

www.kellytechno.com

Case study - HadoopResource sharing

Each hadoop user can share computing power and storage space with other hadoop users.

SynchronizationNo synchronization

Failure detectionNamenode/Jobtracker can know when

datanode/tasktracker fails Based on heartbeat

www.kellytechno.com

Case study - HadoopScalability

Only execute few commands when new nodes are online.

OptimizationA speculative task is launched only when a task

takes too much time on one node. The slower task will be killed when the other one

has been finished

www.kellytechno.com

www.kellytechno.com