Hadoop Summit San Jose 2014 - Analyzing Historical Data of Applications on Hadoop YARN: for Fun and...

Analyzing Historical Data of Applications on Hadoop

YARN: for Fun and ProfitMayank Bansal， Zhijie Shen

Agenda

• Who we are ?

• Why we need New History Server?

• Application History Server

• Timeline Server

• Future Work

Who we are

• Hadoop Architect @ ebay• Apache Hadoop Committer• Apache Oozie PMC and Committer

• Current• Leading Hadoop Core Development for

YARN and MapReduce @ ebay

• Past• Working on Scheduler / Resource

Managers• Working on Distributed Systems• Data Pipeline frameworks

Mayank Bansal

Who we are

• Software Engineer @ Hortonworks• Apache Hadoop Committer• Apache SAMZA PPMC and Committer

Zhijie Shen

Agenda

• Who we are ?

• Application Timeline Server

• Future Work

MR JobHistory Server

• We already have Job History Server

• It is only for Map Reduce Customized

• Storage is HDFS only

• Storage is very MR specific

• Counters

• Mappers and Reducers

• If you have only Map Reduce you are good.

Hadoop-2

Single Use System Batch Apps

Multi Purpose PlatformBatch, Interactive, streaming

Issues with current Job History

• What if I have other Applications

• RM crashes

• Hard Limit on # Apps

• Upgrades / Updates

Agenda

• Who we are ?

• Timeline Server

• Future Work

Application History Server

• Separate Process

• Pluggable Storage

• HDFS

• In-Memory

• Resource Manager directly writes to Storage

• Aggregated Logs

• Separate UI, CLI and Rest End Point

Storage:

• It stores generic Data

• Application level data (queue, user etc…)

• List of ApplicationAttempts

• Information about each ApplicationAttempt

• List of containers for ApplicationAttempt

• Generic information about each container.

• CLI Interface $ yarn application -status <Application ID> $ yarn applicationattempt -list <Application ID>

• REST APIs

• http://localhost:8188/ws/v1/applicationhistory/apps/appid

• Scalability for storage

• One file per application

• File format is protobuff

• Size of HDFS files

• Multiple RM threads writing to History Storage

# of Containers

100 1K 10 K 100K

Size of the File

19 KB 184 KB

1.8 MB 19 MB

Agenda

• Who we are ?

• Timeline Server

• Future Work

Timeline Service - Motivation

• YARN takes care of it

– Relieving the application from monitoring service

• Application diversity– Framework specific metadata/metrics

Timeline Service – Data Model

• Entity Type– An abstract concept of anything

• Entity

– One specific instance of a entity type

– Defining the relationship between entities

• Event

– Something happens to an entity

Timeline Service – Architecture

• LevelDB Store

• Client Library

• REST Interfaces

Timeline Service – Store

• LevelDB based store

– Key-value store

– Lightweight

– License compatible

• Implementing reader/writer interfaces

• Support data retention

Timeline Service – Client

• TimelineClient

– Wrap over REST POST method

– POJO objects

• TimelineEntity

• TimelineEvent

– In Client/AM/Container

Timeline Service – APIs

• Rest APIs, JSON as the media

• Get timeline entities

– http://localhost:8188/ws/v1/timeline/{entityType}

• Get timeline entity

– http://localhost:8188/ws/v1/timeline/{entityType}/{entityId}

• Get timeline events

– http://localhost:8188/ws/v1/timeline/{entityType}/events

Timeline Service – Security

• HTTP SPNEGO

• Kerberos Authentication

• Delegation Token

– Performance

– AM/Container no Kerberos

• Access Control

– Admin/owner

– Timeline entity-level

Timeline Service – Use Case (1)

Timeline Service – Early Adopter (2)

Timeline Service – Early Adopter (3)

Agenda

• Who we are ?

• Timeline Server

• Future Work

To Be Continue…

• Integrating the generic history and

timeline data

• Rebasing MR Job history server on the

timeline server

• Making the timeline server rendering the

timeline data

To Be Continue…

• Leveldb does not handle ebay scale

• We need something which can horizontally scale

• HBASE

Questions

Mayank Bansalmabansal@ebay.commayank@apache.org

Zhijie Shenzshen@hortonworks.comzjshen@apache.org

Hadoop Summit San Jose 2014 - Analyzing Historical Data of Applications on Hadoop YARN: for Fun and...

Technology

YARNsim: Simulating Hadoop YARNscs/psfiles/Ning_CCGrid15.pdf · 2020-02-18 · Hadoop YARN [15]. However, the community still lacks a comprehensive Hadoop YARN simulation system that

An example Apache Hadoop Yarn upgrade

YARNsim: Simulating Hadoop YARN

Debugging Apache Hadoop YARN Cluster in Production

What's new in Hadoop Yarn- Dec 2014

Analyzing Hadoop with Hadoop

YARN - Next Generation Compute Platform fo Hadoop

Hadoop 2.0 YARN webinar

Apache Hadoop YARN: Yet Another Resource Negotiatorweiwa/teaching/Fall15-COMP6611B/reading... · Apache Hadoop YARN: Yet Another Resource Negotiator ... running YARN on production

Apache Hadoop YARN - Hortonworks Meetup Presentation

Apache Ambari: Managing Hadoop and YARN

Cloudera Hadoop Dev-Test Hadoop MySQL Reporting Apache ... · Hadoop YARN ˜Support for long running services in YARN. FService Registry for applications. ˜Support for rolling upgrades

Hadoop 2.0 and YARN

Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

Performance evaluation of job schedulers on Hadoop YARN · Performance evaluation of job schedulers on Hadoop YARN ... Yahoo!, Facebook, and Amazon, have used Hadoop for both research

Hadoop YARN overview

Hadoop YARN in the Cloud

YARN - way to share cluster BEYOND HADOOP

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

OCTOBER 2012 Apache Hadoop* Community Spotlight Apache ... · • Taking advantage of Hadoop YARN. Hadoop YARN is a new framework for job scheduling and cluster resource management