38
HADOOP ROBOT Action and Remediation Center for eBay's Hadoop Clusters Polo Li | 李健 eBay Cloud Services

Hadoop Robot from eBay at China Hadoop Summit 2015

  • Upload
    polo-li

  • View
    319

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Hadoop Robot from eBay at China Hadoop Summit 2015

HADOOP ROBOT Action and Remediation Center for eBay's Hadoop Clusters

Polo Li | 李健

eBay Cloud Services

Page 2: Hadoop Robot from eBay at China Hadoop Summit 2015

About me

2

Page 3: Hadoop Robot from eBay at China Hadoop Summit 2015

Agenda

• Hadoop @eBay

• Problem Statement

• What is Hadoop Robot

• Q&A

3

Page 4: Hadoop Robot from eBay at China Hadoop Summit 2015

4

Hadoop @eBay

1-10 nodes

2007

100+ nodes

1000 + core

1 PB

2010

2011

1000+ node

10,000+ core

10+ PB

4000+ node

40,000+ core

50+ PB

2013

2015

10,000+ nodes

150,000+ cores

150+ PB

2009

10+ nodes

Page 5: Hadoop Robot from eBay at China Hadoop Summit 2015

Hadoop @eBay

•10+ large Hadoop Clusters

•10,000+ nodes

•50,000+ jobs per day

•50,000,000+ tasks per day

5

Page 6: Hadoop Robot from eBay at China Hadoop Summit 2015

Shared vs Dedicated Clusters

• Shared clusters

• Used primarily for analytics of user behavior and inventory

• Mix of batch and ad-hoc jobs

• Mix of MR, YARN, Hive, Pig, Cascading, etc.

• Hadoop and HBase security enabled

• Dedicated clusters

• Very specific use case like index building

• Tight SLAs for jobs (in order of minutes)

• Immediate revenue impact

• Usually smaller than our shared clusters, but still large (600+ nodes)

6

Page 7: Hadoop Robot from eBay at China Hadoop Summit 2015

7

Hadoop Platform Stack

Page 8: Hadoop Robot from eBay at China Hadoop Summit 2015

Agenda

• Hadoop @eBay

• Problem Statement

• What is Hadoop Robot

• Q&A

8

Page 9: Hadoop Robot from eBay at China Hadoop Summit 2015

Problem Statements

• Long trouble shooting time

• Bad cluster performance

• Too many different skus, operating systems and metadatas

• Human resource Cost

• Cluster Availability

9

Page 10: Hadoop Robot from eBay at China Hadoop Summit 2015

Traditional Trouble Shooting Pipeline

Step 1 • Check failed application task logs to find out the suspicious hadoop nodes

Step 2 • Check the suspicious hadoop node hardware && system status

Step 3 • Check hadoop metrics and hadoop daemon logs

Step 4 • Check hadoop source code

10

Page 11: Hadoop Robot from eBay at China Hadoop Summit 2015

Victim or Perpetrator ?

Sometimes you think you've found the perpetrators, however it may turn out to be

the victim.

11

Page 12: Hadoop Robot from eBay at China Hadoop Summit 2015

What may impact cluster performance?

• Hardware

• System

• Hadoop

• JVM

• …

12

Page 13: Hadoop Robot from eBay at China Hadoop Summit 2015

Hardware issues

84%

3%

4% 4%

1%

1%

1%

2%

Hardware Issues distribution

Disk

Nic

Memory

Cpu

Chassis

Fan

Mobo

PSU

13

Page 14: Hadoop Robot from eBay at China Hadoop Summit 2015

System issues

•High load

•Node reboot

•Disk full

•Network saturated fully

•OOM

•Kernel bug

•Orphan processes

•…

14

Page 15: Hadoop Robot from eBay at China Hadoop Summit 2015

Hadoop Issues

•Hot spot

•Low hdfs locality

•High RPC call queue length

•Hadoop configuration inconsistent issues

•Big resource consuming applications

•Big log/hdfs output applications

•Bad Application scheduling

•…

15

Page 16: Hadoop Robot from eBay at China Hadoop Summit 2015

About hadoop node

16

NODE

SKU

OS METADATA

Page 17: Hadoop Robot from eBay at China Hadoop Summit 2015

Different SKU

•Cpu core

•Memory size

•Disk number

•Disk size

•Nic Speed

•…

17

Page 18: Hadoop Robot from eBay at China Hadoop Summit 2015

Different OS

• Image profile

• Image version

•Patches

•…

18

Page 19: Hadoop Robot from eBay at China Hadoop Summit 2015

Different Metadata

• Hadoop Metadata

•Daemons

•Packages

•Configurations

• OS Metadata

•Services

•Scripts/Tools

•Configurations

19

Page 20: Hadoop Robot from eBay at China Hadoop Summit 2015

Human Resource Cost

• Detect node (varies based on admin experiences)

• Node decommission (1-2 Hours)

• Vendor offline remediation (3-5 Days)

• Node reimage(45 Mins)

• Hadoop Installation and OS configuration (15 Mins)

• Node restart (5 Mins)

• Disk burning test (6-8 Hours)

• Node health verification (15 Mins)

• Node recommission (15 Mins)

20

Page 21: Hadoop Robot from eBay at China Hadoop Summit 2015

Cluster Availability

21

• Improve Availability

• Increase Live data node percentage

• Reduce job failures due to bad node

Page 22: Hadoop Robot from eBay at China Hadoop Summit 2015

Agenda

• Hadoop @eBay

• Problem Statement

• What is Hadoop Robot

• Q&A

22

Page 23: Hadoop Robot from eBay at China Hadoop Summit 2015

What is Hadoop Robot Hadoop Robot is action and remediation center for eBay hadoop clusters:

• End-to-end automated remediation center

• API center for hadoop action and remediation

• Unified Hadoop Admin Console

• Real time maintenance view of Hadoop clusters

• Analytical insights into hardware maintenance data

23

Page 24: Hadoop Robot from eBay at China Hadoop Summit 2015

End-to-end Automatic remediation center •Hardware Maintenance

•Alert Detection

•Node Decommission

•Remediation

•Node Recommission

•Remove Failed Disk Volume

•Bad Disk Hot Swap

•Hadoop Daemon Restart

•Hadoop Abnormal Job Termination

•Hadoop Cluster Expansion

•…

24

Page 25: Hadoop Robot from eBay at China Hadoop Summit 2015

Robot Architecture

25

Page 26: Hadoop Robot from eBay at China Hadoop Summit 2015

Hardware Maintenance Workflow

• Alert detection

• Node decommission

• Vendor remediation (while node is offline)

• Node OS reprovisioning

• Hadoop Installation && OS configuration

• Node restart

• Burn-in test

• Node health verification

• Node recommission

26

Page 27: Hadoop Robot from eBay at China Hadoop Summit 2015

Data Flow

27

Page 28: Hadoop Robot from eBay at China Hadoop Summit 2015

Alerts Drive Workflow

28

Page 29: Hadoop Robot from eBay at China Hadoop Summit 2015

Hardware and System Alerts

•Hardware

29

DISK

SMARTD_CHECK

SECTOR_CHECK

RW_CHECK

MOUNT_CHCEK

SYSLOG_CHECK

NIC

SPEED_CHCEK

DUPLEX_CHECK

FLAPPING_CHECK

DOWN_CHECK

NOT_FOUND_CHECK

MCELOG

CPU_CHECK

MEMORY_CHCEK

CPU_HIGH_TEMPERATURE_CHECK

MEMORY

MEMORY_SIZE_CHECK

Page 30: Hadoop Robot from eBay at China Hadoop Summit 2015

Hardware and System Alerts

•System

30

DISK

DISK_SPACE_CHECK

REBOOT

NODE_REBOOT_CHCEK

PROCESS

ORPHAN_PROCESSES_CHECK

Page 31: Hadoop Robot from eBay at China Hadoop Summit 2015

Metadata Setup - ansible

31

Page 32: Hadoop Robot from eBay at China Hadoop Summit 2015

One button hadoop installation and system configuration

We use ansible playbook to install and

configure various OS and software

packages including Hadoop.

32

Copy the Hadoop Packages, Configuration and Codes from

Source to the Destination Node

Update Symbolic Links to point to the Latest Hadoop Code.

Restart Hadoop Daemon

Page 33: Hadoop Robot from eBay at China Hadoop Summit 2015

Bonnie++

•We use bonnie++ to carry out a stress-test of the repaired hardware. This not only

puts a load on the I/O and disk subsystem, but it also can flush out CPU, RAM

and fan/cooling issues.

33

Page 34: Hadoop Robot from eBay at China Hadoop Summit 2015

Cluster level status overview

34

Page 35: Hadoop Robot from eBay at China Hadoop Summit 2015

Unified Hadoop Admin Console

35

Page 36: Hadoop Robot from eBay at China Hadoop Summit 2015

Summit Job

36

Page 37: Hadoop Robot from eBay at China Hadoop Summit 2015

Track Robot Job Activities

37

Page 38: Hadoop Robot from eBay at China Hadoop Summit 2015

38

We are Hiring Now!

https://careers.ebayinc.com Or contact me: [email protected]