Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Matt Singer (@mattbytes)

April 17th, 2019

Pachyzoom: Understanding & Optimizing Hadoop servers with Intel VPP

2

12345

Hadoop & TwitterProblem StatementProject ObjectivesExperimenting and AnalyzingResults

Agenda

What is Hadoop?● A file system: HDFS● Distributed applications API and compute runtime: YARN● Data processing frameworks: MapReduce, Hive, TEZ, Spark, …

Twitter uses Hadoop for: ● Data: Core Data (Tweets, Users, …) , Logs, Durable Storage● Analytics: Metrics● Insight: Model Training, Experiments, Ad-Hoc Analysis

Hadoop @ Twitter

Introduction

3

>1TEvents Per Day

>500KCompute Threads

>1 exabytePhysical Storage

>12,500Peak Cluster Size

Hadoop @ Twitter Scale

Introduction

4

5

Problem Statement

Twitter has an exascale Hadoop infrastructure built on many relatively small (1/2/4TB) HDDs.

How will we have enough IOPS if we use far fewer, but larger (6/8/12TB) HDDs?

6

1

Objective

2 3MEASUREI/O and CPU usage in our test and production Hadoop clusters.

ENABLEEnable adoption of bigger data HDDs

DENSIFYReduce the number of nodes needed for future clusters

The Project’s Objectives

7

Project Objectives

VTune™ Amplifier Platform Profiler

We made extensive use of VPP to visualize things like:

● CPU Utilization● Memory Use● Disk Throughput● Disk Latency● IOPS● Queue Depths● Network Throughput

7

8

Experiment Configuration

The Teams Involved

● Collaborate with Intel to test Intel Cache Acceleration Software

○ Try to reduce contention on disk access as there were more and more reads and writes per disk.

● Test Intel Optane SSD DC P4800X○ Test really fast, really high endurance storage

for this cache● Use a combination of Hadoop Benchmarks

@Twitter:● Hardware Engineering● Hadoop Development● Hadoop SRE● Infrastructure

Optimization and Performance

@Intel:● SSG: Hadoop● SSG: VPP● NSG: Engineering● NSG: Intel CAS Team● NSG: Technical Sales● SMG: Account Team

Collaboration

The Plan

Functional: Gridmix

● Capture of real production cluster workload trace (1000+ jobs)

● Replays reads, writes, shuffles, compute● Used over three generations of hardware● Standard (Apache Hadoop)

Base: TestDFSIO, Teragen, Terasort

● Low level I/O tests● Repeatable● Easy to use● Standard (Apache Hadoop)

9

1 2

Twitter Hadoop Test Approach



● Dual Socket E5-2640 v4 CPU● 128GB RAM● 12x 6TB 7200 RPM SATA DISK● 1x SATA SSD Boot Disk● 1x 750GB Intel P4800X NVMe SSD● 25G Ethernet

102 nodes spread across 6 racks

3.33 CPU Threads / HDD

Twitter ExperimentSystem Setup

10

● Dual Socket Xeon 8180● 128GB RAM● 8x 4TB 7200 RPM SATA DISK● 1x SATA SSD Boot Disk● 1x 750GB Intel P4800X NVMe SSD● 2x10G Ethernet

9 Nodes

14 CPU Threads / HDD

Intel Lab Test System Setup

11


Why does Twitter use YARN on HDD?

Twitter ran tests in 10+ configs:● Baseline Config:

● 12 HDDs with YARN temp files/logs on HDDs● 12 HDDs HDFS Cached● YARN on Optane + HDFS Cached● YARN on HDDs + HDFS Cached● YARN on Optane● Baseline with 6 HDDs● YARN on Optane, 6 HDDs● YARN on Optane, 3 HDDs● YARN, Optane, 6 HDDs, HDFS Cached● YARN, Optane, 3 HDDs, HDFS Cached● Baseline with half the threads● YARN on Optane with half the threads

Intel ran Gridmix in 10 configs (some similar, some different)

In the past, we haven't wanted to make an a priori determination of how much spill a job could produce. Also, SSDs used to be much smaller and more expensive.

Test Configurations


This is data from one of the 12 disks in each system in the test cluster. VPP is showing us:

Throughput: Peaks to 200MB/Sec, but average is 32MB/Sec

IOPS averages about 70R + 80W during the run. 150 IOPS is really high for a HDD!

About ⅔ of the samples show a QDepth > 20, but block size is all over the place.

Baseline Example

12

13

Results

Intel’s Cluster Baseline Config, Gridmix75

Source: Intel Corporation

14

Results

Results

15

Results

Results

We saw a significant reduction in runtimes on both gridmix (27.5%) and terasort (52%) with the same system.

27%

52%

16

Results

Why?

HDD Utilization drops

dramatically now that YARN and

HDFS aren’t contending for the

same disks.

Gridmix HDD Utilization, Baseline Config (37MB/Sec)

Gridmix HDD Utilization, YARN on Optane Config (6MB/Sec)

17

Results

Why?

Since the benchmark was

significantly I/O bound before,

CPU Utilization increases from

an average of 40% to an

average of 57%. The CPU was

doing work at 1.4x the rate,

which correlates with the

reduced runtime.

CPU Utilization, Baseline Config (40%)

CPU Utilization, New Config (57%)

18

Results

Why?

We can now see how the

temporary data really wants to

behave.

Peaks to 20K IOPS, average >

6000 total RW IOPS

There was no way that spinning

disks could handle this alone, let

alone when sharing the HDD

with HDFS load.

19

Gridmix 75

From Phase 1 (Baseline) to Phase 2 (YARN Data on Optane), the processing time was reduced by 51.7%

Adding in some HDFS Caching (Phase 5), the processing time was reduced by 56.2% (or about 9% incrementally)

In comparison, the max reduction in runtime in the Twitter test cluster was 27.5%

In Contrast: Intel’s ResultsSource: Intel Corporation

Results

51.7% Reduction in Runtime

9% Incremental Reduction in Runtime

Phase 2: 750G Optane + 1.6TB NAND

20

Different Systems!

● Much more compute power in Intel’s system.

● Twitter retrofitted the NVMe Disks into an existing system, and couldn’t accommodate the optimal attachment of the NVMe disk to CPU Socket 0.

Intel did some testing that indicated that this may have had a 10% penalty when trying to move big data from HDD to Cache.

Why didn’t we see the same scaling?

Results

Intel Config Twitter Config

CPU Xeon 8160 x 2 E5-2640 v4 x 2

Threads 112 40

HDDs 8 12

Threads / Disk 14 3.33

21

1

Deck title

2 3Moving the YARN data to the Optane SSD dramatically changed the disk utilization pattern

Our test cluster only had 3.33 Threads/Disk, vs 14 Threads/Disk in Intel’s Mini Cluster

We couldn’t push the HDDs hard enough with just HDFS Data to see a benefit from caching metadata and small files with Intel’s Cache Acceleration Software

Opportunity to apply what we discovered

Explore the other dimensions of testing: CPU and HDD Count

22

Results

Experiment: Removing disks from the cluster

Gridmix

Runtime

Terasort

Runtime22

23

12 Disk 6 Disk 3 Disk

27%38%

68%

23

40MB/s

45MB/s

23 MB/s

9MB/s

24

Results

Strawman Example - 2X CPU with ¼ HDD

Gridmix Runtime

+27%

-9%

-27%

-55%

YARN on SSD enabled

scaling.

Benchmark to run in

less than half the time,

with 75% fewer HDDs.

24

Results

With ½ of Cores Disabled, no NVMe

25

CPU goes right to 100%*

*Due to the way that we disabled the cores in the scheduler (rather than BIOS) Linux reports 50%.

Cluster is CPU bound for most of the run!

Results

½ Cores Disabled, with NVMe SSD

26

HDD Utilization drops dramatically, but we’re still CPU bound. Thus, a minimal impact to runtime.

Results

All cores, with NVMe SSD, 3 HDDs

27

CPU is now very utilized (but not pegged at 100%) for most of the run.

The HDDs are running a lot of data without a lot of IOPS.

Great condition to be in!

Results

CPU Utilization and Disk Access In Real Life

CPU Utilization, Processing Cluster 1

28

Results

CPU Utilization and Disk Access In Real LifeHDD Utilization, Processing Cluster 1

29

Results

CPU Utilization and Disk Access In Real Life

CPU Utilization, Realtime Cluster

30

Results

CPU Utilization and Disk Access In Real LifeDisk Utilization, Realtime Cluster

31

Results

vs. The experiment clusters

Intel 8 HDD: ~30MB/Sec

(HDFS Was Partially Cached)Twitter 12 HDD: ~9MB/Sec

Results

vs. The experiment clusters

Twitter 6 HDD: ~22MB/SecIntel 8 HDD: ~30MB/Sec

(HDFS Was Partially Cached)Twitter 3 HDD: ~45MB/Sec

800MB/s

3000-IOPS

34

Impact and Next Steps

35

Shifting the Balance

32 Threads to 8 Disks is a 4:1 Ratio, a better fit for today’s more compute intensive workloads.

3-6T SSD for YARN data enables this shift in Threads:HDDs

Impact


Legacy Config Possible Config

CPU Xeon E3 Series 4-Core 6262V CLX 24-Core

Memory 32-64GB 192GB

HDD 12 x 2TB HDD 8x 6TTB HDD

Boot 240GB Boot 240GB Boot

YARN Storage N/A 6.4TB High Endurance NVMe

Compute 1X >4X

Storage 1X 4X

Racks 4X 1

Original Goal End Result

Scaling 2X-3X 4-5X

30%more cost efficient.


36

Estimated

37

MEASURE

1 2 3EXPERIMENT REPEAT

A visualization tool such as VTune™ Amplifier Platform Profiler made it really easy to measure what was happening in the experiment and production clusters.

Challenge prior assumptions about how things are and how things will work.

Learn from the data that you collected, adjust your experiments, and try again!

Best Practices

Section

38

YARN Data on NVMe SSD

1 2 3Adopt High Density Drives More Compute Power for Every

Disk

This single tweak alone changed the disk access patterns dramatically.

Once the YARN data was moved to the SSD, it became clear that we didn’t need as many HDDs.

Now we know that the next platform that we design needs to have more compute threads for each disk in the system.

Learnings

Section

39

1

Objective

2 3YARN Temporary Capacity Size Analysis

SSD Endurance Needs Analysis

(Total YARN Daily Write Load)

Determine Optimal Balance of HDD, Threads, and NVMe SSD

What’s Next?

40

Thank you!

Thank you Intel Team!Ali Alavi (@TheAliAlavi), Felipe Barajas, Mauricio Cuervo (@mauriciocuervo), Milind Damle, Juan Fernandez (@cachegordon), Uma Gangumalla, Fabrizio Giamello (@giame), Andrzej Jakowski, David Leone, Anup Navare, Chris Parry, Brien Porter, Rakeshr Radhakrishnan Potty, David Tuhy, Barrie Wheeler (@AndBarrie), Michal Wysoczanski

Thank you Twitter Team!Tu Lam (@tulam_tu), Brian Martin (@brayniac), Derrick Tseng(@dstseng), Mark Schonbach (@markbach), Matt Silver (@msilver)

#Collaborate

Twitter @ VTune Summit 2019

Thank You!

April 17th, 2019

Documents

Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime: