41
Matt Singer (@mattbytes) April 17th, 2019 Pachyzoom: Understanding & Optimizing Hadoop servers with Intel VPP

Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Matt Singer (@mattbytes)

April 17th, 2019

Pachyzoom: Understanding & Optimizing Hadoop servers with Intel VPP

Page 2: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

2

12345

Hadoop & TwitterProblem StatementProject ObjectivesExperimenting and AnalyzingResults

Agenda

Page 3: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

What is Hadoop?● A file system: HDFS● Distributed applications API and compute runtime: YARN● Data processing frameworks: MapReduce, Hive, TEZ, Spark, …

Twitter uses Hadoop for: ● Data: Core Data (Tweets, Users, …) , Logs, Durable Storage● Analytics: Metrics● Insight: Model Training, Experiments, Ad-Hoc Analysis

Hadoop @ Twitter

Introduction

3

Page 4: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

>1TEvents Per Day

>500KCompute Threads

>1 exabytePhysical Storage

>12,500Peak Cluster Size

Hadoop @ Twitter Scale

Introduction

4

Page 5: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

5

Problem Statement

Twitter has an exascale Hadoop infrastructure built on many relatively small (1/2/4TB) HDDs.

How will we have enough IOPS if we use far fewer, but larger (6/8/12TB) HDDs?

Page 6: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

6

1

Objective

2 3MEASUREI/O and CPU usage in our test and production Hadoop clusters.

ENABLEEnable adoption of bigger data HDDs

DENSIFYReduce the number of nodes needed for future clusters

The Project’s Objectives

Page 7: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

7

Project Objectives

VTune™ Amplifier Platform Profiler

We made extensive use of VPP to visualize things like:

● CPU Utilization● Memory Use● Disk Throughput● Disk Latency● IOPS● Queue Depths● Network Throughput

7

Page 8: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

8

Experiment Configuration

The Teams Involved

● Collaborate with Intel to test Intel Cache Acceleration Software

○ Try to reduce contention on disk access as there were more and more reads and writes per disk.

● Test Intel Optane SSD DC P4800X○ Test really fast, really high endurance storage

for this cache● Use a combination of Hadoop Benchmarks

@Twitter:● Hardware Engineering● Hadoop Development● Hadoop SRE● Infrastructure

Optimization and Performance

@Intel:● SSG: Hadoop● SSG: VPP● NSG: Engineering● NSG: Intel CAS Team● NSG: Technical Sales● SMG: Account Team

Collaboration

The Plan

Page 9: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Functional: Gridmix

● Capture of real production cluster workload trace (1000+ jobs)

● Replays reads, writes, shuffles, compute● Used over three generations of hardware● Standard (Apache Hadoop)

Base: TestDFSIO, Teragen, Terasort

● Low level I/O tests● Repeatable● Easy to use● Standard (Apache Hadoop)

9

1 2

Twitter Hadoop Test Approach

Experiment Configuration

Page 10: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Experiment Configuration

● Dual Socket E5-2640 v4 CPU● 128GB RAM● 12x 6TB 7200 RPM SATA DISK● 1x SATA SSD Boot Disk● 1x 750GB Intel P4800X NVMe SSD● 25G Ethernet

102 nodes spread across 6 racks

3.33 CPU Threads / HDD

Twitter ExperimentSystem Setup

10

● Dual Socket Xeon 8180● 128GB RAM● 8x 4TB 7200 RPM SATA DISK● 1x SATA SSD Boot Disk● 1x 750GB Intel P4800X NVMe SSD● 2x10G Ethernet

9 Nodes

14 CPU Threads / HDD

Intel Lab Test System Setup

Page 11: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

11

Experiment Configuration

Why does Twitter use YARN on HDD?

Twitter ran tests in 10+ configs:● Baseline Config:

● 12 HDDs with YARN temp files/logs on HDDs● 12 HDDs HDFS Cached● YARN on Optane + HDFS Cached● YARN on HDDs + HDFS Cached● YARN on Optane● Baseline with 6 HDDs● YARN on Optane, 6 HDDs● YARN on Optane, 3 HDDs● YARN, Optane, 6 HDDs, HDFS Cached● YARN, Optane, 3 HDDs, HDFS Cached● Baseline with half the threads● YARN on Optane with half the threads

Intel ran Gridmix in 10 configs (some similar, some different)

In the past, we haven't wanted to make an a priori determination of how much spill a job could produce. Also, SSDs used to be much smaller and more expensive.

Test Configurations

Page 12: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Experiment Configuration

This is data from one of the 12 disks in each system in the test cluster. VPP is showing us:

Throughput: Peaks to 200MB/Sec, but average is 32MB/Sec

IOPS averages about 70R + 80W during the run. 150 IOPS is really high for a HDD!

About ⅔ of the samples show a QDepth > 20, but block size is all over the place.

Baseline Example

12

Page 13: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

13

Results

Intel’s Cluster Baseline Config, Gridmix75

Source: Intel Corporation

Page 14: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

14

Results

Results

Page 15: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

15

Results

Results

We saw a significant reduction in runtimes on both gridmix (27.5%) and terasort (52%) with the same system.

27%

52%

Page 16: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

16

Results

Why?

HDD Utilization drops

dramatically now that YARN and

HDFS aren’t contending for the

same disks.

Gridmix HDD Utilization, Baseline Config (37MB/Sec)

Gridmix HDD Utilization, YARN on Optane Config (6MB/Sec)

Page 17: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

17

Results

Why?

Since the benchmark was

significantly I/O bound before,

CPU Utilization increases from

an average of 40% to an

average of 57%. The CPU was

doing work at 1.4x the rate,

which correlates with the

reduced runtime.

CPU Utilization, Baseline Config (40%)

CPU Utilization, New Config (57%)

Page 18: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

18

Results

Why?

We can now see how the

temporary data really wants to

behave.

Peaks to 20K IOPS, average >

6000 total RW IOPS

There was no way that spinning

disks could handle this alone, let

alone when sharing the HDD

with HDFS load.

Page 19: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

19

Gridmix 75

From Phase 1 (Baseline) to Phase 2 (YARN Data on Optane), the processing time was reduced by 51.7%

Adding in some HDFS Caching (Phase 5), the processing time was reduced by 56.2% (or about 9% incrementally)

In comparison, the max reduction in runtime in the Twitter test cluster was 27.5%

In Contrast: Intel’s ResultsSource: Intel Corporation

Results

51.7% Reduction in Runtime

9% Incremental Reduction in Runtime

Phase 2: 750G Optane + 1.6TB NAND

Page 20: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

20

Different Systems!

● Much more compute power in Intel’s system.

● Twitter retrofitted the NVMe Disks into an existing system, and couldn’t accommodate the optimal attachment of the NVMe disk to CPU Socket 0.

Intel did some testing that indicated that this may have had a 10% penalty when trying to move big data from HDD to Cache.

Why didn’t we see the same scaling?

Results

Intel Config Twitter Config

CPU Xeon 8160 x 2 E5-2640 v4 x 2

Threads 112 40

HDDs 8 12

Threads / Disk 14 3.33

Page 21: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

21

1

Deck title

2 3Moving the YARN data to the Optane SSD dramatically changed the disk utilization pattern

Our test cluster only had 3.33 Threads/Disk, vs 14 Threads/Disk in Intel’s Mini Cluster

We couldn’t push the HDDs hard enough with just HDFS Data to see a benefit from caching metadata and small files with Intel’s Cache Acceleration Software

Opportunity to apply what we discovered

Explore the other dimensions of testing: CPU and HDD Count

Page 22: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

22

Results

Experiment: Removing disks from the cluster

Gridmix

Runtime

Terasort

Runtime22

Page 23: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

23

12 Disk 6 Disk 3 Disk

27%38%

68%

23

40MB/s

45MB/s

23 MB/s

9MB/s

Page 24: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

24

Results

Strawman Example - 2X CPU with ¼ HDD

Gridmix Runtime

+27%

-9%

-27%

-55%

YARN on SSD enabled

scaling.

Benchmark to run in

less than half the time,

with 75% fewer HDDs.

24

Page 25: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Results

With ½ of Cores Disabled, no NVMe

25

CPU goes right to 100%*

*Due to the way that we disabled the cores in the scheduler (rather than BIOS) Linux reports 50%.

Cluster is CPU bound for most of the run!

Page 26: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Results

½ Cores Disabled, with NVMe SSD

26

HDD Utilization drops dramatically, but we’re still CPU bound. Thus, a minimal impact to runtime.

Page 27: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Results

All cores, with NVMe SSD, 3 HDDs

27

CPU is now very utilized (but not pegged at 100%) for most of the run.

The HDDs are running a lot of data without a lot of IOPS.

Great condition to be in!

Page 28: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Results

CPU Utilization and Disk Access In Real Life

CPU Utilization, Processing Cluster 1

28

Page 29: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Results

CPU Utilization and Disk Access In Real LifeHDD Utilization, Processing Cluster 1

29

Page 30: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Results

CPU Utilization and Disk Access In Real Life

CPU Utilization, Realtime Cluster

30

Page 31: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Results

CPU Utilization and Disk Access In Real LifeDisk Utilization, Realtime Cluster

31

Page 32: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Results

vs. The experiment clusters

Intel 8 HDD: ~30MB/Sec

(HDFS Was Partially Cached)Twitter 12 HDD: ~9MB/Sec

Page 33: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Results

vs. The experiment clusters

Twitter 6 HDD: ~22MB/SecIntel 8 HDD: ~30MB/Sec

(HDFS Was Partially Cached)Twitter 3 HDD: ~45MB/Sec

800MB/s

3000-IOPS

Page 34: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

34

Impact and Next Steps

Page 35: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

35

Shifting the Balance

32 Threads to 8 Disks is a 4:1 Ratio, a better fit for today’s more compute intensive workloads.

3-6T SSD for YARN data enables this shift in Threads:HDDs

Impact

Impact and Next Steps

Legacy Config Possible Config

CPU Xeon E3 Series 4-Core 6262V CLX 24-Core

Memory 32-64GB 192GB

HDD 12 x 2TB HDD 8x 6TTB HDD

Boot 240GB Boot 240GB Boot

YARN Storage N/A 6.4TB High Endurance NVMe

Compute 1X >4X

Storage 1X 4X

Racks 4X 1

Original Goal End Result

Scaling 2X-3X 4-5X

Page 36: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

30%more cost efficient.

Impact and Next Steps

36

Estimated

Page 37: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

37

MEASURE

1 2 3EXPERIMENT REPEAT

A visualization tool such as VTune™ Amplifier Platform Profiler made it really easy to measure what was happening in the experiment and production clusters.

Challenge prior assumptions about how things are and how things will work.

Learn from the data that you collected, adjust your experiments, and try again!

Best Practices

Section

Page 38: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

38

YARN Data on NVMe SSD

1 2 3Adopt High Density Drives More Compute Power for Every

Disk

This single tweak alone changed the disk access patterns dramatically.

Once the YARN data was moved to the SSD, it became clear that we didn’t need as many HDDs.

Now we know that the next platform that we design needs to have more compute threads for each disk in the system.

Learnings

Section

Page 39: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

39

1

Objective

2 3YARN Temporary Capacity Size Analysis

SSD Endurance Needs Analysis

(Total YARN Daily Write Load)

Determine Optimal Balance of HDD, Threads, and NVMe SSD

What’s Next?

Page 40: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

40

Thank you!

Thank you Intel Team!Ali Alavi (@TheAliAlavi), Felipe Barajas, Mauricio Cuervo (@mauriciocuervo), Milind Damle, Juan Fernandez (@cachegordon), Uma Gangumalla, Fabrizio Giamello (@giame), Andrzej Jakowski, David Leone, Anup Navare, Chris Parry, Brien Porter, Rakeshr Radhakrishnan Potty, David Tuhy, Barrie Wheeler (@AndBarrie), Michal Wysoczanski

Thank you Twitter Team!Tu Lam (@tulam_tu), Brian Martin (@brayniac), Derrick Tseng(@dstseng), Mark Schonbach (@markbach), Matt Silver (@msilver)

#Collaborate

Page 41: Matt Singer (@mattbytes) Pachyzoom: Understanding ......Experimenting and Analyzing Results Agenda. What is Hadoop? A file system: HDFS Distributed applications API and compute runtime:

Twitter @ VTune Summit 2019

Thank You!

April 17th, 2019