38
Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1

Cross-Layer Memory Management to Reduce DRAM Power … · •Memory power management is challenging •Propose a collaborative approach between applications, operating system, and

Embed Size (px)

Citation preview

Cross-Layer Memory Management to Reduce DRAM Power Consumption

Michael JantzAssistant Professor

University of Tennessee, Knoxville

1

Introduction

2

• Assistant Professor at UT since August 2014

• Before UT– PhD in Computer Science at KU (July 2014)

– Intern at Intel Corporation (2012 – 2013)

• Research interests:– Compilers (optimization, phase ordering)

– Operating Systems (kernel instrumentation, memory and power management)

– Runtime Systems (dynamic compilation, object mgmt.)

• Courses taught:– Compilers (COSC 461), Discrete Structures (COSC 311)

Outline

• Compiler Optimization Phase Ordering

• Dynamic Compilation

• Cross-Layer Memory Management

– Motivation

– Design

– Experimental Evaluation

• Future Directions

• Conclusions

3

Compiler OptimizationPhase Ordering

4

Phase Ordering

• Compiler optimizations operate in phases

– Phases interact with each other

– Phase ordering: different phase orderings produce different quality code

• Problem: finding the best ordering for each function or program takes a very long time

– Iterative search is the most common technique

5

Exploiting Phase Interactions

• Our approach: identify and exploit phase interactions during search

• Major contributions:– Reduce exhaustive phase ordering search time

– Increase applicability and effectiveness of individual optimization phases

– Improve phase ordering heuristics

• Publications: LCTES ‘10 [1], CASES ‘10, [2] CASES ‘13 [3], S:P&E (Jan. ‘13) [4]

6

Dynamic Compilation

7

Tradeoffs in Dynamic Compilation

• Managed language applications (e.g. Java)– Distributed as machine-independent codes

– Require compilation at runtime

• Dynamic compilation policies involve tradeoffs– Can potentially slow down overall performance

– Must consider several factors when setting policy:• Compiling speed and quality of compiled code

• Execution frequency of individual methods

• Availability of compilation resources

8

Dynamic Compilation Strategies

• Conducted multiple studies on how, when, and if to compile program methods

• Employ industrial-grade Java VM (HotSpot)

• Major studies:

– Performance potential of phase selection in dynamic compilers (VEE '13-A [5])

– Dynamic compilation strategy on modern machines (TACO, Dec. '13 [6])

9

Cross-Layer Memory Management

10

A Collaborative Approach toMemory Management

• Memory has become a significant player in power and performance

• Memory power management is challenging

• Propose a collaborative approach between applications, operating system, and hardware:

– Applications – communicate memory usage intent to OS

– OS – re-architect memory mgmt. to interpret application intent and manage memory over hardware units

– Hardware – communicate hardware layout to the OS to guide memory management decisions

11

A Collaborative Approach toMemory Management

• Implemented framework by re-architecting a recent Linux kernel

• Experimental evaluation

• Publications: VEE ’13-B [7], Linux Symposium ‘14 [8] , manuscript in submission [9]

12

Why

• CPU and Memory are most significant players for power and performance

– In servers, memory power == 40% of total power [10]

• Applications can direct CPU usage

– threads may be affinitized to individual cores or migrated b/w cores

– prioritize threads for task deadlines (with nice)

– individual cores may be turned off when unused

• Surprisingly, much of this flexibility does not exist for controlling memory

13

Example Scenario

• System with database workload with 512GB DRAM

– All memory in use, but only 2% of pages are accessed frequently

– CPU utilization is low

• How to reduce power consumption?

14

Challenges in Managing Memory Power

• Memory refs. have temporal and spatial variation

• At least two levels of virtualization:

– Virtual memory abstracts away application-level info

– Physical memory viewed as single, contiguous array of storage

• No way for agents to cooperate with the OS and with each other

• Lack of a tuning methodology

15

A Collaborative Approach

• Our approach: enable applications to guide mem. mgmt.

• Requires collaboration between the application, OS, and hardware:

– Interface for communicating application intent to OS

– Ability to keep track of which memory modules host which physical pages during memory mgmt.

• To achieve this, we propose the following abstractions:

– Colors

– Trays

16

Communicating Application Intent with Colors

• Color = a hint for how pages will be used

– Colors applied to sets of virtual pages that are alike

– Attributes associated with each color

• Attributes express different types of distinctions:

– Hot and cold pages (frequency of access)

– Pages belonging to data structures with different usage patterns

• Allow applications to remain agnostic to lower level details of mem. mgmt.

17

Software Intent

Color

Tray

Memory Allocation

and Freeing

Power-Manageable Units Represented as Trays

• Tray = software structure containing sets of pages that constitute a power-manageable unit

• Requires mapping from physical addresses to power-manageable units– ACPI 5.0 defines memory power state table (MPST) to

expose this mapping

• Re-architect a recent Linux Kernel to perform memory management over trays

Software Intent

Color

Tray

Memory Allocation

and Freeing

18

Hardware

Application

OperatingSystem

T0 T1 T2 T3 T7T6T5T4Trays:

Pages:

Physical memory allocation and recyclying

Application colors pages to indicate a range of pages will be hot

Memory topology represented in

the OS using trays

OS looks up attribute associatedwith the virtual pages’ color

VN…V2V1

PN…P2P1

Hot pages

Seq. Access

Cold pages

19NUMA Node 0 NUMA Node 1

Controller

CH0

M0

M1

Controller

CH1

M2

M3

Controller

CH0

M4

M5

Controller

CH1M

6

M7

Experimental Evaluation

• Emulating NUMA API’s

• Memory prioritization for applications

• Reducing DRAM power consumption

– Power-saving potential of containerized memory management

– Localized allocation and recycling

– Exploiting generational garbage collection

20

Automatic Cross-Layer Memory Management

• Limitations of application guidance:

– Little understanding of which colors or coloring hints will be most useful for existing workloads

– All colors and hints must be manually inserted

• Our approach: integrate with profiling and analysis to automatically provide power / bandwidth mgmt.

– Implemented using the HotSpot JVM

– Instrumentation and analysis to build memory profile

– Partition live objects into separately colored regions

21

22

Application Heap

Tenured generation

Hot tenured Cold tenured

Execution Engine

JIT Compiler

Object profiling and analysis

Garbage Collection

Young generation

Hot survivors Cold survivors

• Employ the default HotSpot config. for server-class applications

• Divide survivor / tenured spaces into spaces for hot / cold objects

Hot eden Cold eden

23

Application Heap

Tenured generation

Hot tenured Cold tenured

Execution Engine

JIT Compiler

Object profiling and analysis

Garbage Collection

Young generation

Hot survivors Cold survivors

• Color spaces on creation or resize

• Partition allocation sites and objects into hot / cold sets

Hot eden Cold eden

Potential of JVM Framework

• Our goal: evaluate power-saving potential when hot / cold objects are known statically

• MemBench: Java benchmark that uses different object types for hot / cold memory

• “HotObject” and “ColdObject”

– Contain memory resources (array of integers)

– Implement different functions for accessing mem.

24

Experimental Platform

• Hardware– Single node of 2-socket server machine

– Processor: Intel Xeon E5-2620 (12 threads @ 2.1GHz)

– Memory: 32GB DDR3 memory (four DIMM’s, each connected to its own channel)

• Operating System– CentOS 6.5 with Linux 2.6.32

• HotSpot JVM– v. 1.6.0_24, 64-bit

– Default configuration for server-class applications

25

The MemBench Benchmark

• Object allocation– Creates “HotObject” and “ColdObject” objects in a

large in-memory array

– # of hots < # of colds (~15% of all objects)

– Object array occupies most (~90%) system mem.

• Multi-threaded object access– Object array divided into 12 separate parts, each

passed to its own thread

– Iterate over object array, only accessing hot objects

• Optional delay parameter

26

MemBench Configurations

• Three configurations– Default

– Tray-based kernel (custom kernel, default HotSpot)

– Hot/cold organize (custom kernel, custom HotSpot)

• Delay varied from "no delay" to 1000ns

– With no delay, 85ns between memory accesses

27

MemBench Performance

• Tray-based kernel has about same performance as default

• Hot/cold organize exhibits poor performance with low delay

28

0

5

10

15

20

25

0

0.5

1

1.5

2

2.5

3

3.5

85 100 150 200 300 500 750 1000

Ban

dw

idth

(G

B /

s)

Perf

. (ru

nti

me)

(P

(X)

/ P

(DEF

))

Time (ns) between memory accesses

default

tray-based kernel

hot/cold organize

MemBench Bandwidth

• Default and tray-based kernel produce high memory bandwidth when delay is low

• Placement of hot objects across multiple channels enables higher bandwidth

29

0

5

10

15

20

25

0

0.5

1

1.5

2

2.5

3

3.5

85 100 150 200 300 500 750 1000

Ban

dw

idth

(G

B /

s)

Perf

. (ru

nti

me)

(P

(X)

/ P

(DEF

))

Time (ns) between memory accesses

default

tray-based kernel

hot/cold organize

MemBench Bandwidth

• Hot/cold organize - hot objects co-located on single channel

• Increased delays reduces bandwidth reqs. of the workload

30

0

5

10

15

20

25

0

0.5

1

1.5

2

2.5

3

3.5

85 100 150 200 300 500 750 1000

Ban

dw

idth

(G

B /

s)

Perf

. (ru

nti

me)

(P

(X)

/ P

(DEF

))

Time (ns) between memory accesses

default

tray-based kernel

hot/cold organize

31

MemBench Energy

• Hot/cold organize consumes much less power with low delay

• Even when BW reqs. are reduced, hot/cold organize consumes less power than other configurations

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

85 100 150 200 300 500 750 1000

Ener

gy c

on

sum

ed

rel

ativ

e to

d

efau

lt (

J) (

J(X

) /

J(D

EF))

Time (ns) between memory accesses

tray-based kernel (DRAM only)tray-based kernel (CPU+DRAM)hot/cold organize (DRAM only)hot/cold organize (CPU+DRAM)

32

MemBench Energy

• Significant energy savings potential with custom JVM

• Max. DRAM energy savings of ~39%, max. CPU+DRAM energy savings of ~15%

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

85 100 150 200 300 500 750 1000

Ener

gy c

on

sum

ed

rel

ativ

e to

d

efau

lt (

J) (

J(X

) /

J(D

EF))

Time (ns) between memory accesses

tray-based kernel (DRAM only)tray-based kernel (CPU+DRAM)hot/cold organize (DRAM only)hot/cold organize (CPU+DRAM)

Results Summary

• Object partitioning strategies– Offline approach partitions allocation points

– Online approach uses sampling to predict object access patterns

• Evaluate with standard sets of benchmarks– DaCapo, SciMark

• Achieve 10% average DRAM energy savings, 2.8% CPU+DRAM reduction

• Performance overhead– 2.2% for offline, 5% for online

33

Current and Future Projects in Cross-Layer Memory Management

• Immediate future work: address performance losses of our current approach

– Improve the online sampling

– Automatic bandwidth management

• Applications for heterogeneous memory architectures

• Exploit data object placement within each page to improve efficiency

34

Conclusions

• Research focuses on software systems– Compilers, operating systems, and runtime systems

• Cross-layer memory management– Achieving power/performance efficiency in memory

requires a cross-layer approach

– First framework to use usage patterns of application objects to steer low-level memory mgmt.

– Approach shows promise for reducing DRAM energy

– Opens several avenues for future research in collaborative memory management

35

Questions?

36

References

1. Prasad Kulkarni, Michael Jantz, and David Whalley. Improving Both the Performance Benefits and Speed of Optimization Phase Sequence Searches In the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES '10), April 2010

2. Michael Jantz and Prasad Kulkarni. Eliminating False Phase Interactions to Reduce Optimization Phase Order Search Space. In the ACM/IEEE International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES '10), October 24-29, 2010.

3. Michael Jantz and Prasad Kulkarni. Exploiting Phase Inter-Dependencies for Faster Iterative Compiler Optimization Phase Order Searches. In the ACM/IEEE International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES '13), September 29 -October 4, 2013.

4. Michael Jantz and Prasad Kulkarni. Analyzing and Addressing False Phase Interactions During Compiler Optimization Phase Ordering. In Software: Practice and Experience. January 2013.

5. Michael Jantz and Prasad Kulkarni. Exploring Single and Multi-Level JIT Compilation Policy for Modern Machines. In ACM Transactions on Architecture and Code Optimization (TACO). December 2013.

6. Michael Jantz and Prasad Kulkarni. Performance Potential of Optimization Phase Selection During Dynamic JIT Compilation. In the ACM SIGPLAN Conference on Virtual Execution Environments (VEE '13), March 16-17, 2013.

37

References

7. Michael Jantz, Carl Strickland, Karthik Kumar, Martin Dimitrov, and Kshitij A. Doshi. A Framework for Application Guidance in Virtual Memory Systems. In the ACM SIGPLAN Conference on Virtual Execution Environments (VEE '13), March 16-17, 2013.

8. Michael Jantz, Kshitij Doshi, Prasad Kulkarni, and Heechul Yun. Leveraging MPST in Linux with Application Guidance to Achieve Power-Performance Goals. In Linux Symposium, Ottawa, Canada. May 2014.

9. Michael Jantz, Forrest Robinson, Prasad Kulkarni, and Kshitij Doshi. Cross-Layer Memory Management for Managed Language Applications. In submission. July 2015.

10. C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. W. Keller. Energy management for commercial servers. Computer ,36 (12):39–48, Dec. 2003

38