31
A Fine-grained Component- level Power Measurement Method Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Embed Size (px)

Citation preview

Page 1: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

A Fine-grained Component-level Power Measurement

Method

Zehan Cui, Yan Zhu,Yungang Bao, Mingyu Chen

Institute of Computing Technology, Chinese Academy of SciencesJuly 28, 2011

Page 2: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Motivation

Design & Implementation

Experiments

Conclusion & Work in Progress

Outline

Page 3: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Motivation

Design & Implementation

Experiments

Conclusion & Work in Progress

Outline

Page 4: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Watts/Server

[source: The Problem of Power Consumption in Servers,Intel,2009]

CPU no longer dominates the system power.

Background

[source: Barroso et. al. , The datacenter as a computer, 2009]

Page 5: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Measurement is the basis.

Motivation

Low power

Hardware

Software

model

measurement

Page 6: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Component-Level: ATX-based method

Existing Measurement Method

accuracy

Directly powered through ATX wires.

Modern motherboards mostly have dedicated ATX wires for processor.VRM (Voltage Regulation Module) loss

Usually deduced from multi ATX wires. Platform dependent.

Page 7: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Motivation

Design & Implementation

Experiments

Conclusion & Work in Progress

Outline

Page 8: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

CPU

Disk

Power Supply

Disk & CPU◦ Similar to other ATX-based methods

Memory & Add-in Card Devices◦ Wrapper-based methods

Advantages◦ Accurate: direct measurement

◦ Easy-to-use: no deduction needed

◦ Portable: multi-platform

Our Solution: A Hybrid Way

wrapperMemory

X

Current Sensor

Page 9: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Prototype◦ Disk power◦ CPU power◦ Memory power

Implementation

Component Count Description

Wrapper Card 1 Memory power measurement.• Support DDR2-400 DIMM.

Intermediate Card

1 8 channels. • A channel is capable of converting one current into voltages.

DMM 2 Agilent 34411A. • One channel each.• Max speed: 50K samples per second. • LAN interface.

Collector 1 PC• Collect data from DMM.

Page 10: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Motivation

Design & Implementation

Experiments

Conclusion & Work in Progress

Outline

Page 11: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Component Detail

CPU Intel Core2 Duo E4500# of Cores: 2Clock Speed: 2.2GHzL2 Cache: 2MBFSB Speed: 800MHz

Memory DDR2-400 2GB UDIMMFrequency: 200MHzMax Bandwidth: 3.2GB/s

Disk 640GB SATA

Experimental Setup

Page 12: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

401.bzip2 from SPECCPU2006

An Example

0 10 20 30 40 50 60 700

5

10

15

20

25

30

35

40

45

50

Time from Beginning (unit: Second)

Pow

er o

f C

ompo

nent

s

(un

it: W

att)

Memory DiskCPU

Page 13: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

More frequently we measure the power, more details we can get.

Time Granularity

Observation:5,000 samples/s is an appropriate sample frequency at

component level.

Page 14: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Graph BFS (Breadth First Search)

Higher BW, but lower Power

Lower BW, Higher Power

Page 15: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Microbenchmark

Time: 6.5 times longer

Power: slightly lower Energy: 5.9 times higher

Malloc 512MB

Access in different strides

Two causes◦ Row conflict◦ Lots of TLB miss

increase row buffer hit rate

large page may be more efficient

What is the relationship between performance and power?

Page 16: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

64MB memory◦ Random vs. Sequential

Jump at least 64B eliminate cache hit

Large page(2MB) eliminate TLB miss

Load/Sotre_Unit % = LSU_stall_time/CPU_Cycle

Random vs. Sequential

Observation:It seems that DRAM power is already proportional to bandwidth. But the fact is that …

Page 17: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Use different SEEDs to generate different random access patterns;

Power varies less than 1.1%.

Random Access

Observation:DRAM power is highly correlated to two factors• Load/Store Unit Utilization• Sequential / Random

We can build memory power models based on the two factors rather than Bandwidth.

Page 18: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Motivation

Design & Implementation

Experiments

Conclusion & Work in Progress

Outline

Page 19: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

We use a hybrid approach ◦ ATX-Based CPU/Disk◦ Wrapper card DRAM/…

5KHz is an appropriate sampling frequency to disclose fine-grain power behavior.

DRAM power is highly correlated to Load/Store Unit Utilization, rather than Bandwidth.

Takeaway Messages (Conclusions)

Page 20: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Upgrade current system◦ Support DDR3◦ Support Large memory capacity◦ Support 40 simultaneous measuring channels

Use FPGA to collect measured data

Correlate the measured power data with high-level semantics information

Work in progress

Page 21: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Thanks!&

Questions?

Page 22: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Backup

Page 23: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Wrapper Card already exists

We only did several small modifications

Wrapper Card Design

Current Sensor

Power Supply Signals

Page 24: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Memory Capacity Limitation

DIMM slot Motherboard

DIMM: Dual-Inline Memory Module

Normal

Page 25: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

With our initial wrapper card

Memory Capacity Limitation

DIMM slot Motherboard

DIMM

Wrapper Card

Page 26: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011
Page 27: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011
Page 28: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

28

Inside a DRAM Device

Bank 0

Sense AmpsColumn Decoder

Row

Deco

der ODT

Reci

ever

sD

rive

rs

Regis

ter

s

Wri

te

FIFO

Banks• Independent

arrays• Asynchronous:

independent of memory bus speed

I/O Circuitry• Runs at bus speed• Clock sync/distribution• Bus drivers and receivers• Buffering/queueing

On-Die Termination• Required by bus electrical

characteristicsfor reliable operation

• Resistive element that dissipates power when bus is active

[Source: H. David et. al., Memory Power Management viaDynamic Voltage/Frequency Scaling, ICAC, 2011]

Page 29: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Can be approximately divided into◦ Background power

considered to be stable◦ Bank power

active/precharge Related to frequency of row operation

◦ I/O power Burst proportional to bandwidth

◦ Termination power Termination resistors Proportional to bandwidth

DRAM power

Page 30: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Current Sensor

P = U * I

ADCor

DMM

CSA(Current-Sense

Amplifier)

DC Voltage

DC Voltage

DC Current

Doesn’t fluctuate too much, less than 2% in our platform.

Collector

(PC)

Data

Page 31: Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011

Possible reason for non-proportional of random power in slide17: ◦ When bandwidth is low, auto-precharge (caused

by refresh) cause every access needs ACTIVE; the bank power is proportional to bandwidth.

◦ When bandwidth is high, some access may hit in the row buffer, which need less ACTIVE; the slope of bank power increase is lower than before.

DRAM power