35
Dark Silicon, Mobile Devices, and Possible Open Source Solutions Koan-Sin Tan [email protected] COSCUP 2013, Aug. 3rd, TICC, Taipei Friday, August 23, 13

Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Dark Silicon, Mobile Devices, and Possible Open

Source SolutionsKoan-Sin Tan

[email protected] 2013, Aug. 3rd, TICC, Taipei

Friday, August 23, 13

Page 2: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

• Software engineer, veteran open-source user

• Learned something about light-weight process (LWP) on Sun OS 4.x in early 1990s

• Did a user-level thread library on 386BSD with a classmate in 1992

• Was involved in big.LITTLE scheduling work recently

Friday, August 23, 13

Page 4: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Friday, August 23, 13

Page 5: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

SiliconFriday, August 23, 13

Page 6: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

• “Dark Silicon refers to the exponentially increasing number of a chip's transistors that must remain passive, or "dark", in order to stay within a chip's power budget”

Friday, August 23, 13

Page 7: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Figure from the textbook. We know we are in CMP era. “Since 2003, the limits of power and available instruction-level parallelism have slowed uniprocessor performance.”

Friday, August 23, 13

Page 8: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Dennard scaling hits the wall

• Dennard Scaling

• When voltages are scaled along with all dimensions, a device’s electric fields remain constant, and most device characteristics are preserved

• scaling maintains constant power density

• logic area and power is scaled down by alpha^2

• energy per transition is scaled down by alpha^3, but frequency is scaled up by 1/alpha, resulting in an alpha^2 decrease in power per gate

• ........

• google Dennard Scaling you can find more information, such as, http://www1.cs.columbia.edu/~cs4824/lectures/csee4824_f12_lec22.pdf

Friday, August 23, 13

Page 9: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Mobile Devices

• Both power and thermal constrains are more severe than desktop devices

• The progress of battery is relatively slow

• You don’t want put a fan into you smartphone

• conduction, convection, radiation

Friday, August 23, 13

Page 10: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Yes, modern high-end mobile processors have serious thermal problems. Tegra 4 game console figure from

iFixit

Friday, August 23, 13

Page 11: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Nexus 10 Thermal Throttling

• Antutu 3.0.2

• Unit for X axis is 200 ms

• It reaches 80 ˚C in 20 second

• Throttling starts at 80 ˚C; stops at 78 ˚C

• Throttling is to decrement themaximum freq value of cpufreq

Friday, August 23, 13

Page 12: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Running&Antutu&on&Octa�

0&

200&

400&

600&

800&

1000&

1200&

0&

200000&

400000&

600000&

800000&

1000000&

1200000&

1400000&

1600000&

1& 10&

19&

28&

37&

46&

55&

64&

73&

82&

91&

100&

109&

118&

127&

136&

145&

154&

163&

172&

181&

190&

199&

208&

217&

226&

235&

244&

253&

262&

271&

280&

289&

298&

307&

316&

325&

334&

343&

352&

freq&0&

freq&1&

freq&2&

freq&3&

temp&0&&

temp&1&

temp&2&

temp&3&

Antutu 3.0.2 on S4 OctaFriday, August 23, 13

Page 13: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Running&Antutu&on&New&One�

0&

10&

20&

30&

40&

50&

60&

70&

80&

90&

100&

1& 9& 17&

25&

33&

41&

49&

57&

65&

73&

81&

89&

97&

105&

113&

121&

129&

137&

145&

153&

161&

169&

177&

185&

193&

201&

209&

217&

225&

233&

241&

249&

257&

265&

273&

281&

289&

297&

305&

313&

321&

329&

337&

tz0&

tz1&

tz2&

tz3&

tz4&

tz5&

tz6&

tz7&

tz8&

tz9&

tz10&

tz11&

Antutu 3.0.2 on new OneFriday, August 23, 13

Page 14: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Introducing big.LITTLE

ARM DEN0013C Copyright © 2011, 2012 ARM. All rights reserved. 28-5ID071612 Non-Confidential

Figure 28-3 Processor DVFS curves

In a big.LITTLE system these operating points are applied both to the Cortex-A15 and Cortex-A7 processors. When the Cortex-A7 processor is executing the OS can tune the operating points as it would for an existing platform with a single applications processor. When the Cortex-A7 processor is at its highest operating point (Figure 28-3), if more performance is required a switch is invoked that transfers the OS and applications to the Cortex-A15 processor. Further DVFS tuning takes place on the Cortex-A15 processor if required, as the operating load increases.

Migration requires rapid context switching capability. Coherency is clearly a critical enabler in achieving a fast task migration time as it allows the state that has been saved on the outbound (migrated from) processor to be snooped and restored on the inbound (migrated to) processor rather than going via main memory. Additionally, for Cluster migration, (or for CPU migration when all processors have been switched) because the L2 cache of the outbound processor is coherent it can remain powered up after a task migration to improve the cache warming time of the inbound processor through snooping of data values. However, since the L2 cache of the outbound processor cannot be allocated, it will eventually need to be cleaned and powered off to save leakage power. The switching sequence is described in Figure 28-4 on page 28-6.

ARM big.LITTLEFriday, August 23, 13

Page 15: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Thread-Level Parallelism• Thread-level Parallelism (TLP) is

an index you can treat it as number of threads running concurrently

• a table from an ISCA ‘10 paper named “Evolution of thread-level parallelism in desktop applications”

• 2000, 2010

• mobile devices are worse

• http://dl.acm.org/citation.cfm?id=1816000

Friday, August 23, 13

Page 16: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Parallel Programming Could Help a Bit

• Parallel computing/programming has been there for a long time

• You know pthread and OpenMP are available and C++11 came with currency support

• Java use thread and its synchronization model

• “Why Threads Are A Bad Idea”, by John Ousterhout, http://www.cc.gatech.edu/classes/AY2009/cs4210_fall/papers/ousterhout-threads.pdf

• Thread is “easy: to describe; to use; to get wrong” to quote Andrew Birrell, http://www.cs.princeton.edu/courses/archive/spr07/cos598A/lectures/Birrell.pdf

• For more theoretical explanation, see “The Problems with Threads” by Edward Lee, http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf

• And you know that except shared memory model, there is message passing computing model. And more, e.g., actors, data-flow, systolic array, etc.

Friday, August 23, 13

Page 18: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Some of Ousterhout’s arguments remain valid• Synchronization

• manually set of mutex/lock

• deadlock: yes deadlock

• hard to debug

• threads breaks modularization

• callbacks don’t work with locks

Friday, August 23, 13

Page 19: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

thread is easy to get wrong

• Manual selection of mutual exclusion:

• Default is too little (and hence races)

• Easy fix is too much (deadlocks or blank stares)

• Projects don’t create hierarchical abstractions

• Can’t decide and/or maintain acyclic locking order

• “Composition” requires entire new abstractions

• “Clever” optimizations aren’t maintainable

• .....

Friday, August 23, 13

Page 20: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

User-level libraries, frameworks

• Android AsyncTask

• a class to help perform background operations and publish results on the UI thread without having to manipulate threads and/or handlers

• http://developer.android.com/reference/android/os/AsyncTask.html

• Intel Threading Building Blocks (TBB)

• http://threadingbuildingblocks.org/, http://en.wikipedia.org/wiki/Intel_Threading_Building_Blocks

• works on Android x86 and ARM

• Apple Grand Central Dispatch (GCD)

• http://developer.apple.com/library/ios/#documentation/Performance/Reference/GCD_libdispatch_Ref/

• Software Transactional Memory

• http://gcc.gnu.org/wiki/TransactionalMemory

Friday, August 23, 13

Page 22: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

OpenCL Related• OpenCL

• pocl, http://pocl.sourceforge.net/

• OpenCL and Java

• Aparapi, https://code.google.com/p/aparapi/

• Smuatra, http://openjdk.java.net/projects/sumatra/

• RenderScript

• in AOSP

• ThorScript

• will be open-sourced

Friday, August 23, 13

Page 23: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Cilk Plus: simple language extensions originated from Charles Leiserson

Friday, August 23, 13

Page 24: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Simple Cilk Plus Example

int fib(int n) { if (n < 2) return n; int x = fib(n-1); int y = fib(n-2); return x + y;}

int fib(int n) { if (n < 2) return n; int x = clik_spawn fib(n-1); int y = fib(n-2); cilk_sync; return x + y;}

Friday, August 23, 13

Page 25: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

simple GCD+blocks dispatch_group_t group = dispatch_group_create(); fib = ^() { if (n < 2) { result = n; return; } __block int x, y; int m = n;

n = m - 1; dispatch_group_async(group, a_queue, ^{fib(); x = result;}); dispatch_group_wait(group, DISPATCH_TIME_FOREVER); n = m - 2; dispatch_sync(a_queue, ^{fib(); y = result;}); n = m; result = x + y; return; };

Friday, August 23, 13

Page 26: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

data parallel fib() looks more reasonable

int fib(int n) { if (n < 2) return n; int p = 0, q = 1, result =0;

cilk_for (int i=2; i <= n; i++) { result = p + q; p = q; q = result; } return result;}

TextText

Textn.b.: in case you didn’t notice, this may produce wrong results because of loop-carried dependency

Friday, August 23, 13

Page 27: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

parallel fib() with GCD and blocks

int(^fib)(int);

fib = ^(int n){ if (n < 2) return n; __block int p = 0, q = 1, result = 0; dispatch_apply(n-1, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t i) { result = p + q; p = q; q = result; }); return result;};

Friday, August 23, 13

Page 28: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

GCD is can be used with OpenCL And GCD

• That’s what is available on Mac OS X and iOS

• Nope, iOS didn’t open OpenCL yet. But you can find how to use OpenCL for ARM on iOS easily

Friday, August 23, 13

Page 29: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

What are available

• Task-parallel and data-parallel constructs, libraries or languguages

• Lambda, closure, continuation, etc.

• Queue, queue management: load balance, work stealing, etc

• Data structures, e.g., TBB

• Lock-less synchronization

Friday, August 23, 13

Page 30: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Lockfree synchronization

• In case you didn’t know it, NO, it’s not new at all

• Linux has been used RCU (Read-Copy-Update) for several years

• In fact, it’s there since 1970s, see Kung’s 1980 paper proposed RCU-like mechanism.

Friday, August 23, 13

Page 31: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Kernel

• big.LITTLE

• IKS: in-kernel-switcher

• related code being upstreaming after 3.10

• Global Task Scheduling (GTS), Heterogenous Multi-Processor (HMP)

• Current CFS maintainer Ingo didn’t like GTS’s power-saving part

• Power Management

• So many mechanisms: cpufreq, cpuidle, runtime PM, CCF, etc.

• Linaro has a wiki page on how to/what to enable/implement for a new SoC

• Thermal Management

• Throttling, e.g., ask related components to slow down so that less heat will be generated

Friday, August 23, 13

Page 32: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Linaro In-kernel SwitcherFriday, August 23, 13

Page 33: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Global Task-Scheduling (GTS)

Friday, August 23, 13

Page 34: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Many are remained to be done

• No widely used open-source power or thermal management framework available?

• Some problems are fundamental hard to parallelized, e.g.,

• parsing in browser: nowadays, webkit and firefox use LALR(1) or similar parsing algorithm

• No full-featured open-source OpenCL implementation for GPGPU

Friday, August 23, 13

Page 35: Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Wrap-up

• “dark silicon” is reality on mobile devices,

• power wall and thermal wall

• parallel/concurrent code isn’t popular on mobile devices (yet)

• discussed some possible free and open source solutions

• many remained to be done

Friday, August 23, 13