Upload
serchius-zolotareus
View
232
Download
0
Embed Size (px)
Citation preview
8/2/2019 Doctor Dobs Digital Issue_0112
1/40
Dr. Dobbs JournalFEBRUARY 2012
ALSO INSIDE
The Need to Rewrite Estab
Algorithms >>
Efficient Use of Lambda Exstd::function >>
From the Vault:8 Simple Rules for DesigninApplications >>Welcome to the
Jungle
Parallel ProgrammingParallel Programming
Herb Sutter told you when the free lunch wasover, now he leads you throughthe parallel hardware jungle
8/2/2019 Doctor Dobs Digital Issue_0112
2/40
February 2012
C O N T E N T SCOVER STORY6 Welcome to the JungleBy Herb Sutter
The transitions to multicore processors, GPU computing,
and HaaS cloud computing are not separate trends, but as-
pects of a single trend mainstream computers from
desktops to smartphones are being permanently trans-
formed into heterogeneous supercomputer clusters.
Henceforth, a single compute-intensive application will
need to harness different kinds of cores, in immense num-
bers, to get its job done. The free lunch is over. Welcome to
the hardware jungle.
26 Efficient Use of Lambda Expressions andstd::functionBy Cassio Neri
Functors and std:function implementations vary widelybetween libraries. C++11s lambdas make them more
efficient.
3 Editorial: The Need to Rewrite EstablishedAlgorithmsBy Andrew Binstock
Parallel architectures, like other hardware advances beforethem, require us to rewrite algorithms and data structures especially the old standbys that have served us well.
31 From the Vault: 8 Simple Rules forDesigning Threaded ApplicationsBy Clay Breshears
Multithreaded programming is still more art than science. This
article gives eight simple rules that you can add to your palette
of threading design methods. By following these rules, you will
have more success in writing the best and most-efficient
threaded implementation of your applications.
38 LinksSnapshots of the most interesting items on drdobbs.com in-
cluding cross-platform development with Eclipse CDT, deploy-
ment with Amazons Elastic Beanstalk, and more.
39 Editorial and Business Contactswww.drdobbs.com
F e b r u a r y 2 0 1 2
Previous Next
Dr. Dobbs Journal
More on D
Boost Performance for Y
The Android NDK is a tool
ponents that make use of
applications.
http://drdobbs.com/go-paral
Seeing the Light with BaUse the processing speed
explore different possible
puzzles.http://drdobbs.com/go-pa
design/232300953
The Best of 2011
The most popular articles o
some additional pieces p
consideration by our staff.http://drdobbs.com/232301271
Booting an Intel Architec
Early Initialization
The boot sequence today is
even a decade ago. Heres
step walkthrough of the b
http://drdobbs.com/parallel/
Two Different Kinds of O
Experience with SPITBO
least two fundamentally
tion, and that the advicplies only to one of those
http://drdobbs.com/blogs/cp
http://prevpage/http://drdobbs.com/go-parallel/blogs/architecture-and-design/232300953http://drdobbs.com/go-parallel/blogs/architecture-and-design/232300953http://prevpage/http://drdobbs.com/go-parallel/blogs/architecture-and-design/2323009538/2/2019 Doctor Dobs Digital Issue_0112
3/40
www.drdobbs.com
central point of developer wisdom is to reuse
code, especially data structures and collections. A
few decades ago, it was common for C program-
mers to write innumerable implementations oflinked lists from scratch. The code became almost
a muscle memory as you banged it out. Today,
such an exercise is more the result of ignoring es-
tablished and well-tested options, rather than coding prowess. Except
in exigent circumstances, writing your own collections has the whiff
of cowboy programming.
Its safe to say that, for the most part, you should not be writing
your own data structures or basic algorithms (sorts, checksums, en-
cryption, calendars, etc.). However, this principle has a recurring ex-
ception that needs to be acknowledged; namely, that advances inhardware must find their way promptly into the implementations of
common algorithms.
In a 1996 article on hashing efficiency that I w
(http://drdobbs.com/database/184409859), I discu
the then-significant problem of memory latency on
table design. Basically, the concern was that evebucket that was not in cache created a significant p
cle as the processor waited for the long memory-fe
gested that on closed hash tables, nonlinear rehas
slot was found was a costly operation. Linear reha
closest empty slot) worked better. The problem of m
small caches, in those days, made algorithm and d
tion a task best completed with care.
The expansion of processor caches changed th
insofar as algorithms were concerned. Unless
comp-sci background, the terms cache-aware ous algorithms might be new to you. Impleme
mer tend to uncover the size of the cache on the
The Need to RewriteEstablished Algorithms
A
Previous Next
February 2012
Previous Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
Parallel architectures, like other hardware advances before them, require usto rewrite algorithms and data structures especially the old standbys thathave served us well
http://nextpage/http://nextpage/http://nextpage/8/2/2019 Doctor Dobs Digital Issue_0112
4/40
February 2012www.drdobbs.com
and then size the data structures and algorithms to minimize mem-
ory fetches. Success in this can represent significant performance
gains, at the cost of some portability. Some libraries, frequently
those provided by processor vendors (such as Intel and AMD, in par-
ticular) or specialized development houses, provide these imple-
mentations. Intels Integrated Performance Primitives library
(http://drdobbs.com/go-parallel/blogs/cpp/232300486), for exam-
ple, checks the runtime platform characteristics and brings in the
right binaries for optimal performance.
For most applications, however, were dependent on the standard
libraries provided with the language. (Intels IPP library, for example,
comes only for native code. Java and .NET are supported only with
wrappers.) Language providers eventually do deliver library updates,
but the progress can be frustratingly slow and the work uneven. The
delivery of Javas support for multithread-friendly collections is a
case in point. Scalas multithreaded collections were a major drawbecause they came at a time when Javas collections did not work
well enough.
Not only are better libraries needed, but even within standard li-
braries, the choice of data structures is becoming more complex.
In this excellent article explaining why linked lists are pass
(http://drdobbs.com/go-parallel/blogs/parallel/232400466), Dr.
Dobbs blogger Clay Breshears discusses why trees make a better
and more parallel-friendly data structure than the ever-sequential
linked list. This is exactly the kind of nuance that should keep us
vigilant against lazily accepting a static view of which algorithms
and data structures to choose. Everyone knows, dont they, that
Previous Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
Instantly SeaTerabytes Of T
Fully-Functional Evaluwww.dtSearch.co
Lightning Fast Redmond M
Covers all data sources eW
Returns results in less than a InfoWorld
25+ fielded & full-text search op
dtSearchs own file parsers highpopular file & email types
Spider supports static & dynam
APIs for .NET, Java, C++, SQL, et
Win /Linux (64-bit & 32-bit)
8/2/2019 Doctor Dobs Digital Issue_0112
5/40
www.drdobbs.com
linked lists are faster th an trees? And yet, even this mainstay of ob-
vious logic is now changing beneath our feet.
The imminent era of manycore processors is likely to bring other
changes to the fore. I especially expect that sort routines will be
dramatically affected. Quicksort will no longer be the default sorting
algorithm. The choice of sort will be more carefully matched to the
needs of the data and the capabilities of the platform. We already
see this on a macro level in the new world of big data. Map-reduce
at scale depends upon sorts being done in smaller increments and
reassembled through a merge function. And even there, the basic
sorting has to be capable of handling billions of data items. In
which case, grabbing an early item and making it the pivot element
for millions of other entries (Quicksort) can have unfortunate con-
sequences on performance.
Between the proliferation of cores, the rapid expa
faster performance of RAM, and the huge increase
ditional choices of algorithms and data structures a
ently safe or appropriate at all. Once again, select
with considerable care aforethought.
Andrew Binstock is Editor in Chief for Dr. Dobbs an
Previous Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
http://nextpage/http://www.drdobbs.com/parallel/232500147http://nextpage/8/2/2019 Doctor Dobs Digital Issue_0112
6/40
February 2012www.drdobbs.com
Welcome to the JungleThe free lunch is over. Welcome to the hardware jungle
n the twilight of Moores Law, the transitions to multicore proces-
sors, GPU computing, and hardware or infrastructure as a service
(HaaS) cloud computing are not separate trends, but aspects of a sin-
gle trend mainstream computers from desktops to smartphones
are being permanently transformed into heterogeneous supercomputer
clusters. Henceforth, a single compute-intensive application will need to
harness different kinds of cores, in immense numbers, to get its job done.
The free lunch is over. Welcome to the hardware jungle.From 1975 to 2005, our industry accomplished a phenomenal mis-
sion: In 30 years, we put a personal computer on every desk, in every
home, and in every pocket.
In 2005, however, mainstream computing hit a wall. In The Free
Lunch Is Over (A Fundamental Turn Toward Concurrency in Software)
(http://is.gd/RHSOzm), I described the reasons for the then-upcoming
industry transition from single-core to multicore CPUs in mainstream
machines, why it would require changes throughout the software stack
from operating systems to languages to tools, and why it would per-
manently affect the way we as software developers have to write our
code if we want our applications to continue exploiting Moores tran-
sistor dividend.
In 2005, our industry undertook a new mission
parallel supercomputeron every desk, in every h
pocket. 2011 was special: Its the year that we com
tion to parallel computing in all mainstream form f
rival of multicore tablets (such as iPad 2, Playbook
Tablet) and smartphones (for example, Galaxy S I
4S). 2012 will see the continued build out of mu
stream quad- and eight-core tablets (as Windows tablet experience to x86 as well as ARM), and the la
ing console holdout will go multicore (as Nintend
Wii; http://is.gd/sBuPtr).
It took us just six years to deliver mainstream pa
all popular form factors. And we know the transition
manent, because multicore delivers compute perfo
core cannot and there will always be mainstream ap
better on a multicore machine. Theres no going ba
For the first time in the history of computing, ma
is no longer a single-processor von Neumann mach
be again.
That was the first act.
By Herb Sutter
I
Previous Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
7/40
www.drdobbs.com
Overview: Trifecta
It turns out that multicore is just the first of three related permanent
transitions that layer on and amplify each other ; as the timeline in Fig-
ure 1 illustrates.
1. Multicore (2005-). As explained previously.
2. Heterogeneous cores (2009-). A single computer already typi-
cally includes more than one kind of processor core, as mainstream
notebooks, consoles, and tablets all increasingly have both CPUs and
compute-capable GPUs. The open question in the industry today is not
whether a single application will be spread across different kinds of
cores, but only how different the cores should be. That is, whether
they should be basically the same with similar instruction sets but in
a mix of a few big cores that are best at sequential code plus many
smaller cores best at running parallel code (the Intel MIC
(http://is.gd/I2iB09) model slated to arrive in 2012-2013, which is easier
to program). Or should they be cores with different capabilities that may
only support subsets of general-purpose languages
current Cell and GPGPU model, which requires m
cluding language extensions and subsets).
Heterogeneity amplifies the first trend (multicor
of the cores are smaller, then we can fit more of them
Indeed, 100x and 1,000x parallelism is already availa
mainstream home machines for programs that can
We know the transition to heterogeneous core
cause different kinds of computations naturally ru
less power on different kinds of cores and dif
same application will run faster and/or cooler on a
eral different kinds of cores.
3. Elastic compute cloud cores (2010-). For ou
means specifically HaaS delivering access to m
hardware as an extension of the mainstream m
started to hit the mainstream with commercial coings from Amazon Web Services (AWS), Microsoft
Engine (GAE), and others.
Cloud HaaS again amplifies both of the first two
fundamentally about deploying large numbers of
node is a mainstream machine containing multi
neous cores. In the cloud, the number of cores avai
plication is scaling fast. In mid-2011, Cycle Comp
30,000-core cloud for under $1,300/hour (http://is.
AWS. The same heterogeneous cores are available
For example, AWS already offers Cluster GPU nod
Tegra M2050 GPU cards, enabling massively paralle
tributed CUDA applications.
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Previous Next JUNG
Figure 1.
8/2/2019 Doctor Dobs Digital Issue_0112
8/40
In short, parallelism is not just in full bloom, but increasingly in full
variety. In this article, I develop four key points:
1. Moores End. We can observe clear evidence that Moores Law
is ending because we can point to a pattern that precedes the
end of exploiting any kind of resource. But theres no reason to
panic, because Moores Law limits only one kind of scaling, andwe have already started another kind.
2. Mapping one trend, not three. Multicore, heterogeneous cores,
and HaaS cloud computing are not three separate trends, but as-
pects of a single trend: putting a personal heterogeneous super-
computer cluster on every desk and in every pocket.
3. The effect on software development. As software developers,
we will be expected to enable a single application to exploit a
jungle of enormous numbers of cores that are increasingly dif-
ferent in kind (specialized for different tasks) and different in lo-
cation (from local to very remote; on-die, in-box, on-premises, in-
cloud). The jungle of heterogeneity will continue to spur deep
and fast evolution of mainstream software development, but we
can predict what some of the changes will be.
4. Three distinct near-term stages of Moores End. And why
smartphones arent, really.
Lets begin with the end of Moores Law.
Mining Moores Law
Weve been hearing breathless Moores Law is e
ments for years. That Moores Law would end was
exponential progression must. Although it didn
prognosticators expected, its end is possible to forec
to know what to look for, and that is diminishing ret
A key observation is that exploiting Moores Law
gold mine or any other kind of resource. Exploiting
never just stops abruptly; rather, running a mine go
of increasing costs and diminishing returns until fin
left in that patch of ground is no longer commercia
operating the mine is no longer profitable.
Mining Moores Law has followed the same patte
three major phases, where we are now in transitio
Phase III. And throughout this discussion, never fo
reason Moores Law is interesting at all is because w
raw resource (more transistors) into a useful form (putational throughput or lower cost).
Phase I, Moores Motherlode = Unicore Free Lu
When you first find an ore deposit and open a mine
forts on the motherlode, where everybody gets to
and a low cost per pound of gold extracted.
www.drdobbs.comFebruary 2012
Previous Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
JUNGL
http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
9/40
For 30 years, mainstream processors mined Moores motherlode
by using their growing transistor budgets to make a single core
more and more complex so that it could execute a single thread
faster. This was wonderful because it meant the performance was
easily exploitable compute-bound software would get faster with
relatively little effort. Mining this motherlode in mainstream micro-
processors went through two main subphases as the pendulum
swung from simpler to increasingly complex cores:
In the 1970s and 1980s, each chip generation could use most of
the extra transistors to add One Big Feature (such as on-die float-
ing point unit, pipelining, out of order execution) that would
make single-threaded code run faster.
In the 1990s and 2000s, each chip generation started using the
extra transistors to add or improve two or three smaller features
that would make single-threaded code run f
or six smaller features, and so on.
Figure 2 shows how the pendulum swung toward
plex single cores, with three sample chips: the 8028
tium Extreme Edition 840. Note that the chips bo
number of transistors.
By 2005, the pendulum had swung about as far as
the complex single-core model. Although the mo
mostly exhausted, were still scraping ore off its w
some continued improvement in single-threaded
but no longer at the historically delightful exponen
Phase II, Secondary Veins = Homogeneous Mult
As a motherlode gets used up, miners concentrate
that are still profitable but have a more moderate yi
per pound of extracted gold. So when Moores ustarted getting mined out, the industry turned to m
ondary veins using the additional transistors to m
chip. Multicore let us continue to deliver exponentia
pute throughput in mainstream computers, but in
easily exploitable because it placed a greater burden
opers who had to write parallel programs that could
Moving into Phase II took a lot of work in the soft
had to learn to write new free lunch applications
lots of latent parallelism and so can once again rid
the same executable faster on next years hardw
still delivers exponential performance gains but pr
of additional cores. Today, there are parallel runtim
www.drdobbs.com
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Previous Next JUNG
Figure 2.
http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
10/40
www.drdobbs.com
Intel Threading Building Blocks (TBB) and Microsoft Parallel Patterns
Library (PPL), parallel debuggers and parallel profilers, and updated
operating systems to run them all.
But this time the phase didnt last 30 years. We barely have time to
catch our breath, because Phase III is already beginning.
Phase III, Tertiary Veins = Heterogeneous Cores (2011-)
As our miners are forced to move into smaller and smaller veins, yields
diminish and costs rise. The miners are turning to Moores tertiary
veins: Using Moores extra transistors to make not just more cores, but
also different kinds of cores and in very large numbers because the
different cores are often smaller and swing the pendulum back toward
the left.
There are two main categories of heterogeneity, see Figure 3.
Big/fast vs. small/slow cores.The smallest amou
is when all the cores are general-purpose cores wit
tion set, but some cores are beefier than others be
more hardware to accelerate execution (notably by
tency using various forms of internal concurrency).
cores are big complex ones that are optimized to
parts of a program really fast, while others are sm
optimized to get better total throughput for the sc
of the program. However, even though they use th
set, the compiler will often want to generate differe
ence can become visible to the programmer if the
guage must expose ways to control code generatio
proach with Xeon (big/fast) and MIC (small/slow
approximately the x86 instruction set.
General vs. specialized cores. Beyond that, w
multiple cores having different capabilities, includ
may not be able to support all of a mainstream langIn 2006-2007, with the arrival of the PlayStation 3, t
sor led the way by incorporating different kinds of
chip, with a single general-purpose core assisted by
cial-purpose SPU cores. Since 2009, we have begun
use of GPUs to perform computation instead of jus
ized cores like SPUs and GPUs are attractive when t
kinds of code more efficiently, both faster and mo
a great bargain if your workload fits it.
GPGPU is especially interesting because we al
derutilized installed base: A significant percentag
stream machines already have compute-capable
to be exploited. With the June 2011 introduction o
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Figure 3.
http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
11/40
www.drdobbs.com
the November 2011 launch of NVIDIA Tegra 3, systems with CPU and
GPU cores on the same chip is becoming a new n orm. That installed
base is a big carrot, and creates an enormous incentive for compute-
intensive mainstream applications to leverage that patiently waiting
hardware. To date, a few early adopters have been using technolo-
gies like CUDA, OpenCL, and more recently C++ AMP to harness
GPUs for computation. Mainstream application developers who careabout performance need to learn to do the same; see Table 1.
But thats pretty much it we currently know of no other major
ways to exploit Moores Law for compute performance, and once these
veins are exhausted, it will be largely mined out.
Were still actively mining for now, but the writing on the wall is clear:
mene mene diminishing returns demonstrate that weve entered the
endgame.
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
SHRINKWRAP
YOUR APPWITH AWARD-WINNING VERISIGN CODE SIGNINGYou developed the software. Now, deliver it with the same care and vigilance by using
VeriSign Code Signing. Why? Code signing not only protects the identity and reputation
of the author, but it also verifies the authenticity and version of your software. Then, go
a step further. VeriSign Code Signing can create a unique digital signature every time the
code is signed. Plus, we support more certification programs and development platforms
than any other Certificate Authority. Leverage the reputation of the most recognized and
trusted name in online security.
Learn how VeriSign Code Signing can help make sureyour applications are more trusted and adopted atwww.VeriSign.com/CodeSigning or call 1-866-893-6565.
Copyright 2011SymantecCorporation.All rightsreserved.Symantec,the SymantecLogo,and theCheckmarkLogo aretrademarks
SymantecCorporationorits affiliatesintheU.S. andothercountries.VeriSignand otherrelatedmarksare thetrademarksorregistered
orits affiliatesorsubsidiariesinthe U.S.andother countriesandlicensedto SymantecCorporation.Othernamesmay betrademarkso
Now from
BEST SECURITYSOFTWARE
DEVELOPMENTSOLUTION
Table 1.
http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
12/40
www.drdobbs.com
On The Charts: Not Three Trends, But One Trend
Next, lets put all of this in perspective by showing
ero-core, and cloud-core are not three trends, but
trend. To show that, we have to show that they can
same map. Figure 4 shows an appropriate map th
where processor core architectures are going, wh
tectures are going, and visualize just where wearound in the mine so far.
First, I describe each axis, then map out past and
to spot trends, and finally draw some conclusions
ware is likely to concentrate.
Processor Core Types
The vertical axis shows processor core architecture
ure 5, from bottom to top, they form a continuum
formance and scalability, but also of increasing r
grams and programmers in the form of additional p(yellow) or correctness issues (red) added at each s
Complex cores are the big traditional ones, w
swung far to the right in the habitable zone. The
ning sequential code, including code limited by Am
Simpler cores are the small traditional ones, to
habitable zone. These are best at running paral
still requires the full expressivity of a mainstream
guage.
Specialized cores like those in GPUs, DSPs, and C
limited, and often do not yet fully support all features
guages (such as exception handling). These are bes
parallelizable code that can be expressed in a subse
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Figure 4.
Figure 5.
http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
13/40
www.drdobbs.com
C or C++. For example, XBox Kinect skeletal tracking requires using the
CPU and the GPU cores on the console, and would be impossible other-
wise.The farther you move upward on the chart (to the right in the blown-
up figure), the better the performance throughput and/or the less power
you need, but the more the application code is constrained as it has to
be more parallel and/or use only subsets of a mainstream language.
Future mainstream hardware will likely contain all three basic kinds
of cores, because many applications have all these
same program, and so naturally will run best on a h
puter that has all these kinds of cores. For example
all Kinect games, and all CUDA/OpenCL/C++AMP
able today could not run well or at all on a homo
because they rely on running parts of the same
CPU(s) and other parts on specialized cores. Those athe beginning.
Memory Architectures
The horizontal axis shows six common memory a
left to right, they form a continuum of increasing
scalability, but (except for one important discontinu
work for programs and programmers to deal with p
(yellow) or correctness issues (red). In Figure 6, t
cache and lower boxes represent RAM. A processo
the top of each cache peak.Unified memory is tied to the unicore motherl
ory hierarchy is wonderfully simple a single mo
sitting on top. This describes essentially all main
from the dawn of computing until the mid-2000s. Th
programming model: Every pointer (or object refe
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Figure 6.
http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
14/40
www.drdobbs.com
every byte, and every byte is equally far away from the core. Even
here, programmers need to be conscious of at least two basic cache
effects: locality, or how well hot data fits into cache; and access order,
because modern memory architectures love sequential access pat-
terns (for more on this, see my Machine Architecture talk at
http://is.gd/1Fe99o).
NUMA cache retains a single chunk of RAM, but adds multiplecaches. Now instead of a single mountain, we have a mountain range
with multiple peaks, each with a core on top. This describes todays
mainstream multicore devices. Here, we still enjoy a single address
space and pretty good performance as long as different cores access
different memory, but programmers now have to deal with two main
additional performance effects:
locality matters in new ways because some peaks are closer to
each other than others (two cores that share an L2 cache vs. two
cores that share only L3 or RAM), layout matters because we have to keep data physically close
together if its used together (on the same cache line), and apart
if its not (for example, to avoid the ping-pong game of false
sharing).
NUMA RAM further fragments memory into multiple physical chunks
of RAM, but still exposes a single logical address space. Now, the per-
formance valleys between the cores get deeper, because accessing
RAM in a chunk not local to this core incurs a trip across the bus. Exam-
ples include bladed servers, symmetric multiprocessor (SMP) desktop
computers with multiple sockets, and newer GPU architectures that
provide a unified address space view of the CPUs and GPUs memory,
but leave some memory physically closer to the CPU
closer to the GPU. Now we add another item to the
formance-conscious programmer needs to think a
because we can form a pointer to anything doesn
should, if it means reaching across an expensive cha
Incoherent and weak memory makes memory b
chronized, in the hope that allowing each core to havview of the state of memory can make them run f
memory must inevitably be synchronized again. As
only remaining mainstream CPUs with weak memo
rent PowerPC and ARM processors (popular despite
els rather than because of them; more on this belo
has the simplicity of a single address space, but no
further has to take on the burden ofsynchronizing m
Disjoint (tightly coupled) memory bites the bu
ent cores see different memory, typically over a sh
running as a tightly coupled unit that has low lateliability is still evaluated as a single unit. Now the
tightly clustered group of mountainous islands, eac
mountains of cache overlooking square miles of
nected by bridges with a fleet of trucks expediting
to point bulk transfer operations, message que
the mainstream, we see this model used by 2009-
whose on-board memory is not shared with the
other. True, programmers no longer enjoy havin
space and the ability to share pointers But in excha
moved the entire set of programmer burdens accu
replaced them with a single new responsibility: cop
islands of memory.
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
15/40
www.drdobbs.com
Disjoint (loosely coupled) is the cloud where cores spread out-of-
box into different rooms and buildings and datacenters. This moves
the islands farther apart, and replaces the bus bridges with network
speedboats and tankers. In the mainstream, we see this model in
HaaS cloud computing offerings; this is the comm
compute cluster. Programmers now have to arrang
additional concerns, which often can be abstracte
and runtimes: reliabilityas nodes can come and go
islands are farther apart.
Charting the HardwareAll three trends are just aspects of a single trend: f
and enabling heterogeneous parallel computing. F
the chart wants to be filled out because there are
naturally suited to each of these boxes, though so
popular than others.
To help visualize the filling-out process more c
check to see how mainstream hardware has progre
The easiest place to start is the long-standing ma
more recent GPU:
From the 1970s to the 2000s, CPUs started
cores and then moved downward as the pend
creasingly complex cores. They hugged the le
by staying single-core as long as possible, b
out of room and turned toward multicore NU
tures; see Figure 8.
Meanwhile, in the late 2000s, mainstream GP
pable of handling computational workloads
started life in an add-on discrete GPU card fo
ics-specific cores and memory were physically
the CPU and system RAM, they started furtthe right (Specialized / Disjoint (local)). GPUs
leftward to increasingly unified views of me
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Figure 7.
Figure 8.
http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
16/40
www.drdobbs.com
downward to try to support full mainstream languages (such as
adding exception handling support).
Todays typical mainstream computer includes both a CPU and
a discrete or integrated GPU. The dotted line in the graphic de-
notes cores that are available to a single application because
they are in the same device, but not on the same chip.
Now we are seeing a trend to use CPU and specialized (currently GPU)
cores with very tightly coupled memory, and even on the same die:
In 2005, the XBox 360 sported a multicore CPU and GPU that could
not only directly access the same RAM, but had the very unusual
feature that they could share even L2 cache.
In 2006 and 2007, the Cell-based PS3 console sported a single
processor having both a single general-purpose core and eight
special-purpose SPU cores. The solid line in Figure 9 denotes
cores that are on the same chip, not just in the same device.
In June 2011 and November 2011, respective
launched the Fusion and Tegra 3 architectu
chips that sported a compute-class GPU (he
tically) on the same die (hence well to the lef
Intel has also shipped the Sandy Bridge
which includes an integrated GPU that is no
capable. Intels main focus has been the MIC
50 simple, general-purpose x86-like cores opected to be commercially available in the n
Finally, we complete the picture with cloud HaaS; F
In 2008 and 2009, Amazon, Microsoft, Google
began rolling out their cloud compute offerin
GAE support an elastic cloud of nodes each
tional computer (big-core and loosely cou
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Figure 9. Figure 10.
http://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
17/40
www.drdobbs.com
the bottom right corner of the chart) where each node in the
cloud has a single core or multiple CPU cores (the two lower-left
boxes). As before, the dotted line denotes that all of the cores are
available to a single application, and the network is just another
bus to more compute cores.
Since November 2010, AWS also supports compute instances
that contain both CPU cores and GPU cores, indicated by the H-shaped virtual machine where the application runs on a cloud
of loosely coupled nodes with disjoint memory (right column)
each of which contains both CPU and GPU cores (currently not
on the same die, so the vertical lines are still dotted).
The Jungle
Putting it all together, we get a noisy profusion of life and color as in
Figure 11. This may look like a confused mess, so lets notice two things
that help make sense of it.
First, every box has a workload that its best at, bu
ticularly some columns) are more popular than ot
are particularly less interesting:
Fully unified memory models are only applic
which is being essentially abandoned in the
Incoherent/weak memory models are a pement that is in the process of failing in the m
hardware side, the theoretical performance
from letting caches work less synchronously
largely duplicated in other ways by main
having stronger memory models. On the s
the mainstream general-purpose languages
(C, C++, Java, .NET) have largely rejected wea
and require a coherent model that is tech
quential consistency for data race
(http://is.gd/EmpCDn [PDF]) as either thememory model (Java, .NET) or their default m
C++11, ISO C11). Nobody is moving toward
incoherent/weak memory strip of the cha
moving through it to get to the other side,
to stay there.
But all other boxes, including all rows (processo
strongly represented, and we realize why thats true
ent parts of even the same application naturally wa
ent kinds of cores.
Second, lets clarify the picture by highlighting an
regions that hardware is migrating toward in Figur
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Figure 11.
http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
18/40
www.drdobbs.com
In Figure 12 again we see the first and fourth columns being de-emphasized, as hardware trends have begun gradually coalescing
around two major areas. Both areas extend vertically across all kinds
of cores and the most important thing to note is that these rep-
resent two mines, where the area to the left is the Moores Law mine.
Mine #1: Scale in = Moores Law. Local machines will con-
tinue to use large numbers of heterogeneous local cores, either
in-box (such as CPU with discrete GPU) or on-die (Sandy Bridge,
Fusion, Tegra 3). Well see core counts increase until Moores Law
ends, and then stabilize core counts for individual local devices.
Mine #2: Scale out = distributed cloud. Much more impor-tantly, we will continue to see a cornucopia of cores delivered
via compute clouds, either on-premises (clu
or in public clouds. This is a brand new mine d
the lower coupling of disjoint memory, espe
pled distributed nodes.
The good news is that we can heave a sigh of re
another mine to open. The even better news is that far faster growth rate than even Moores Law. Notic
lines when we graph the amount of parallelism avai
plication running on various architectures; see Fig
three lines are mining Moores Law for scale-in gro
mon slope reflects Moores wonderful exponent, jus
downward to account for how many cores of a given
onto the same die. The top two lines are mining th
and GPUs, respectively) for scale-out growth an
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Figure 12.
Figure 13.
http://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
19/40
www.drdobbs.com
If hardware designers merely use Moores Law to deliver more big
fat cores, on-device hardware parallelism will stay in double digits for
the next decade, which is very roughly when Moores Law is due to
sputter, give or take about a half decade. If hardware follows Niagaras
and MICs lead to go back to simpler cores, well see a one-time jump
and then stay in triple digits. If we all learn to leverage GPUs, we already
have 1,500-way parallelism in modern graphics cards (Ill say coresfor convenience, though that word means something a little different
on GPUs) and likely reach five digits in the decade timeframe.
But all of that is eclipsed by the scalability of the cloud, whose growth
line is already steeper than Moores Law because were better at quickly
deploying and using cost-effective networked machines than weve
been at quickly jam-packing and harnessing cost-effective transistors.
Its hard to get data on the current largest cloud deployments because
many projects are private, but the largest documented public cloud apps
(which dont use GPUs) are already harnessing over 30,000 cores for a
single computation. I wouldnt be surprised if some projects are exceed-ing 100,000 cores today. And thats general-purpose cores; if you add
GPU-capable nodes to the mix, add two more zeroes.
Such massive parallelism, already available for rates of under
$1,300/hour for a 30,000-core cloud, is game-changing. If you doubt
that, here is a boring example that doesnt involve advanced aug-
mented reality or spook-level technomancery: How long will it take
someone whos stolen a strong password file (which well assume is
correctly hashed and salted and contains no dictionary passwords) to
retrieve 90% of the passwords by brute force using a publicly available
GPU-enabled compute cloud? Hint: An AWS dual-Tegra node can test
on the order of 20 billion passwords per second, and clouds of 30,000nodes are publicly documented (of course, Amazon wont say if it has
that many GPU-enabled nodes for hire; but if it d
soon). To borrow a tired misquote (http://is.gd/PJ
affordable attempts per second should be enough
thats not enough for you, not to worry; just wait
years and itll be 640 quadrillion affordable attemp
What It Means For Us: A Programmers ViewHow will all of this change the way we write our s
about harnessing mainstream hardware performa
clusions echo and expand upon ones that I prop
Lunch is Over:
Applications will need to be at least mass
ideally able to use non-local cores and hete
if they want to fully exploit the long-term con
growth in compute throughput being deliver
in-cloud. After all, soon the vast majority of coable to a mainstream application will be non
Efficiency and performance optimizati
not less, important. Were being asked to
periences like sensor-based UIs and augm
less hardware (constrained mobile form fac
tual plateauing of scale-in when Moores Law
ber 2004 I wrote: Those languages that a
selves to heavy optimization will find new lif
will need to find ways to compete and beco
and optimizable. Expect long-term increase
formance-oriented languages and systemswitness the resurgence of interest in C++ in
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
http://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
20/40
www.drdobbs.com
primarily because of its expressive flexibility and performance
efficiency. A program that is twice as efficient has two advan-
tages:
It will be able to run twice as well on a local discon-
nected device especially when Moores Law can no longer
deliver local performance improvements in any form;
It will always be able to run at half the power and coston an elastic compute cloud even as those continue to
expand for the indefinite future.
Programming languages and systems will increasingly be
forced to deal with heterogeneous distributed parallelism. As
previously predicted, just basic homogeneous multicore has
proved to be a far bigger event for languages than even object-
oriented programming was, because some languages (notably C)
could get away with ignoring objects while still remaining com-
mercially relevant for mainstream software development. No
mainstream language, including the just-ratified C11 standard,
could ignore basic concurrency and parallelism and stay relevant
in even a homogeneous-multicore world. Now expect all main-
stream languages and environments, including their standard li-
braries, to develop explicit support for at least distributed paral-
lelism and probably also heterogeneous parallelism; they cannot
hope to avoid it without becoming marginalized for mainstream
app development.
Expanding on that last bullet, what are some basic elements we will
need to add to mainstream programming models (think: C, C++, Java,
and .NET)? Here are a few basics I think will be unavoidable, that mustbe supported explicitly in one form or another.
Deal with the processor axis lower secti
compute cores with different perform
slow/small). At minimum, mainstream ope
runtimes will need to be aware that some co
others, and know which parts of an applicat
which of those cores.
Deal with the processor axis upper sectilanguage subsets, to allow for cores with d
including that not all fully support mainstream
In the next decade, a mainstream operating
or augmented with an extra runtime like the J
ConcRT runtime underpinning PPL) will be ca
cores with different instruction sets and runn
cation across many of those cores. Programm
tools will be extended to let the developer e
restricted to use just a subset of a mainstream
guage (as with the restrict() qualifiers in
timistic that for most mainstream language
guage extension will be sufficient while le
language rules for overloading and dispatch, a
the impact on developers.
Deal with the memory axis for computation
tributed algorithms that can scale not ju
across a compute cloud. Libraries and runtim
TBB and PPL will be extended or duplicated to e
and other algorithms that run on large numbe
local parallel cores. Today we can write a paral
that can run with 1,000x parallelism on a set ofand ship the right data shards to the right com
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
http://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
21/40
www.drdobbs.com
results back; tomorrow we need to be able to write that same call
that can run with 1,000,000,000x parallelism on a set of cloud-based
GPUs and ship the right data shards to the right nodes and the re-
sults back. This is a baby step example in that it just uses local data
(that can fit in a single machines memory), but distributed compu-
tation; the data subsets are simply copied hub-and-spoke.
Deal with the memory axis for data, by providing distributed
data containers, which can be spread across many nodes.The
next step is for the data itself to be larger than any nodes memory,
and (preferably automatically) move the right data subsets to the
right nodes of a distributed computation. For example, we need
containers like a distributed_arrayor distributed_table
that can be backed by multiple and/or redundant cloud storage,
and then make those the target of the same distributedparal-
lel_for_each call. After all, why shouldnt we write a single par-
allel_for_each call that efficiently updates a 100 petabyte
table? Hadoop (http://hadoop.apache.org/) enables this today for
specific workloads and with extra work; this will become a stan-
dard capability available out-of-the-box in mainstream languagecompilers and their standard libraries.
Enable a unified programming model th
entire chart with the same source code. Sin
hardware on a single chart with two degre
landscape is unified enough that it should b
by a single programming model in the futur
have at least two basic characteristics: Firs
Processor axis by letting the programmer expsets in a way integrated holistically into the la
will cover or hide the Memory axis by abstra
of data, and copying data subsets on deman
also providing a way to take control of the co
users who want to optimize the performance
putation.
Perhaps our most difficult mental adjustment, h
learn to think of the cloud as part of the mainstre
view all these local and non-local cores as being
target machine that executes our application, wh
just another bus that connects us to more cores. Th
we will write code for mainstream machines assum
million-way parallelism, of which only thousand
guaranteed to always be available (when out of Wi
Five years from now, we want to be delivering ap
an isolated device, and then just run faster or bette
WiFi range and have dynamic access to many more
of our operating systems, runtimes, libraries, progra
and tools need to get us to a place where we ca
bound applications that run well in isolation on diswith 1,000-way local parallelismand when the de
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Our most difficult mental adjustment, however, will be to
learn to think of the cloud as part of the mainstream ma-
chine to view all these local and non-local cores as
being equally part of the target machine that executes
our application, where the network is just another bus
that connects us to more cores
http://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
22/40
www.drdobbs.com
just run faster, handle much larger data sets, and/or light up with ad-
ditional capabilities. We have a very small taste of that now with cloud-
based apps like Shazam (which function only when online), but yet a
long way to go to realize this full vision.
Exit Moore, Pursued by a Dark Silicon Bear
Finally, lets return one more time to the end of Moores Law to see
what awaits us in our near future (Figure 14), and why we will likely
pass through three distinct stages as we navigate Moores End.
Eventually, the tired miners will reach the point where its no longer
economically feasible to operate the mine. Theres still gold left, but
its no longer commercially exploitable. Recall that Moores Law has
been interesting only because of the ability to transform its raw re-
source of more transistors into one of two useful forms:
Exploit #1: Greater throughput. Moores Law lets us deliver
more transistors, and therefore more complex chips, at the samecost. Thats what will let processors continue to deliver more
computational performance per chip as l
ways to harness the extra transistors for com
Exploit #2: Lower cost/power/size. Alterna
enables delivery of the same number of tra
cost, including in a smaller area and at lower
will let us continue to deliver powerful expe
ingly compact and mobile and embedded fo
The key thing to note is that we can expect the
ploiting Moores Law to end, not at the same time
other and in that order.
Why? Because Exploit #2 only relies on the basic
whereas the first relies on Moores Law andthe a
transistors at the same time.
Which brings me to one last problem down in ou
The Power Problem: Dark Silicon
Sometimes you can be hard at work in a mine, still
small disaster happens: a cave-in, or striking water
render entire sections of the mine unreachable. W
to hit exactly those kinds of problems.
One particular problem we have just begun to e
as dark silicon. Although Moores Law is still deliv
tors, we are losing the ability to power them all at the s
details, see Jem Davies talk Compute Power With
(http://is.gd/Lfl7iz [PDF]) and the ISCA11 paper D
End of Multicore Scaling (http://is.gd/GhGdz9 [PD
This dark silicon effect is like a Shakespeariandoomed character offstage. Even though we can con
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Figure 14.
http://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
23/40
www.drdobbs.com
cores on a chip, if we cannot use them at the same time ,we have failed
to exploit Moores Law to deliver more computational throughput (Ex-
ploit #1). When we enter the phase where Moores Law continues to
give us more transistors per die area, but we are no longer able to
power them all, we will find ourselves in a transitional period where Ex-
ploit #1 has ended while Exploit #2 continues and outlives it for a time.
This means that we will likely see the following major phases in thescale-in growth of mainstream machines. (Note that these apply to
individual machines only, such as your personal notebook and smart-
phone or an individual compute node; they do not apply to a compute
cloud, which we saw belongs to a different scale-out mine.)
Exploit #1 + Exploit #2: Increasing performance (compute
throughput) in all form factors (1975 mid-2010s?). For a
few years yet, we will see continuing increases in mainstream
computer performance in all form factors from desktop to smart-
phone. As of today, the bigger form factors still have more paral-
lelism, just as todays desktop CPUs and GPUs are routinely more
capable than those in tablets and smartphones as long as Ex-
ploit #1 lives, and then
Exploit #2 only: Flat performance (compute throughput) at
the top end, and mid and lower segments catching up (late
2010s early 2020s?). Next, if problems like dark silicon are not
solved, we will enter a period where mainstream computer per-
formance levels out, starting at the top end with desktops and
game consoles and working its way down through tablets and
smartphones. During this period we will continue to use Moores
Law to lower cost, power, and/or size delivering the same com-plexity and performance already available in bigger form factors
also in smaller devices. Assuming Moores L
enough beyond the end of Exploit #1, we can
it will take for Exploit #2 to equalize personal
ing the difference in transistor counts betw
stream desktop machines and smartphones;
of 20, which will take Moores Law about eigh
Democratization (early 2020s? onward).ratization will reach the point where a desk
smartphone have roughly the same comp
ance. In that case, why buy a desktop ever ag
tablet or smartphone. You might think that th
portant differences between the desktop and
power, because the desktop is plugged in, a
cause the desktop has easier access to a bigg
keyboard/mouse but once you dock the s
the same access to power and peripherals.
Speaking of Smartphones Pocket Tablets and D
Note that the word smartphone is already a ma
cause a pocket device that can run apps is not pr
all. Its primarily a general-purpose personal comp
to have a couple of built-in radios for cell and WiF
the traditional cell phone capability just an ap
use the cell radio, and the Skype IP phone capa
device just another similar app that happens to
instead.
The right way to think about even todays mobi
there are not really tablets and smartphones; tsized tablets and pocket-sized tablets, both alread
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
P i N
http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
24/40
www.drdobbs.com
without cellular radios. That they run different operating systems today
is just a point-in-time effect.
This is why those people who said an iPad is just a big iPhone without
the cellular radio had it exactly backwards the iPhone (3G or later,
which allows apps) is a small iPad that fits in your pocket and happens
to have a cellular radio in order to obsolete another pocket-sized device.
Both devices are primarily tablets they minimize hardware chromeand turn into the full-screen immersive app, and thats the closest thing
you can get today to a morphing device that turns into a special-pur-
pose device on demand. Many of us routinely use our phones mostly
as a small tablet spending most of our time on the device running
apps to read books, browse news, watch movies, play games, update so-
cial networks, and surf the Internet. I already use my phone as a small
tablet far more often than I use it as a phone, and if you have an app-ca-
pable phone then Ill bet you already do that, too.
Well before the end of this decade, I expect the most likely dominant
mainstream form factor to be page-sized and pocket-sized tablets, plus
docking where docking means any means of attaching peripher-
als like keyboards and big screens on demand, which today already en-
compasses physical docks and Bluetooth and Play To connections,
and will only continue to get more wireless and more seamless.
This future shouldnt be too hard to imagine, because many of us have
already been working that way for a while now: For the past decade Ive
routinely worked from my notebook as my primary and only environ-
ment. Usually, Im in my home office or work office where I use a real key-
board and big screens by docking the notebook and/or using it via a re-
mote-desktop client, and when Im mobile I use it as a notebook. In 2012,
I expect to replace my notebook with an x86-based modern tablet anduse it exactly the same way. Weve seen it play out many times:
Many of us used to carry around both a PalmPi
but then the smartphone took over the job
PalmPilot and eliminated a device with the sa
Lots of kids (or their parents) carry a hand-held
a pocket tablet (aka smartphone), and we are
of the dedicated hand-held gaming device (h
as the pocket tablet is taking over more and m Similarly, today many of us carry around a no
cated tablet, and convergence will again let u
with the same form factor.
Computing loves convergence. In general-purpos
ing (like notebooks and tablets, not special-purpo
microwaves and automobiles that may happen to
sors), convergence always happily dooms special-
the long run, as each device either evolves to take
or gets taken over. We will continue to have dis
tablets and page-sized tablets for a time because
form factors with different mobile uses, but even
until we find a way to unify the form factors (fold t
too can converge.
Summary and Conclusions
Mainstream hardware is becoming permanently para
and distributed. These changes are permanent, and
affect the way we have to write performance-inten
stream architectures.
The good news is that Moores local scale-in tempty yet. It appears the transistor bonanza will
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Pre io s Ne t
http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
25/40
www.drdobbs.com
another decade, give or take five years or so, which should be long
enough to exploit the lower-cost side of the Law to get us to parity
between desktops and pocket tablets. The bad news is that we can
clearly observe the diminishing returns as the transistors are decreas-
ingly exploitable with each new generation of processors, software
developers have to work harder and the chips get more difficult to
power. And with each new crank of the diminishing-returns wheel,theres less time for hardware and software designers to come up with
ways to overcome the next hurdle; the motherlode free lunch lasted
30 years, but the homogeneous multicore era lasted only about six
years, and we are now already overlapping the next two eras of het-
ero-core and cloud-core.
But all is well: When your mine is getting empty, you dont panic, you
just open a new mine at a new motherlode. As usual, in this case the
end of one dominant wave overlaps with the beginning of the next,
and we are now early in the period of overlap where we are standing
with a foot in each wave, a crew in each of Moores mine and the cloud
mine. Perhaps the best news of all is that the cloud wave is already
scaling enormously quickly faster than the Moores Law wave that
it complements, and that it will outlive and replace.
If you havent done so already, now is the time to take a hard look at
the design of your applications, determine what existing features
or better still, what potential and currently unimaginable demanding
new features are CPU-sensitive now or are likely to become so soon,
and identify how those places could benefit from local and distributed
parallelism. Now is also the time for you and your team to grok the re-
quirements, pitfalls, styles, and idioms of hetero-parallel (e.g., GPGPU)
and cloud programming (e.g., Amazon Web Services, Microsoft Azure,Google App Engine).
To continue enjoying the free lunch of shipping
runs well on todays hardware and will just naturally
on tomorrows hardware, you need to write an app
parallelism expressed in a form that can be spread
with a variable number of cores of different kinds
uted cores, and big/small/specialized cores. The thr
cost extra extra development effort, extra code tra testing effort. The good news is that for many cla
the extra effort will be worthwhile, because concu
fully exploit the exponential gains in compute th
continue to grow strong and fast long after Moores
its sunny retirement, as we continue to mine the c
our careers.
Acknowledgments
I would like to particularly thank Jeffrey Barr, Dav
Giroux, Yossi Levanoni, Henry Moreton, and Jam
graciously made themselves available to answer
vide background information, and who shared
appropriately mapping their companies produ
sor/memory chart.
Herb Sutter is a bestselling author and consultant on softw
and a software architect at Microsoft. A version of this article is p
http://herbsutter.com/welcome-to-the-jungle/.
JUNGPrevious Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
[LAMP i N
http://prevpage/http://prevpage/http://drdobbs.com/parallel/232400273http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
26/40
February 2012www.drdobbs.com
Efficient Use of
Lambda Expressions andstd::function
Functors and std:function implementations vary widely between libraries.C++11s lambdas make them more efficient
unctor classes classes that implement operator() are
old friends to C++ programmers who, for many years, have
used them as predicates for STL algorithms. Nevertheless, im-
plementing simple functor classes is quite cumbersome as the
following example shows.
Suppose that v is an STL container ofints and we want to compute
how many of its elements are multiple of a certain value n set at run-
time. An STL way of doing this is:
std::count_if(v.begin(), v.end(), is_multiple_of(n));
where is_multiple_of is defined by:
class is_multiple_of {public:typedef bool result_type; // These t
// recommtypedef int argument_type; // strict
// More dis_multiple_of(int n) : n(n) {}bool operator()(int i) const { return
private:const int n;
};
Having to write all this code pushes many program
own loops instead of calling std::count_if. By dogood opportunities for compiler optimizations.
[LAM
F
Previous Next
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
By Cassio Neri
[LAMPrevious Next
http://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
27/40
www.drdobbs.com
Lambda expressions (http://en.wikipedia.org/wiki/Anonymous_func-
tion#C.2B.2B) make creation of simple functor classes much easier. Al-
though two of the Boost libraries (http://www.boost.org)
Boost.Lambda and, more recently, Boost.Phoenix provide very good
implementations of lambda abstractions in C++03, to improve the lan-
guage expressiveness, the standard committee decided to add lan-
guage support for lambda expressions in C++11. Using this new fea-ture, the previous example becomes:
std::count_if(v.begin(), v.end(),[n](int i){return i%n == 0;});
Behind the scenes, the lambda expression [n](int i){return
i%n == 0;} forces the compiler to implement an unnamed functor
class similar to is_multiple_ofwith some obvious advantages:
1. Its much less verbose.
2. It doesnt introduce a new name just for a temporary use, result-
ing in less name pollution.
3. Frequently (not in this example, though) the name of the functor
class is much less expressive than its actual code. Placing the
code closer to where its called improves code clarity.
The Closure Type
In the previous examples, our functor class was named is_multi-
ple_of. Naturally, the functor class automatically implemented by the
compiler has a different name. Only the compiler knows this types
name, and we can think of it as an unnamed type. For presentation
purposes, its called the closure type, whereas the temporary objectresulting from the lambda expression is the closure object. The type
anonymity is not an issue for std::count_if be
plate function and, therefore, argument type dedu
Turning a function into a template is a way to ma
expressions as arguments. Consider, for instance, a s
implements a root-finder; i.e., a function that takes
and returns a double value x such that f(x) =
might be a template function:template double find_root(T const& f);
However, this might not be desirable due to a few
plate weaknesses: The code must be exposed in hea
time increases, and template functions cant be vir
Canfind_rootbe a non-template function? If so
signature?
double find_root(??? const& f);
Argument type deduction for template function
C++11. Nevertheless, the new standard introduce
auto and decltype to support type deduction
keyword in C++03, but with a different meaning.) I
name to a closure object, then we can follow this e
auto f = [](double x){ return x * x 0.
Furthermore, an alias for the closure type, sayfuncti
typedef decltype(f) function_t;
Unfortunately,function_t is set at the same sc
expression and, therefore, is invisible elsewhere. Inbe used in find_roots signature.
[LAM
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Previous Next
[LAMPrevious Next
http://en.wikipedia.org/wiki/Anonymous_function#C.2B.2Bhttp://en.wikipedia.org/wiki/Anonymous_function#C.2B.2Bhttp://prevpage/http://en.wikipedia.org/wiki/Anonymous_function#C.2B.2Bhttp://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
28/40
www.drdobbs.com
Now, the other important actor of our play enters the stage:
std::function.
std::function and Its Costs
Another Boost option, std::function, implements a type erasure
mechanism that allows a uniform treatment of different functor types.
Its predecessor boost::functiondates back to 2001 and was intro-duced into TR1 in 2005 as std::tr1::function. Now, its part of
C++11 and has been promoted to namespace std.
We shall see a few details of three different implementations of
std::function and related classes: Boost, the Microsoft C++ Stan-
dard library (MSLIB for short), and the GNU Standard C++ Library
(a.k.a. libstdc++, but referred to here as GCCLIB). Unless otherwise
stated, we shall generically refer to the relevant library types and
functions as if they belonged to namespace std, regardless of the
fact that Boosts do not. I will cover two compilers: Microsoft Visual
Studio 2010 (MSVC) and the GNU Compiler Collection 4.5.3 (GCC) us-
ing option -std=c++0x. Ill consider these compilers, compiling
their corresponding aforementioned standard libraries, and also
compiling Boost.
Using std::function,find_roots declaration becomes
double find_root(std::function const& f);
Generally, std::function is a functor
class that wraps any functor object that takes N arguments of types T1,
...,TNand returns a value convertible to R. It provides template conver-
sion constructors that accept such functor objects. In particular, closure
types are implicitly converted to std::function. There are two hid-den and preventable costs at construction.
First, the constructor takes the functor object by v
a copy is made. Furthermore, the constructor forw
series of helper functions, many of which also take i
further copies. For instance, MSLIB and GCCLIB ma
Boost makes seven. However, the large number of c
prit for the biggest performance hit.
The second issue is related to the functors size. Thtations follow the standards recommendation to a
optimization so as to avoid dynamic memory alloc
use a data member to store a copy of the wrapped
because the objects size is known only at constructi
not be big enough to hold the copy. In this case, the
the heap through a call to new (unless a custom al
and only a pointer to this copy is stored by the data
size beyond which the heap is used depends on the
ment considerations. The best cases for common pla
16 bytes, and 24 bytes for MSLIB, GCCLIB, and Boost
Improving Performance of std::function
Clearly, to address these performance issues, cop
should be avoided. The natural idea is working with
of copies. However, we all know that this is not gen
cause you might want the std::functionobject
inal functor.
This is an old issue, as STL algorithms also take f
by value. A good solution was implemented by Boo
and is now also part of C++11 as well.
The template class std::reference_wrapperwan object and provides automatic conversion to
[LAMe ou e t
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
[LAMPrevious Next
http://prevpage/http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
29/40
www.drdobbs.com
making thestd::reference_wrapper usable in many circumstances
where the wrapped type is expected. The size of
std::reference_wrapper is the size of a reference and, thus, small.
Additionally, there are two template functions std::ref and
std::cref to ease the creation of non-const an d const
std::reference_wrappers, respectively. (They act like
std::make_pairdoes to create std::pairs.)Back to the first example: To avoid the multiple copies of is_mul-
tiple_of (which actually dont cost much since this is a small class)
we can use:
std::count_if(v.begin(), v.end(),std::cref(is_multiple_of(n)));
Applying the same idea to the lambda expression yields:
std::count_if(v.begin(), v.end(),std::cref([n](int i){return i%n == 0;}));
Unfortunately, things get a bit more complicated and depend on the
compiler and library.
Boost in both compilers (change std::cref to boost::cref):
It doesnt work because boost::reference_wrapper is
not a functor.
MSLIB: Currently, it doesnt work, but should in the near future.
Indeed, to handle types returned by functor objects, MSLIB uses
std::result_of which, on TR1, depends on the functor type
having a member result_type a typedef to the type re-
turned by operator(). Notice that is_multiple_of has thismember type, but the closure type doesnt (as per C++11). In
C++11,std::result_of has changed and is
decltype. We are in a transition period and M
(http://social.msdn.microsoft.com/Forums/en/v
4e438675-eb1e-42ef-b1df-7ae262234695), bu
MSLIB (https://connect.microsoft.com/VisualS
tails/618807) is supposed to follow C++11.
GCCLIB: It works.
In addition, as per C++11, functor classes origin
expressions are not adaptable they dont contain
bers required by STL adaptors and the following
std::not1([n](int i){ return i%n == 0; }
In this case, std::not1 requires argument_
is_multiple_ofdefines it.
The previous issue takes a slightly different form
tion is involved. By definition, std::function w
Hence, when its constructor receives an object of t
ence_wrapper, it assumes that T is a functor
accordingly. For instance, the following lines are le
GCCLIB, but not yet with MSLIB (though they sho
release):
std::function f(std::cref([n{return i%n ==
std::count_if(v.begin(), v.end(), f);
Its worth mentioning that std::functions wrap
nary functor classes are adaptable and can be giv(for example,std::not1).
[LAM
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
[LAMPrevious Next
http://social.msdn.microsoft.com/Forums/en/vclanguage/thread/4e438675-eb1e-42ef-b1df-7ae262234695http://social.msdn.microsoft.com/Forums/en/vclanguage/thread/4e438675-eb1e-42ef-b1df-7ae262234695https://connect.microsoft.com/VisualStudio/feedback/details/618807https://connect.microsoft.com/VisualStudio/feedback/details/618807https://connect.microsoft.com/VisualStudio/feedback/details/618807http://social.msdn.microsoft.com/Forums/en/vclanguage/thread/4e438675-eb1e-42ef-b1df-7ae262234695http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
30/40
Im led to conclude that if you dont want heap storage (and cus-
tom allocators), then Boost and GCCLIB are good options. If you are
aiming portability, then you should use Boost.
For developers using MSVCLIB, the performance issue remains un-
solved until the next release. For those who cant wait, here is aworkaround that turns out to be portable (works with GCC and MSVC).
The idea is obvious: Keep the closure type small. This size depends
on the variables that are captured by the lambda expression (that is,
appear inside the square brackets []). For instance, the lambda ex-
pression previously seen
[n](int i){ return i%n == 0; };
captures the variable int n and, for this reason, the closure type has
an intdata member holding a copy ofn. The more identifiers we put
inside [], the bigger the size of the closure type gets. If the aggre-
gated size of all identifiers inside[] is small enough (for example, one
intor one double), then the heap is not used.
One way to keep the size small is by creating a structenclosing ref-
erences to all identifiers that normally would go inside [] and put-
ting only a reference to this struct inside []. You use the struct
members in the body of the lambda expression. For instance, the fol-
lowing lambda expression
double a;double b;// ...[a, b](double x){ return a * x + b; };
yields a closure type with at least2 * sizeof(doub
which is enough for MSLIB to use the heap. The alter
double a;double b;
// ...struct {const double& a;const double& b;
} p = { a, b };[&p](double x){ return p.a * x + p.b; };
In this way, only a reference to p is captured, wh
MSLIB, GCC, and Boost to avoid the heap.
A final word on the letter of the law: The standard
sure type can have different sizes and alignments. Th
aforementioned workaround might notwork. More p
remains legal, but the heap might be used if the
enough. However, neither MSVC nor GCC do this.
Acknowledgment
I would like to thank Lorenz Schneider and Victor
comments and careful reading of this article.
Cassio Neri has a Ph.D. in Mathematics. He works in the FX Q
at Lloyds Banking Group in London.
www.drdobbs.com
[LAM
February 2012
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
Previous Next
http://www.drdobbs.com/cpp/232500059http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
31/40
he Threading Methodology used at Intel has four major
steps: Analysis, Design & Implementation, Debugging, and
Performance Tuning. These steps are used to create a multi-
threaded application from a serial base code. While the use
of software tools for the first, third, and fourth steps is well
documented, there hasnt been much written about how to do the De-
sign & Implementation part of the process.
There are plenty of books published on parallel algorithms and com-
putation. However, these tend to focus on message-passing, distrib-
uted-memory systems, or theoretical parallel models of computation
that may or may not have much in common with realized multicoreplatforms. If youre going to be engaged in threaded programming, it
can be helpful to know how to program or design a
models. Of course, these models are fairly limited a
developers will not have had the opportunity to be e
that need such specialized programming.
Multithreaded programming is still more art than
gives eight simple rules that you can add to your p
design methods. By following these rules, you will
in writing the best and most-efficient threaded impl
applications.
Rule 1. Be sure you identify truly independent coYou cant execute anything concurrently unless t
February 2012www.drdobbs.com
8 Simple Rules for DesigninThreaded ApplicationsThis entry from Dr. Dobbs in 2008 offers rules that still hold true for creating efficient threaded impleme
applications.
By Clay Breshears
T
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
From the Vault
Previous Next
http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
32/40
would be executed in parallel can be run independently of each other.
We can easily think of different real-world instances of independent
actions being performed in order to satisfy a single goal. For example,
building a house can involve many different workers with different
skills carpenters, electricians, glazers, plumbers, roofers, painters, ma-sons, lawn wranglers, etc. There are some obvious scheduling depend-
encies between pairs of workers (cant put on roof shingles before
walls are built, cant paint the walls until the drywall is installed), but
for the most part, the people involved in building a house can work
independently of each other.
Another real-world example would be a DVD rental warehouse. Or-
ders for movies are collected and then distributed to the workers that
go out to where all the discs are stored and find copies to satisfy their
assigned orders. Pulling out My Fair Ladyby one worker does not in-
terfere with another worker that is looking for The Terminator, nor will
it interfere with a worker trying to locate episodes from the second
season of Seinfeld. (We can assume that any conflicts that would re-
sult from unavailable inventory have been dealt with before orders are
transmitted to the warehouse.) Also, packaging and mailing of each
order will not interfere with disk searches or the shipping and handling
of any other order.
There are cases where you will have exclusively sequential computa-
tions that cannot be made concurrent; many of these will be depend-
encies between loop iterations or steps that must be carried out in a
specific order. An example for the latter is a pregnant reindeer. The nor-
mal gestation period is about eight consecutive months, so you cantget a calf by putting eight cows on the job for one month. However, if
Santa wanted to field a whole new sled team as so
could have eight cows carrying his future team all a
Rule 2. Implement concurrency at the highest le
There are two directions that you can use when appto thread a serial code. These are bottom-up and
Analysis phase of the Threading Methodology, yo
ments of your code that take the most execution
you are able to run those code portions in paralle
best chance at achieving the maximum performan
In a bottom-up approach, you would attempt to t
in your code. If this is not possible, you can search
the application to determine if there is another pla
can be run in parallel and still executes the hotsp
you have a picture compression application, you c
cessing of the picture into separate, independ
processed in parallel. Even if it is possible to emp
the hotspot code, you should still look to see if it w
implement that concurrency at a point in the cod
call stack. This can increase the granularity of the
each thread.
With the top-down approach, you first consider
tion, what the computation is coded to accomplis
stract level, all the parts of the app that combine t
putation. If there is no obvious concurrency, you sho
of the computation into successively smaller parts tify independent computations. Results from the
www.drdobbs.com
IN THIS ISSUE
Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>
February 2012
Previous Next
http://prevpage/http://prevpage/http://prevpage/8/2/2019 Doctor Dobs Digital Issue_0112
33/40
guide your investigation to include the most time-consuming mod-
ules. Consider threading a video encoding application. You can start
at the lowest level of independent pixels within a single frame, or re-
alize that groups of frames can be processed independent of other
groups. If the video encoding app is expected to process multiple
videos, expressing your parallelism at that level may be easier to writeand will be at the highest level of possible concurrency.
The granularity of concurrent computations is loosely defined as the
amount of computation done before synchronization is needed. The
longer the time between synchronizations, the coarser the granularity
will be. Fine-grained parallelism runs the danger of not having enough
work assigned to threads to overcome the overhead costs of using
threads. Adding more threads, when the amount of computation does-
nt change, only exacerbates the problem. Coarse-grained parallelism
has lower overhead costs and also tends to be more readily scalable
to an increase in the number of threads. Top-down approaches to
threading (or driving the point of threading as high in the call stack)are the best options to achieve a coarse-grained solution.
Rule 3. Plan early for scalability to take advanta
numbers of cores.
Processors have recently gone from being dual-core
has announced the 80-core Teraflop chip. The num
able in future processors will only increase. Thus, y
such processor increases within your software. Scaure of an applications ability to handle changes, ty
system resources (number of cores, memory size,
data set sizes. In the face of more available cores,
that can take advantage of different numbers of co