Doctor Dobs Digital Issue_0112

8/2/2019 Doctor Dobs Digital Issue_0112

1/40

Dr. Dobbs JournalFEBRUARY 2012

ALSO INSIDE

The Need to Rewrite Estab

Algorithms >>

Efficient Use of Lambda Exstd::function >>

From the Vault:8 Simple Rules for DesigninApplications >>Welcome to the

Jungle

Parallel ProgrammingParallel Programming

Herb Sutter told you when the free lunch wasover, now he leads you throughthe parallel hardware jungle


2/40

February 2012

C O N T E N T SCOVER STORY6 Welcome to the JungleBy Herb Sutter

The transitions to multicore processors, GPU computing,

and HaaS cloud computing are not separate trends, but as-

pects of a single trend mainstream computers from

desktops to smartphones are being permanently trans-

formed into heterogeneous supercomputer clusters.

Henceforth, a single compute-intensive application will

need to harness different kinds of cores, in immense num-

bers, to get its job done. The free lunch is over. Welcome to

the hardware jungle.

26 Efficient Use of Lambda Expressions andstd::functionBy Cassio Neri

Functors and std:function implementations vary widelybetween libraries. C++11s lambdas make them more

efficient.

3 Editorial: The Need to Rewrite EstablishedAlgorithmsBy Andrew Binstock

Parallel architectures, like other hardware advances beforethem, require us to rewrite algorithms and data structures especially the old standbys that have served us well.

31 From the Vault: 8 Simple Rules forDesigning Threaded ApplicationsBy Clay Breshears

Multithreaded programming is still more art than science. This

article gives eight simple rules that you can add to your palette

of threading design methods. By following these rules, you will

have more success in writing the best and most-efficient

threaded implementation of your applications.

38 LinksSnapshots of the most interesting items on drdobbs.com in-

cluding cross-platform development with Eclipse CDT, deploy-

ment with Amazons Elastic Beanstalk, and more.

39 Editorial and Business Contactswww.drdobbs.com

F e b r u a r y 2 0 1 2

Previous Next

Dr. Dobbs Journal

More on D

Boost Performance for Y

The Android NDK is a tool

ponents that make use of

applications.

http://drdobbs.com/go-paral

Seeing the Light with BaUse the processing speed

explore different possible

puzzles.http://drdobbs.com/go-pa

design/232300953

The Best of 2011

The most popular articles o

some additional pieces p

consideration by our staff.http://drdobbs.com/232301271

Booting an Intel Architec

Early Initialization

The boot sequence today is

even a decade ago. Heres

step walkthrough of the b

http://drdobbs.com/parallel/

Two Different Kinds of O

Experience with SPITBO

least two fundamentally

tion, and that the advicplies only to one of those

http://drdobbs.com/blogs/cp
http://prevpage/http://drdobbs.com/go-parallel/blogs/architecture-and-design/232300953http://drdobbs.com/go-parallel/blogs/architecture-and-design/232300953http://prevpage/http://drdobbs.com/go-parallel/blogs/architecture-and-design/232300953


3/40

www.drdobbs.com

central point of developer wisdom is to reuse

code, especially data structures and collections. A

few decades ago, it was common for C program-

mers to write innumerable implementations oflinked lists from scratch. The code became almost

a muscle memory as you banged it out. Today,

such an exercise is more the result of ignoring es-

tablished and well-tested options, rather than coding prowess. Except

in exigent circumstances, writing your own collections has the whiff

of cowboy programming.

Its safe to say that, for the most part, you should not be writing

your own data structures or basic algorithms (sorts, checksums, en-

cryption, calendars, etc.). However, this principle has a recurring ex-

ception that needs to be acknowledged; namely, that advances inhardware must find their way promptly into the implementations of

common algorithms.

In a 1996 article on hashing efficiency that I w

(http://drdobbs.com/database/184409859), I discu

the then-significant problem of memory latency on

table design. Basically, the concern was that evebucket that was not in cache created a significant p

cle as the processor waited for the long memory-fe

gested that on closed hash tables, nonlinear rehas

slot was found was a costly operation. Linear reha

closest empty slot) worked better. The problem of m

small caches, in those days, made algorithm and d

tion a task best completed with care.

The expansion of processor caches changed th

insofar as algorithms were concerned. Unless

comp-sci background, the terms cache-aware ous algorithms might be new to you. Impleme

mer tend to uncover the size of the cache on the

The Need to RewriteEstablished Algorithms

A

Previous Next

February 2012

Previous Next

IN THIS ISSUE

Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

Parallel architectures, like other hardware advances before them, require usto rewrite algorithms and data structures especially the old standbys thathave served us well
http://nextpage/http://nextpage/http://nextpage/


4/40

February 2012www.drdobbs.com

and then size the data structures and algorithms to minimize mem-

ory fetches. Success in this can represent significant performance

gains, at the cost of some portability. Some libraries, frequently

those provided by processor vendors (such as Intel and AMD, in par-

ticular) or specialized development houses, provide these imple-

mentations. Intels Integrated Performance Primitives library

(http://drdobbs.com/go-parallel/blogs/cpp/232300486), for exam-

ple, checks the runtime platform characteristics and brings in the

right binaries for optimal performance.

For most applications, however, were dependent on the standard

libraries provided with the language. (Intels IPP library, for example,

comes only for native code. Java and .NET are supported only with

wrappers.) Language providers eventually do deliver library updates,

but the progress can be frustratingly slow and the work uneven. The

delivery of Javas support for multithread-friendly collections is a

case in point. Scalas multithreaded collections were a major drawbecause they came at a time when Javas collections did not work

well enough.

Not only are better libraries needed, but even within standard li-

braries, the choice of data structures is becoming more complex.

In this excellent article explaining why linked lists are pass

(http://drdobbs.com/go-parallel/blogs/parallel/232400466), Dr.

Dobbs blogger Clay Breshears discusses why trees make a better

and more parallel-friendly data structure than the ever-sequential

linked list. This is exactly the kind of nuance that should keep us

vigilant against lazily accepting a static view of which algorithms

and data structures to choose. Everyone knows, dont they, that

Previous Next

IN THIS ISSUE


Instantly SeaTerabytes Of T

Fully-Functional Evaluwww.dtSearch.co

Lightning Fast Redmond M

Covers all data sources eW

Returns results in less than a InfoWorld

25+ fielded & full-text search op

dtSearchs own file parsers highpopular file & email types

Spider supports static & dynam

APIs for .NET, Java, C++, SQL, et

Win /Linux (64-bit & 32-bit)


5/40

www.drdobbs.com

linked lists are faster th an trees? And yet, even this mainstay of ob-

vious logic is now changing beneath our feet.

The imminent era of manycore processors is likely to bring other

changes to the fore. I especially expect that sort routines will be

dramatically affected. Quicksort will no longer be the default sorting

algorithm. The choice of sort will be more carefully matched to the

needs of the data and the capabilities of the platform. We already

see this on a macro level in the new world of big data. Map-reduce

at scale depends upon sorts being done in smaller increments and

reassembled through a merge function. And even there, the basic

sorting has to be capable of handling billions of data items. In

which case, grabbing an early item and making it the pivot element

for millions of other entries (Quicksort) can have unfortunate con-

sequences on performance.

Between the proliferation of cores, the rapid expa

faster performance of RAM, and the huge increase

ditional choices of algorithms and data structures a

ently safe or appropriate at all. Once again, select

with considerable care aforethought.

Andrew Binstock is Editor in Chief for Dr. Dobbs an

[email protected].

Previous Next

IN THIS ISSUE


February 2012
http://nextpage/http://www.drdobbs.com/parallel/232500147http://nextpage/


6/40


Welcome to the JungleThe free lunch is over. Welcome to the hardware jungle

n the twilight of Moores Law, the transitions to multicore proces-

sors, GPU computing, and hardware or infrastructure as a service

(HaaS) cloud computing are not separate trends, but aspects of a sin-

gle trend mainstream computers from desktops to smartphones

are being permanently transformed into heterogeneous supercomputer

clusters. Henceforth, a single compute-intensive application will need to

harness different kinds of cores, in immense numbers, to get its job done.

The free lunch is over. Welcome to the hardware jungle.From 1975 to 2005, our industry accomplished a phenomenal mis-

sion: In 30 years, we put a personal computer on every desk, in every

home, and in every pocket.

In 2005, however, mainstream computing hit a wall. In The Free

Lunch Is Over (A Fundamental Turn Toward Concurrency in Software)

(http://is.gd/RHSOzm), I described the reasons for the then-upcoming

industry transition from single-core to multicore CPUs in mainstream

machines, why it would require changes throughout the software stack

from operating systems to languages to tools, and why it would per-

manently affect the way we as software developers have to write our

code if we want our applications to continue exploiting Moores tran-

sistor dividend.

In 2005, our industry undertook a new mission

parallel supercomputeron every desk, in every h

pocket. 2011 was special: Its the year that we com

tion to parallel computing in all mainstream form f

rival of multicore tablets (such as iPad 2, Playbook

Tablet) and smartphones (for example, Galaxy S I

4S). 2012 will see the continued build out of mu

stream quad- and eight-core tablets (as Windows tablet experience to x86 as well as ARM), and the la

ing console holdout will go multicore (as Nintend

Wii; http://is.gd/sBuPtr).

It took us just six years to deliver mainstream pa

all popular form factors. And we know the transition

manent, because multicore delivers compute perfo

core cannot and there will always be mainstream ap

better on a multicore machine. Theres no going ba

For the first time in the history of computing, ma

is no longer a single-processor von Neumann mach

be again.

That was the first act.

By Herb Sutter

I

Previous Next

IN THIS ISSUE

http://prevpage/http://prevpage/


7/40

www.drdobbs.com

Overview: Trifecta

It turns out that multicore is just the first of three related permanent

transitions that layer on and amplify each other ; as the timeline in Fig-

ure 1 illustrates.

1. Multicore (2005-). As explained previously.

2. Heterogeneous cores (2009-). A single computer already typi-

cally includes more than one kind of processor core, as mainstream

notebooks, consoles, and tablets all increasingly have both CPUs and

compute-capable GPUs. The open question in the industry today is not

whether a single application will be spread across different kinds of

cores, but only how different the cores should be. That is, whether

they should be basically the same with similar instruction sets but in

a mix of a few big cores that are best at sequential code plus many

smaller cores best at running parallel code (the Intel MIC

(http://is.gd/I2iB09) model slated to arrive in 2012-2013, which is easier

to program). Or should they be cores with different capabilities that may

only support subsets of general-purpose languages

current Cell and GPGPU model, which requires m

cluding language extensions and subsets).

Heterogeneity amplifies the first trend (multicor

of the cores are smaller, then we can fit more of them

Indeed, 100x and 1,000x parallelism is already availa

mainstream home machines for programs that can

We know the transition to heterogeneous core

cause different kinds of computations naturally ru

less power on different kinds of cores and dif

same application will run faster and/or cooler on a

eral different kinds of cores.

3. Elastic compute cloud cores (2010-). For ou

means specifically HaaS delivering access to m

hardware as an extension of the mainstream m

started to hit the mainstream with commercial coings from Amazon Web Services (AWS), Microsoft

Engine (GAE), and others.

Cloud HaaS again amplifies both of the first two

fundamentally about deploying large numbers of

node is a mainstream machine containing multi

neous cores. In the cloud, the number of cores avai

plication is scaling fast. In mid-2011, Cycle Comp

30,000-core cloud for under $1,300/hour (http://is.

AWS. The same heterogeneous cores are available

For example, AWS already offers Cluster GPU nod

Tegra M2050 GPU cards, enabling massively paralle

tributed CUDA applications.

IN THIS ISSUE


February 2012

Previous Next JUNG

Figure 1.


8/40

In short, parallelism is not just in full bloom, but increasingly in full

variety. In this article, I develop four key points:

1. Moores End. We can observe clear evidence that Moores Law

is ending because we can point to a pattern that precedes the

end of exploiting any kind of resource. But theres no reason to

panic, because Moores Law limits only one kind of scaling, andwe have already started another kind.

2. Mapping one trend, not three. Multicore, heterogeneous cores,

and HaaS cloud computing are not three separate trends, but as-

pects of a single trend: putting a personal heterogeneous super-

computer cluster on every desk and in every pocket.

3. The effect on software development. As software developers,

we will be expected to enable a single application to exploit a

jungle of enormous numbers of cores that are increasingly dif-

ferent in kind (specialized for different tasks) and different in lo-

cation (from local to very remote; on-die, in-box, on-premises, in-

cloud). The jungle of heterogeneity will continue to spur deep

and fast evolution of mainstream software development, but we

can predict what some of the changes will be.

4. Three distinct near-term stages of Moores End. And why

smartphones arent, really.

Lets begin with the end of Moores Law.

Mining Moores Law

Weve been hearing breathless Moores Law is e

ments for years. That Moores Law would end was

exponential progression must. Although it didn

prognosticators expected, its end is possible to forec

to know what to look for, and that is diminishing ret

A key observation is that exploiting Moores Law

gold mine or any other kind of resource. Exploiting

never just stops abruptly; rather, running a mine go

of increasing costs and diminishing returns until fin

left in that patch of ground is no longer commercia

operating the mine is no longer profitable.

Mining Moores Law has followed the same patte

three major phases, where we are now in transitio

Phase III. And throughout this discussion, never fo

reason Moores Law is interesting at all is because w

raw resource (more transistors) into a useful form (putational throughput or lower cost).

Phase I, Moores Motherlode = Unicore Free Lu

When you first find an ore deposit and open a mine

forts on the motherlode, where everybody gets to

and a low cost per pound of gold extracted.

www.drdobbs.comFebruary 2012

Previous Next

IN THIS ISSUE


JUNGL


9/40

For 30 years, mainstream processors mined Moores motherlode

by using their growing transistor budgets to make a single core

more and more complex so that it could execute a single thread

faster. This was wonderful because it meant the performance was

easily exploitable compute-bound software would get faster with

relatively little effort. Mining this motherlode in mainstream micro-

processors went through two main subphases as the pendulum

swung from simpler to increasingly complex cores:

In the 1970s and 1980s, each chip generation could use most of

the extra transistors to add One Big Feature (such as on-die float-

ing point unit, pipelining, out of order execution) that would

make single-threaded code run faster.

In the 1990s and 2000s, each chip generation started using the

extra transistors to add or improve two or three smaller features

that would make single-threaded code run f

or six smaller features, and so on.

Figure 2 shows how the pendulum swung toward

plex single cores, with three sample chips: the 8028

tium Extreme Edition 840. Note that the chips bo

number of transistors.

By 2005, the pendulum had swung about as far as

the complex single-core model. Although the mo

mostly exhausted, were still scraping ore off its w

some continued improvement in single-threaded

but no longer at the historically delightful exponen

Phase II, Secondary Veins = Homogeneous Mult

As a motherlode gets used up, miners concentrate

that are still profitable but have a more moderate yi

per pound of extracted gold. So when Moores ustarted getting mined out, the industry turned to m

ondary veins using the additional transistors to m

chip. Multicore let us continue to deliver exponentia

pute throughput in mainstream computers, but in

easily exploitable because it placed a greater burden

opers who had to write parallel programs that could

Moving into Phase II took a lot of work in the soft

had to learn to write new free lunch applications

lots of latent parallelism and so can once again rid

the same executable faster on next years hardw

still delivers exponential performance gains but pr

of additional cores. Today, there are parallel runtim

www.drdobbs.com

IN THIS ISSUE


February 2012

Previous Next JUNG

Figure 2.


10/40

www.drdobbs.com

Intel Threading Building Blocks (TBB) and Microsoft Parallel Patterns

Library (PPL), parallel debuggers and parallel profilers, and updated

operating systems to run them all.

But this time the phase didnt last 30 years. We barely have time to

catch our breath, because Phase III is already beginning.

Phase III, Tertiary Veins = Heterogeneous Cores (2011-)

As our miners are forced to move into smaller and smaller veins, yields

diminish and costs rise. The miners are turning to Moores tertiary

veins: Using Moores extra transistors to make not just more cores, but

also different kinds of cores and in very large numbers because the

different cores are often smaller and swing the pendulum back toward

the left.

There are two main categories of heterogeneity, see Figure 3.

Big/fast vs. small/slow cores.The smallest amou

is when all the cores are general-purpose cores wit

tion set, but some cores are beefier than others be

more hardware to accelerate execution (notably by

tency using various forms of internal concurrency).

cores are big complex ones that are optimized to

parts of a program really fast, while others are sm

optimized to get better total throughput for the sc

of the program. However, even though they use th

set, the compiler will often want to generate differe

ence can become visible to the programmer if the

guage must expose ways to control code generatio

proach with Xeon (big/fast) and MIC (small/slow

approximately the x86 instruction set.

General vs. specialized cores. Beyond that, w

multiple cores having different capabilities, includ

may not be able to support all of a mainstream langIn 2006-2007, with the arrival of the PlayStation 3, t

sor led the way by incorporating different kinds of

chip, with a single general-purpose core assisted by

cial-purpose SPU cores. Since 2009, we have begun

use of GPUs to perform computation instead of jus

ized cores like SPUs and GPUs are attractive when t

kinds of code more efficiently, both faster and mo

a great bargain if your workload fits it.

GPGPU is especially interesting because we al

derutilized installed base: A significant percentag

stream machines already have compute-capable

to be exploited. With the June 2011 introduction o

JUNGPrevious Next

IN THIS ISSUE


February 2012

Figure 3.
http://prevpage/http://prevpage/http://prevpage/


11/40

www.drdobbs.com

the November 2011 launch of NVIDIA Tegra 3, systems with CPU and

GPU cores on the same chip is becoming a new n orm. That installed

base is a big carrot, and creates an enormous incentive for compute-

intensive mainstream applications to leverage that patiently waiting

hardware. To date, a few early adopters have been using technolo-

gies like CUDA, OpenCL, and more recently C++ AMP to harness

GPUs for computation. Mainstream application developers who careabout performance need to learn to do the same; see Table 1.

But thats pretty much it we currently know of no other major

ways to exploit Moores Law for compute performance, and once these

veins are exhausted, it will be largely mined out.

Were still actively mining for now, but the writing on the wall is clear:

mene mene diminishing returns demonstrate that weve entered the

endgame.

JUNGPrevious Next

IN THIS ISSUE


February 2012

SHRINKWRAP

YOUR APPWITH AWARD-WINNING VERISIGN CODE SIGNINGYou developed the software. Now, deliver it with the same care and vigilance by using

VeriSign Code Signing. Why? Code signing not only protects the identity and reputation

of the author, but it also verifies the authenticity and version of your software. Then, go

a step further. VeriSign Code Signing can create a unique digital signature every time the

code is signed. Plus, we support more certification programs and development platforms

than any other Certificate Authority. Leverage the reputation of the most recognized and

trusted name in online security.

Learn how VeriSign Code Signing can help make sureyour applications are more trusted and adopted atwww.VeriSign.com/CodeSigning or call 1-866-893-6565.

Copyright 2011SymantecCorporation.All rightsreserved.Symantec,the SymantecLogo,and theCheckmarkLogo aretrademarks

SymantecCorporationorits affiliatesintheU.S. andothercountries.VeriSignand otherrelatedmarksare thetrademarksorregistered

orits affiliatesorsubsidiariesinthe U.S.andother countriesandlicensedto SymantecCorporation.Othernamesmay betrademarkso

Now from

BEST SECURITYSOFTWARE

DEVELOPMENTSOLUTION

Table 1.


12/40

www.drdobbs.com

On The Charts: Not Three Trends, But One Trend

Next, lets put all of this in perspective by showing

ero-core, and cloud-core are not three trends, but

trend. To show that, we have to show that they can

same map. Figure 4 shows an appropriate map th

where processor core architectures are going, wh

tectures are going, and visualize just where wearound in the mine so far.

First, I describe each axis, then map out past and

to spot trends, and finally draw some conclusions

ware is likely to concentrate.

Processor Core Types

The vertical axis shows processor core architecture

ure 5, from bottom to top, they form a continuum

formance and scalability, but also of increasing r

grams and programmers in the form of additional p(yellow) or correctness issues (red) added at each s

Complex cores are the big traditional ones, w

swung far to the right in the habitable zone. The

ning sequential code, including code limited by Am

Simpler cores are the small traditional ones, to

habitable zone. These are best at running paral

still requires the full expressivity of a mainstream

guage.

Specialized cores like those in GPUs, DSPs, and C

limited, and often do not yet fully support all features

guages (such as exception handling). These are bes

parallelizable code that can be expressed in a subse

JUNGPrevious Next

IN THIS ISSUE


February 2012

Figure 4.

Figure 5.


13/40

www.drdobbs.com

C or C++. For example, XBox Kinect skeletal tracking requires using the

CPU and the GPU cores on the console, and would be impossible other-

wise.The farther you move upward on the chart (to the right in the blown-

up figure), the better the performance throughput and/or the less power

you need, but the more the application code is constrained as it has to

be more parallel and/or use only subsets of a mainstream language.

Future mainstream hardware will likely contain all three basic kinds

of cores, because many applications have all these

same program, and so naturally will run best on a h

puter that has all these kinds of cores. For example

all Kinect games, and all CUDA/OpenCL/C++AMP

able today could not run well or at all on a homo

because they rely on running parts of the same

CPU(s) and other parts on specialized cores. Those athe beginning.

Memory Architectures

The horizontal axis shows six common memory a

left to right, they form a continuum of increasing

scalability, but (except for one important discontinu

work for programs and programmers to deal with p

(yellow) or correctness issues (red). In Figure 6, t

cache and lower boxes represent RAM. A processo

the top of each cache peak.Unified memory is tied to the unicore motherl

ory hierarchy is wonderfully simple a single mo

sitting on top. This describes essentially all main

from the dawn of computing until the mid-2000s. Th

programming model: Every pointer (or object refe

JUNGPrevious Next

IN THIS ISSUE


February 2012

Figure 6.


14/40

www.drdobbs.com

every byte, and every byte is equally far away from the core. Even

here, programmers need to be conscious of at least two basic cache

effects: locality, or how well hot data fits into cache; and access order,

because modern memory architectures love sequential access pat-

terns (for more on this, see my Machine Architecture talk at

http://is.gd/1Fe99o).

NUMA cache retains a single chunk of RAM, but adds multiplecaches. Now instead of a single mountain, we have a mountain range

with multiple peaks, each with a core on top. This describes todays

mainstream multicore devices. Here, we still enjoy a single address

space and pretty good performance as long as different cores access

different memory, but programmers now have to deal with two main

additional performance effects:

locality matters in new ways because some peaks are closer to

each other than others (two cores that share an L2 cache vs. two

cores that share only L3 or RAM), layout matters because we have to keep data physically close

together if its used together (on the same cache line), and apart

if its not (for example, to avoid the ping-pong game of false

sharing).

NUMA RAM further fragments memory into multiple physical chunks

of RAM, but still exposes a single logical address space. Now, the per-

formance valleys between the cores get deeper, because accessing

RAM in a chunk not local to this core incurs a trip across the bus. Exam-

ples include bladed servers, symmetric multiprocessor (SMP) desktop

computers with multiple sockets, and newer GPU architectures that

provide a unified address space view of the CPUs and GPUs memory,

but leave some memory physically closer to the CPU

closer to the GPU. Now we add another item to the

formance-conscious programmer needs to think a

because we can form a pointer to anything doesn

should, if it means reaching across an expensive cha

Incoherent and weak memory makes memory b

chronized, in the hope that allowing each core to havview of the state of memory can make them run f

memory must inevitably be synchronized again. As

only remaining mainstream CPUs with weak memo

rent PowerPC and ARM processors (popular despite

els rather than because of them; more on this belo

has the simplicity of a single address space, but no

further has to take on the burden ofsynchronizing m

Disjoint (tightly coupled) memory bites the bu

ent cores see different memory, typically over a sh

running as a tightly coupled unit that has low lateliability is still evaluated as a single unit. Now the

tightly clustered group of mountainous islands, eac

mountains of cache overlooking square miles of

nected by bridges with a fleet of trucks expediting

to point bulk transfer operations, message que

the mainstream, we see this model used by 2009-

whose on-board memory is not shared with the

other. True, programmers no longer enjoy havin

space and the ability to share pointers But in excha

moved the entire set of programmer burdens accu

replaced them with a single new responsibility: cop

islands of memory.

JUNGPrevious Next

IN THIS ISSUE


February 2012


15/40

www.drdobbs.com

Disjoint (loosely coupled) is the cloud where cores spread out-of-

box into different rooms and buildings and datacenters. This moves

the islands farther apart, and replaces the bus bridges with network

speedboats and tankers. In the mainstream, we see this model in

HaaS cloud computing offerings; this is the comm

compute cluster. Programmers now have to arrang

additional concerns, which often can be abstracte

and runtimes: reliabilityas nodes can come and go

islands are farther apart.

Charting the HardwareAll three trends are just aspects of a single trend: f

and enabling heterogeneous parallel computing. F

the chart wants to be filled out because there are

naturally suited to each of these boxes, though so

popular than others.

To help visualize the filling-out process more c

check to see how mainstream hardware has progre

The easiest place to start is the long-standing ma

more recent GPU:

From the 1970s to the 2000s, CPUs started

cores and then moved downward as the pend

creasingly complex cores. They hugged the le

by staying single-core as long as possible, b

out of room and turned toward multicore NU

tures; see Figure 8.

Meanwhile, in the late 2000s, mainstream GP

pable of handling computational workloads

started life in an add-on discrete GPU card fo

ics-specific cores and memory were physically

the CPU and system RAM, they started furtthe right (Specialized / Disjoint (local)). GPUs

leftward to increasingly unified views of me

JUNGPrevious Next

IN THIS ISSUE


February 2012

Figure 7.

Figure 8.


16/40

www.drdobbs.com

downward to try to support full mainstream languages (such as

adding exception handling support).

Todays typical mainstream computer includes both a CPU and

a discrete or integrated GPU. The dotted line in the graphic de-

notes cores that are available to a single application because

they are in the same device, but not on the same chip.

Now we are seeing a trend to use CPU and specialized (currently GPU)

cores with very tightly coupled memory, and even on the same die:

In 2005, the XBox 360 sported a multicore CPU and GPU that could

not only directly access the same RAM, but had the very unusual

feature that they could share even L2 cache.

In 2006 and 2007, the Cell-based PS3 console sported a single

processor having both a single general-purpose core and eight

special-purpose SPU cores. The solid line in Figure 9 denotes

cores that are on the same chip, not just in the same device.

In June 2011 and November 2011, respective

launched the Fusion and Tegra 3 architectu

chips that sported a compute-class GPU (he

tically) on the same die (hence well to the lef

Intel has also shipped the Sandy Bridge

which includes an integrated GPU that is no

capable. Intels main focus has been the MIC

50 simple, general-purpose x86-like cores opected to be commercially available in the n

Finally, we complete the picture with cloud HaaS; F

In 2008 and 2009, Amazon, Microsoft, Google

began rolling out their cloud compute offerin

GAE support an elastic cloud of nodes each

tional computer (big-core and loosely cou

JUNGPrevious Next

IN THIS ISSUE


February 2012

Figure 9. Figure 10.
http://prevpage/http://prevpage/http://prevpage/http://prevpage/


17/40

www.drdobbs.com

the bottom right corner of the chart) where each node in the

cloud has a single core or multiple CPU cores (the two lower-left

boxes). As before, the dotted line denotes that all of the cores are

available to a single application, and the network is just another

bus to more compute cores.

Since November 2010, AWS also supports compute instances

that contain both CPU cores and GPU cores, indicated by the H-shaped virtual machine where the application runs on a cloud

of loosely coupled nodes with disjoint memory (right column)

each of which contains both CPU and GPU cores (currently not

on the same die, so the vertical lines are still dotted).

The Jungle

Putting it all together, we get a noisy profusion of life and color as in

Figure 11. This may look like a confused mess, so lets notice two things

that help make sense of it.

First, every box has a workload that its best at, bu

ticularly some columns) are more popular than ot

are particularly less interesting:

Fully unified memory models are only applic

which is being essentially abandoned in the

Incoherent/weak memory models are a pement that is in the process of failing in the m

hardware side, the theoretical performance

from letting caches work less synchronously

largely duplicated in other ways by main

having stronger memory models. On the s

the mainstream general-purpose languages

(C, C++, Java, .NET) have largely rejected wea

and require a coherent model that is tech

quential consistency for data race

(http://is.gd/EmpCDn [PDF]) as either thememory model (Java, .NET) or their default m

C++11, ISO C11). Nobody is moving toward

incoherent/weak memory strip of the cha

moving through it to get to the other side,

to stay there.

But all other boxes, including all rows (processo

strongly represented, and we realize why thats true

ent parts of even the same application naturally wa

ent kinds of cores.

Second, lets clarify the picture by highlighting an

regions that hardware is migrating toward in Figur

JUNGPrevious Next

IN THIS ISSUE


February 2012

Figure 11.


18/40

www.drdobbs.com

In Figure 12 again we see the first and fourth columns being de-emphasized, as hardware trends have begun gradually coalescing

around two major areas. Both areas extend vertically across all kinds

of cores and the most important thing to note is that these rep-

resent two mines, where the area to the left is the Moores Law mine.

Mine #1: Scale in = Moores Law. Local machines will con-

tinue to use large numbers of heterogeneous local cores, either

in-box (such as CPU with discrete GPU) or on-die (Sandy Bridge,

Fusion, Tegra 3). Well see core counts increase until Moores Law

ends, and then stabilize core counts for individual local devices.

Mine #2: Scale out = distributed cloud. Much more impor-tantly, we will continue to see a cornucopia of cores delivered

via compute clouds, either on-premises (clu

or in public clouds. This is a brand new mine d

the lower coupling of disjoint memory, espe

pled distributed nodes.

The good news is that we can heave a sigh of re

another mine to open. The even better news is that far faster growth rate than even Moores Law. Notic

lines when we graph the amount of parallelism avai

plication running on various architectures; see Fig

three lines are mining Moores Law for scale-in gro

mon slope reflects Moores wonderful exponent, jus

downward to account for how many cores of a given

onto the same die. The top two lines are mining th

and GPUs, respectively) for scale-out growth an

JUNGPrevious Next

IN THIS ISSUE


February 2012

Figure 12.

Figure 13.


19/40

www.drdobbs.com

If hardware designers merely use Moores Law to deliver more big

fat cores, on-device hardware parallelism will stay in double digits for

the next decade, which is very roughly when Moores Law is due to

sputter, give or take about a half decade. If hardware follows Niagaras

and MICs lead to go back to simpler cores, well see a one-time jump

and then stay in triple digits. If we all learn to leverage GPUs, we already

have 1,500-way parallelism in modern graphics cards (Ill say coresfor convenience, though that word means something a little different

on GPUs) and likely reach five digits in the decade timeframe.

But all of that is eclipsed by the scalability of the cloud, whose growth

line is already steeper than Moores Law because were better at quickly

deploying and using cost-effective networked machines than weve

been at quickly jam-packing and harnessing cost-effective transistors.

Its hard to get data on the current largest cloud deployments because

many projects are private, but the largest documented public cloud apps

(which dont use GPUs) are already harnessing over 30,000 cores for a

single computation. I wouldnt be surprised if some projects are exceed-ing 100,000 cores today. And thats general-purpose cores; if you add

GPU-capable nodes to the mix, add two more zeroes.

Such massive parallelism, already available for rates of under

$1,300/hour for a 30,000-core cloud, is game-changing. If you doubt

that, here is a boring example that doesnt involve advanced aug-

mented reality or spook-level technomancery: How long will it take

someone whos stolen a strong password file (which well assume is

correctly hashed and salted and contains no dictionary passwords) to

retrieve 90% of the passwords by brute force using a publicly available

GPU-enabled compute cloud? Hint: An AWS dual-Tegra node can test

on the order of 20 billion passwords per second, and clouds of 30,000nodes are publicly documented (of course, Amazon wont say if it has

that many GPU-enabled nodes for hire; but if it d

soon). To borrow a tired misquote (http://is.gd/PJ

affordable attempts per second should be enough

thats not enough for you, not to worry; just wait

years and itll be 640 quadrillion affordable attemp

What It Means For Us: A Programmers ViewHow will all of this change the way we write our s

about harnessing mainstream hardware performa

clusions echo and expand upon ones that I prop

Lunch is Over:

Applications will need to be at least mass

ideally able to use non-local cores and hete

if they want to fully exploit the long-term con

growth in compute throughput being deliver

in-cloud. After all, soon the vast majority of coable to a mainstream application will be non

Efficiency and performance optimizati

not less, important. Were being asked to

periences like sensor-based UIs and augm

less hardware (constrained mobile form fac

tual plateauing of scale-in when Moores Law

ber 2004 I wrote: Those languages that a

selves to heavy optimization will find new lif

will need to find ways to compete and beco

and optimizable. Expect long-term increase

formance-oriented languages and systemswitness the resurgence of interest in C++ in

JUNGPrevious Next

IN THIS ISSUE


February 2012


20/40

www.drdobbs.com

primarily because of its expressive flexibility and performance

efficiency. A program that is twice as efficient has two advan-

tages:

It will be able to run twice as well on a local discon-

nected device especially when Moores Law can no longer

deliver local performance improvements in any form;

It will always be able to run at half the power and coston an elastic compute cloud even as those continue to

expand for the indefinite future.

Programming languages and systems will increasingly be

forced to deal with heterogeneous distributed parallelism. As

previously predicted, just basic homogeneous multicore has

proved to be a far bigger event for languages than even object-

oriented programming was, because some languages (notably C)

could get away with ignoring objects while still remaining com-

mercially relevant for mainstream software development. No

mainstream language, including the just-ratified C11 standard,

could ignore basic concurrency and parallelism and stay relevant

in even a homogeneous-multicore world. Now expect all main-

stream languages and environments, including their standard li-

braries, to develop explicit support for at least distributed paral-

lelism and probably also heterogeneous parallelism; they cannot

hope to avoid it without becoming marginalized for mainstream

app development.

Expanding on that last bullet, what are some basic elements we will

need to add to mainstream programming models (think: C, C++, Java,

and .NET)? Here are a few basics I think will be unavoidable, that mustbe supported explicitly in one form or another.

Deal with the processor axis lower secti

compute cores with different perform

slow/small). At minimum, mainstream ope

runtimes will need to be aware that some co

others, and know which parts of an applicat

which of those cores.

Deal with the processor axis upper sectilanguage subsets, to allow for cores with d

including that not all fully support mainstream

In the next decade, a mainstream operating

or augmented with an extra runtime like the J

ConcRT runtime underpinning PPL) will be ca

cores with different instruction sets and runn

cation across many of those cores. Programm

tools will be extended to let the developer e

restricted to use just a subset of a mainstream

guage (as with the restrict() qualifiers in

timistic that for most mainstream language

guage extension will be sufficient while le

language rules for overloading and dispatch, a

the impact on developers.

Deal with the memory axis for computation

tributed algorithms that can scale not ju

across a compute cloud. Libraries and runtim

TBB and PPL will be extended or duplicated to e

and other algorithms that run on large numbe

local parallel cores. Today we can write a paral

that can run with 1,000x parallelism on a set ofand ship the right data shards to the right com

JUNGPrevious Next

IN THIS ISSUE


February 2012


21/40

www.drdobbs.com

results back; tomorrow we need to be able to write that same call

that can run with 1,000,000,000x parallelism on a set of cloud-based

GPUs and ship the right data shards to the right nodes and the re-

sults back. This is a baby step example in that it just uses local data

(that can fit in a single machines memory), but distributed compu-

tation; the data subsets are simply copied hub-and-spoke.

Deal with the memory axis for data, by providing distributed

data containers, which can be spread across many nodes.The

next step is for the data itself to be larger than any nodes memory,

and (preferably automatically) move the right data subsets to the

right nodes of a distributed computation. For example, we need

containers like a distributed_arrayor distributed_table

that can be backed by multiple and/or redundant cloud storage,

and then make those the target of the same distributedparal-

lel_for_each call. After all, why shouldnt we write a single par-

allel_for_each call that efficiently updates a 100 petabyte

table? Hadoop (http://hadoop.apache.org/) enables this today for

specific workloads and with extra work; this will become a stan-

dard capability available out-of-the-box in mainstream languagecompilers and their standard libraries.

Enable a unified programming model th

entire chart with the same source code. Sin

hardware on a single chart with two degre

landscape is unified enough that it should b

by a single programming model in the futur

have at least two basic characteristics: Firs

Processor axis by letting the programmer expsets in a way integrated holistically into the la

will cover or hide the Memory axis by abstra

of data, and copying data subsets on deman

also providing a way to take control of the co

users who want to optimize the performance

putation.

Perhaps our most difficult mental adjustment, h

learn to think of the cloud as part of the mainstre

view all these local and non-local cores as being

target machine that executes our application, wh

just another bus that connects us to more cores. Th

we will write code for mainstream machines assum

million-way parallelism, of which only thousand

guaranteed to always be available (when out of Wi

Five years from now, we want to be delivering ap

an isolated device, and then just run faster or bette

WiFi range and have dynamic access to many more

of our operating systems, runtimes, libraries, progra

and tools need to get us to a place where we ca

bound applications that run well in isolation on diswith 1,000-way local parallelismand when the de

JUNGPrevious Next

IN THIS ISSUE


February 2012

Our most difficult mental adjustment, however, will be to

learn to think of the cloud as part of the mainstream ma-

chine to view all these local and non-local cores as

being equally part of the target machine that executes

our application, where the network is just another bus

that connects us to more cores


22/40

www.drdobbs.com

just run faster, handle much larger data sets, and/or light up with ad-

ditional capabilities. We have a very small taste of that now with cloud-

based apps like Shazam (which function only when online), but yet a

long way to go to realize this full vision.

Exit Moore, Pursued by a Dark Silicon Bear

Finally, lets return one more time to the end of Moores Law to see

what awaits us in our near future (Figure 14), and why we will likely

pass through three distinct stages as we navigate Moores End.

Eventually, the tired miners will reach the point where its no longer

economically feasible to operate the mine. Theres still gold left, but

its no longer commercially exploitable. Recall that Moores Law has

been interesting only because of the ability to transform its raw re-

source of more transistors into one of two useful forms:

Exploit #1: Greater throughput. Moores Law lets us deliver

more transistors, and therefore more complex chips, at the samecost. Thats what will let processors continue to deliver more

computational performance per chip as l

ways to harness the extra transistors for com

Exploit #2: Lower cost/power/size. Alterna

enables delivery of the same number of tra

cost, including in a smaller area and at lower

will let us continue to deliver powerful expe

ingly compact and mobile and embedded fo

The key thing to note is that we can expect the

ploiting Moores Law to end, not at the same time

other and in that order.

Why? Because Exploit #2 only relies on the basic

whereas the first relies on Moores Law andthe a

transistors at the same time.

Which brings me to one last problem down in ou

The Power Problem: Dark Silicon

Sometimes you can be hard at work in a mine, still

small disaster happens: a cave-in, or striking water

render entire sections of the mine unreachable. W

to hit exactly those kinds of problems.

One particular problem we have just begun to e

as dark silicon. Although Moores Law is still deliv

tors, we are losing the ability to power them all at the s

details, see Jem Davies talk Compute Power With

(http://is.gd/Lfl7iz [PDF]) and the ISCA11 paper D

End of Multicore Scaling (http://is.gd/GhGdz9 [PD

This dark silicon effect is like a Shakespeariandoomed character offstage. Even though we can con

JUNGPrevious Next

IN THIS ISSUE


February 2012

Figure 14.


23/40

www.drdobbs.com

cores on a chip, if we cannot use them at the same time ,we have failed

to exploit Moores Law to deliver more computational throughput (Ex-

ploit #1). When we enter the phase where Moores Law continues to

give us more transistors per die area, but we are no longer able to

power them all, we will find ourselves in a transitional period where Ex-

ploit #1 has ended while Exploit #2 continues and outlives it for a time.

This means that we will likely see the following major phases in thescale-in growth of mainstream machines. (Note that these apply to

individual machines only, such as your personal notebook and smart-

phone or an individual compute node; they do not apply to a compute

cloud, which we saw belongs to a different scale-out mine.)

Exploit #1 + Exploit #2: Increasing performance (compute

throughput) in all form factors (1975 mid-2010s?). For a

few years yet, we will see continuing increases in mainstream

computer performance in all form factors from desktop to smart-

phone. As of today, the bigger form factors still have more paral-

lelism, just as todays desktop CPUs and GPUs are routinely more

capable than those in tablets and smartphones as long as Ex-

ploit #1 lives, and then

Exploit #2 only: Flat performance (compute throughput) at

the top end, and mid and lower segments catching up (late

2010s early 2020s?). Next, if problems like dark silicon are not

solved, we will enter a period where mainstream computer per-

formance levels out, starting at the top end with desktops and

game consoles and working its way down through tablets and

smartphones. During this period we will continue to use Moores

Law to lower cost, power, and/or size delivering the same com-plexity and performance already available in bigger form factors

also in smaller devices. Assuming Moores L

enough beyond the end of Exploit #1, we can

it will take for Exploit #2 to equalize personal

ing the difference in transistor counts betw

stream desktop machines and smartphones;

of 20, which will take Moores Law about eigh

Democratization (early 2020s? onward).ratization will reach the point where a desk

smartphone have roughly the same comp

ance. In that case, why buy a desktop ever ag

tablet or smartphone. You might think that th

portant differences between the desktop and

power, because the desktop is plugged in, a

cause the desktop has easier access to a bigg

keyboard/mouse but once you dock the s

the same access to power and peripherals.

Speaking of Smartphones Pocket Tablets and D

Note that the word smartphone is already a ma

cause a pocket device that can run apps is not pr

all. Its primarily a general-purpose personal comp

to have a couple of built-in radios for cell and WiF

the traditional cell phone capability just an ap

use the cell radio, and the Skype IP phone capa

device just another similar app that happens to

instead.

The right way to think about even todays mobi

there are not really tablets and smartphones; tsized tablets and pocket-sized tablets, both alread

JUNGPrevious Next

IN THIS ISSUE


February 2012

P i N
http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/


24/40

www.drdobbs.com

without cellular radios. That they run different operating systems today

is just a point-in-time effect.

This is why those people who said an iPad is just a big iPhone without

the cellular radio had it exactly backwards the iPhone (3G or later,

which allows apps) is a small iPad that fits in your pocket and happens

to have a cellular radio in order to obsolete another pocket-sized device.

Both devices are primarily tablets they minimize hardware chromeand turn into the full-screen immersive app, and thats the closest thing

you can get today to a morphing device that turns into a special-pur-

pose device on demand. Many of us routinely use our phones mostly

as a small tablet spending most of our time on the device running

apps to read books, browse news, watch movies, play games, update so-

cial networks, and surf the Internet. I already use my phone as a small

tablet far more often than I use it as a phone, and if you have an app-ca-

pable phone then Ill bet you already do that, too.

Well before the end of this decade, I expect the most likely dominant

mainstream form factor to be page-sized and pocket-sized tablets, plus

docking where docking means any means of attaching peripher-

als like keyboards and big screens on demand, which today already en-

compasses physical docks and Bluetooth and Play To connections,

and will only continue to get more wireless and more seamless.

This future shouldnt be too hard to imagine, because many of us have

already been working that way for a while now: For the past decade Ive

routinely worked from my notebook as my primary and only environ-

ment. Usually, Im in my home office or work office where I use a real key-

board and big screens by docking the notebook and/or using it via a re-

mote-desktop client, and when Im mobile I use it as a notebook. In 2012,

I expect to replace my notebook with an x86-based modern tablet anduse it exactly the same way. Weve seen it play out many times:

Many of us used to carry around both a PalmPi

but then the smartphone took over the job

PalmPilot and eliminated a device with the sa

Lots of kids (or their parents) carry a hand-held

a pocket tablet (aka smartphone), and we are

of the dedicated hand-held gaming device (h

as the pocket tablet is taking over more and m Similarly, today many of us carry around a no

cated tablet, and convergence will again let u

with the same form factor.

Computing loves convergence. In general-purpos

ing (like notebooks and tablets, not special-purpo

microwaves and automobiles that may happen to

sors), convergence always happily dooms special-

the long run, as each device either evolves to take

or gets taken over. We will continue to have dis

tablets and page-sized tablets for a time because

form factors with different mobile uses, but even

until we find a way to unify the form factors (fold t

too can converge.

Summary and Conclusions

Mainstream hardware is becoming permanently para

and distributed. These changes are permanent, and

affect the way we have to write performance-inten

stream architectures.

The good news is that Moores local scale-in tempty yet. It appears the transistor bonanza will

JUNGPrevious Next

IN THIS ISSUE


February 2012

Pre io s Ne t
http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/


25/40

www.drdobbs.com

another decade, give or take five years or so, which should be long

enough to exploit the lower-cost side of the Law to get us to parity

between desktops and pocket tablets. The bad news is that we can

clearly observe the diminishing returns as the transistors are decreas-

ingly exploitable with each new generation of processors, software

developers have to work harder and the chips get more difficult to

power. And with each new crank of the diminishing-returns wheel,theres less time for hardware and software designers to come up with

ways to overcome the next hurdle; the motherlode free lunch lasted

30 years, but the homogeneous multicore era lasted only about six

years, and we are now already overlapping the next two eras of het-

ero-core and cloud-core.

But all is well: When your mine is getting empty, you dont panic, you

just open a new mine at a new motherlode. As usual, in this case the

end of one dominant wave overlaps with the beginning of the next,

and we are now early in the period of overlap where we are standing

with a foot in each wave, a crew in each of Moores mine and the cloud

mine. Perhaps the best news of all is that the cloud wave is already

scaling enormously quickly faster than the Moores Law wave that

it complements, and that it will outlive and replace.

If you havent done so already, now is the time to take a hard look at

the design of your applications, determine what existing features

or better still, what potential and currently unimaginable demanding

new features are CPU-sensitive now or are likely to become so soon,

and identify how those places could benefit from local and distributed

parallelism. Now is also the time for you and your team to grok the re-

quirements, pitfalls, styles, and idioms of hetero-parallel (e.g., GPGPU)

and cloud programming (e.g., Amazon Web Services, Microsoft Azure,Google App Engine).

To continue enjoying the free lunch of shipping

runs well on todays hardware and will just naturally

on tomorrows hardware, you need to write an app

parallelism expressed in a form that can be spread

with a variable number of cores of different kinds

uted cores, and big/small/specialized cores. The thr

cost extra extra development effort, extra code tra testing effort. The good news is that for many cla

the extra effort will be worthwhile, because concu

fully exploit the exponential gains in compute th

continue to grow strong and fast long after Moores

its sunny retirement, as we continue to mine the c

our careers.

Acknowledgments

I would like to particularly thank Jeffrey Barr, Dav

Giroux, Yossi Levanoni, Henry Moreton, and Jam

graciously made themselves available to answer

vide background information, and who shared

appropriately mapping their companies produ

sor/memory chart.

Herb Sutter is a bestselling author and consultant on softw

and a software architect at Microsoft. A version of this article is p

http://herbsutter.com/welcome-to-the-jungle/.

JUNGPrevious Next

IN THIS ISSUE


February 2012

[LAMP i N
http://prevpage/http://prevpage/http://drdobbs.com/parallel/232400273http://prevpage/http://prevpage/http://prevpage/


26/40


Efficient Use of

Lambda Expressions andstd::function

Functors and std:function implementations vary widely between libraries.C++11s lambdas make them more efficient

unctor classes classes that implement operator() are

old friends to C++ programmers who, for many years, have

used them as predicates for STL algorithms. Nevertheless, im-

plementing simple functor classes is quite cumbersome as the

following example shows.

Suppose that v is an STL container ofints and we want to compute

how many of its elements are multiple of a certain value n set at run-

time. An STL way of doing this is:

std::count_if(v.begin(), v.end(), is_multiple_of(n));

where is_multiple_of is defined by:

class is_multiple_of {public:typedef bool result_type; // These t

// recommtypedef int argument_type; // strict

// More dis_multiple_of(int n) : n(n) {}bool operator()(int i) const { return

private:const int n;

};

Having to write all this code pushes many program

own loops instead of calling std::count_if. By dogood opportunities for compiler optimizations.

[LAM

F

Previous Next

IN THIS ISSUE


By Cassio Neri

[LAMPrevious Next


27/40

www.drdobbs.com

Lambda expressions (http://en.wikipedia.org/wiki/Anonymous_func-

tion#C.2B.2B) make creation of simple functor classes much easier. Al-

though two of the Boost libraries (http://www.boost.org)

Boost.Lambda and, more recently, Boost.Phoenix provide very good

implementations of lambda abstractions in C++03, to improve the lan-

guage expressiveness, the standard committee decided to add lan-

guage support for lambda expressions in C++11. Using this new fea-ture, the previous example becomes:

std::count_if(v.begin(), v.end(),[n](int i){return i%n == 0;});

Behind the scenes, the lambda expression [n](int i){return

i%n == 0;} forces the compiler to implement an unnamed functor

class similar to is_multiple_ofwith some obvious advantages:

1. Its much less verbose.

2. It doesnt introduce a new name just for a temporary use, result-

ing in less name pollution.

3. Frequently (not in this example, though) the name of the functor

class is much less expressive than its actual code. Placing the

code closer to where its called improves code clarity.

The Closure Type

In the previous examples, our functor class was named is_multi-

ple_of. Naturally, the functor class automatically implemented by the

compiler has a different name. Only the compiler knows this types

name, and we can think of it as an unnamed type. For presentation

purposes, its called the closure type, whereas the temporary objectresulting from the lambda expression is the closure object. The type

anonymity is not an issue for std::count_if be

plate function and, therefore, argument type dedu

Turning a function into a template is a way to ma

expressions as arguments. Consider, for instance, a s

implements a root-finder; i.e., a function that takes

and returns a double value x such that f(x) =

might be a template function:template double find_root(T const& f);

However, this might not be desirable due to a few

plate weaknesses: The code must be exposed in hea

time increases, and template functions cant be vir

Canfind_rootbe a non-template function? If so

signature?

double find_root(??? const& f);

Argument type deduction for template function

C++11. Nevertheless, the new standard introduce

auto and decltype to support type deduction

keyword in C++03, but with a different meaning.) I

name to a closure object, then we can follow this e

auto f = [](double x){ return x * x 0.

Furthermore, an alias for the closure type, sayfuncti

typedef decltype(f) function_t;

Unfortunately,function_t is set at the same sc

expression and, therefore, is invisible elsewhere. Inbe used in find_roots signature.

[LAM

IN THIS ISSUE


February 2012

Previous Next

[LAMPrevious Next
http://en.wikipedia.org/wiki/Anonymous_function#C.2B.2Bhttp://en.wikipedia.org/wiki/Anonymous_function#C.2B.2Bhttp://prevpage/http://en.wikipedia.org/wiki/Anonymous_function#C.2B.2Bhttp://prevpage/http://prevpage/http://prevpage/http://prevpage/


28/40

www.drdobbs.com

Now, the other important actor of our play enters the stage:

std::function.

std::function and Its Costs

Another Boost option, std::function, implements a type erasure

mechanism that allows a uniform treatment of different functor types.

Its predecessor boost::functiondates back to 2001 and was intro-duced into TR1 in 2005 as std::tr1::function. Now, its part of

C++11 and has been promoted to namespace std.

We shall see a few details of three different implementations of

std::function and related classes: Boost, the Microsoft C++ Stan-

dard library (MSLIB for short), and the GNU Standard C++ Library

(a.k.a. libstdc++, but referred to here as GCCLIB). Unless otherwise

stated, we shall generically refer to the relevant library types and

functions as if they belonged to namespace std, regardless of the

fact that Boosts do not. I will cover two compilers: Microsoft Visual

Studio 2010 (MSVC) and the GNU Compiler Collection 4.5.3 (GCC) us-

ing option -std=c++0x. Ill consider these compilers, compiling

their corresponding aforementioned standard libraries, and also

compiling Boost.

Using std::function,find_roots declaration becomes

double find_root(std::function const& f);

Generally, std::function is a functor

class that wraps any functor object that takes N arguments of types T1,

...,TNand returns a value convertible to R. It provides template conver-

sion constructors that accept such functor objects. In particular, closure

types are implicitly converted to std::function. There are two hid-den and preventable costs at construction.

First, the constructor takes the functor object by v

a copy is made. Furthermore, the constructor forw

series of helper functions, many of which also take i

further copies. For instance, MSLIB and GCCLIB ma

Boost makes seven. However, the large number of c

prit for the biggest performance hit.

The second issue is related to the functors size. Thtations follow the standards recommendation to a

optimization so as to avoid dynamic memory alloc

use a data member to store a copy of the wrapped

because the objects size is known only at constructi

not be big enough to hold the copy. In this case, the

the heap through a call to new (unless a custom al

and only a pointer to this copy is stored by the data

size beyond which the heap is used depends on the

ment considerations. The best cases for common pla

16 bytes, and 24 bytes for MSLIB, GCCLIB, and Boost

Improving Performance of std::function

Clearly, to address these performance issues, cop

should be avoided. The natural idea is working with

of copies. However, we all know that this is not gen

cause you might want the std::functionobject

inal functor.

This is an old issue, as STL algorithms also take f

by value. A good solution was implemented by Boo

and is now also part of C++11 as well.

The template class std::reference_wrapperwan object and provides automatic conversion to

[LAMe ou e t

IN THIS ISSUE


February 2012

[LAMPrevious Next


29/40

www.drdobbs.com

making thestd::reference_wrapper usable in many circumstances

where the wrapped type is expected. The size of

std::reference_wrapper is the size of a reference and, thus, small.

Additionally, there are two template functions std::ref and

std::cref to ease the creation of non-const an d const

std::reference_wrappers, respectively. (They act like

std::make_pairdoes to create std::pairs.)Back to the first example: To avoid the multiple copies of is_mul-

tiple_of (which actually dont cost much since this is a small class)

we can use:

std::count_if(v.begin(), v.end(),std::cref(is_multiple_of(n)));

Applying the same idea to the lambda expression yields:

std::count_if(v.begin(), v.end(),std::cref([n](int i){return i%n == 0;}));

Unfortunately, things get a bit more complicated and depend on the

compiler and library.

Boost in both compilers (change std::cref to boost::cref):

It doesnt work because boost::reference_wrapper is

not a functor.

MSLIB: Currently, it doesnt work, but should in the near future.

Indeed, to handle types returned by functor objects, MSLIB uses

std::result_of which, on TR1, depends on the functor type

having a member result_type a typedef to the type re-

turned by operator(). Notice that is_multiple_of has thismember type, but the closure type doesnt (as per C++11). In

C++11,std::result_of has changed and is

decltype. We are in a transition period and M

(http://social.msdn.microsoft.com/Forums/en/v

4e438675-eb1e-42ef-b1df-7ae262234695), bu

MSLIB (https://connect.microsoft.com/VisualS

tails/618807) is supposed to follow C++11.

GCCLIB: It works.

In addition, as per C++11, functor classes origin

expressions are not adaptable they dont contain

bers required by STL adaptors and the following

std::not1([n](int i){ return i%n == 0; }

In this case, std::not1 requires argument_

is_multiple_ofdefines it.

The previous issue takes a slightly different form

tion is involved. By definition, std::function w

Hence, when its constructor receives an object of t

ence_wrapper, it assumes that T is a functor

accordingly. For instance, the following lines are le

GCCLIB, but not yet with MSLIB (though they sho

release):

std::function f(std::cref([n{return i%n ==

std::count_if(v.begin(), v.end(), f);

Its worth mentioning that std::functions wrap

nary functor classes are adaptable and can be giv(for example,std::not1).

[LAM

IN THIS ISSUE


February 2012

[LAMPrevious Next
http://social.msdn.microsoft.com/Forums/en/vclanguage/thread/4e438675-eb1e-42ef-b1df-7ae262234695http://social.msdn.microsoft.com/Forums/en/vclanguage/thread/4e438675-eb1e-42ef-b1df-7ae262234695https://connect.microsoft.com/VisualStudio/feedback/details/618807https://connect.microsoft.com/VisualStudio/feedback/details/618807https://connect.microsoft.com/VisualStudio/feedback/details/618807http://social.msdn.microsoft.com/Forums/en/vclanguage/thread/4e438675-eb1e-42ef-b1df-7ae262234695http://prevpage/http://prevpage/http://prevpage/


30/40

Im led to conclude that if you dont want heap storage (and cus-

tom allocators), then Boost and GCCLIB are good options. If you are

aiming portability, then you should use Boost.

For developers using MSVCLIB, the performance issue remains un-

solved until the next release. For those who cant wait, here is aworkaround that turns out to be portable (works with GCC and MSVC).

The idea is obvious: Keep the closure type small. This size depends

on the variables that are captured by the lambda expression (that is,

appear inside the square brackets []). For instance, the lambda ex-

pression previously seen

[n](int i){ return i%n == 0; };

captures the variable int n and, for this reason, the closure type has

an intdata member holding a copy ofn. The more identifiers we put

inside [], the bigger the size of the closure type gets. If the aggre-

gated size of all identifiers inside[] is small enough (for example, one

intor one double), then the heap is not used.

One way to keep the size small is by creating a structenclosing ref-

erences to all identifiers that normally would go inside [] and put-

ting only a reference to this struct inside []. You use the struct

members in the body of the lambda expression. For instance, the fol-

lowing lambda expression

double a;double b;// ...[a, b](double x){ return a * x + b; };

yields a closure type with at least2 * sizeof(doub

which is enough for MSLIB to use the heap. The alter

double a;double b;

// ...struct {const double& a;const double& b;

} p = { a, b };[&p](double x){ return p.a * x + p.b; };

In this way, only a reference to p is captured, wh

MSLIB, GCC, and Boost to avoid the heap.

A final word on the letter of the law: The standard

sure type can have different sizes and alignments. Th

aforementioned workaround might notwork. More p

remains legal, but the heap might be used if the

enough. However, neither MSVC nor GCC do this.

Acknowledgment

I would like to thank Lorenz Schneider and Victor

comments and careful reading of this article.

Cassio Neri has a Ph.D. in Mathematics. He works in the FX Q

at Lloyds Banking Group in London.

www.drdobbs.com

[LAM

February 2012

IN THIS ISSUE


Previous Next
http://www.drdobbs.com/cpp/232500059http://prevpage/http://prevpage/http://prevpage/


31/40

he Threading Methodology used at Intel has four major

steps: Analysis, Design & Implementation, Debugging, and

Performance Tuning. These steps are used to create a multi-

threaded application from a serial base code. While the use

of software tools for the first, third, and fourth steps is well

documented, there hasnt been much written about how to do the De-

sign & Implementation part of the process.

There are plenty of books published on parallel algorithms and com-

putation. However, these tend to focus on message-passing, distrib-

uted-memory systems, or theoretical parallel models of computation

that may or may not have much in common with realized multicoreplatforms. If youre going to be engaged in threaded programming, it

can be helpful to know how to program or design a

models. Of course, these models are fairly limited a

developers will not have had the opportunity to be e

that need such specialized programming.

Multithreaded programming is still more art than

gives eight simple rules that you can add to your p

design methods. By following these rules, you will

in writing the best and most-efficient threaded impl

applications.

Rule 1. Be sure you identify truly independent coYou cant execute anything concurrently unless t


8 Simple Rules for DesigninThreaded ApplicationsThis entry from Dr. Dobbs in 2008 offers rules that still hold true for creating efficient threaded impleme

applications.

By Clay Breshears

T

IN THIS ISSUE


From the Vault

Previous Next


32/40

would be executed in parallel can be run independently of each other.

We can easily think of different real-world instances of independent

actions being performed in order to satisfy a single goal. For example,

building a house can involve many different workers with different

skills carpenters, electricians, glazers, plumbers, roofers, painters, ma-sons, lawn wranglers, etc. There are some obvious scheduling depend-

encies between pairs of workers (cant put on roof shingles before

walls are built, cant paint the walls until the drywall is installed), but

for the most part, the people involved in building a house can work

independently of each other.

Another real-world example would be a DVD rental warehouse. Or-

ders for movies are collected and then distributed to the workers that

go out to where all the discs are stored and find copies to satisfy their

assigned orders. Pulling out My Fair Ladyby one worker does not in-

terfere with another worker that is looking for The Terminator, nor will

it interfere with a worker trying to locate episodes from the second

season of Seinfeld. (We can assume that any conflicts that would re-

sult from unavailable inventory have been dealt with before orders are

transmitted to the warehouse.) Also, packaging and mailing of each

order will not interfere with disk searches or the shipping and handling

of any other order.

There are cases where you will have exclusively sequential computa-

tions that cannot be made concurrent; many of these will be depend-

encies between loop iterations or steps that must be carried out in a

specific order. An example for the latter is a pregnant reindeer. The nor-

mal gestation period is about eight consecutive months, so you cantget a calf by putting eight cows on the job for one month. However, if

Santa wanted to field a whole new sled team as so

could have eight cows carrying his future team all a

Rule 2. Implement concurrency at the highest le

There are two directions that you can use when appto thread a serial code. These are bottom-up and

Analysis phase of the Threading Methodology, yo

ments of your code that take the most execution

you are able to run those code portions in paralle

best chance at achieving the maximum performan

In a bottom-up approach, you would attempt to t

in your code. If this is not possible, you can search

the application to determine if there is another pla

can be run in parallel and still executes the hotsp

you have a picture compression application, you c

cessing of the picture into separate, independ

processed in parallel. Even if it is possible to emp

the hotspot code, you should still look to see if it w

implement that concurrency at a point in the cod

call stack. This can increase the granularity of the

each thread.

With the top-down approach, you first consider

tion, what the computation is coded to accomplis

stract level, all the parts of the app that combine t

putation. If there is no obvious concurrency, you sho

of the computation into successively smaller parts tify independent computations. Results from the

www.drdobbs.com

IN THIS ISSUE


February 2012

Previous Next


33/40

guide your investigation to include the most time-consuming mod-

ules. Consider threading a video encoding application. You can start

at the lowest level of independent pixels within a single frame, or re-

alize that groups of frames can be processed independent of other

groups. If the video encoding app is expected to process multiple

videos, expressing your parallelism at that level may be easier to writeand will be at the highest level of possible concurrency.

The granularity of concurrent computations is loosely defined as the

amount of computation done before synchronization is needed. The

longer the time between synchronizations, the coarser the granularity

will be. Fine-grained parallelism runs the danger of not having enough

work assigned to threads to overcome the overhead costs of using

threads. Adding more threads, when the amount of computation does-

nt change, only exacerbates the problem. Coarse-grained parallelism

has lower overhead costs and also tends to be more readily scalable

to an increase in the number of threads. Top-down approaches to

threading (or driving the point of threading as high in the call stack)are the best options to achieve a coarse-grained solution.

Rule 3. Plan early for scalability to take advanta

numbers of cores.

Processors have recently gone from being dual-core

has announced the 80-core Teraflop chip. The num

able in future processors will only increase. Thus, y

such processor increases within your software. Scaure of an applications ability to handle changes, ty

system resources (number of cores, memory size,

data set sizes. In the face of more available cores,

that can take advantage of different numbers of co

Documents

Doctor Dobs Digital Issue_0112