Doctor Dobs Digital Issue_0112

Embed Size (px)

Citation preview

  • 8/2/2019 Doctor Dobs Digital Issue_0112

    1/40

    Dr. Dobbs JournalFEBRUARY 2012

    ALSO INSIDE

    The Need to Rewrite Estab

    Algorithms >>

    Efficient Use of Lambda Exstd::function >>

    From the Vault:8 Simple Rules for DesigninApplications >>Welcome to the

    Jungle

    Parallel ProgrammingParallel Programming

    Herb Sutter told you when the free lunch wasover, now he leads you throughthe parallel hardware jungle

  • 8/2/2019 Doctor Dobs Digital Issue_0112

    2/40

    February 2012

    C O N T E N T SCOVER STORY6 Welcome to the JungleBy Herb Sutter

    The transitions to multicore processors, GPU computing,

    and HaaS cloud computing are not separate trends, but as-

    pects of a single trend mainstream computers from

    desktops to smartphones are being permanently trans-

    formed into heterogeneous supercomputer clusters.

    Henceforth, a single compute-intensive application will

    need to harness different kinds of cores, in immense num-

    bers, to get its job done. The free lunch is over. Welcome to

    the hardware jungle.

    26 Efficient Use of Lambda Expressions andstd::functionBy Cassio Neri

    Functors and std:function implementations vary widelybetween libraries. C++11s lambdas make them more

    efficient.

    3 Editorial: The Need to Rewrite EstablishedAlgorithmsBy Andrew Binstock

    Parallel architectures, like other hardware advances beforethem, require us to rewrite algorithms and data structures especially the old standbys that have served us well.

    31 From the Vault: 8 Simple Rules forDesigning Threaded ApplicationsBy Clay Breshears

    Multithreaded programming is still more art than science. This

    article gives eight simple rules that you can add to your palette

    of threading design methods. By following these rules, you will

    have more success in writing the best and most-efficient

    threaded implementation of your applications.

    38 LinksSnapshots of the most interesting items on drdobbs.com in-

    cluding cross-platform development with Eclipse CDT, deploy-

    ment with Amazons Elastic Beanstalk, and more.

    39 Editorial and Business Contactswww.drdobbs.com

    F e b r u a r y 2 0 1 2

    Previous Next

    Dr. Dobbs Journal

    More on D

    Boost Performance for Y

    The Android NDK is a tool

    ponents that make use of

    applications.

    http://drdobbs.com/go-paral

    Seeing the Light with BaUse the processing speed

    explore different possible

    puzzles.http://drdobbs.com/go-pa

    design/232300953

    The Best of 2011

    The most popular articles o

    some additional pieces p

    consideration by our staff.http://drdobbs.com/232301271

    Booting an Intel Architec

    Early Initialization

    The boot sequence today is

    even a decade ago. Heres

    step walkthrough of the b

    http://drdobbs.com/parallel/

    Two Different Kinds of O

    Experience with SPITBO

    least two fundamentally

    tion, and that the advicplies only to one of those

    http://drdobbs.com/blogs/cp

    http://prevpage/http://drdobbs.com/go-parallel/blogs/architecture-and-design/232300953http://drdobbs.com/go-parallel/blogs/architecture-and-design/232300953http://prevpage/http://drdobbs.com/go-parallel/blogs/architecture-and-design/232300953
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    3/40

    www.drdobbs.com

    central point of developer wisdom is to reuse

    code, especially data structures and collections. A

    few decades ago, it was common for C program-

    mers to write innumerable implementations oflinked lists from scratch. The code became almost

    a muscle memory as you banged it out. Today,

    such an exercise is more the result of ignoring es-

    tablished and well-tested options, rather than coding prowess. Except

    in exigent circumstances, writing your own collections has the whiff

    of cowboy programming.

    Its safe to say that, for the most part, you should not be writing

    your own data structures or basic algorithms (sorts, checksums, en-

    cryption, calendars, etc.). However, this principle has a recurring ex-

    ception that needs to be acknowledged; namely, that advances inhardware must find their way promptly into the implementations of

    common algorithms.

    In a 1996 article on hashing efficiency that I w

    (http://drdobbs.com/database/184409859), I discu

    the then-significant problem of memory latency on

    table design. Basically, the concern was that evebucket that was not in cache created a significant p

    cle as the processor waited for the long memory-fe

    gested that on closed hash tables, nonlinear rehas

    slot was found was a costly operation. Linear reha

    closest empty slot) worked better. The problem of m

    small caches, in those days, made algorithm and d

    tion a task best completed with care.

    The expansion of processor caches changed th

    insofar as algorithms were concerned. Unless

    comp-sci background, the terms cache-aware ous algorithms might be new to you. Impleme

    mer tend to uncover the size of the cache on the

    The Need to RewriteEstablished Algorithms

    A

    Previous Next

    February 2012

    Previous Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    Parallel architectures, like other hardware advances before them, require usto rewrite algorithms and data structures especially the old standbys thathave served us well

    http://nextpage/http://nextpage/http://nextpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    4/40

    February 2012www.drdobbs.com

    and then size the data structures and algorithms to minimize mem-

    ory fetches. Success in this can represent significant performance

    gains, at the cost of some portability. Some libraries, frequently

    those provided by processor vendors (such as Intel and AMD, in par-

    ticular) or specialized development houses, provide these imple-

    mentations. Intels Integrated Performance Primitives library

    (http://drdobbs.com/go-parallel/blogs/cpp/232300486), for exam-

    ple, checks the runtime platform characteristics and brings in the

    right binaries for optimal performance.

    For most applications, however, were dependent on the standard

    libraries provided with the language. (Intels IPP library, for example,

    comes only for native code. Java and .NET are supported only with

    wrappers.) Language providers eventually do deliver library updates,

    but the progress can be frustratingly slow and the work uneven. The

    delivery of Javas support for multithread-friendly collections is a

    case in point. Scalas multithreaded collections were a major drawbecause they came at a time when Javas collections did not work

    well enough.

    Not only are better libraries needed, but even within standard li-

    braries, the choice of data structures is becoming more complex.

    In this excellent article explaining why linked lists are pass

    (http://drdobbs.com/go-parallel/blogs/parallel/232400466), Dr.

    Dobbs blogger Clay Breshears discusses why trees make a better

    and more parallel-friendly data structure than the ever-sequential

    linked list. This is exactly the kind of nuance that should keep us

    vigilant against lazily accepting a static view of which algorithms

    and data structures to choose. Everyone knows, dont they, that

    Previous Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    Instantly SeaTerabytes Of T

    Fully-Functional Evaluwww.dtSearch.co

    Lightning Fast Redmond M

    Covers all data sources eW

    Returns results in less than a InfoWorld

    25+ fielded & full-text search op

    dtSearchs own file parsers highpopular file & email types

    Spider supports static & dynam

    APIs for .NET, Java, C++, SQL, et

    Win /Linux (64-bit & 32-bit)

  • 8/2/2019 Doctor Dobs Digital Issue_0112

    5/40

    www.drdobbs.com

    linked lists are faster th an trees? And yet, even this mainstay of ob-

    vious logic is now changing beneath our feet.

    The imminent era of manycore processors is likely to bring other

    changes to the fore. I especially expect that sort routines will be

    dramatically affected. Quicksort will no longer be the default sorting

    algorithm. The choice of sort will be more carefully matched to the

    needs of the data and the capabilities of the platform. We already

    see this on a macro level in the new world of big data. Map-reduce

    at scale depends upon sorts being done in smaller increments and

    reassembled through a merge function. And even there, the basic

    sorting has to be capable of handling billions of data items. In

    which case, grabbing an early item and making it the pivot element

    for millions of other entries (Quicksort) can have unfortunate con-

    sequences on performance.

    Between the proliferation of cores, the rapid expa

    faster performance of RAM, and the huge increase

    ditional choices of algorithms and data structures a

    ently safe or appropriate at all. Once again, select

    with considerable care aforethought.

    Andrew Binstock is Editor in Chief for Dr. Dobbs an

    [email protected].

    Previous Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    http://nextpage/http://www.drdobbs.com/parallel/232500147http://nextpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    6/40

    February 2012www.drdobbs.com

    Welcome to the JungleThe free lunch is over. Welcome to the hardware jungle

    n the twilight of Moores Law, the transitions to multicore proces-

    sors, GPU computing, and hardware or infrastructure as a service

    (HaaS) cloud computing are not separate trends, but aspects of a sin-

    gle trend mainstream computers from desktops to smartphones

    are being permanently transformed into heterogeneous supercomputer

    clusters. Henceforth, a single compute-intensive application will need to

    harness different kinds of cores, in immense numbers, to get its job done.

    The free lunch is over. Welcome to the hardware jungle.From 1975 to 2005, our industry accomplished a phenomenal mis-

    sion: In 30 years, we put a personal computer on every desk, in every

    home, and in every pocket.

    In 2005, however, mainstream computing hit a wall. In The Free

    Lunch Is Over (A Fundamental Turn Toward Concurrency in Software)

    (http://is.gd/RHSOzm), I described the reasons for the then-upcoming

    industry transition from single-core to multicore CPUs in mainstream

    machines, why it would require changes throughout the software stack

    from operating systems to languages to tools, and why it would per-

    manently affect the way we as software developers have to write our

    code if we want our applications to continue exploiting Moores tran-

    sistor dividend.

    In 2005, our industry undertook a new mission

    parallel supercomputeron every desk, in every h

    pocket. 2011 was special: Its the year that we com

    tion to parallel computing in all mainstream form f

    rival of multicore tablets (such as iPad 2, Playbook

    Tablet) and smartphones (for example, Galaxy S I

    4S). 2012 will see the continued build out of mu

    stream quad- and eight-core tablets (as Windows tablet experience to x86 as well as ARM), and the la

    ing console holdout will go multicore (as Nintend

    Wii; http://is.gd/sBuPtr).

    It took us just six years to deliver mainstream pa

    all popular form factors. And we know the transition

    manent, because multicore delivers compute perfo

    core cannot and there will always be mainstream ap

    better on a multicore machine. Theres no going ba

    For the first time in the history of computing, ma

    is no longer a single-processor von Neumann mach

    be again.

    That was the first act.

    By Herb Sutter

    I

    Previous Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    7/40

    www.drdobbs.com

    Overview: Trifecta

    It turns out that multicore is just the first of three related permanent

    transitions that layer on and amplify each other ; as the timeline in Fig-

    ure 1 illustrates.

    1. Multicore (2005-). As explained previously.

    2. Heterogeneous cores (2009-). A single computer already typi-

    cally includes more than one kind of processor core, as mainstream

    notebooks, consoles, and tablets all increasingly have both CPUs and

    compute-capable GPUs. The open question in the industry today is not

    whether a single application will be spread across different kinds of

    cores, but only how different the cores should be. That is, whether

    they should be basically the same with similar instruction sets but in

    a mix of a few big cores that are best at sequential code plus many

    smaller cores best at running parallel code (the Intel MIC

    (http://is.gd/I2iB09) model slated to arrive in 2012-2013, which is easier

    to program). Or should they be cores with different capabilities that may

    only support subsets of general-purpose languages

    current Cell and GPGPU model, which requires m

    cluding language extensions and subsets).

    Heterogeneity amplifies the first trend (multicor

    of the cores are smaller, then we can fit more of them

    Indeed, 100x and 1,000x parallelism is already availa

    mainstream home machines for programs that can

    We know the transition to heterogeneous core

    cause different kinds of computations naturally ru

    less power on different kinds of cores and dif

    same application will run faster and/or cooler on a

    eral different kinds of cores.

    3. Elastic compute cloud cores (2010-). For ou

    means specifically HaaS delivering access to m

    hardware as an extension of the mainstream m

    started to hit the mainstream with commercial coings from Amazon Web Services (AWS), Microsoft

    Engine (GAE), and others.

    Cloud HaaS again amplifies both of the first two

    fundamentally about deploying large numbers of

    node is a mainstream machine containing multi

    neous cores. In the cloud, the number of cores avai

    plication is scaling fast. In mid-2011, Cycle Comp

    30,000-core cloud for under $1,300/hour (http://is.

    AWS. The same heterogeneous cores are available

    For example, AWS already offers Cluster GPU nod

    Tegra M2050 GPU cards, enabling massively paralle

    tributed CUDA applications.

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Previous Next JUNG

    Figure 1.

  • 8/2/2019 Doctor Dobs Digital Issue_0112

    8/40

    In short, parallelism is not just in full bloom, but increasingly in full

    variety. In this article, I develop four key points:

    1. Moores End. We can observe clear evidence that Moores Law

    is ending because we can point to a pattern that precedes the

    end of exploiting any kind of resource. But theres no reason to

    panic, because Moores Law limits only one kind of scaling, andwe have already started another kind.

    2. Mapping one trend, not three. Multicore, heterogeneous cores,

    and HaaS cloud computing are not three separate trends, but as-

    pects of a single trend: putting a personal heterogeneous super-

    computer cluster on every desk and in every pocket.

    3. The effect on software development. As software developers,

    we will be expected to enable a single application to exploit a

    jungle of enormous numbers of cores that are increasingly dif-

    ferent in kind (specialized for different tasks) and different in lo-

    cation (from local to very remote; on-die, in-box, on-premises, in-

    cloud). The jungle of heterogeneity will continue to spur deep

    and fast evolution of mainstream software development, but we

    can predict what some of the changes will be.

    4. Three distinct near-term stages of Moores End. And why

    smartphones arent, really.

    Lets begin with the end of Moores Law.

    Mining Moores Law

    Weve been hearing breathless Moores Law is e

    ments for years. That Moores Law would end was

    exponential progression must. Although it didn

    prognosticators expected, its end is possible to forec

    to know what to look for, and that is diminishing ret

    A key observation is that exploiting Moores Law

    gold mine or any other kind of resource. Exploiting

    never just stops abruptly; rather, running a mine go

    of increasing costs and diminishing returns until fin

    left in that patch of ground is no longer commercia

    operating the mine is no longer profitable.

    Mining Moores Law has followed the same patte

    three major phases, where we are now in transitio

    Phase III. And throughout this discussion, never fo

    reason Moores Law is interesting at all is because w

    raw resource (more transistors) into a useful form (putational throughput or lower cost).

    Phase I, Moores Motherlode = Unicore Free Lu

    When you first find an ore deposit and open a mine

    forts on the motherlode, where everybody gets to

    and a low cost per pound of gold extracted.

    www.drdobbs.comFebruary 2012

    Previous Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    JUNGL

    http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    9/40

    For 30 years, mainstream processors mined Moores motherlode

    by using their growing transistor budgets to make a single core

    more and more complex so that it could execute a single thread

    faster. This was wonderful because it meant the performance was

    easily exploitable compute-bound software would get faster with

    relatively little effort. Mining this motherlode in mainstream micro-

    processors went through two main subphases as the pendulum

    swung from simpler to increasingly complex cores:

    In the 1970s and 1980s, each chip generation could use most of

    the extra transistors to add One Big Feature (such as on-die float-

    ing point unit, pipelining, out of order execution) that would

    make single-threaded code run faster.

    In the 1990s and 2000s, each chip generation started using the

    extra transistors to add or improve two or three smaller features

    that would make single-threaded code run f

    or six smaller features, and so on.

    Figure 2 shows how the pendulum swung toward

    plex single cores, with three sample chips: the 8028

    tium Extreme Edition 840. Note that the chips bo

    number of transistors.

    By 2005, the pendulum had swung about as far as

    the complex single-core model. Although the mo

    mostly exhausted, were still scraping ore off its w

    some continued improvement in single-threaded

    but no longer at the historically delightful exponen

    Phase II, Secondary Veins = Homogeneous Mult

    As a motherlode gets used up, miners concentrate

    that are still profitable but have a more moderate yi

    per pound of extracted gold. So when Moores ustarted getting mined out, the industry turned to m

    ondary veins using the additional transistors to m

    chip. Multicore let us continue to deliver exponentia

    pute throughput in mainstream computers, but in

    easily exploitable because it placed a greater burden

    opers who had to write parallel programs that could

    Moving into Phase II took a lot of work in the soft

    had to learn to write new free lunch applications

    lots of latent parallelism and so can once again rid

    the same executable faster on next years hardw

    still delivers exponential performance gains but pr

    of additional cores. Today, there are parallel runtim

    www.drdobbs.com

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Previous Next JUNG

    Figure 2.

    http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    10/40

    www.drdobbs.com

    Intel Threading Building Blocks (TBB) and Microsoft Parallel Patterns

    Library (PPL), parallel debuggers and parallel profilers, and updated

    operating systems to run them all.

    But this time the phase didnt last 30 years. We barely have time to

    catch our breath, because Phase III is already beginning.

    Phase III, Tertiary Veins = Heterogeneous Cores (2011-)

    As our miners are forced to move into smaller and smaller veins, yields

    diminish and costs rise. The miners are turning to Moores tertiary

    veins: Using Moores extra transistors to make not just more cores, but

    also different kinds of cores and in very large numbers because the

    different cores are often smaller and swing the pendulum back toward

    the left.

    There are two main categories of heterogeneity, see Figure 3.

    Big/fast vs. small/slow cores.The smallest amou

    is when all the cores are general-purpose cores wit

    tion set, but some cores are beefier than others be

    more hardware to accelerate execution (notably by

    tency using various forms of internal concurrency).

    cores are big complex ones that are optimized to

    parts of a program really fast, while others are sm

    optimized to get better total throughput for the sc

    of the program. However, even though they use th

    set, the compiler will often want to generate differe

    ence can become visible to the programmer if the

    guage must expose ways to control code generatio

    proach with Xeon (big/fast) and MIC (small/slow

    approximately the x86 instruction set.

    General vs. specialized cores. Beyond that, w

    multiple cores having different capabilities, includ

    may not be able to support all of a mainstream langIn 2006-2007, with the arrival of the PlayStation 3, t

    sor led the way by incorporating different kinds of

    chip, with a single general-purpose core assisted by

    cial-purpose SPU cores. Since 2009, we have begun

    use of GPUs to perform computation instead of jus

    ized cores like SPUs and GPUs are attractive when t

    kinds of code more efficiently, both faster and mo

    a great bargain if your workload fits it.

    GPGPU is especially interesting because we al

    derutilized installed base: A significant percentag

    stream machines already have compute-capable

    to be exploited. With the June 2011 introduction o

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Figure 3.

    http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    11/40

    www.drdobbs.com

    the November 2011 launch of NVIDIA Tegra 3, systems with CPU and

    GPU cores on the same chip is becoming a new n orm. That installed

    base is a big carrot, and creates an enormous incentive for compute-

    intensive mainstream applications to leverage that patiently waiting

    hardware. To date, a few early adopters have been using technolo-

    gies like CUDA, OpenCL, and more recently C++ AMP to harness

    GPUs for computation. Mainstream application developers who careabout performance need to learn to do the same; see Table 1.

    But thats pretty much it we currently know of no other major

    ways to exploit Moores Law for compute performance, and once these

    veins are exhausted, it will be largely mined out.

    Were still actively mining for now, but the writing on the wall is clear:

    mene mene diminishing returns demonstrate that weve entered the

    endgame.

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    SHRINKWRAP

    YOUR APPWITH AWARD-WINNING VERISIGN CODE SIGNINGYou developed the software. Now, deliver it with the same care and vigilance by using

    VeriSign Code Signing. Why? Code signing not only protects the identity and reputation

    of the author, but it also verifies the authenticity and version of your software. Then, go

    a step further. VeriSign Code Signing can create a unique digital signature every time the

    code is signed. Plus, we support more certification programs and development platforms

    than any other Certificate Authority. Leverage the reputation of the most recognized and

    trusted name in online security.

    Learn how VeriSign Code Signing can help make sureyour applications are more trusted and adopted atwww.VeriSign.com/CodeSigning or call 1-866-893-6565.

    Copyright 2011SymantecCorporation.All rightsreserved.Symantec,the SymantecLogo,and theCheckmarkLogo aretrademarks

    SymantecCorporationorits affiliatesintheU.S. andothercountries.VeriSignand otherrelatedmarksare thetrademarksorregistered

    orits affiliatesorsubsidiariesinthe U.S.andother countriesandlicensedto SymantecCorporation.Othernamesmay betrademarkso

    Now from

    BEST SECURITYSOFTWARE

    DEVELOPMENTSOLUTION

    Table 1.

    http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    12/40

    www.drdobbs.com

    On The Charts: Not Three Trends, But One Trend

    Next, lets put all of this in perspective by showing

    ero-core, and cloud-core are not three trends, but

    trend. To show that, we have to show that they can

    same map. Figure 4 shows an appropriate map th

    where processor core architectures are going, wh

    tectures are going, and visualize just where wearound in the mine so far.

    First, I describe each axis, then map out past and

    to spot trends, and finally draw some conclusions

    ware is likely to concentrate.

    Processor Core Types

    The vertical axis shows processor core architecture

    ure 5, from bottom to top, they form a continuum

    formance and scalability, but also of increasing r

    grams and programmers in the form of additional p(yellow) or correctness issues (red) added at each s

    Complex cores are the big traditional ones, w

    swung far to the right in the habitable zone. The

    ning sequential code, including code limited by Am

    Simpler cores are the small traditional ones, to

    habitable zone. These are best at running paral

    still requires the full expressivity of a mainstream

    guage.

    Specialized cores like those in GPUs, DSPs, and C

    limited, and often do not yet fully support all features

    guages (such as exception handling). These are bes

    parallelizable code that can be expressed in a subse

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Figure 4.

    Figure 5.

    http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    13/40

    www.drdobbs.com

    C or C++. For example, XBox Kinect skeletal tracking requires using the

    CPU and the GPU cores on the console, and would be impossible other-

    wise.The farther you move upward on the chart (to the right in the blown-

    up figure), the better the performance throughput and/or the less power

    you need, but the more the application code is constrained as it has to

    be more parallel and/or use only subsets of a mainstream language.

    Future mainstream hardware will likely contain all three basic kinds

    of cores, because many applications have all these

    same program, and so naturally will run best on a h

    puter that has all these kinds of cores. For example

    all Kinect games, and all CUDA/OpenCL/C++AMP

    able today could not run well or at all on a homo

    because they rely on running parts of the same

    CPU(s) and other parts on specialized cores. Those athe beginning.

    Memory Architectures

    The horizontal axis shows six common memory a

    left to right, they form a continuum of increasing

    scalability, but (except for one important discontinu

    work for programs and programmers to deal with p

    (yellow) or correctness issues (red). In Figure 6, t

    cache and lower boxes represent RAM. A processo

    the top of each cache peak.Unified memory is tied to the unicore motherl

    ory hierarchy is wonderfully simple a single mo

    sitting on top. This describes essentially all main

    from the dawn of computing until the mid-2000s. Th

    programming model: Every pointer (or object refe

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Figure 6.

    http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    14/40

    www.drdobbs.com

    every byte, and every byte is equally far away from the core. Even

    here, programmers need to be conscious of at least two basic cache

    effects: locality, or how well hot data fits into cache; and access order,

    because modern memory architectures love sequential access pat-

    terns (for more on this, see my Machine Architecture talk at

    http://is.gd/1Fe99o).

    NUMA cache retains a single chunk of RAM, but adds multiplecaches. Now instead of a single mountain, we have a mountain range

    with multiple peaks, each with a core on top. This describes todays

    mainstream multicore devices. Here, we still enjoy a single address

    space and pretty good performance as long as different cores access

    different memory, but programmers now have to deal with two main

    additional performance effects:

    locality matters in new ways because some peaks are closer to

    each other than others (two cores that share an L2 cache vs. two

    cores that share only L3 or RAM), layout matters because we have to keep data physically close

    together if its used together (on the same cache line), and apart

    if its not (for example, to avoid the ping-pong game of false

    sharing).

    NUMA RAM further fragments memory into multiple physical chunks

    of RAM, but still exposes a single logical address space. Now, the per-

    formance valleys between the cores get deeper, because accessing

    RAM in a chunk not local to this core incurs a trip across the bus. Exam-

    ples include bladed servers, symmetric multiprocessor (SMP) desktop

    computers with multiple sockets, and newer GPU architectures that

    provide a unified address space view of the CPUs and GPUs memory,

    but leave some memory physically closer to the CPU

    closer to the GPU. Now we add another item to the

    formance-conscious programmer needs to think a

    because we can form a pointer to anything doesn

    should, if it means reaching across an expensive cha

    Incoherent and weak memory makes memory b

    chronized, in the hope that allowing each core to havview of the state of memory can make them run f

    memory must inevitably be synchronized again. As

    only remaining mainstream CPUs with weak memo

    rent PowerPC and ARM processors (popular despite

    els rather than because of them; more on this belo

    has the simplicity of a single address space, but no

    further has to take on the burden ofsynchronizing m

    Disjoint (tightly coupled) memory bites the bu

    ent cores see different memory, typically over a sh

    running as a tightly coupled unit that has low lateliability is still evaluated as a single unit. Now the

    tightly clustered group of mountainous islands, eac

    mountains of cache overlooking square miles of

    nected by bridges with a fleet of trucks expediting

    to point bulk transfer operations, message que

    the mainstream, we see this model used by 2009-

    whose on-board memory is not shared with the

    other. True, programmers no longer enjoy havin

    space and the ability to share pointers But in excha

    moved the entire set of programmer burdens accu

    replaced them with a single new responsibility: cop

    islands of memory.

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    15/40

    www.drdobbs.com

    Disjoint (loosely coupled) is the cloud where cores spread out-of-

    box into different rooms and buildings and datacenters. This moves

    the islands farther apart, and replaces the bus bridges with network

    speedboats and tankers. In the mainstream, we see this model in

    HaaS cloud computing offerings; this is the comm

    compute cluster. Programmers now have to arrang

    additional concerns, which often can be abstracte

    and runtimes: reliabilityas nodes can come and go

    islands are farther apart.

    Charting the HardwareAll three trends are just aspects of a single trend: f

    and enabling heterogeneous parallel computing. F

    the chart wants to be filled out because there are

    naturally suited to each of these boxes, though so

    popular than others.

    To help visualize the filling-out process more c

    check to see how mainstream hardware has progre

    The easiest place to start is the long-standing ma

    more recent GPU:

    From the 1970s to the 2000s, CPUs started

    cores and then moved downward as the pend

    creasingly complex cores. They hugged the le

    by staying single-core as long as possible, b

    out of room and turned toward multicore NU

    tures; see Figure 8.

    Meanwhile, in the late 2000s, mainstream GP

    pable of handling computational workloads

    started life in an add-on discrete GPU card fo

    ics-specific cores and memory were physically

    the CPU and system RAM, they started furtthe right (Specialized / Disjoint (local)). GPUs

    leftward to increasingly unified views of me

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Figure 7.

    Figure 8.

    http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    16/40

    www.drdobbs.com

    downward to try to support full mainstream languages (such as

    adding exception handling support).

    Todays typical mainstream computer includes both a CPU and

    a discrete or integrated GPU. The dotted line in the graphic de-

    notes cores that are available to a single application because

    they are in the same device, but not on the same chip.

    Now we are seeing a trend to use CPU and specialized (currently GPU)

    cores with very tightly coupled memory, and even on the same die:

    In 2005, the XBox 360 sported a multicore CPU and GPU that could

    not only directly access the same RAM, but had the very unusual

    feature that they could share even L2 cache.

    In 2006 and 2007, the Cell-based PS3 console sported a single

    processor having both a single general-purpose core and eight

    special-purpose SPU cores. The solid line in Figure 9 denotes

    cores that are on the same chip, not just in the same device.

    In June 2011 and November 2011, respective

    launched the Fusion and Tegra 3 architectu

    chips that sported a compute-class GPU (he

    tically) on the same die (hence well to the lef

    Intel has also shipped the Sandy Bridge

    which includes an integrated GPU that is no

    capable. Intels main focus has been the MIC

    50 simple, general-purpose x86-like cores opected to be commercially available in the n

    Finally, we complete the picture with cloud HaaS; F

    In 2008 and 2009, Amazon, Microsoft, Google

    began rolling out their cloud compute offerin

    GAE support an elastic cloud of nodes each

    tional computer (big-core and loosely cou

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Figure 9. Figure 10.

    http://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    17/40

    www.drdobbs.com

    the bottom right corner of the chart) where each node in the

    cloud has a single core or multiple CPU cores (the two lower-left

    boxes). As before, the dotted line denotes that all of the cores are

    available to a single application, and the network is just another

    bus to more compute cores.

    Since November 2010, AWS also supports compute instances

    that contain both CPU cores and GPU cores, indicated by the H-shaped virtual machine where the application runs on a cloud

    of loosely coupled nodes with disjoint memory (right column)

    each of which contains both CPU and GPU cores (currently not

    on the same die, so the vertical lines are still dotted).

    The Jungle

    Putting it all together, we get a noisy profusion of life and color as in

    Figure 11. This may look like a confused mess, so lets notice two things

    that help make sense of it.

    First, every box has a workload that its best at, bu

    ticularly some columns) are more popular than ot

    are particularly less interesting:

    Fully unified memory models are only applic

    which is being essentially abandoned in the

    Incoherent/weak memory models are a pement that is in the process of failing in the m

    hardware side, the theoretical performance

    from letting caches work less synchronously

    largely duplicated in other ways by main

    having stronger memory models. On the s

    the mainstream general-purpose languages

    (C, C++, Java, .NET) have largely rejected wea

    and require a coherent model that is tech

    quential consistency for data race

    (http://is.gd/EmpCDn [PDF]) as either thememory model (Java, .NET) or their default m

    C++11, ISO C11). Nobody is moving toward

    incoherent/weak memory strip of the cha

    moving through it to get to the other side,

    to stay there.

    But all other boxes, including all rows (processo

    strongly represented, and we realize why thats true

    ent parts of even the same application naturally wa

    ent kinds of cores.

    Second, lets clarify the picture by highlighting an

    regions that hardware is migrating toward in Figur

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Figure 11.

    http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    18/40

    www.drdobbs.com

    In Figure 12 again we see the first and fourth columns being de-emphasized, as hardware trends have begun gradually coalescing

    around two major areas. Both areas extend vertically across all kinds

    of cores and the most important thing to note is that these rep-

    resent two mines, where the area to the left is the Moores Law mine.

    Mine #1: Scale in = Moores Law. Local machines will con-

    tinue to use large numbers of heterogeneous local cores, either

    in-box (such as CPU with discrete GPU) or on-die (Sandy Bridge,

    Fusion, Tegra 3). Well see core counts increase until Moores Law

    ends, and then stabilize core counts for individual local devices.

    Mine #2: Scale out = distributed cloud. Much more impor-tantly, we will continue to see a cornucopia of cores delivered

    via compute clouds, either on-premises (clu

    or in public clouds. This is a brand new mine d

    the lower coupling of disjoint memory, espe

    pled distributed nodes.

    The good news is that we can heave a sigh of re

    another mine to open. The even better news is that far faster growth rate than even Moores Law. Notic

    lines when we graph the amount of parallelism avai

    plication running on various architectures; see Fig

    three lines are mining Moores Law for scale-in gro

    mon slope reflects Moores wonderful exponent, jus

    downward to account for how many cores of a given

    onto the same die. The top two lines are mining th

    and GPUs, respectively) for scale-out growth an

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Figure 12.

    Figure 13.

    http://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    19/40

    www.drdobbs.com

    If hardware designers merely use Moores Law to deliver more big

    fat cores, on-device hardware parallelism will stay in double digits for

    the next decade, which is very roughly when Moores Law is due to

    sputter, give or take about a half decade. If hardware follows Niagaras

    and MICs lead to go back to simpler cores, well see a one-time jump

    and then stay in triple digits. If we all learn to leverage GPUs, we already

    have 1,500-way parallelism in modern graphics cards (Ill say coresfor convenience, though that word means something a little different

    on GPUs) and likely reach five digits in the decade timeframe.

    But all of that is eclipsed by the scalability of the cloud, whose growth

    line is already steeper than Moores Law because were better at quickly

    deploying and using cost-effective networked machines than weve

    been at quickly jam-packing and harnessing cost-effective transistors.

    Its hard to get data on the current largest cloud deployments because

    many projects are private, but the largest documented public cloud apps

    (which dont use GPUs) are already harnessing over 30,000 cores for a

    single computation. I wouldnt be surprised if some projects are exceed-ing 100,000 cores today. And thats general-purpose cores; if you add

    GPU-capable nodes to the mix, add two more zeroes.

    Such massive parallelism, already available for rates of under

    $1,300/hour for a 30,000-core cloud, is game-changing. If you doubt

    that, here is a boring example that doesnt involve advanced aug-

    mented reality or spook-level technomancery: How long will it take

    someone whos stolen a strong password file (which well assume is

    correctly hashed and salted and contains no dictionary passwords) to

    retrieve 90% of the passwords by brute force using a publicly available

    GPU-enabled compute cloud? Hint: An AWS dual-Tegra node can test

    on the order of 20 billion passwords per second, and clouds of 30,000nodes are publicly documented (of course, Amazon wont say if it has

    that many GPU-enabled nodes for hire; but if it d

    soon). To borrow a tired misquote (http://is.gd/PJ

    affordable attempts per second should be enough

    thats not enough for you, not to worry; just wait

    years and itll be 640 quadrillion affordable attemp

    What It Means For Us: A Programmers ViewHow will all of this change the way we write our s

    about harnessing mainstream hardware performa

    clusions echo and expand upon ones that I prop

    Lunch is Over:

    Applications will need to be at least mass

    ideally able to use non-local cores and hete

    if they want to fully exploit the long-term con

    growth in compute throughput being deliver

    in-cloud. After all, soon the vast majority of coable to a mainstream application will be non

    Efficiency and performance optimizati

    not less, important. Were being asked to

    periences like sensor-based UIs and augm

    less hardware (constrained mobile form fac

    tual plateauing of scale-in when Moores Law

    ber 2004 I wrote: Those languages that a

    selves to heavy optimization will find new lif

    will need to find ways to compete and beco

    and optimizable. Expect long-term increase

    formance-oriented languages and systemswitness the resurgence of interest in C++ in

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    http://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    20/40

    www.drdobbs.com

    primarily because of its expressive flexibility and performance

    efficiency. A program that is twice as efficient has two advan-

    tages:

    It will be able to run twice as well on a local discon-

    nected device especially when Moores Law can no longer

    deliver local performance improvements in any form;

    It will always be able to run at half the power and coston an elastic compute cloud even as those continue to

    expand for the indefinite future.

    Programming languages and systems will increasingly be

    forced to deal with heterogeneous distributed parallelism. As

    previously predicted, just basic homogeneous multicore has

    proved to be a far bigger event for languages than even object-

    oriented programming was, because some languages (notably C)

    could get away with ignoring objects while still remaining com-

    mercially relevant for mainstream software development. No

    mainstream language, including the just-ratified C11 standard,

    could ignore basic concurrency and parallelism and stay relevant

    in even a homogeneous-multicore world. Now expect all main-

    stream languages and environments, including their standard li-

    braries, to develop explicit support for at least distributed paral-

    lelism and probably also heterogeneous parallelism; they cannot

    hope to avoid it without becoming marginalized for mainstream

    app development.

    Expanding on that last bullet, what are some basic elements we will

    need to add to mainstream programming models (think: C, C++, Java,

    and .NET)? Here are a few basics I think will be unavoidable, that mustbe supported explicitly in one form or another.

    Deal with the processor axis lower secti

    compute cores with different perform

    slow/small). At minimum, mainstream ope

    runtimes will need to be aware that some co

    others, and know which parts of an applicat

    which of those cores.

    Deal with the processor axis upper sectilanguage subsets, to allow for cores with d

    including that not all fully support mainstream

    In the next decade, a mainstream operating

    or augmented with an extra runtime like the J

    ConcRT runtime underpinning PPL) will be ca

    cores with different instruction sets and runn

    cation across many of those cores. Programm

    tools will be extended to let the developer e

    restricted to use just a subset of a mainstream

    guage (as with the restrict() qualifiers in

    timistic that for most mainstream language

    guage extension will be sufficient while le

    language rules for overloading and dispatch, a

    the impact on developers.

    Deal with the memory axis for computation

    tributed algorithms that can scale not ju

    across a compute cloud. Libraries and runtim

    TBB and PPL will be extended or duplicated to e

    and other algorithms that run on large numbe

    local parallel cores. Today we can write a paral

    that can run with 1,000x parallelism on a set ofand ship the right data shards to the right com

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    http://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    21/40

    www.drdobbs.com

    results back; tomorrow we need to be able to write that same call

    that can run with 1,000,000,000x parallelism on a set of cloud-based

    GPUs and ship the right data shards to the right nodes and the re-

    sults back. This is a baby step example in that it just uses local data

    (that can fit in a single machines memory), but distributed compu-

    tation; the data subsets are simply copied hub-and-spoke.

    Deal with the memory axis for data, by providing distributed

    data containers, which can be spread across many nodes.The

    next step is for the data itself to be larger than any nodes memory,

    and (preferably automatically) move the right data subsets to the

    right nodes of a distributed computation. For example, we need

    containers like a distributed_arrayor distributed_table

    that can be backed by multiple and/or redundant cloud storage,

    and then make those the target of the same distributedparal-

    lel_for_each call. After all, why shouldnt we write a single par-

    allel_for_each call that efficiently updates a 100 petabyte

    table? Hadoop (http://hadoop.apache.org/) enables this today for

    specific workloads and with extra work; this will become a stan-

    dard capability available out-of-the-box in mainstream languagecompilers and their standard libraries.

    Enable a unified programming model th

    entire chart with the same source code. Sin

    hardware on a single chart with two degre

    landscape is unified enough that it should b

    by a single programming model in the futur

    have at least two basic characteristics: Firs

    Processor axis by letting the programmer expsets in a way integrated holistically into the la

    will cover or hide the Memory axis by abstra

    of data, and copying data subsets on deman

    also providing a way to take control of the co

    users who want to optimize the performance

    putation.

    Perhaps our most difficult mental adjustment, h

    learn to think of the cloud as part of the mainstre

    view all these local and non-local cores as being

    target machine that executes our application, wh

    just another bus that connects us to more cores. Th

    we will write code for mainstream machines assum

    million-way parallelism, of which only thousand

    guaranteed to always be available (when out of Wi

    Five years from now, we want to be delivering ap

    an isolated device, and then just run faster or bette

    WiFi range and have dynamic access to many more

    of our operating systems, runtimes, libraries, progra

    and tools need to get us to a place where we ca

    bound applications that run well in isolation on diswith 1,000-way local parallelismand when the de

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Our most difficult mental adjustment, however, will be to

    learn to think of the cloud as part of the mainstream ma-

    chine to view all these local and non-local cores as

    being equally part of the target machine that executes

    our application, where the network is just another bus

    that connects us to more cores

    http://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    22/40

    www.drdobbs.com

    just run faster, handle much larger data sets, and/or light up with ad-

    ditional capabilities. We have a very small taste of that now with cloud-

    based apps like Shazam (which function only when online), but yet a

    long way to go to realize this full vision.

    Exit Moore, Pursued by a Dark Silicon Bear

    Finally, lets return one more time to the end of Moores Law to see

    what awaits us in our near future (Figure 14), and why we will likely

    pass through three distinct stages as we navigate Moores End.

    Eventually, the tired miners will reach the point where its no longer

    economically feasible to operate the mine. Theres still gold left, but

    its no longer commercially exploitable. Recall that Moores Law has

    been interesting only because of the ability to transform its raw re-

    source of more transistors into one of two useful forms:

    Exploit #1: Greater throughput. Moores Law lets us deliver

    more transistors, and therefore more complex chips, at the samecost. Thats what will let processors continue to deliver more

    computational performance per chip as l

    ways to harness the extra transistors for com

    Exploit #2: Lower cost/power/size. Alterna

    enables delivery of the same number of tra

    cost, including in a smaller area and at lower

    will let us continue to deliver powerful expe

    ingly compact and mobile and embedded fo

    The key thing to note is that we can expect the

    ploiting Moores Law to end, not at the same time

    other and in that order.

    Why? Because Exploit #2 only relies on the basic

    whereas the first relies on Moores Law andthe a

    transistors at the same time.

    Which brings me to one last problem down in ou

    The Power Problem: Dark Silicon

    Sometimes you can be hard at work in a mine, still

    small disaster happens: a cave-in, or striking water

    render entire sections of the mine unreachable. W

    to hit exactly those kinds of problems.

    One particular problem we have just begun to e

    as dark silicon. Although Moores Law is still deliv

    tors, we are losing the ability to power them all at the s

    details, see Jem Davies talk Compute Power With

    (http://is.gd/Lfl7iz [PDF]) and the ISCA11 paper D

    End of Multicore Scaling (http://is.gd/GhGdz9 [PD

    This dark silicon effect is like a Shakespeariandoomed character offstage. Even though we can con

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Figure 14.

    http://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    23/40

    www.drdobbs.com

    cores on a chip, if we cannot use them at the same time ,we have failed

    to exploit Moores Law to deliver more computational throughput (Ex-

    ploit #1). When we enter the phase where Moores Law continues to

    give us more transistors per die area, but we are no longer able to

    power them all, we will find ourselves in a transitional period where Ex-

    ploit #1 has ended while Exploit #2 continues and outlives it for a time.

    This means that we will likely see the following major phases in thescale-in growth of mainstream machines. (Note that these apply to

    individual machines only, such as your personal notebook and smart-

    phone or an individual compute node; they do not apply to a compute

    cloud, which we saw belongs to a different scale-out mine.)

    Exploit #1 + Exploit #2: Increasing performance (compute

    throughput) in all form factors (1975 mid-2010s?). For a

    few years yet, we will see continuing increases in mainstream

    computer performance in all form factors from desktop to smart-

    phone. As of today, the bigger form factors still have more paral-

    lelism, just as todays desktop CPUs and GPUs are routinely more

    capable than those in tablets and smartphones as long as Ex-

    ploit #1 lives, and then

    Exploit #2 only: Flat performance (compute throughput) at

    the top end, and mid and lower segments catching up (late

    2010s early 2020s?). Next, if problems like dark silicon are not

    solved, we will enter a period where mainstream computer per-

    formance levels out, starting at the top end with desktops and

    game consoles and working its way down through tablets and

    smartphones. During this period we will continue to use Moores

    Law to lower cost, power, and/or size delivering the same com-plexity and performance already available in bigger form factors

    also in smaller devices. Assuming Moores L

    enough beyond the end of Exploit #1, we can

    it will take for Exploit #2 to equalize personal

    ing the difference in transistor counts betw

    stream desktop machines and smartphones;

    of 20, which will take Moores Law about eigh

    Democratization (early 2020s? onward).ratization will reach the point where a desk

    smartphone have roughly the same comp

    ance. In that case, why buy a desktop ever ag

    tablet or smartphone. You might think that th

    portant differences between the desktop and

    power, because the desktop is plugged in, a

    cause the desktop has easier access to a bigg

    keyboard/mouse but once you dock the s

    the same access to power and peripherals.

    Speaking of Smartphones Pocket Tablets and D

    Note that the word smartphone is already a ma

    cause a pocket device that can run apps is not pr

    all. Its primarily a general-purpose personal comp

    to have a couple of built-in radios for cell and WiF

    the traditional cell phone capability just an ap

    use the cell radio, and the Skype IP phone capa

    device just another similar app that happens to

    instead.

    The right way to think about even todays mobi

    there are not really tablets and smartphones; tsized tablets and pocket-sized tablets, both alread

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    P i N

    http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    24/40

    www.drdobbs.com

    without cellular radios. That they run different operating systems today

    is just a point-in-time effect.

    This is why those people who said an iPad is just a big iPhone without

    the cellular radio had it exactly backwards the iPhone (3G or later,

    which allows apps) is a small iPad that fits in your pocket and happens

    to have a cellular radio in order to obsolete another pocket-sized device.

    Both devices are primarily tablets they minimize hardware chromeand turn into the full-screen immersive app, and thats the closest thing

    you can get today to a morphing device that turns into a special-pur-

    pose device on demand. Many of us routinely use our phones mostly

    as a small tablet spending most of our time on the device running

    apps to read books, browse news, watch movies, play games, update so-

    cial networks, and surf the Internet. I already use my phone as a small

    tablet far more often than I use it as a phone, and if you have an app-ca-

    pable phone then Ill bet you already do that, too.

    Well before the end of this decade, I expect the most likely dominant

    mainstream form factor to be page-sized and pocket-sized tablets, plus

    docking where docking means any means of attaching peripher-

    als like keyboards and big screens on demand, which today already en-

    compasses physical docks and Bluetooth and Play To connections,

    and will only continue to get more wireless and more seamless.

    This future shouldnt be too hard to imagine, because many of us have

    already been working that way for a while now: For the past decade Ive

    routinely worked from my notebook as my primary and only environ-

    ment. Usually, Im in my home office or work office where I use a real key-

    board and big screens by docking the notebook and/or using it via a re-

    mote-desktop client, and when Im mobile I use it as a notebook. In 2012,

    I expect to replace my notebook with an x86-based modern tablet anduse it exactly the same way. Weve seen it play out many times:

    Many of us used to carry around both a PalmPi

    but then the smartphone took over the job

    PalmPilot and eliminated a device with the sa

    Lots of kids (or their parents) carry a hand-held

    a pocket tablet (aka smartphone), and we are

    of the dedicated hand-held gaming device (h

    as the pocket tablet is taking over more and m Similarly, today many of us carry around a no

    cated tablet, and convergence will again let u

    with the same form factor.

    Computing loves convergence. In general-purpos

    ing (like notebooks and tablets, not special-purpo

    microwaves and automobiles that may happen to

    sors), convergence always happily dooms special-

    the long run, as each device either evolves to take

    or gets taken over. We will continue to have dis

    tablets and page-sized tablets for a time because

    form factors with different mobile uses, but even

    until we find a way to unify the form factors (fold t

    too can converge.

    Summary and Conclusions

    Mainstream hardware is becoming permanently para

    and distributed. These changes are permanent, and

    affect the way we have to write performance-inten

    stream architectures.

    The good news is that Moores local scale-in tempty yet. It appears the transistor bonanza will

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Pre io s Ne t

    http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    25/40

    www.drdobbs.com

    another decade, give or take five years or so, which should be long

    enough to exploit the lower-cost side of the Law to get us to parity

    between desktops and pocket tablets. The bad news is that we can

    clearly observe the diminishing returns as the transistors are decreas-

    ingly exploitable with each new generation of processors, software

    developers have to work harder and the chips get more difficult to

    power. And with each new crank of the diminishing-returns wheel,theres less time for hardware and software designers to come up with

    ways to overcome the next hurdle; the motherlode free lunch lasted

    30 years, but the homogeneous multicore era lasted only about six

    years, and we are now already overlapping the next two eras of het-

    ero-core and cloud-core.

    But all is well: When your mine is getting empty, you dont panic, you

    just open a new mine at a new motherlode. As usual, in this case the

    end of one dominant wave overlaps with the beginning of the next,

    and we are now early in the period of overlap where we are standing

    with a foot in each wave, a crew in each of Moores mine and the cloud

    mine. Perhaps the best news of all is that the cloud wave is already

    scaling enormously quickly faster than the Moores Law wave that

    it complements, and that it will outlive and replace.

    If you havent done so already, now is the time to take a hard look at

    the design of your applications, determine what existing features

    or better still, what potential and currently unimaginable demanding

    new features are CPU-sensitive now or are likely to become so soon,

    and identify how those places could benefit from local and distributed

    parallelism. Now is also the time for you and your team to grok the re-

    quirements, pitfalls, styles, and idioms of hetero-parallel (e.g., GPGPU)

    and cloud programming (e.g., Amazon Web Services, Microsoft Azure,Google App Engine).

    To continue enjoying the free lunch of shipping

    runs well on todays hardware and will just naturally

    on tomorrows hardware, you need to write an app

    parallelism expressed in a form that can be spread

    with a variable number of cores of different kinds

    uted cores, and big/small/specialized cores. The thr

    cost extra extra development effort, extra code tra testing effort. The good news is that for many cla

    the extra effort will be worthwhile, because concu

    fully exploit the exponential gains in compute th

    continue to grow strong and fast long after Moores

    its sunny retirement, as we continue to mine the c

    our careers.

    Acknowledgments

    I would like to particularly thank Jeffrey Barr, Dav

    Giroux, Yossi Levanoni, Henry Moreton, and Jam

    graciously made themselves available to answer

    vide background information, and who shared

    appropriately mapping their companies produ

    sor/memory chart.

    Herb Sutter is a bestselling author and consultant on softw

    and a software architect at Microsoft. A version of this article is p

    http://herbsutter.com/welcome-to-the-jungle/.

    JUNGPrevious Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    [LAMP i N

    http://prevpage/http://prevpage/http://drdobbs.com/parallel/232400273http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    26/40

    February 2012www.drdobbs.com

    Efficient Use of

    Lambda Expressions andstd::function

    Functors and std:function implementations vary widely between libraries.C++11s lambdas make them more efficient

    unctor classes classes that implement operator() are

    old friends to C++ programmers who, for many years, have

    used them as predicates for STL algorithms. Nevertheless, im-

    plementing simple functor classes is quite cumbersome as the

    following example shows.

    Suppose that v is an STL container ofints and we want to compute

    how many of its elements are multiple of a certain value n set at run-

    time. An STL way of doing this is:

    std::count_if(v.begin(), v.end(), is_multiple_of(n));

    where is_multiple_of is defined by:

    class is_multiple_of {public:typedef bool result_type; // These t

    // recommtypedef int argument_type; // strict

    // More dis_multiple_of(int n) : n(n) {}bool operator()(int i) const { return

    private:const int n;

    };

    Having to write all this code pushes many program

    own loops instead of calling std::count_if. By dogood opportunities for compiler optimizations.

    [LAM

    F

    Previous Next

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    By Cassio Neri

    [LAMPrevious Next

    http://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    27/40

    www.drdobbs.com

    Lambda expressions (http://en.wikipedia.org/wiki/Anonymous_func-

    tion#C.2B.2B) make creation of simple functor classes much easier. Al-

    though two of the Boost libraries (http://www.boost.org)

    Boost.Lambda and, more recently, Boost.Phoenix provide very good

    implementations of lambda abstractions in C++03, to improve the lan-

    guage expressiveness, the standard committee decided to add lan-

    guage support for lambda expressions in C++11. Using this new fea-ture, the previous example becomes:

    std::count_if(v.begin(), v.end(),[n](int i){return i%n == 0;});

    Behind the scenes, the lambda expression [n](int i){return

    i%n == 0;} forces the compiler to implement an unnamed functor

    class similar to is_multiple_ofwith some obvious advantages:

    1. Its much less verbose.

    2. It doesnt introduce a new name just for a temporary use, result-

    ing in less name pollution.

    3. Frequently (not in this example, though) the name of the functor

    class is much less expressive than its actual code. Placing the

    code closer to where its called improves code clarity.

    The Closure Type

    In the previous examples, our functor class was named is_multi-

    ple_of. Naturally, the functor class automatically implemented by the

    compiler has a different name. Only the compiler knows this types

    name, and we can think of it as an unnamed type. For presentation

    purposes, its called the closure type, whereas the temporary objectresulting from the lambda expression is the closure object. The type

    anonymity is not an issue for std::count_if be

    plate function and, therefore, argument type dedu

    Turning a function into a template is a way to ma

    expressions as arguments. Consider, for instance, a s

    implements a root-finder; i.e., a function that takes

    and returns a double value x such that f(x) =

    might be a template function:template double find_root(T const& f);

    However, this might not be desirable due to a few

    plate weaknesses: The code must be exposed in hea

    time increases, and template functions cant be vir

    Canfind_rootbe a non-template function? If so

    signature?

    double find_root(??? const& f);

    Argument type deduction for template function

    C++11. Nevertheless, the new standard introduce

    auto and decltype to support type deduction

    keyword in C++03, but with a different meaning.) I

    name to a closure object, then we can follow this e

    auto f = [](double x){ return x * x 0.

    Furthermore, an alias for the closure type, sayfuncti

    typedef decltype(f) function_t;

    Unfortunately,function_t is set at the same sc

    expression and, therefore, is invisible elsewhere. Inbe used in find_roots signature.

    [LAM

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Previous Next

    [LAMPrevious Next

    http://en.wikipedia.org/wiki/Anonymous_function#C.2B.2Bhttp://en.wikipedia.org/wiki/Anonymous_function#C.2B.2Bhttp://prevpage/http://en.wikipedia.org/wiki/Anonymous_function#C.2B.2Bhttp://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    28/40

    www.drdobbs.com

    Now, the other important actor of our play enters the stage:

    std::function.

    std::function and Its Costs

    Another Boost option, std::function, implements a type erasure

    mechanism that allows a uniform treatment of different functor types.

    Its predecessor boost::functiondates back to 2001 and was intro-duced into TR1 in 2005 as std::tr1::function. Now, its part of

    C++11 and has been promoted to namespace std.

    We shall see a few details of three different implementations of

    std::function and related classes: Boost, the Microsoft C++ Stan-

    dard library (MSLIB for short), and the GNU Standard C++ Library

    (a.k.a. libstdc++, but referred to here as GCCLIB). Unless otherwise

    stated, we shall generically refer to the relevant library types and

    functions as if they belonged to namespace std, regardless of the

    fact that Boosts do not. I will cover two compilers: Microsoft Visual

    Studio 2010 (MSVC) and the GNU Compiler Collection 4.5.3 (GCC) us-

    ing option -std=c++0x. Ill consider these compilers, compiling

    their corresponding aforementioned standard libraries, and also

    compiling Boost.

    Using std::function,find_roots declaration becomes

    double find_root(std::function const& f);

    Generally, std::function is a functor

    class that wraps any functor object that takes N arguments of types T1,

    ...,TNand returns a value convertible to R. It provides template conver-

    sion constructors that accept such functor objects. In particular, closure

    types are implicitly converted to std::function. There are two hid-den and preventable costs at construction.

    First, the constructor takes the functor object by v

    a copy is made. Furthermore, the constructor forw

    series of helper functions, many of which also take i

    further copies. For instance, MSLIB and GCCLIB ma

    Boost makes seven. However, the large number of c

    prit for the biggest performance hit.

    The second issue is related to the functors size. Thtations follow the standards recommendation to a

    optimization so as to avoid dynamic memory alloc

    use a data member to store a copy of the wrapped

    because the objects size is known only at constructi

    not be big enough to hold the copy. In this case, the

    the heap through a call to new (unless a custom al

    and only a pointer to this copy is stored by the data

    size beyond which the heap is used depends on the

    ment considerations. The best cases for common pla

    16 bytes, and 24 bytes for MSLIB, GCCLIB, and Boost

    Improving Performance of std::function

    Clearly, to address these performance issues, cop

    should be avoided. The natural idea is working with

    of copies. However, we all know that this is not gen

    cause you might want the std::functionobject

    inal functor.

    This is an old issue, as STL algorithms also take f

    by value. A good solution was implemented by Boo

    and is now also part of C++11 as well.

    The template class std::reference_wrapperwan object and provides automatic conversion to

    [LAMe ou e t

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    [LAMPrevious Next

    http://prevpage/http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    29/40

    www.drdobbs.com

    making thestd::reference_wrapper usable in many circumstances

    where the wrapped type is expected. The size of

    std::reference_wrapper is the size of a reference and, thus, small.

    Additionally, there are two template functions std::ref and

    std::cref to ease the creation of non-const an d const

    std::reference_wrappers, respectively. (They act like

    std::make_pairdoes to create std::pairs.)Back to the first example: To avoid the multiple copies of is_mul-

    tiple_of (which actually dont cost much since this is a small class)

    we can use:

    std::count_if(v.begin(), v.end(),std::cref(is_multiple_of(n)));

    Applying the same idea to the lambda expression yields:

    std::count_if(v.begin(), v.end(),std::cref([n](int i){return i%n == 0;}));

    Unfortunately, things get a bit more complicated and depend on the

    compiler and library.

    Boost in both compilers (change std::cref to boost::cref):

    It doesnt work because boost::reference_wrapper is

    not a functor.

    MSLIB: Currently, it doesnt work, but should in the near future.

    Indeed, to handle types returned by functor objects, MSLIB uses

    std::result_of which, on TR1, depends on the functor type

    having a member result_type a typedef to the type re-

    turned by operator(). Notice that is_multiple_of has thismember type, but the closure type doesnt (as per C++11). In

    C++11,std::result_of has changed and is

    decltype. We are in a transition period and M

    (http://social.msdn.microsoft.com/Forums/en/v

    4e438675-eb1e-42ef-b1df-7ae262234695), bu

    MSLIB (https://connect.microsoft.com/VisualS

    tails/618807) is supposed to follow C++11.

    GCCLIB: It works.

    In addition, as per C++11, functor classes origin

    expressions are not adaptable they dont contain

    bers required by STL adaptors and the following

    std::not1([n](int i){ return i%n == 0; }

    In this case, std::not1 requires argument_

    is_multiple_ofdefines it.

    The previous issue takes a slightly different form

    tion is involved. By definition, std::function w

    Hence, when its constructor receives an object of t

    ence_wrapper, it assumes that T is a functor

    accordingly. For instance, the following lines are le

    GCCLIB, but not yet with MSLIB (though they sho

    release):

    std::function f(std::cref([n{return i%n ==

    std::count_if(v.begin(), v.end(), f);

    Its worth mentioning that std::functions wrap

    nary functor classes are adaptable and can be giv(for example,std::not1).

    [LAM

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    [LAMPrevious Next

    http://social.msdn.microsoft.com/Forums/en/vclanguage/thread/4e438675-eb1e-42ef-b1df-7ae262234695http://social.msdn.microsoft.com/Forums/en/vclanguage/thread/4e438675-eb1e-42ef-b1df-7ae262234695https://connect.microsoft.com/VisualStudio/feedback/details/618807https://connect.microsoft.com/VisualStudio/feedback/details/618807https://connect.microsoft.com/VisualStudio/feedback/details/618807http://social.msdn.microsoft.com/Forums/en/vclanguage/thread/4e438675-eb1e-42ef-b1df-7ae262234695http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    30/40

    Im led to conclude that if you dont want heap storage (and cus-

    tom allocators), then Boost and GCCLIB are good options. If you are

    aiming portability, then you should use Boost.

    For developers using MSVCLIB, the performance issue remains un-

    solved until the next release. For those who cant wait, here is aworkaround that turns out to be portable (works with GCC and MSVC).

    The idea is obvious: Keep the closure type small. This size depends

    on the variables that are captured by the lambda expression (that is,

    appear inside the square brackets []). For instance, the lambda ex-

    pression previously seen

    [n](int i){ return i%n == 0; };

    captures the variable int n and, for this reason, the closure type has

    an intdata member holding a copy ofn. The more identifiers we put

    inside [], the bigger the size of the closure type gets. If the aggre-

    gated size of all identifiers inside[] is small enough (for example, one

    intor one double), then the heap is not used.

    One way to keep the size small is by creating a structenclosing ref-

    erences to all identifiers that normally would go inside [] and put-

    ting only a reference to this struct inside []. You use the struct

    members in the body of the lambda expression. For instance, the fol-

    lowing lambda expression

    double a;double b;// ...[a, b](double x){ return a * x + b; };

    yields a closure type with at least2 * sizeof(doub

    which is enough for MSLIB to use the heap. The alter

    double a;double b;

    // ...struct {const double& a;const double& b;

    } p = { a, b };[&p](double x){ return p.a * x + p.b; };

    In this way, only a reference to p is captured, wh

    MSLIB, GCC, and Boost to avoid the heap.

    A final word on the letter of the law: The standard

    sure type can have different sizes and alignments. Th

    aforementioned workaround might notwork. More p

    remains legal, but the heap might be used if the

    enough. However, neither MSVC nor GCC do this.

    Acknowledgment

    I would like to thank Lorenz Schneider and Victor

    comments and careful reading of this article.

    Cassio Neri has a Ph.D. in Mathematics. He works in the FX Q

    at Lloyds Banking Group in London.

    www.drdobbs.com

    [LAM

    February 2012

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    Previous Next

    http://www.drdobbs.com/cpp/232500059http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    31/40

    he Threading Methodology used at Intel has four major

    steps: Analysis, Design & Implementation, Debugging, and

    Performance Tuning. These steps are used to create a multi-

    threaded application from a serial base code. While the use

    of software tools for the first, third, and fourth steps is well

    documented, there hasnt been much written about how to do the De-

    sign & Implementation part of the process.

    There are plenty of books published on parallel algorithms and com-

    putation. However, these tend to focus on message-passing, distrib-

    uted-memory systems, or theoretical parallel models of computation

    that may or may not have much in common with realized multicoreplatforms. If youre going to be engaged in threaded programming, it

    can be helpful to know how to program or design a

    models. Of course, these models are fairly limited a

    developers will not have had the opportunity to be e

    that need such specialized programming.

    Multithreaded programming is still more art than

    gives eight simple rules that you can add to your p

    design methods. By following these rules, you will

    in writing the best and most-efficient threaded impl

    applications.

    Rule 1. Be sure you identify truly independent coYou cant execute anything concurrently unless t

    February 2012www.drdobbs.com

    8 Simple Rules for DesigninThreaded ApplicationsThis entry from Dr. Dobbs in 2008 offers rules that still hold true for creating efficient threaded impleme

    applications.

    By Clay Breshears

    T

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    From the Vault

    Previous Next

    http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    32/40

    would be executed in parallel can be run independently of each other.

    We can easily think of different real-world instances of independent

    actions being performed in order to satisfy a single goal. For example,

    building a house can involve many different workers with different

    skills carpenters, electricians, glazers, plumbers, roofers, painters, ma-sons, lawn wranglers, etc. There are some obvious scheduling depend-

    encies between pairs of workers (cant put on roof shingles before

    walls are built, cant paint the walls until the drywall is installed), but

    for the most part, the people involved in building a house can work

    independently of each other.

    Another real-world example would be a DVD rental warehouse. Or-

    ders for movies are collected and then distributed to the workers that

    go out to where all the discs are stored and find copies to satisfy their

    assigned orders. Pulling out My Fair Ladyby one worker does not in-

    terfere with another worker that is looking for The Terminator, nor will

    it interfere with a worker trying to locate episodes from the second

    season of Seinfeld. (We can assume that any conflicts that would re-

    sult from unavailable inventory have been dealt with before orders are

    transmitted to the warehouse.) Also, packaging and mailing of each

    order will not interfere with disk searches or the shipping and handling

    of any other order.

    There are cases where you will have exclusively sequential computa-

    tions that cannot be made concurrent; many of these will be depend-

    encies between loop iterations or steps that must be carried out in a

    specific order. An example for the latter is a pregnant reindeer. The nor-

    mal gestation period is about eight consecutive months, so you cantget a calf by putting eight cows on the job for one month. However, if

    Santa wanted to field a whole new sled team as so

    could have eight cows carrying his future team all a

    Rule 2. Implement concurrency at the highest le

    There are two directions that you can use when appto thread a serial code. These are bottom-up and

    Analysis phase of the Threading Methodology, yo

    ments of your code that take the most execution

    you are able to run those code portions in paralle

    best chance at achieving the maximum performan

    In a bottom-up approach, you would attempt to t

    in your code. If this is not possible, you can search

    the application to determine if there is another pla

    can be run in parallel and still executes the hotsp

    you have a picture compression application, you c

    cessing of the picture into separate, independ

    processed in parallel. Even if it is possible to emp

    the hotspot code, you should still look to see if it w

    implement that concurrency at a point in the cod

    call stack. This can increase the granularity of the

    each thread.

    With the top-down approach, you first consider

    tion, what the computation is coded to accomplis

    stract level, all the parts of the app that combine t

    putation. If there is no obvious concurrency, you sho

    of the computation into successively smaller parts tify independent computations. Results from the

    www.drdobbs.com

    IN THIS ISSUE

    Editorial >>Jungle >>Lambda Expressions >>8 Simple Rules >>Links >>Table of Contents >>

    February 2012

    Previous Next

    http://prevpage/http://prevpage/http://prevpage/
  • 8/2/2019 Doctor Dobs Digital Issue_0112

    33/40

    guide your investigation to include the most time-consuming mod-

    ules. Consider threading a video encoding application. You can start

    at the lowest level of independent pixels within a single frame, or re-

    alize that groups of frames can be processed independent of other

    groups. If the video encoding app is expected to process multiple

    videos, expressing your parallelism at that level may be easier to writeand will be at the highest level of possible concurrency.

    The granularity of concurrent computations is loosely defined as the

    amount of computation done before synchronization is needed. The

    longer the time between synchronizations, the coarser the granularity

    will be. Fine-grained parallelism runs the danger of not having enough

    work assigned to threads to overcome the overhead costs of using

    threads. Adding more threads, when the amount of computation does-

    nt change, only exacerbates the problem. Coarse-grained parallelism

    has lower overhead costs and also tends to be more readily scalable

    to an increase in the number of threads. Top-down approaches to

    threading (or driving the point of threading as high in the call stack)are the best options to achieve a coarse-grained solution.

    Rule 3. Plan early for scalability to take advanta

    numbers of cores.

    Processors have recently gone from being dual-core

    has announced the 80-core Teraflop chip. The num

    able in future processors will only increase. Thus, y

    such processor increases within your software. Scaure of an applications ability to handle changes, ty

    system resources (number of cores, memory size,

    data set sizes. In the face of more available cores,

    that can take advantage of different numbers of co