Scaling Reliably Scaling Reliably: Improving the ... · Distributed actor languages are an effective means of constructing scalable reliable systems, and the Erlang ... that capture

P Trinder N Chechina N Papaspyrou K Sagonas S Thompson et al bull Scaling Reliably bull 2017

Scaling Reliably Improving theScalability of the Erlang Distributed

Actor PlatformPhil Trinder Natalia Chechina Nikolaos Papaspyrou

Konstantinos Sagonas Simon Thompson Stephen Adams Stavros AronisRobert Baker Eva Bihari Olivier Boudeville Francesco Cesarini

Maurizio Di Stefano Sverker Eriksson Viktoria Fordos Amir GhaffariAggelos Giantsios Rockard Green Csaba Hoch David Klaftenegger

Huiqing Li Kenneth Lundin Kenneth MacKenzie Katerina RoukounakiYiannis Tsiouris Kjell Winblad

RELEASE EU FP7 STREP (287510) projectwwwrelease-projecteu

April 25 2017

Abstract

Distributed actor languages are an effective means of constructing scalable reliable systems and the Erlangprogramming language has a well-established and influential model While the Erlang model conceptuallyprovides reliable scalability it has some inherent scalability limits and these force developers to depart fromthe model at scale This article establishes the scalability limits of Erlang systems and reports the work ofthe EU RELEASE project to improve the scalability and understandability of the Erlang reliable distributedactor model

We systematically study the scalability limits of Erlang and then address the issues at the virtual machinelanguage and tool levels More specifically (1) We have evolved the Erlang virtual machine so that it canwork effectively in large scale single-host multicore and NUMA architectures We have made importantchanges and architectural improvements to the widely used ErlangOTP release (2) We have designed andimplemented Scalable Distributed (SD) Erlang libraries to address language-level scalability issues andprovided and validated a set of semantics for the new language constructs (3) To make large Erlang systemseasier to deploy monitor and debug we have developed and made open source releases of five complementarytools some specific to SD Erlang

Throughout the article we use two case studies to investigate the capabilities of our new technologies andtools a distributed hash table based Orbit calculation and Ant Colony Optimisation (ACO) Chaos Monkeyexperiments show that two versions of ACO survive random process failure and hence that SD Erlangpreserves the Erlang reliability model While we report measurements on a range of NUMA and clusterarchitectures the key scalability experiments are conducted on the Athos cluster with 256 hosts (6144 cores)Even for programs with no global recovery data to maintain SD Erlang partitions the network to reducenetwork traffic and hence improves performance of the Orbit and ACO benchmarks above 80 hosts ACOmeasurements show that maintaining global recovery data dramatically limits scalability however scalabilityis recovered by partitioning the recovery data We exceed the established scalability limits of distributedErlang and do not reach the limits of SD Erlang for these benchmarks at this scale (256 hosts 6144 cores)

I Introduction

Distributed programming languages andframeworks are central to engineering largescale systems where key properties includescalability and reliability By scalability wemean that performance increases as hosts andcores are added and by large scale we meanarchitectures with hundreds of hosts and tensof thousands of cores Experience with highperformance and data centre computing showsthat reliability is critical at these scales eghost failures alone account for around one fail-ure per hour on commodity servers with ap-proximately 105 cores [11] To be usable pro-gramming languages employed on them mustbe supported by a suite of deployment moni-toring refactoring and testing tools that workat scale

Controlling shared state is the only way tobuild reliable scalable systems State shared bymultiple units of computation limits scalabilitydue to high synchronisation and communica-tion costs Moreover shared state is a threatfor reliability as failures corrupting or perma-nently locking shared state may poison theentire system

Actor languages avoid shared state actors orprocesses have entirely local state and only in-teract with each other by sending messages [2]Recovery is facilitated in this model since ac-tors like operating system processes can failindependently without affecting the state ofother actors Moreover an actor can superviseother actors detecting failures and taking re-medial action eg restarting the failed actor

Erlang [6 19] is a beacon language for re-liable scalable computing with a widely em-ulated distributed actor model It has in-fluenced the design of numerous program-ming languages like Clojure [43] and F [74]and many languages have Erlang-inspiredactor frameworks eg Kilim for Java [73]Cloud Haskell [31] and Akka for C F andScala [64] Erlang is widely used for build-

ing reliable scalable servers eg EricssonrsquosAXD301 telephone exchange (switch) [79] theFacebook chat server and the Whatsapp in-stant messaging server [77]

In Erlang the actors are termed processesand are managed by a sophisticated VirtualMachine on a single multicore or NUMAhost while distributed Erlang provides rela-tively transparent distribution over networksof VMs on multiple hosts Erlang is supportedby the Open Telecom Platform (OTP) librariesthat capture common patterns of reliable dis-tributed computation such as the client-serverpattern and process supervision Any large-scale system needs scalable persistent stor-age and following the CAP theorem [39]Erlang uses and indeed implements Dynamo-style NoSQL DBMS like Riak [49] and Cassan-dra [50]

While the Erlang distributed actor modelconceptually provides reliable scalability it hassome inherent scalability limits and indeedlarge-scale distributed Erlang systems must de-part from the distributed Erlang paradigm inorder to scale eg not maintaining a fully con-nected graph of hosts The EU FP7 RELEASEproject set out to establish and address the scal-ability limits of the Erlang reliable distributedactor model [67]

After outlining related work (Section II) andthe benchmarks used throughout the article(Section III) we investigate the scalability lim-its of ErlangOTP seeking to identify specificissues at the virtual machine language andpersistent storage levels (Section IV)

We then report the RELEASE project work toaddress these issues working at the followingthree levels

1 We have designed and implemented aset of Scalable Distributed (SD) Erlang li-braries to address language-level reliabil-ity and scalability issues An operationalsemantics is provided for the key news_group construct and the implementa-

tion is validated against the semantics (Sec-tion V)

2 We have evolved the Erlang virtual machineso that it can work effectively in large-scalesingle-host multicore and NUMA archi-tectures We have improved the sharedETS tables time management and loadbalancing between schedulers Most ofthese improvements are now included inthe ErlangOTP release currently down-loaded approximately 50K times eachmonth (Section VI)

3 To facilitate the development of scalableErlang systems and to make them main-tainable we have developed three newtools Devo SDMon and WombatOAMand enhanced two others the visualisa-tion tool Percept and the refactorer Wran-gler The tools support refactoring pro-grams to make them more scalable eas-ier to deploy at large scale (hundreds ofhosts) easier to monitor and visualise theirbehaviour Most of these tools are freelyavailable under open source licences theWombatOAM deployment and monitor-ing tool is a commercial product (Sec-tion VII)

Throughout the article we use two bench-marks to investigate the capabilities of our newtechnologies and tools These are a compu-tation in symbolic algebra more specificallyan algebraic lsquoorbitrsquo calculation that exploitsa non-replicated distributed hash table andan Ant Colony Optimisation (ACO) parallelsearch program (Section III)

We report on the reliability and scalabilityimplications of our new technologies using Or-bit ACO and other benchmarks We use aChaos Monkey instance [14] that randomlykills processes in the running system to demon-strate the reliability of the benchmarks andto show that SD Erlang preserves the Erlanglanguage-level reliability model While we

report measurements on a range of NUMAand cluster architectures as specified in Ap-pendix A the key scalability experiments areconducted on the Athos cluster with 256 hostsand 6144 cores Having established scientifi-cally the folklore limitations of around 60 con-nected hostsnodes for distributed Erlang sys-tems in Section 4 a key result is to show thatthe SD Erlang benchmarks exceed this limitand do not reach their limits on the Athos clus-ter (Section VIII)

Contributions This article is the first sys-tematic presentation of the coherent set oftechnologies for engineering scalable reliableErlang systems developed in the RELEASEproject

Section IV presents the first scalability studycovering Erlang VM language and storagescalability Indeed we believe it is the firstcomprehensive study of any distributed ac-tor language at this scale (100s of hosts andaround 10K cores) Individual scalability stud-ies eg into Erlang VM scaling [8] or lan-guage and storage scaling have appeared be-fore [38 37]

At the language level the design implemen-tation and validation of the new libraries (Sec-tion V) have been reported piecemeal [21 60]and are included here for completeness

While some of the improvements made tothe Erlang Virtual Machine (Section i) havebeen thoroughly reported in conference pub-lications [66 46 47 69 70 71] others are re-ported here for the first time (Sections iii ii)

In Section VII the WombatOAM and SD-Mon tools are described for the first time asis the revised Devo system and visualisationThe other tools for profiling debugging andrefactoring developed in the project have previ-ously been published piecemeal [52 53 55 10]but this is their first unified presentation

All of the performance results in Section VIIIare entirely new although a comprehensivestudy of SD Erlang performance is now avail-able in a recent article by [22]

II Context

i Scalable Reliable ProgrammingModels

There is a plethora of shared memory con-current programming models like PThreadsor Java threads and some models likeOpenMP [20] are simple and high level How-ever synchronisation costs mean that thesemodels generally do not scale well often strug-gling to exploit even 100 cores Moreover re-liability mechanisms are greatly hampered bythe shared state for example a lock becomespermanently unavailable if the thread holdingit fails

The High Performance Computing (HPC)community build large-scale (106 core) dis-tributed memory systems using the de factostandard MPI communication libraries [72] In-creasingly these are hybrid applications thatcombine MPI with OpenMP UnfortunatelyMPI is not suitable for producing general pur-pose concurrent software as it is too low levelwith explicit message passing Moreover themost widely used MPI implementations offerno fault recovery1 if any part of the compu-tation fails the entire computation fails Cur-rently the issue is addressed by using whatis hoped to be highly reliable computationaland networking hardware but there is intenseresearch interest in introducing reliability intoHPC applications [33]

Server farms use commodity computationaland networking hardware and often scale toaround 105 cores where host failures are rou-tine They typically perform rather constrainedcomputations eg Big Data Analytics using re-liable frameworks like Google MapReduce [26]or Hadoop [78] The idempotent nature of theanalytical queries makes it relatively easy forthe frameworks to provide implicit reliabilityqueries are monitored and failed queries are

1Some fault tolerance is provided in less widely usedMPI implementations like [27]

simply re-run In contrast actor languages likeErlang are used to engineer reliable generalpurpose computation often recovering failedstateful computations

ii Actor Languages

The actor model of concurrency consists of in-dependent processes communicating by meansof messages sent asynchronously between pro-cesses A process can send a message to anyother process for which it has the address (inErlang the ldquoprocess identifierrdquo or pid) andthe remote process may reside on a differenthost While the notion of actors originatedin AI [42] it has been used widely as a gen-eral metaphor for concurrency as well as beingincorporated into a number of niche program-ming languages in the 1970s and 80s More re-cently it has come back to prominence throughthe rise of not only multicore chips but alsolarger-scale distributed programming in datacentres and the cloud

With built-in concurrency and data isola-tion actors are a natural paradigm for engi-neering reliable scalable general-purpose sys-tems [1 41] The model has two main conceptsactors that are the unit of computation andmessages that are the unit of communicationEach actor has an address-book that contains theaddresses of all the other actors it is aware ofThese addresses can be either locations in mem-ory direct physical attachments or networkaddresses In a pure actor language messagesare the only way for actors to communicate

After receiving a message an actor can dothe following (i) send messages to anotheractor from its address-book (ii) create new ac-tors or (iii) designate a behaviour to handle thenext message it receives The model does notimpose any restrictions in the order in whichthese actions must be taken Similarly twomessages sent concurrently can be received inany order These features enable actor basedsystems to support indeterminacy and quasi-

commutativity while providing locality modu-larity reliability and scalability [41]

Actors are just one message-based paradigmand other languages and libraries have relatedmessage passing paradigms Recent examplelanguages include Go [28] and Rust [62] thatprovide explicit channels similar to actor mail-boxes Probably the most famous messagepassing library is MPI [72] with APIs for manylanguages and widely used on clusters andHigh Performance Computers It is howeverarguable that the most important contributionof the actor model is the one-way asynchronouscommunication [41] Messages are not coupledwith the sender and neither they are trans-ferred synchronously to a temporary containerwhere transmission takes place eg a buffer aqueue or a mailbox Once a message is sentthe receiver is the only entity responsible forthat message

Erlang [6 19] is the pre-eminent program-ming language based on the actor model hav-ing a history of use in production systemsinitially with its developer Ericsson and thenmore widely through open source adoptionThere are now actor frameworks for manyother languages these include Akka for C Fand Scala [64] CAF2 for C++ Pykka3 CloudHaskell [31] PARLEY [51] for Python and Ter-mite Scheme [35] and each of these is currentlyunder active use and development Moreoverthe recently defined Rust language [62] has aversion of the actor model built in albeit in animperative context

iii Erlangrsquos Support for Concurrency

In Erlang actors are termed processes and vir-tual machines are termed nodes The key ele-ments of the actor model are fast process cre-ation and destruction lightweight processeseg enabling 106 concurrent processes on a sin-gle host with 8GB RAM fast asynchronous

2httpactor-frameworkorg3httppykkareadthedocsorgenlatest

message passing with copying semantics pro-cess monitoring strong dynamic typing andselective message reception

By default Erlang processes are addressedby their process identifier (pid) eg

Pong_PID = spawn(fun some_modulepong0)

spawns a process to execute the anonymousfunction given as argument to the spawn prim-itive and binds Pong_PID to the new processidentifier Here the new process will exe-cute the pong0 function which is defined insome_module A subsequent call

Pong_PID finish

sends the messaged finish to the process iden-tified by Pong_PID Alternatively processes canbe given names using a call of the form

register(my_funky_name Pong_PID)

which registers this process name in the nodersquosprocess name table if not already present Sub-sequently these names can be used to refer toor communicate with the corresponding pro-cesses (eg send them a message)

my_funky_process hello

A distributed Erlang system executes on mul-tiple nodes and the nodes can be freely de-ployed across hosts eg they can be locatedon the same or different hosts To help makedistribution transparent to the programmerwhen any two nodes connect they do so transi-tively sharing their sets of connections With-out considerable care this quickly leads to afully connected graph of nodes A process maybe spawned on an explicitly identified nodeeg

Remote_Pong_PID = spawn(some_node

fun some_modulepong0)

After this the remote process can be addressedjust as if it were local It is a significant bur-den on the programmer to identify the remotenodes in large systems and we will return tothis in Sections ii i2

Figure 1 Conceptual view of Erlangrsquos concurrency mul-ticore support and distribution

iv Scalability and Reliability inErlang Systems

Erlang was designed to solve a particularset of problems namely those in buildingtelecommunicationsrsquo infrastructure where sys-tems need to be scalable to accommodate hun-dreds of thousands of calls concurrently in softreal-time These systems need to be highly-available and reliable ie to be robust in thecase of failure which can come from softwareor hardware faults Given the inevitability ofthe latter Erlang adopts the ldquolet it failrdquo phi-losophy for error handling That is encourageprogrammers to embrace the fact that a processmay fail at any point and have them rely onthe supervision mechanism discussed shortlyto handle the failures

Figure 1 illustrates Erlangrsquos support for con-currency multicores and distribution EachErlang node is represented by a yellow shapeand each rectangle represents a host with an IPaddress Each red arc represents a connectionbetween Erlang nodes Each node can run onmultiple cores and exploit the inherent con-currency provided This is done automaticallyby the VM with no user intervention neededTypically each core has an associated schedulerthat schedules processes a new process will bespawned on the same core as the process thatspawns it but work can be moved to a differentscheduler through a work-stealing allocationalgorithm Each scheduler allows a process

that is ready to compute at most a fixed num-ber of computation steps before switching toanother Erlang built-in functions or BIFs are im-plemented in C and at the start of the projectwere run to completion once scheduled caus-ing performance and responsiveness problemsif the BIF had a long execution time

Scaling in Erlang is provided in two differ-ent ways It is possible to scale within a singlenode by means of the multicore virtual machineexploiting the concurrency provided by themultiple cores or NUMA nodes It is also pos-sible to scale across multiple hosts using multipledistributed Erlang nodes

Reliability in Erlang is multi-faceted As inall actor languages each process has privatestate preventing a failed or failing processfrom corrupting the state of other processesMessages enable stateful interaction and con-tain a deep copy of the value to be shared withno references (eg pointers) to the sendersrsquointernal state Moreover Erlang avoids typeerrors by enforcing strong typing albeit dy-namically [7] Connected nodes check livenesswith heartbeats and can be monitored fromoutside Erlang eg by an operating systemprocess

However the most important way to achievereliability is supervision which allows a processto monitor the status of a child process andreact to any failure for example by spawninga substitute process to replace a failed processSupervised processes can in turn superviseother processes leading to a supervision treeThe supervising and supervised processes maybe in different nodes and on different hostsand hence the supervision tree may span mul-tiple hosts or nodes

To provide reliable distributed service regis-tration a global namespace is maintained onevery node which maps process names to pidsIt is this that we mean when we talk about alsquoreliablersquo system it is one in which a namedprocess in a distributed system can be restartedwithout requiring the client processes also to

be restarted (because the name can still be usedfor communication)

To see global registration in action considera pong server process

globalregister_name(pong_server

Remote_Pong_PID)

Clients of the server can send messages to theregistered name eg

globalwhereis_name(pong_server)finish

If the server fails the supervisor can spawn areplacement server process with a new pid andregister it with the same name (pong_server)Thereafter client messages to the pong_server

will be delivered to the new server process Wereturn to discuss the scalability limitations ofmaintaining a global namespace in Section V

v ETS Erlang Term Storage

Erlang is a pragmatic language and the actormodel it supports is not pure Erlang processesbesides communicating via asynchronous mes-sage passing can also share data in publicmemory areas called ETS tables

The Erlang Term Storage (ETS) mechanismis a central component of Erlangrsquos implementa-tion It is used internally by many libraries andunderlies the in-memory databases ETS tablesare key-value stores they store Erlang tupleswhere one of the positions in the tuple servesas the lookup key An ETS table has a typethat may be either set bag or duplicate_bagimplemented as a hash table or ordered_setwhich is currently implemented as an AVL treeThe main operations that ETS supports are ta-ble creation insertion of individual entries andatomic bulk insertion of multiple entries in atable deletion and lookup of an entry basedon some key and destructive update The oper-ations are implemented as C built-in functionsin the Erlang VM

The code snippet below creates a set ETStable keyed by the first element of the en-try atomically inserts two elements with keys

some_key and 42 updates the value associatedwith the table entry with key 42 and then looksup this entry

Table = etsnew(my_table

[set public keypos 1])

etsinsert(Table [

some_key an_atom_value

42 atuplevalue])

etsupdate_element(Table 42

[2 anothertuplevalue])

[Key Value] = etslookup(Table 42)

ETS tables are heavily used in many Erlangapplications This is partly due to the conve-nience of sharing data for some programmingtasks but also partly due to their fast imple-mentation As a shared resource however ETStables induce contention and become a scala-bility bottleneck as we shall see in Section i

III Benchmarks for scalability

and reliability

The two benchmarks that we use through-out this article are Orbit that measures scal-ability without looking at reliability andAnt Colony Optimisation (ACO) that allowsus to measure the impact on scalability ofadding global namespaces to ensure reli-ability The source code for the bench-marks together with more detailed docu-mentation is available at httpsgithub

comrelease-projectbenchmarks TheRELEASE project team also worked to im-prove the reliability and scalability of otherErlang programs including a substantial (ap-proximately 150K lines of Erlang code) Sim-Diasca simulator [16] and an Instant Messengerthat is more typical of Erlang applications [23]but we do not cover these systematically here

i Orbit

Orbit is a computation in symbolic algebrawhich generalises a transitive closure compu-

Figure 2 Distributed Erlang Orbit (D-Orbit) architec-ture workers are mapped to nodes

tation [57] To compute the orbit for a givenspace [0X] a list of generators g1 g2 gn areapplied on an initial vertex x0 isin [0X] Thiscreates new values (x1xn) isin [0X] wherexi = gi(x0) The generator functions are ap-plied on the new values until no new value isgenerated

Orbit is a suitable benchmark because it hasa number of aspects that characterise a classof real applications The core data structureit maintains is a set and in distributed envi-ronments is implemented as a distributed hashtable (DHT) similar to the DHTs used in repli-cated form in NoSQL database managementsystems Also in distributed mode it usesstandard peer-to-peer (P2P) techniques like acredit-based termination algorithm [61] Bychoosing the orbit size the benchmark can beparameterised to specify smaller or larger com-putations that are suitable to run on a singlemachine (Section i) or on many nodes (Sec-tioni) Moreover it is only a few hundred linesof code

As shown in Figure 2 the computation isinitiated by a master which creates a numberof workers In the single node scenario of thebenchmark workers correspond to processesbut these workers can also spawn other pro-cesses to apply the generator functions on asubset of their input values thus creating intra-worker parallelism In the distributed versionof the benchmark processes are spawned bythe master node to worker nodes each main-taining a DHT fragment A newly spawned

process gets a share of the parentrsquos credit andreturns this on completion The computationis finished when the master node collects allcredit ie all workers have completed

ii Ant Colony Optimisation (ACO)

Ant Colony Optimisation [29] is a meta-heuristic which has been applied to a largenumber of combinatorial optimisation prob-lems For the purpose of this article we haveapplied it to an NP-hard scheduling problemknown as the Single Machine Total WeightedTardiness Problem (SMTWTP) [63] where anumber of jobs of given lengths have to be ar-ranged in a single linear schedule The goal isto minimise the cost of the schedule as deter-mined by certain constraints

The ACO method is attractive from the pointof view of distributed computing because itcan benefit from having multiple cooperatingcolonies each running on a separate computenode and consisting of multiple ldquoantsrdquo Antsare simple computational agents which concur-rently compute possible solutions to the inputproblem guided by shared information aboutgood paths through the search space there isalso a certain amount of stochastic variationwhich allows the ants to explore new direc-tions Having multiple colonies increases thenumber of ants thus increasing the probabilityof finding a good solution

We implement four distributed coordinationpatterns for the same multi-colony ACO com-

Figure 3 Distributed Erlang Two-level Ant Colony Op-timisation (TL-ACO) architecture

Figure 4 Distributed Erlang Multi-level Ant Colony Optimisation (ML-ACO) architecture

putation as follows In each implementationthe individual colonies perform some numberof local iterations (ie generations of ants) andthen report their best solutions the globally-best solution is then selected and is reportedto the colonies which use it to update theirpheromone matrices This process is repeatedfor some number of global iterations

Two-level ACO (TL-ACO) has a single masternode that collects the coloniesrsquo best solutionsand distributes the overall best solution back tothe colonies Figure 3 depicts the process andnode placements of the TL-ACO in a clusterwith NC nodes The master process spawnsNC colony processes on available nodes Inthe next step each colony process spawns NAant processes on the local node Each ant iter-ates IA times returning its result to the colonymaster Each colony iterates IM times report-ing their best solution to and receiving theglobally-best solution from the master processWe validated the implementation by applyingTL-ACO to a number of standard SMTWTP

instances [25 13 34] obtaining good resultsin all cases and confirmed that the number ofperfect solutions increases as we increase thenumber of colonies

Multi-level ACO (ML-ACO) In TL-ACO themaster node receives messages from all of thecolonies and thus could become a bottleneckML-ACO addresses this by having a tree ofsubmasters (Figure 4) with each node in thebottom level collecting results from a smallnumber of colonies These are then fed upthrough the tree with nodes at higher levelsselecting the best solutions from their children

Globally Reliable ACO (GR-ACO) In ML-ACOif a single colony fails to report back the sys-tem will wait indefinitely GR-ACO adds faulttolerance supervising colonies so that a faultycolony can be detected and restarted allowingthe system to continue execution

Scalable Reliable ACO (SR-ACO) also addsfault-tolerance but using supervision withinour new s_groups from Section i1 and thearchitecture of SR-ACO is discussed in detail

IV Erlang Scalability Limits

This section investigates the scalability ofErlang at VM language and persistent stor-age levels An aspect we choose not to exploreis the security of large scale systems wherefor example one might imagine providing en-hanced security for systems with multiple clus-ters or cloud instances connected by a WideArea Network We assume that existing secu-rity mechanisms are used eg a Virtual PrivateNetwork

i Scaling Erlang on a Single Host

To investigate Erlang scalability we builtBenchErl an extensible open source bench-mark suite with a web interface4 BenchErlshows how an applicationrsquos performancechanges when resources like cores or sched-ulers are added or when options that controlthese resources change

bull the number of nodes ie the number ofErlang VMs used typically on multiplehosts

bull the number of cores per node

bull the number of schedulers ie the OS threadsthat execute Erlang processes in paralleland their binding to the topology of thecores of the underlying computer node

bull the ErlangOTP release and flavor and

bull the command-line arguments used to startthe Erlang nodes

Using BenchErl we investigated the scala-bility of an initial set of twelve benchmarksand two substantial Erlang applications us-ing a single Erlang node (VM) on machines

4Information about BenchErl is available at http

releasesoftlabntuagrbencherl

0 10 20 30 40 50 60 70

Schedulers

with intra-worker parallelism 10048 128without intra-worker parallelism 10048 128

0 10 20 30 40 50 60 70

Schedulers

Figure 5 Runtime and speedup of two configu-rations of the Orbit benchmark usingErlangOTP R15B01

with up to 64 cores including the Bulldozermachine specified in Appendix A This set ofexperiments reported by [8] confirmed thatsome programs scaled well in the most recentErlangOTP release of the time (R15B01) butalso revealed VM and language level scalabilitybottlenecks

Figure 5 shows runtime and speedup curvesfor the Orbit benchmark where master andworkers run on a single Erlang node in con-figurations with and without intra-worker par-allelism In both configurations the programscales Runtime continuously decreases as weadd more schedulers to exploit more coresThe speedup of the benchmark without intra-worker parallelism ie without spawning ad-ditional processes for the computation (greencurve) is almost linear up to 32 cores but in-

creases less rapidly from that point on we see asimilar but more clearly visible pattern for theother configuration (red curve) where there isno performance improvement beyond 32 sched-ulers This is due to the asymmetric charac-teristics of the machinersquos micro-architecturewhich consists of modules that couple two con-ventional x86 out-of-order cores that share theearly pipeline stages the floating point unitand the L2 cache with the rest of the mod-ule [3]

Some other benchmarks however did notscale well or experienced significant slow-downs when run in many VM schedulers(threads) For example the ets_test benchmarkhas multiple processes accessing a shared ETStable Figure 6 shows runtime and speedupcurves for ets_test on a 16-core (eight coreswith hyperthreading) Intel Xeon-based ma-chine It shows that runtime increases beyondtwo schedulers and that the program exhibitsa slowdown instead of a speedup

For many benchmarks there are obvious rea-sons for poor scaling like limited parallelismin the application or contention for shared re-sources The reasons for poor scaling are lessobvious for other benchmarks and it is exactlythese we have chosen to study in detail in sub-sequent work [8 46]

A simple example is the parallel BenchErlbenchmark that spawns some n processeseach of which creates a list of m timestampsand after it checks that each timestamp in thelist is strictly greater than the previous onesends the result to its parent Figure 7 showsthat up to eight cores each additional core leadsto a slowdown thereafter a small speedup isobtained up to 32 cores and then again a slow-down A small aspect of the benchmark eas-ily overlooked explains the poor scalabilityThe benchmark creates timestamps using theerlangnow0 built-in function whose imple-mentation acquires a global lock in order to re-turn a unique timestamp That is two calls toerlangnow0 even from different processes

0 2 4 6 8 10 12 14 16

Schedulers

[25121264640]

0 2 4 6 8 10 12 14 16

Schedulers

[25121264640]

Figure 6 Runtime and speedup of theets_test BenchErl benchmark usingErlangOTP R15B01

are guaranteed to produce monotonically in-creasing values This lock is precisely the bot-tleneck in the VM that limits the scalabilityof this benchmark We describe our work toaddress VM timing issues in Figure iii

Discussion Our investigations identified con-tention for shared ETS tables and forcommonly-used shared resources like timersas the key VM-level scalability issues Sec-tion VI outlines how we addressed these issuesin recent ErlangOTP releases

ii Distributed Erlang Scalability

Network Connectivity Costs When any nor-mal distributed Erlang nodes communicatethey share their connection sets and this typi-cally leads to a fully connected graph of nodes

100000

120000

140000

160000

0 10 20 30 40 50 60 70

Schedulers

[70016640]

0 10 20 30 40 50 60 70

Schedulers

[70016640]

Figure 7 Runtime and speedup of the BenchErlbenchmark called parallel usingErlangOTP R15B01

So a system with n nodes will maintain O(n2)connections and these are relatively expensiveTCP connections with continual maintenancetraffic This design aids transparent distribu-tion as there is no need to discover nodes andthe design works well for small numbers ofnodes However at emergent server architec-ture scales ie hundreds of nodes this designbecomes very expensive and system architectsmust switch from the default Erlang model egthey need to start using hidden nodes that donot share connection sets

We have investigated the scalability limitsimposed by network connectivity costs usingseveral Orbit calculations on two large clustersKalkyl and Athos as specified in Appendix AThe Kalkyl results are discussed by [21] andFigure 29 in Section i shows representative

results for distributed Erlang computing or-bits with 2M and 5M elements on Athos Inall cases performance degrades beyond somescale (40 nodes for the 5M orbit and 140 nodesfor the 2M orbit) Figure 32 illustrates the ad-ditional network traffic induced by the fullyconnected network It allows a comparison be-tween the number of packets sent in a fully con-nected network (ML-ACO) with those sent ina network partitioned using our new s_groups(SR-ACO)

Global Information Costs Maintainingglobal information is known to limit thescalability of distributed systems and cruciallythe process namespaces used for reliabilityare global To investigate the scalabilitylimits imposed on distributed Erlang bysuch global information we have designedand implemented DE-Bench an open sourceparameterisable and scalable peer-to-peerbenchmarking framework [37 36] DE-Benchmeasures the throughput and latency ofdistributed Erlang commands on a cluster ofErlang nodes and the design is influencedby the Basho Bench benchmarking tool forRiak [12] Each DE-Bench instance acts asa peer providing scalability and reliabilityby eliminating central coordination and anysingle point of failure

To evaluate the scalability of distributedErlang we measure how adding more hostsincreases the throughput ie the total num-ber of successfully executed distributed Erlangcommands per experiment Figure 8 showsthe parameterisable internal workflow of DE-Bench There are three classes of commandsin DE-Bench(i) Point-to-Point (P2P) commandswhere a function with tunable argument sizeand computation time is run on a remotenode include spawn rpc and synchronouscalls to server processes ie gen_server

or gen_fsm (ii) Global commands whichentail synchronisation across all connectednodes such as globalregister_name and

Figure 8 DE-Benchrsquos internal workflow

globalunregister_name (iii) Local commandswhich are executed independently by a singlenode eg whereis_name a look up in thelocal name table

The benchmarking is conducted on 10 to 100host configurations of the Kalkyl cluster (insteps of 10) and measures the throughput ofsuccessful commands per second over 5 min-utes There is one Erlang VM on each host andone DE-Bench instance on each VM The fullpaper [37] investigates the impact of data sizeand computation time in P2P calls both inde-pendently and in combination and the scalingproperties of the common ErlangOTP genericserver processes gen_server and gen_fsm

Here we focus on the impact of different pro-portions of global commands mixing globalwith P2P and local commands Figure 9shows that even a low proportion of globalcommands limits the scalability of distributedErlang eg just 001 global commands limitsscalability to around 60 nodes Figure 10 re-ports the latency of all commands and showsthat while the latencies for P2P and local com-mands are stable at scale the latency of theglobal commands increases dramatically withscale Both results illustrate that the impact ofglobal operations on throughput and latencyin a distributed Erlang system is severe

Explicit Placement While network connectiv-ity and global information impact performanceat scale our investigations also identified ex-plicit process placement as a programming issueat scale Recall from Section iii that distributedErlang requires the programmer to identify anexplicit Erlang node (VM) when spawning aprocess Identifying an appropriate node be-comes a significant burden for large and dy-namic systems The problem is exacerbated inlarge distributed systems where (1) the hostsmay not be identical having different hardwarecapabilities or different software installed and(2) communication times may be non-uniformit may be fast to send a message between VMson the same host and slow if the VMs are ondifferent hosts in a large distributed system

These factors make it difficult to deploy ap-plications especially in a scalable and portablemanner Moreover while the programmer maybe able to use platform-specific knowledge todecide where to spawn processes to enable anapplication to run efficiently if the applicationis then deployed on a different platform or ifthe platform changes as hosts fail or are addedthis becomes outdated

Discussion Our investigations confirm threelanguage-level scalability limitations of Erlang

Figure 9 Scalability vs percentage of global commands in Distributed Erlang

from developer folklore(1) Maintaining a fullyconnected network of Erlang nodes limits scal-ability for example Orbit is typically limitedto just 40 nodes (2) Global operations and cru-cially the global operations required for relia-bility ie to maintain a global namespace seri-ously limit the scalability of distributed Erlangsystems (3) Explicit process placement makes ithard to built performance portable applicationsfor large architectures These issues cause de-signers of reliable large scale systems in Erlangto depart from the standard Erlang model egusing techniques like hidden nodes and stor-ing pids in data structures In Section V wedevelop language technologies to address theseissues

iii Persistent Storage

Any large scale system needs reliable scalablepersistent storage and we have studied thescalability limits of Erlang persistent storage al-ternatives [38] We envisage a typical largeserver having around 105 cores on around100 hosts We have reviewed the require-ments for scalable and available persistent stor-

age and evaluated four popular Erlang DBMSagainst these requirements For a target scaleof around 100 hosts Mnesia and CouchDBare unsurprisingly not suitable HoweverDynamo-style NoSQL DBMS like Cassandraand Riak have the potential to be

We have investigated the current scalabil-ity limits of the Riak NoSQL DBMS using theBasho Bench benchmarking framework on acluster with up to 100 nodes and indepen-dent disks We found that that the scalabil-ity limit of Riak version 111 is 60 nodes onthe Kalkyl cluster The study placed into thepublic scientific domain what was previouslywell-evidenced but anecdotal developer expe-rience

We have also shown that resources like mem-ory disk and network do not limit the scal-ability of Riak By instrumenting the global

and gen_server OTP libraries we identified aspecific Riak remote procedure call that failsto scale We outline how later releases of Riakare refactored to eliminate the scalability bot-tlenecks

Figure 10 Latency of commands as the number of Erlang nodes increases

Discussion We conclude that Dynamo-likeNoSQL DBMSs have the potential to deliver re-liable persistent storage for Erlang at our targetscale of approximately 100 hosts Specificallyan Erlang Cassandra interface is available andRiak 111 already provides scalable and avail-able persistent storage on 60 nodes Moreoverthe scalability of Riak is much improved insubsequent versions

V Improving Language Scalability

This section outlines the Scalable Distributed(SD) Erlang libraries [21] we have designedand implemented to address the distributedErlang scalability issues identified in Section iiSD Erlang introduces two concepts to improvescalability S_groups partition the set of nodesin an Erlang system to reduce network con-nectivity and partition global data (Section i1)Semi-explicit placement alleviates the issues ofexplicit process placement in large heteroge-neous networks (Section i2) The two featuresare independent and can be used separatelyor in combination We overview SD Erlang inSection i and outline s_group semantics and

validation in Sections ii iii respectively

i SD Erlang Design

i1 S_groups

reduce both the number of connections a nodemaintains and the size of name spaces iethey minimise global information Specificallynames are registered on and synchronised be-tween only the nodes within the s_group Ans_group has the following parameters a namea list of nodes and a list of registered names Anode can belong to many s_groups or to noneIf a node belongs to no s_group it behaves as ausual distributed Erlang node

The s_group library defines the functionsshown in Table 1 Some of these functionsmanipulate s_groups and provide informationabout them such as creating s_groups and pro-viding a list of nodes from a given s_groupThe remaining functions manipulate namesregistered in s_groups and provide informa-tion about these names For example to regis-ter a process Pid with name Name in s_groupSGroupName we use the following function Thename will only be registered if the process is

Table 1 Summary of s_group functions

Function Description

new_s_group(SGroupName Nodes) Creates a new s_group consisting of some nodes

delete_s_group(SGroupName) Deletes an s_group

add_nodes(SGroupName Nodes) Adds a list of nodes to an s_group

remove_nodes(SGroupName Nodes) Removes a list of nodes from an s_group

s_groups() Returns a list of all s_groups known to the node

own_s_groups() Returns a list of s_group tuples the node belongsto

own_nodes() Returns a list of nodes from all s_groups the nodebelongs to

own_nodes(SGroupName) Returns a list of nodes from the given s_group

info() Returns s_group state information

register_name(SGroupName Name Pid) Registers a name in the given s_group

re_register_name(SGroupName Name Pid) Re-registers a name (changes a registration) in agiven s_group

unregister_name(SGroupName Name) Unregisters a name in the given s_group

registered_names(nodeNode) Returns a list of all registered names on the givennode

registered_names(s_groupSGroupName) Returns a list of all registered names in the givens_group

whereis_name(SGroupName Name) Return the pid of a name registered in the given

whereis_name(Node SGroupName Name) s_group

send(SGroupName Name Msg) Send a message to a name registered in the given

send(Node SGroupName Name Msg) s_group

being executed on a node that belongs to thegiven s_group and neither Name nor Pid arealready registered in that group

s_groupregister_name(SGroupName Name

Pid) -gt yes | no

To illustrate the impact of s_groups on scal-ability we repeat the global operations exper-iment from Section ii (Figure 9) In the SDErlang experiment we partition the set of nodesinto s_groups each containing ten nodes and

hence the names are replicated and synchro-nised on just ten nodes and not on all nodesas in distributed Erlang The results in Fig-ure 11 show that with 001 of global opera-tions throughput of distributed Erlang stopsgrowing at 40 nodes while throughput of SDErlang continues to grow linearly

The connection topology of s_groups is ex-tremely flexible they may be organised intoa hierarchy of arbitrary depth or branchingeg there could be multiple levels in the tree

10 20 30 40 50 60 70 80 90 100

Number of nodes

Distributed Erlang SD Erlang

Figure 11 Global operations in Distributed Erlang vs SD Erlang

of s_groups see Figure 12 Moreover it is notnecessary to create an hierarchy of s_groupsfor example we have constructed an Orbit im-plementation using a ring of s_groups

Given such a flexible way of organising dis-tributed systems key questions in the designof an SD Erlang system are the following Howshould s_groups be structured Depending onthe reason the nodes are grouped ndash reducingthe number of connections or reducing thenamespace or both ndash s_groups can be freelystructured as a tree ring or some other topol-ogy How large should the s_groups be Smallers_groups mean more inter-group communica-tion but the synchronisation of the s_groupstate between the s_group nodes constrains themaximum size of s_groups We have not foundthis constraint to be a serious restriction Forexample many s_groups are either relativelysmall eg 10-node internal or terminal ele-ments in some topology eg the leaves andnodes of a tree How do nodes from differents_groups communicate While any two nodescan communicate in an SD Erlang system tominimise the number of connections communi-cation between nodes from different s_groups

is typically routed via gateway nodes that be-long to both s_groups How do we avoid singlepoints of failure For reliability and to minimisecommunication load multiple gateway nodesand processes may be required

Information to make these design choices isprovided by the tools in Section VII and bybenchmarking A further challenge is how tosystematically refactor a distributed Erlang ap-plication into SD Erlang and this is outlined inSection i A detailed discussion of distributedsystem design and refactoring in SD Erlangprovided in a recent article [22]

We illustrate typical SD Erlang system de-signs by showing refactorings of the Orbit andACO benchmarks from Section III In both dis-tributed and SD Erlang the computation startson the Master node and the actual computationis done on the Worker nodes In the distributedErlang version all nodes are interconnectedand messages are transferred directly from thesending node to the receiving node (Figure 2)In contrast in the SD Erlang version nodesare grouped into s_groups and messages aretransferred between different s_groups via Sub-master nodes (Figure 13)

master process

colony nodes in this level node1 node2 node16 node1 node1 node16

Level 1

Level 2 node1 node1 node16

Level 0 P1

P16 P2

represents a sub-master process

represents a colony process

represents an ant process

represents an s_group

represents a node

master node

Figure 12 SD Erlang ACO (SR-ACO) architecture

A fragment of code that creates an s_groupon a Sub-master node is as follows

create_s_group(Master

GroupName Nodes0) -gt

case s_groupnew_s_group(GroupName

Nodes0) of

ok GroupName Nodes -gt

Master GroupName Nodes

_ -gt ioformat(exception message)

Similarly we introduce s_groups in theGR-ACO benchmark from Section ii to cre-ate Scalable Reliable ACO (SR-ACO) see Fig-ure 12 Here apart from reducing the num-ber of connections s_groups also reduce theglobal namespace information That is in-stead of registering the name of a pid glob-ally ie with all nodes the names is regis-tered only on all nodes in the s_group withs_groupregister_name3

A comparative performance evaluation ofdistributed Erlang and SD Erlang Orbit andACO is presented in Section VIII

i2 Semi-Explicit Placement

Recall from Section iii that distributed Erlangspawns a process onto an explicitly namedErlang node eg

spawn(some_nodefun some_modulepong0)

and also recall the portability and program-ming effort issues associated with such explicitplacement in large scale systems discussed inSection ii

To address these issues we have developed asemi-explicit placement library that enables theprogrammer to select nodes on which to spawnprocesses based on run-time information aboutthe properties of the nodes For example ifa process performs a lot of computation onewould like to spawn it on a node with consid-erable computation power or if two processesare likely to communicate frequently then itwould be desirable to spawn them on the samenode or nodes with a fast interconnect

We have implemented two Erlang librariesto support semi-explicit placement [60] Thefirst deals with node attributes and describesproperties of individual Erlang VMs and associ-ated hosts such as total and currently availableRAM installed software hardware configura-tion etc The second deals with a notion of com-munication distances which models the commu-nication times between nodes in a distributedsystem Therefore instead of specifyinga node we can use the attrchoose_node1

function to define the target node ie

spawn(attrchoose_node(Params)

Figure 13 SD Erlang (SD-Orbit) architecture

[60] report an investigation into the commu-nication latencies on a range of NUMA andcluster architectures and demonstrate the ef-fectiveness of the placement libraries using theML-ACO benchmark on the Athos cluster

ii S_group Semantics

For precise specification and as a basis for val-idation we provide a small-step operationalsemantics of the s_group operations [21] Fig-ure 14 defines the state of an SD Erlang systemand associated abstract syntax variables Theabstract syntax variables on the left are definedas members of sets denoted and these inturn may contain tuples denoted () or furthersets In particular nm is a process name p apid ni a node_id and nis a set of node_idsThe state of a system is modelled as a fourtuple comprising a set of s_groups a set off ree_groups a set of f ree_hidden_groups anda set of nodes Each type of group is associ-ated with nodes and has a namespace Ans_group additionally has a name whereas af ree_hidden_group consists of only one node

ie a hidden node simultaneously acts as anode and as a group because as a group it hasa namespace but does not share it with anyother node Free normal and hidden groupshave no names and are uniquely defined bythe nodes associated with them Thereforegroup names gr_names are either NoGroupor a set of s_group_names A namespace is aset of name and process id pid pairs and isreplicated on all nodes of the associated group

A node has the following four parame-ters node_id identifier node_type that canbe either hidden or normal connections andgroup_names ie names of groups the nodebelongs to The node can belong to either a listof s_groups or one of the free groups The typeof the free group is defined by the node typeConnections are a set of node_ids

Transitions in the semantics have the form(state command ni) minusrarr (stateprime value) meaningthat executing command on node ni in statereturns value and transitions to stateprime

The semantics is presented in more detailby [21] but we illustrate it here with thes_groupregistered_names1 function fromSection i1 The function returns a list of names

(grs fgs fhs nds) isin state equiv(s_group free_group free_hidden_group node

)gr isin grs equiv s_group equiv

(s_group_name node_id namespace

)fg isin fgs equiv free_group equiv

(node_id namespace

)fh isin fhs equiv free_hidden_group equiv

(node_id namespace

)nd isin nds equiv node equiv

(node_id node_type connections gr_names

)gs isin gr_names equiv

NoGroup s_group_name

ns isin namespace equiv

(name pid)

cs isin connections equiv

node_id

nt isin node_type equiv Normal Hidden

s isin NoGroup s_group_name

Figure 14 SD Erlang state [21]

registered in s_group s if node ni belongs to thes_group an empty list otherwise (Figure 15)Here oplus denotes disjoint set union IsSGroupN-ode returns true if node ni is a member of somes_group s false otherwise and OutputNms re-turns a set of process names registered in thens namespace of s_group s

iii Semantics Validation

As the semantics is concrete it can readily bemade executable in Erlang with lists replacing

Figure 16 Testing s_groups using QuickCheck

sets throughout Having an executable seman-tics allows users to engage with it and to un-derstand how the semantics behaves vis agrave visthe library giving them an opportunity to as-sess the correctness of the library against thesemantics

Better still we can automatically assess howthe system behaves in comparison with the(executable) semantics by executing them inlockstep guided by the constraints of whichoperations are possible at each point We dothat by building an abstract state machinemodel of the library We can then gener-ate random sequences (or traces) through themodel with appropriate library data gener-ated too This random generation is supportedby the QuickCheck property-based testing sys-tem [24 9]

The architecture of the testing frameworkis shown in Figure 16 First an abstract statemachine embedded as an ldquoeqc_statemrdquo mod-ule is derived from the executable semanticspecification The state machine defines theabstract state representation and the transitionfrom one state to another when an operationis applied Test case and data generators arethen defined to control the test case generation

((grsfgs fhs nds) s_group registered_names(s) ni

)minusrarr

((grs fgs fhs nds) nms) if IsSGroupNode(ni s grs

)minusrarr

((grs fgs fhs nds)

)otherwise

where(s ni oplus nis ns)

oplus grsprime equiv grs

nms equiv OutputNms(s ns)

IsSGroupNode(ni s grs) = existnis ns grsprime (s ni oplus nis ns)

OutputNms(s ns) =(s nm) | (nm p) isin ns

Figure 15 SD Erlang Semantics of s_groupregistered_names1 Function

this includes the automatic generation of eligi-ble s_group operations and the input data tothose operations Test oracles are encoded asthe postcondition for s_group operations

During testing each test command is ap-plied to both the abstract model and thes_group library The application of the testcommand to the abstract model takes the ab-stract model from its current state to a newstate as described by the transition functionswhereas the application of the test commandto the library leads the system to a new actualstate The actual state information is collectedfrom each node in the distributed system thenmerged and normalised to the same formatas the abstract state representation For a testto be successful after the execution of a testcommand the test oracles specified for thiscommand should be satisfied Various test ora-cles can be defined for s_group operations forinstance one of the generic constraints that ap-plies to all the s_group operations is that aftereach s_group operation the normalised systemstate should be equivalent to the abstract state

Thousands of tests were run and three kindsof errors mdash which have subsequently beencorrected mdash were found Some errors in the

library implementation were found includ-ing one error due to the synchronisation be-tween nodes and the other related to theremove_nodes operation which erroneouslyraised an exception We also found a coupleof trivial errors in the semantic specificationitself which had been missed by manual ex-amination Finally we found some situationswhere there were inconsistencies between thesemantics and the library implementation de-spite their states being equivalent an exampleof this was in the particular values returnedby functions on certain errors Overall the au-tomation of testing boosted our confidence inthe correctness of the library implementationand the semantic specification This work isreported in more detail by [54]

VI Improving VM Scalability

This section reports the primary VM and li-brary improvements we have designed andimplemented to address the scalability and re-liability issues identified in Section i

i Improvements to Erlang Term Stor-age

Because ETS tables are so heavily used inErlang systems they are a focus for scalabilityimprovements We start by describing theirredesign including some improvements thatpre-date our RELEASE project work ie thoseprior to ErlangOTP R15B03 These historicalimprovements are very relevant for a scalabilitystudy and form the basis for our subsequentchanges and improvements At the point whenErlangOTP got support for multiple cores (inrelease R11B) there was a single reader-writerlock for each ETS table Optional fine grainedlocking of hash-based ETS tables (ie set bagor duplicate_bag tables) was introduced inErlangOTP R13B02-1 adding 16 reader-writerlocks for the hash buckets Reader groups tominimise read synchronisation overheads wereintroduced in ErlangOTP R14B The key ob-servation is that a single count of the multi-ple readers must be synchronised across manycache lines potentially far away in a NUMAsystem Maintaining reader counts in multi-ple (local) caches makes reads fast althoughwrites must now check every reader count InErlangOTP R16B the number of bucket locksand the default number of reader groups wereboth upgraded from 16 to 64

We illustrate the scaling properties of theETS concurrency options using the ets_bench

BenchErl benchmark on the Intel NUMA ma-chine with 32 hyperthreaded cores specified inAppendix A The ets_bench benchmark inserts1M items into the table then records the time toperform 17M operations where an operationis either a lookup an insert or a delete Theexperiments are conducted on a hash-based(set) ETS table with different percentages ofupdate operations ie insertions or deletions

Figure 17 shows the runtimes in seconds of17M operations in different ErlangOTP ver-sions varying the number of schedulers (x-axis) reflecting how the scalability of ETS ta-

0 10 20 30 40 50 60 70

R11B-5R13B02-1R14BR16B

0 10 20 30 40 50 60 70

R11B-5R13B02-1R14BR16B

Figure 17 Runtime of 17M ETS operations inErlangOTP releases 10 updates (left) and1 updates (right) The y-axis shows timein seconds (lower is better) and the x-axis isnumber of OS threads (VM schedulers)

bles has improved in more recent ErlangOTPreleases Figure 18 shows the runtimes in sec-onds of 17M operations on an ETS table withdifferent numbers of reader groups again vary-ing the number of schedulers We see that onereader group is not sufficient with 10 up-dates nor are two with 1 updates Beyondthat different numbers of reader groups havelittle impact on the benchmark performanceexcept that using 64 groups with 10 updatesslightly degrades performance

We have explored four other extensions orredesigns in the ETS implementation for betterscalability 1 Allowing more programmer con-trol over the number of bucket locks in hash-based tables so the programmer can reflectthe number of schedulers and the expected

0 10 20 30 40 50 60 70

1 RG2 RG4 RG8 RG

16 RG32 RG64 RG

0 10 20 30 40 50 60 70

1 RG2 RG4 RG8 RG

16 RG32 RG64 RG

Figure 18 Runtime of 17M ETS operations with vary-ing numbers of reader groups 10 updates(left) and 1 updates (right) The y-axisshows runtime in seconds (lower is better)and the x-axis is number of schedulers

access pattern 2 Using contention-adaptingtrees to get better scalability for ordered_setETS tables as described by [69] 3 Using queuedelegation locking to improve scalability [47]4 Adopting schemes for completely eliminat-ing the locks in the meta table A more com-plete discussion of our work on ETS can befound in the papers by [69] and [47]

Here we outline only our work oncontention-adapting (CA) trees A CA treemonitors contention in different parts of atree-shaped data structure introducing rout-ing nodes with locks in response to high con-tention and removing them in response to lowcontention For experimental purposes twovariants of the CA tree have been implementedto represent ordered_sets in the virtual ma-

0 10 20 30 40 50 60 70Number of Threads

setordsetAVL-CA treeTreap-CA tree

Figure 19 Throughput of CA tree variants 10 up-dates (left) and 1 updates (right)

chine of ErlangOTP 170 One extends theexisting AVL trees in the Erlang VM and theother uses a Treap data structure [5] Figure 19compares the throughput of the CA tree vari-ants with that of ordered_set and set as thenumber of schedulers increases It is unsur-prising that the CA trees scale so much betterthan an ordered_set which is protected by asingle readers-writer lock It is more surprisingthat they also scale better than set This is dueto hash tables using fine-grained locking at afixed granularity while CA trees can adapt thenumber of locks to the current contention leveland also to the parts of the key range wherecontention is occurring

ii Improvements to Schedulers

In the Erlang VM a scheduler is responsiblefor executing multiple processes concurrentlyin a timely and fair fashion making optimaluse of hardware resources The VM imple-ments preemptive multitasking with soft real-time guarantees Erlang processes are normallyscheduled on a reduction count basis whereone reduction is roughly equivalent to a func-tion call Each process is allowed to executeuntil it either blocks waiting for input typicallya message from some other process or until ithas executed its quota of reductions

The Erlang VM is usually started with onescheduler per logical core (SMT-thread) avail-able on the host machine and schedulers areimplemented as OS threads When an Erlangprocess is spawned it is placed in the run queueof the scheduler of its parent and it waits onthat queue until the scheduler allocates it aslice of core time Work stealing is used to bal-ance load between cores that is an idle sched-uler may migrate a process from another runqueue Scheduler run queues are visualised inFigure iii2

The default load management mechanismis load compaction that aims to keep as manyscheduler threads as possible fully loaded withwork ie it attempts to ensure that sched-uler threads do not run out of work Wehave developed a new optional scheduler utilisa-tion balancing mechanism that is available fromErlangOTP 170 The new mechanism aimsto balance scheduler utilisation between sched-ulers that is it will strive for equal schedulerutilisation on all schedulers

The scheduler utilisation balancing mecha-nism has no performance impact on the systemwhen not enabled On the other hand whenenabled it results in changed timing in the sys-tem normally there is a small overhead dueto measuring of utilisation and calculating bal-ancing information which depends on the un-derlying primitives provided by the operating

systemThe new balancing mechanism results in a

better distribution of processes to schedulersreducing the probability of core contention To-gether with other VM improvements such asinterruptable BIFs and garbage collection itresults in lower latency and improved respon-siveness and hence reliability for soft real-timeapplications

iii Improvements to Time Manage-ment

Soon after the start of the RELEASE projecttime management in the Erlang VM becamea scalability bottleneck for many applicationsas illustrated by the parallel benchmark in Fig-ure i The issue came to prominence as othermore severe bottlenecks were eliminated Thissubsection motivates and outlines the improve-ments to time management that we madethese were incorporated into ErlangOTP 18xas a new API for time and time warping Theold API is still supported at the time of writingbut its use is deprecated

The original time API provides theerlangnow0 built-in that returns ldquoErlang sys-tem timerdquo or time since Epoch with micro sec-ond resolution This time is the basis for alltime internally in the Erlang VM

Many of the scalability problems oferlangnow0 stem from its specification writ-ten at a time when the Erlang VM was notmulti-threaded ie SMT-enabled The docu-mentation promises that values returned by itare strictly increasing and many applicationsended up relying on this For example applica-tions often employ erlangnow0 to generateunique integers

Erlang system time should align with theoperating systemrsquos view of time since Epoch orldquoOS system timerdquo However while OS systemtime can be freely changed both forwards andbackwards Erlang system time cannot with-out invalidating the strictly increasing value

guarantee The Erlang VM therefore contains amechanism that slowly adjusts Erlang systemtime towards OS system time if they do notalign

One problem with time adjustment is thatthe VM deliberately presents time with an in-accurate frequency this is required to alignErlang system time with OS system timesmoothly when these two have deviated egin the case of clock shifts when leap secondsare inserted or deleted Another problem isthat Erlang system time and OS system timecan differ for very long periods of time In thenew API we resolve this using a common OStechnique [59] ie a monotonic time that hasits zero point at some unspecified point in timeMonotonic time is not allowed to make leapsforwards and backwards while system time isallowed to do this Erlang system time is thusjust a dynamically varying offset from Erlangmonotonic time

Time Retrieval Retrieval of Erlang systemtime was previously protected by a global mu-tex which made the operation thread safe butscaled poorly Erlang system time and Erlangmonotonic time need to run at the same fre-quency otherwise the time offset between themwould not be constant In the common casemonotonic time delivered by the operating sys-tem is solely based on the machinersquos local clockand cannot be changed while the system timeis adjusted using the Network Time Protocol(NTP) That is they will run with different fre-quencies Linux is an exception with a mono-tonic clock that is NTP adjusted and runs withthe same frequency as system time [76] Toalign the frequencies of Erlang monotonic timeand Erlang system time we adjust the fre-quency of the Erlang monotonic clock Thisis done by comparing monotonic time and sys-tem time delivered by the OS and calculatingan adjustment To achieve this scalably oneVM thread calculates the time adjustment touse at least once a minute If the adjustment

needs to be changed new adjustment informa-tion is published and used to calculate Erlangmonotonic time in the future

When a thread needs to retrieve time it readsthe monotonic time delivered by the OS andthe time adjustment information previouslypublished and calculates Erlang monotonictime To preserve monotonicity it is impor-tant that all threads that read the same OSmonotonic time map this to exactly the sameErlang monotonic time This requires synchro-nisation on updates to the adjustment infor-mation using a readers-writer (RW) lock ThisRW lock is write-locked only when the adjust-ment information is changed This means thatin the vast majority of cases the RW lock willbe read-locked which allows multiple readersto run concurrently To prevent bouncing thelock cache-line we use a bespoke reader opti-mised RW lock implementation where readerthreads notify about their presence in coun-ters on separate cache-lines The concept issimilar to the reader indicator algorithm de-scribed by [48 Fig 11] and alternatives includethe ingress-egress counter used by [18] and theSNZI algorithm of [30]

Timer Wheel and BIF Timer The timerwheel contains all timers set by Erlang pro-cesses The original implementation was pro-tected by a global mutex and scaled poorly Toincrease concurrency each scheduler threadhas been assigned its own timer wheel that isused by processes executing on the scheduler

The implementation of timers inErlangOTP uses a built in function (BIF)as most low-level operations do UntilErlangOTP 174 this BIF was also protectedby a global mutex Besides inserting timersinto the timer wheel the BIF timer implemen-tation also maps timer references to a timer inthe timer wheel To improve concurrency fromErlangOTP 18 we provide scheduler-specificBIF timer servers as Erlang processes Thesekeep information about timers in private ETS

tables and only insert one timer at the timeinto the timer wheel

Benchmarks We have measured severalbenchmarks on a 16-core Bulldozer machinewith eight dual CPU AMD Opteron 4376 HEs5

We present three of them hereThe first micro benchmark compares the ex-

ecution time of an Erlang receive with that ofa receive after that specifies a timeout andprovides a default value The receive after

sets a timer when the process blocks in thereceive and cancels it when a message ar-rives In ErlangOTP 174 the total executiontime with standard timers is 62 longer thanwithout timers Using the improved implemen-tation in ErlangOTP 180 total execution timewith the optimised timers is only 5 longerthan without timers

The second micro benchmark repeatedlychecks the system time calling the built-inerlangnow0 in ErlangOTP 174 andcalling both erlangmonotonic_time0 anderlangtime_offset0 and adding the resultsin ErlangOTP 180 In this machine wherethe VM uses 16 schedulers by default the 180release is more than 69 times faster than the174 release

The third benchmark is the parallel BenchErlbenchmark from Section i Figure 20 showsthe results of executing the original version ofthis benchmark which uses erlangnow0 tocreate monotonically increasing unique valuesusing three ErlangOTP releases R15B01 174and 181 We also measure a version of thebenchmark in ErlangOTP 181 where the callto erlangnow0 has been substituted with acall to erlangmonotonic_time0 The graphon its left shows that 1 the performance oftime management has remained roughly un-changed between ErlangOTP releases priorto 180 2 the improved time management inErlangOTP 18x make time management less

5See sect254 of the RELEASE project Deliverable 24(httprelease-projecteudocumentsD24pdf)

100000

120000

140000

160000

180000

0 10 20 30 40 50 60 70

Schedulers

15B01 (now()) [70016640]174 (now()) [70016640]

181 (monotonic_time()) [70016640]181 (now()) [70016640]

0 10 20 30 40 50 60 70

Schedulers

181 (monotonic_time()) [70016640]

Figure 20 BenchErl parallel benchmark us-ing erlangmonotonic_time0 orerlangnow0 in different ErlangOTPreleases runtimes (left) and speedup obtainedusing erlangmonotonic_time0 (right)

likely to be a scalability bottleneck even whenusing erlangnow0 and 3 the new time API(using erlangmonotonic_time0 and friends)provides a scalable solution The graph on theright side of Figure 20 shows the speedup thatthe modified version of the parallel benchmarkachieves in ErlangOTP 181

VII Scalable Tools

This section outlines five tools developed in theRELEASE project to support scalable Erlangsystems Some tools were developed fromscratch like Devo SDMon and WombatOAMwhile others extend existing tools like Perceptand Wrangler These include tooling to trans-form programs to make them more scalable

to deploy them for scalability to monitor andvisualise them Most of the tools are freelyavailable under open source licences (DevoPercept2 SD-Mon Wrangler) while Wombat-OAM is a commercial product The tools havebeen used for profiling and refactoring theACO and Orbit benchmarks from Section III

The Erlang tool ldquoecosystemrdquo consists ofsmall stand-alone tools for tracing profilingand debugging Erlang systems that can be usedseparately or together as appropriate for solv-ing the problem at hand rather than as a sin-gle monolithic super-tool The tools presentedhere have been designed to be used as part ofthat ecosystem and to complement alreadyavailable functionality rather than to duplicateit The Erlang runtime system has built-insupport for tracing many types of events andthis infrastructure forms the basis of a numberof tools for tracing and profiling Typically thetools build on or specialise the services offeredby the Erlang virtual machine through a num-ber of built-in functions Most recently andsince the RELEASE project was planned theObserver6 application gives a comprehensiveoverview of many of these data on a node-by-node basis

As actor frameworks and languages (seeSection ii) have only recently become widelyadopted commercially their tool support re-mains relatively immature and generic in na-ture That is the tools support the languageitself rather than its distinctively concurrentaspects Given the widespread use of Erlangtools developed for it point the way for toolsin other actor languages and frameworks Forexample just as many Erlang tools use trac-ing support provided by the Erlang VM socan other actor frameworks eg Akka can usethe Kamon7 JVM monitoring system Simi-larly tools for other actor languages or frame-works could use data derived through OS-level

6httpwwwerlangorgdocappsobserver7httpkamonio

tracing frameworks DTrace8 and SystemTap9

probes as we show in this section for Erlangprovided that the host language has tracinghooks into the appropriate infrastructure

i Refactoring for Scalability

Refactoring [65 32 75] is the process of chang-ing how a program works without changingwhat it does This can be done for readabilityfor testability to prepare it for modification orextension or mdash as is the case here mdash in orderto improve its scalability Because refactoringinvolves the transformation of source code itis typically performed using machine supportin a refactoring tool There are a number oftools that support refactoring in Erlang in theRELEASE project we have chosen to extendWrangler10 [56] other tools include Tidier [68]and RefactorErl [44]

Supporting API Migration The SD Erlang li-braries modify Erlangrsquos global_group librarybecoming the new s_group library as a re-sult Erlang programs using global_group willhave to be refactored to use s_group This kindof API migration problem is not uncommon assoftware evolves and this often changes the APIof a library Rather than simply extend Wran-gler with a refactoring to perform this particu-lar operation we instead added a frameworkfor the automatic generation of API migrationrefactorings from a user-defined adaptor mod-ule

Our approach to automatic API migrationworks in this way when an API functionrsquosinterface is changed the author of this APIfunction implements an adaptor function defin-ing calls to the old API in terms of the newFrom this definition we automatically generatethe refactoring that transforms the client codeto use the new API this refactoring can also be

8httpdtraceorgblogsabout9httpssourcewareorgsystemtapwiki

10httpwwwcskentacukprojectswrangler

supplied by the API writer to clients on libraryupgrade allowing users to upgrade their codeautomatically The refactoring works by gener-ating a set of rules that ldquofold inrdquo the adaptationto the client code so that the resulting codeworks directly with the new API More detailsof the design choices underlying the work andthe technicalities of the implementation can befound in a paper by [52]

Support for Introducing Parallelism Wehave introduced support for parallelising ex-plicit list operations (map and foreach) for pro-cess introduction to complete a computation-ally intensive task in parallel for introducing aworker process to deal with call handling in anErlang ldquogeneric serverrdquo and to parallelise a tailrecursive function We discuss these in turnnow more details and practical examples ofthe refactorings appear in a conference paperdescribing that work [55]

Uses of map and foreach in list process-ing are among of the most obvious placeswhere parallelism can be introduced Wehave added a small library to Wrangler calledpara_lib which provides parallel implementa-tions of map and foreach The transformationfrom an explicit use of sequential mapforeachto the use of their parallel counterparts isvery straightforward even manual refactor-ing would not be a problem However amapforeach operation could also be imple-mented differently using recursive functionslist comprehensions etc identifying this kindof implicit mapforeach usage can be done us-ing Wranglerrsquos code inspection facility and arefactoring that turns an implicit mapforeachto an explicit mapforeach can also be speci-fied using Wranglerrsquos rule-based transforma-tion API

If the computations of two non-trivial tasksdo not depend on each other then they canbe executed in parallel The Introduce a NewProcess refactoring implemented in Wranglercan be used to spawn a new process to execute

a task in parallel with its parent process Theresult of the new process is sent back to the par-ent process which will then consume it whenneeded In order not to block other computa-tions that do not depend on the result returnedby the new process the receive expression isplaced immediately before the point where theresult is needed

While some tail-recursive list processingfunctions can be refactored to an explicit mapoperation many cannot due to data dependen-cies For instance an example might perform arecursion over a list while accumulating resultsin an accumulator variable In such a situa-tion it is possible to ldquofloat outrdquo some of thecomputations into parallel computations Thiscan only be done when certain dependencyconstraints are satisfied and these are done byprogram slicing which is discussed below

Support for Program Slicing Program slic-ing is a general technique of program analysisfor extracting the part of a program also calledthe slice that influences or is influenced by agiven point of interest ie the slicing criterionStatic program slicing is generally based onprogram dependency including both controldependency and data dependency Backwardintra-function slicing is used by some of therefactorings described above it is also usefulin general and made available to end-usersunder Wranglerrsquos Inspector menu [55]

Our work can be compared with that inPaRTE11 [17] a tool developed in another EUproject that also re-uses the Wrangler front endThis work concentrates on skeleton introduc-tion as does some of our work but we gofurther in using static analysis and slicing intransforming programs to make them suitablefor introduction of parallel structures

11httpparaphrase-enlargedeltehu

downloadsD4-3_user_manualpdf

InfrastructureProvider

Wombat Web

Dashboard

Northbound Interface

Master

Middle Manager

MetricsAPI

OrchestrationAPI

TopologyAPI

Core Metrics

elibcloudLibcloud

ProvisionVMs(REST)

Agents

Metrics agent

Managed Erlang NodeVirtual machine

Erlang

ErlangRPC

Erlang

Core Metrics

OrchestrationNode manager

Core DB(including

Topology and Orchestration

Topology

Node manager

Provision software(SSH)

WombatCLI

NotifAPI

error_logger agent

lager agent

Alarms agent

AlarmsAPI

Alarms

Orchestration

Erlang

Figure 21 The architecture of WombatOAM

ii Scalable Deployment

We have developed the WombatOAM tool12 toprovide a deployment operations and main-tenance framework for large-scale Erlang dis-tributed systems These systems typically con-sist of a number of Erlang nodes executing ondifferent hosts These hosts may have differenthardware or operating systems be physical orvirtual or run different versions of ErlangOTP

12WombatOAM (httpswwwerlang-solutionscomproductswombat-oamhtml) is a commercial toolavailable from Erlang Solutions Ltd

Prior to the development of WombatOAM de-ployment of systems would use scripting inErlang and the shell and this is the state ofthe art for other actor frameworks it would bepossible to adapt the WombatOAM approachto these frameworks in a straightforward way

Architecture The architecture ofWombatOAM is summarised in Figure 21Originally the system had problems address-ing full scalability because of the role playedby the central Master node in its current

version an additional layer of Middle Managerswas introduced to allow the system to scaleeasily to thousands of deployed nodes As thediagram shows the ldquonorthboundrdquo interfacesto the web dashboard and the command-lineare provided through RESTful connectionsto the Master The operations of the Masterare delegated to the Middle Managers thatengage directly with the managed nodes Eachmanaged node runs a collection of services thatcollect metrics raise alarms and so forth wedescribe those now

Services WombatOAM is designed to collectstore and display various kinds of informationand event from running Erlang systems andthese data are accessed and managed throughan AJAX-based Web Dashboard these includethe following

Metrics WombatOAM supports the collectionof some hundred metrics mdash including forinstance numbers of processes on a nodeand the message traffic in and out of thenode mdash on a regular basis from the ErlangVMs running on each of the hosts It canalso collect metrics defined by users withinother metrics collection frameworks suchas Folsom13 and interface with other toolssuch as graphite14 which can log and dis-play such information The metrics can bedisplayed as histograms covering differentwindows such as the last fifteen minuteshour day week or month

Notifications As well as metrics it can sup-port event-based monitoring through thecollection of notifications from runningnodes Notifications which are one timeevents can be generated using the ErlangSystem Architecture Support LibrariesSASL which is part of the standard distri-

13Folsom collects and publishes metrics through anErlang API httpsgithubcomboundaryfolsom

14httpsgraphiteapporg

bution or the lager logging framework15and will be displayed and logged as theyoccur

Alarms Alarms are more complex entitiesAlarms have a state they can be raised andonce dealt with they can be cleared theyalso have identities so that the same alarmmay be raised on the same node multipletimes and each instance will need to bedealt with separately Alarms are gener-ated by SASL or lager just as for notifica-tions

Topology The Topology service handlesadding deleting and discovering nodesIt also monitors whether they are accessi-ble and if not it notifies the other servicesand periodically tries to reconnect Whenthe nodes are available again it also noti-fies the other services It doesnrsquot have amiddle manager part because it doesnrsquottalk to the nodes directly instead it asksthe Node manager service to do so

Node manager This service maintains the con-nection to all managed nodes via theErlang distribution protocol If it losesthe connection towards a node it periodi-cally tries to reconnect It also maintainsthe states of the nodes in the database (egif the connection towards a node is lostthe Node manager changes the node stateto DOWN and raises an alarm) The Nodemanager doesnrsquot have a REST API sincethe node states are provided via the Topol-ogy servicersquos REST API

Orchestration This service can deploy newErlang nodes on already running ma-chines It can also provision new vir-tual machine instances using several cloudproviders and deploy Erlang nodes onthose instances For communicating withthe cloud providers the Orchestration

15httpsgithubcombasholager

service uses an external library calledLibcloud16 for which Erlang Solutions haswritten an open source Erlang wrappercalled elibcloud to make Libcloud eas-ier to use from WombatOAM Note thatWombatOAM Orchestration doesnrsquot pro-vide a platform for writing Erlang appli-cations it provides infrastructure for de-ploying them

Deployment The mechanism consists of thefollowing five steps

1 Registering a provider WombatOAM pro-vides the same interface for differentcloud providers which support the Open-Stack standard or the Amazon EC2 APIWombatOAM also provides the same in-terface for using a fixed set of machinesIn WombatOAMrsquos backend this has beenimplemented as two driver modules theelibcloud driver module which uses theelibcloud and Libcloud libraries to com-municate with the cloud providers andthe SSH driver module that keeps track ofa fixed set of machines

2 Uploading a release The release can be ei-ther a proper Erlang release archive or aset of Erlang modules The only importantaspect from WombatOAMrsquos point of viewis that start and stop commands shouldbe explicitly specified This will enableWombatOAM start and stop nodes whenneeded

3 Defining a node family The next step is cre-ating the node family which is the entitythat refers to a certain release containsdeployment domains that refer to certainproviders and contains other informationnecessary to deploy a node

16The unified cloud API [4] httpslibcloud

apacheorg

4 Defining a deployment domain At this stepa deployment domain is created that spec-ifies(i) which providers should be used forprovisioning machines (ii) the usernamethat will be used when WombatOAM con-nects to the hosts using SSH

5 Node deployment To deploy nodes aWombatOAM user needs only to specifythe number of nodes and the node fam-ily these nodes belong to Nodes can bedynamically added to or removed fromthe system depending on the needs of theapplication The nodes are started andWombatOAM is ready to initiate and runthe application

iii Monitoring and Visualisation

A key aim in designing new monitoring andvisualisation tools and adapting existing oneswas to provide support for systems running onparallel and distributed hardware Specificallyin order to run modern Erlang systems andin particular SD Erlang systems it is necessaryto understand both their single host (ldquomulti-corerdquo) and multi-host (ldquodistributedrdquo) natureThat is we need to be able to understand howsystems run on the Erlang multicore virtualmachine where the scheduler associated witha core manages its own run queue and pro-cesses migrate between the queues througha work stealing algorithm at the same timewe have to understand the dynamics of a dis-tributed Erlang program where the user ex-plicitly spawns processes to nodes17

iii1 Percept2

Percept218 builds on the existing Percept toolto provide post hoc offline analysis and visuali-

17This is in contrast with the multicore VM whereprogrammers have no control over where processes arespawned however they still need to gain insight into thebehaviour of their programs to tune performance

18httpsgithubcomRefactoringToolspercept2

Figure 22 Offline profiling of a distributed Erlang application using Percept2

sation of Erlang systems Percept2 is designedto allow users to visualise and tune the paral-lel performance of Erlang systems on a singlenode on a single manycore host It visualisesErlang application level concurrency and iden-tifies concurrency bottlenecks Percept2 usesErlang built in tracing and profiling to monitorprocess states ie waiting running runnablefree and exiting A waiting or suspended pro-cess is considered an inactive and a running orrunnable process is considered active As a pro-gram runs with Percept events are collectedand stored to a file The file is then analysedwith the results stored in a RAM database andthis data is viewed through a web-based in-terface The process of offline profiling for adistributed Erlang application using Percept2is shown in Figure 22

Percept generates an application-levelzoomable concurrency graph showing thenumber of active processes at each pointduring profiling dips in the graph representlow concurrency A lifetime bar for eachprocess is also included showing the pointsduring its lifetime when the process was activeas well as other per-process information

Percept2 extends Percept in a number ofways mdash as detailed by [53] mdash including mostimportantly

bull Distinguishing between running andrunnable time for each process this isapparent in the process runnability com-parison as shown in Figure 23 where or-ange represents runnable and green rep-resents running This shows very clearlywhere potential concurrency is not beingexploited

bull Showing scheduler activity the numberof active schedulers at any time during theprofiling

bull Recording more information during exe-cution including the migration history ofa process between run queues statisticsabout message passing between processesthe number of messages and the averagemessage size sentreceived by a processthe accumulated runtime per-process theaccumulated time when a process is in arunning state

bull Presenting the process tree the hierarchy

Figure 23 Percept2 showing processes running and runnable (with 4 schedulersrun queues)

structure indicating the parent-child rela-tionships between processes

bull Recording dynamic function callgraphcounttime the hierarchy struc-ture showing the calling relationshipsbetween functions during the programrun and the amount of time spent in afunction

bull Tracing of s_group activities in a dis-tributed system Unlike global groups_group allows dynamic changes to thes_group structure of a distributed Erlangsystem In order to support SD Erlang wehave also extended Percept2 to allow theprofiling of s_group related activities sothat the dynamic changes to the s_groupstructure of a distributed Erlang systemcan be captured

We have also improved on Percept as follows

bull Enabling finer-grained control of whatis profiled The profiling of port activi-ties schedulers activities message pass-ing process migration garbage collec-tion and s_group activities can be en-ableddisabled while the profiling of pro-cess runnability (indicated by the ldquoprocrdquoflag) is always enabled

bull Selective function profiling of processesIn Percept2 we have built a version offprof which does not measure a functionrsquosown execution time but measures every-thing else that fprof measures Eliminatingmeasurement of a functionrsquos own execu-tion time gives users the freedom of notprofiling all the function calls invoked dur-ing the program execution For examplethey can choose to profile only functionsdefined in their own applicationsrsquo codeand not those in libraries

bull Improved dynamic function callgraphWith the dynamic function callgraph auser is able to understand the causes ofcertain events such as heavy calls of a par-ticular function by examining the regionaround the node for the function includ-ing the path to the root of the graph Eachedge in the callgraph is annotated withthe number of times the target functionis called by the source function as well asfurther information

Finally we have also improved the scalabilityof Percept in three ways First we have paral-lelised the processing of trace files so that mul-tiple data files can be processed at the sametime We have also compressed the representa-tion of call graphs and cached the history of

generated web pages Together these make thesystem more responsive and more scalable

iii2 DTraceSystemTap tracing

DTrace provides dynamic tracing support forvarious flavours of Unix including BSD andMac OS X and SystemTap does the same forLinux both allow the monitoring of live run-ning systems with minimal overhead Theycan be used by administrators language devel-opers and application developers alike to ex-amine the behaviour of applications languageimplementations and the operating system dur-ing development or even on live productionsystems In comparison to other similar toolsand instrumentation frameworks they are rel-atively lightweight do not require special re-compiled versions of the software to be ex-amined nor special post-processing tools tocreate meaningful information from the datagathered Using these probs it is possible toidentify bottlenecks both in the VM itself andin applications

VM bottlenecks are identified using a largenumber of probes inserted into ErlangOTPrsquosVM for example to explore scheduler run-queue lengths These probes can be used tomeasure the number of processes per sched-uler the number of processes moved duringwork stealing the number of attempts to gaina run-queue lock how many of these succeedimmediately etc Figure 24 visualises the re-sults of such monitoring it shows how the sizeof run queues vary during execution of thebang BenchErl benchmark on a VM with 16schedulers

Application bottlenecks are identified by analternative back-end for Percept2 based onDTraceSystemTap instead of the Erlang built-in tracing mechanism The implementation re-uses the existing Percept2 infrastructure as faras possible It uses a different mechanism forcollecting information about Erlang programsa different format for the trace files but the

Run Queues

RunQueue0RunQueue1RunQueue2RunQueue3RunQueue4RunQueue5RunQueue6RunQueue7RunQueue8RunQueue9RunQueue10RunQueue11RunQueue12RunQueue13RunQueue14RunQueue15

Figure 24 DTrace runqueue visualisation of bang

benchmark with 16 schedulers (time on X-axis)

same storage infrastructure and presentationfacilities

iii3 Devo

Devo19 [10] is designed to provide real-time on-line visualisation of both the low-level (singlenode multiple cores) and high-level (multi-ple nodes grouped into s_groups) aspects ofErlang and SD Erlang systems Visualisationis within a browser with web sockets provid-ing the connections between the JavaScript vi-sualisations and the running Erlang systemsinstrumented through the trace tool builder(ttb)

Figure 25 shows visualisations from devoin both modes On the left-hand side a sin-gle compute node is shown This consists oftwo physical chips (the upper and lower halvesof the diagram) with six cores each with hy-perthreading this gives twelve virtual coresand hence 24 run queues in total The size ofthese run queues is shown by both the colourand height of each column and process migra-tions are illustrated by (fading) arc between thequeues within the circle A green arc shows mi-

19httpsgithubcomRefactoringToolsdevo

p0 9152 0

p0 9501 0

p0 10890 0

p0 10891 0

Figure 25 Devo low and high-level visualisations ofSD-Orbit

gration on the same physical core a grey oneon the same chip and blue shows migrationsbetween the two chips

On the right-hand side of Figure 25 is a vi-sualisation of SD-Orbit in action Each nodein the graph represents an Erlang node andcolours (red green blue and orange) are usedto represent the s_groups to which a node be-longs As is evident the three nodes in thecentral triangle belong to multiple groups andact as routing nodes between the other nodesThe colour of the arc joining two nodes repre-

sents the current intensity of communicationbetween the nodes (green quiescent red busi-est)

iii4 SD-Mon

SD-Mon is a tool specifically designed for mon-itoring SD-Erlang systems This purpose isaccomplished by means of a shadow network ofagents that collect data from a running sys-tem An example deployment is shown inFigure 26 where blue dots represent nodesin the target system and the other nodes makeup the SD-Mon infrastructure The network isdeployed on the basis of a configuration filedescribing the network architecture in terms ofhosts Erlang nodes global group and s_grouppartitions Tracing to be performed on moni-tored nodes is also specified within the config-uration file

An agent is started by a master SD-Mon nodefor each s_group and for each free node Con-figured tracing is applied on every monitorednode and traces are stored in binary format inthe agent file system The shadow network fol-lows system changes so that agents are startedand stopped at runtime as required as shownin Figure 27 Such changes are persistentlystored so that the last configuration can be re-produced after a restart Of course the shadownetwork can be always updated via the UserInterface

Each agent takes care of an s_group or of afree node At start-up it tries to get in contactwith its nodes and apply the tracing to them asstated by the master Binary files are stored inthe host file system Tracing is internally usedin order to track s_group operations happeningat runtime An asynchronous message is sentto the master whenever one of these changes oc-curs Since each process can only be traced bya single process at a time each node (includedthose belonging to more than one s_group) iscontrolled by only one agent When a node isremoved from a group or when the group is

Figure 26 SD-Mon architecture

deleted another agent takes over as shown inFigure 27 When an agent is stopped all traceson the controlled nodes are switched off

The monitoring network is also supervisedin order to take account of network fragilityand when an agent node goes down anothernode is deployed to play its role there are alsoperiodic consistency checks for the system as awhole and when an inconsistency is detectedthen that part of the system can be restarted

SD-Mon does more than monitor activitiesone node at a time In particular inter-node andinter-group messages are displayed at runtimeAs soon as an agent is stopped the relatedtracing files are fetched across the network bythe master and they are made available in areadable format in the master file system

SD-Mon provides facilities for online visual-isation of this data as well as post hoc offlineanalysis Figure 28 shows in real time mes-sages that are sent between s_groups This datacan also be used as input to the animated Devovisualisation as illustrated in the right-handside of Figure 25

Figure 27 SD-Mon evolution Before (top) and aftereliminating s_group 2 (bottom)

Figure 28 SD-Mon online monitoring

VIII Systemic Evaluation

Preceding sections have investigated the im-provements of individual aspects of an Erlangsystem eg ETS tables in Section i This sec-tion analyses the impact of the new tools andtechnologies from Sections V VI VII in concertWe do so by investigating the deployment re-liability and scalability of the Orbit and ACObenchmarks from Section III The experimentsreported here are representative Similar exper-iments show consistent results for a range ofmicro-benchmarks several benchmarks andthe very substantial (approximately 150K linesof Erlang) Sim-Diasca case study [16] on sev-eral state of the art NUMA architectures and

the four clusters specified in Appendix A A co-herent presentation of many of these results isavailable in an article by [22] and in a RELEASEproject deliverable20 The bulk of the exper-iments reported here are conducted on theAthos cluster using ErlangOTP 174 and theassociated SD Erlang libraries

The experiments cover two measures of scal-ability As Orbit does a fixed size computa-tion the scaling measure is relative speedup (orstrong scaling) ie speedup relative to execu-tion time on a single core As the work in ACOincreases with compute resources weak scalingis the appropriate measure The benchmarks

20See Deliverable 62 available online httpwww

release-projecteudocumentsD62pdf

0 50(1200) 100(2400) 150(3600) 200(4800) 257(6168)

Number of nodes (cores)

D-Orbit W=2MD-Orbit W=5M

SD-Orbit W=2MSD-Orbit W=5M

Figure 29 Speedup of distributed Erlang vs SD Erlang Orbit (2M and 5M elements) [Strong Scaling]

also evaluate different aspects of s_groups Or-bit evaluates the scalability impacts of networkconnections while ACO evaluates the impactof both network connections and the globalnamespace required for reliability

i Orbit

Figure 29 shows the speedup of the D-Orbitand SD-Orbit benchmarks from Section i1 Themeasurements are repeated seven times andwe plot standard deviation Results show thatD-Orbit performs better on a small numberof nodes as communication is direct ratherthan via a gateway node As the number ofnodes grows however SD-Orbit delivers betterspeedups ie beyond 80 nodes in case of 2Morbit elements and beyond 100 nodes in caseof 5M orbit elements When we increase thesize of Orbit beyond 5M the D-Orbit versionfails due to the fact that some VMs exceed theavailable RAM of 64GB In contrast SD-Orbitexperiments run successfully even for an orbitwith 60M elements

Deployment The deployment and moni-toring of ACO and of the large (150Klines of Erlang) Sim-Diasca simulation usingWombatOAM (Section ii) on the Athos clusteris detailed in [22]

An example experiment deploys 10000Erlang nodes without enabling monitoringand hence allocates three nodes per core (ie72 nodes on each of 139 24-core Athos hosts)Figure 30 shows that WombatOAM deploysthe nodes in 212s which is approximately 47nodes per second It is more common to haveat least one core per Erlang node and in a re-lated experiment 5000 nodes are deployed oneper core (ie 24 nodes on each of 209 24-coreAthos hosts) in 101s or 50 nodes per secondCrucially in both cases the deployment timeis linear in the number of nodes The deploy-ment time could be reduced to be logarithmicin the number of nodes using standard divide-and-conquer techniques However there hasnrsquotbeen the demand to do so as most Erlang sys-tems are long running servers

0 2000 4000 6000 8000 10000

Number of nodes (N)

Experimental data Best fit 00125N+83

Figure 30 Wombat Deployment Time

The measurement data shows two impor-tant facts it shows that WombatOAM scaleswell (up to a deployment base of 10000 Erlangnodes) and that WombatOAM is non-intrusivebecause its overhead on a monitored node istypically less than 15 of effort on the node

We conclude that WombatOAM is capableof deploying and monitoring substantial dis-tributed Erlang and SD Erlang programs Theexperiments in the remainder of this sectionuse standard distributed Erlang configurationfile deployment

Reliability SD Erlang changes the organisa-tion of processes and their recovery data atthe language level so we seek to show thatthese changes have not disrupted Erlangrsquosworld-class reliability mechanisms at this levelAs we havenrsquot changed them we donrsquot exer-cise Erlangrsquos other reliability mechanisms egthose for managing node failures network con-gestion etc A more detailed study of SDErlang reliability including the use of repli-cated databases for recovering Instant Messen-ger chat sessions finds similar results [23]

We evaluate the reliability of two ACO ver-sions using a Chaos Monkey service that killsprocesses in the running system at random [14]Recall that GR-ACO provides reliability by reg-istering the names of critical processes globallyand SR-ACO registers them only within ans_group (Section ii)

For both GR-ACO and SR-ACO a ChaosMonkey runs on every Erlang node ie mas-ter submasters and colony nodes killing arandom Erlang process every second BothACO versions run to completion Recoveryat this failure frequency has no measurableimpact on runtime This is because processesare recovered within the Virtual machine using(globally synchronised) local recovery informa-tion For example on a common X86Ubuntuplatform typical Erlang process recovery timesare around 03ms so around 1000x less thanthe Unix process recovery time on the sameplatform [58] We have conducted more de-tailed experiments on an Instant Messengerbenchmark and obtained similar results [23]

We conclude that both GR-ACO and SR-ACO are reliable and that SD Erlang preserves

0 50(1200) 100(2400) 150(3600) 200(4800) 250(6000)

GR-ACO SR-ACO ML-ACO

Figure 31 ACO execution times ErlangOTP 174 (RELEASE) [Weak Scaling]

distributed Erlang reliability model The re-mainder of the section outlines the impact ofmaintaining the recovery information requiredfor reliability on scalability

Scalability Figure 31 compares the runtimesof the ML GR and SR versions of ACO (Sec-tion ii) on ErlangOTP 174(RELEASE) As out-lined in Section i1 GR-ACO not only main-tains a fully connected graph of nodes it reg-isters process names for reliability and hencescales significantly worse than the unreliableML-ACO We conclude that providing reliabil-ity with standard distributed Erlang processregistration dramatically limits scalability

While ML-ACO does not provide reliabilityand hence doesnrsquot register process names itmaintains a fully connected graph of nodeswhich limits scalability SR-ACO that main-tains connections and registers process namesonly within s_groups scales best of all Fig-ure 31 illustrates how maintaining the processnamespace and fully connected network im-pacts performance This reinforces the evi-dence from the Orbit benchmarks and oth-

ers that partitioning the network of Erlangnodes significantly improves performance atlarge scale

To investigate the impact of SD Erlang onnetwork traffic we measure the number of sentand received packets on the GPG cluster forthree versions of ACO ML-ACO GR-ACOand SR-ACO Figure 32 shows the total numberof sent packets The highest traffic (the red line)belongs to the GR-ACO and the lowest trafficbelongs to the SR-ACO (dark blue line) Thisshows that SD Erlang significantly reduces thenetwork traffic between Erlang nodes Evenwith the s_group name registration SR-ACOhas less network traffic than ML-ACO that hasno global name registration

iii Evaluation Summary

We have shown that WombatOAM is capableof deploying and monitoring substantial dis-tributed Erlang and SD Erlang programs likeACO and Sim-Diasca The Chaos Monkey ex-periments with GR-ACO and SR-ACO showthat both are reliable and hence that SD Erlang

Figure 32 Number of sent packets in ML-ACO GR-ACO and SR-ACO

preserves the distributed Erlang language-levelreliability model

As SD Orbit scales better than D-Orbit SR-ACO scales better than ML-ACO and SR-ACO has significantly less network trafficwe conclude that even when global recoverydata is not maintained partitioning the fully-connected network into s_groups reduces net-work traffic and improves performance Whilethe distributed Orbit instances (W=2M) and(W=5M) reach scalability limits at around 40and 60 nodes Orbit scales to 150 nodes on SDErlang (limited by input size) and SR-ACOis still scaling well on 256 nodes (6144 cores)Hence not only have we exceeded the 60 nodescaling limits of distributed Erlang identifiedin Section ii we have not reached the scalinglimits of SD Erlang on this architecture

Comparing GR-ACO and ML-ACO scalabil-ity curves shows that maintaining global re-covery data ie a process name space dra-matically limits scalability Comparing GR-ACO and SR-ACO scalability curves showsthat scalability can be recovered by partitioningthe nodes into appropriately-sized s_groups

and hence maintaining the recovery data onlywithin a relatively small group of nodes Theseresults are consistent with other experiments

IX Discussion

Distributed actor platforms like Erlang orScala with Akka are a common choice forinternet-scale system architects as the modelwith its automatic and VM-supported reliabil-ity mechanisms makes it extremely easy to engi-neer scalable reliable systems Targeting emergentserver architectures with hundreds of hostsand tens of thousands of cores we report asystematic effort to improve the scalability of aleading distributed actor language while pre-serving reliability The work is a vade mecumfor addressing scalability of reliable actor lan-guages and frameworks It is also high impactwith downloads of our improved ErlangOTPrunning at 50K a month

We have undertaken the first systematicstudy of scalability in a distributed actor lan-guage covering VM language and persis-tent storage levels We have developed the

BenchErl and DE-Bench tools for this purposeKey VM-level scalability issues we identify in-clude contention for shared ETS tables and forcommonly-used shared resources like timersKey language scalability issues are the costs ofmaintaining a fully-connected network main-taining global recovery information and ex-plicit process placement Unsurprisingly thescaling issues for this distributed actor lan-guage are common to other distributed or par-allel languages and frameworks with otherparadigms like CHARM++ [45] Cilk [15] orLegion [40] We establish scientifically the folk-lore limitations of around 60 connected nodesfor distributed Erlang (Section IV)

The actor model is no panacea and there canstill be scalability problems in the algorithmsthat we write either within a single actor or inthe way that we structure communicating ac-tors A range of pragmatic issues also impactthe performance and scalability of actor sys-tems including memory occupied by processes(even when quiescent) mailboxes filling upetc Identifying and resolving these problemsis where tools like Percept2 and WombatOAMare needed However many of the scalabilityissues arise where Erlang departs from the privatestate principle of the actor model eg in main-taining shared state in ETS tables or a sharedglobal process namespace for recovery

We have designed and implemented a setof Scalable Distributed (SD) Erlang librariesto address language-level scalability issues Thekey constructs are s_groups for partitioning thenetwork and global process namespace andsemi-explicit process placement for deployingdistributed Erlang applications on large hetero-geneous architectures in a portable way Wehave provided a state transition operational se-mantics for the new s_groups and validatedthe library implementation against the seman-tics using QuickCheck (Section V)

To improve the scalability of the Erlang VMand libraries we have improved the implemen-tation of shared ETS tables time management

and load balancing between schedulers Fol-lowing a systematic analysis of ETS tablesthe number of fine-grained (bucket) locks andof reader groups have been increased Wehave developed and evaluated four new tech-niques for improving ETS scalability(i) pro-grammer control of number of bucket locks(ii) a contention-adapting tree data structurefor ordered_sets (iii) queue delegation lock-ing and (iv) eliminating the locks in the metatable We have introduced a new scheduler util-isation balancing mechanism to spread work tomultiple schedulers (and hence cores) and newsynchronisation mechanisms to reduce con-tention on the widely-used time managementmechanisms By June 2015 with ErlangOTP180 the majority of these changes had been in-cluded in the primary releases In any scalableactor language implementation such thought-ful design and engineering will be requiredto schedule large numbers of actors on hostswith many cores and to minimise contentionon shared VM resources (Section VI)

To facilitate the development of large Erlangsystems and to make them understandable wehave developed a range of tools The propri-etary WombatOAM tool deploys and monitorslarge distributed Erlang systems over multi-ple and possibly heterogeneous clusters orclouds We have made open source releases offour concurrency tools Percept2 now detectsconcurrency bad smells Wrangler provides en-hanced concurrency refactoring the Devo toolis enhanced to provide interactive visualisationof SD Erlang systems and the new SD-Montool monitors SD Erlang systems We an-ticipate that these tools will guide the designof tools for other large scale distributed actorlanguages and frameworks (Section VII)

We report on the reliability and scalabilityimplications of our new technologies using arange of benchmarks and consistently use theOrbit and ACO benchmarks throughout thearticle While we report measurements on arange of NUMA and cluster architectures the

Table 2 Cluster Specifications (RAM per host in GB)

per max

Name Hosts host total avail Processor RAM Inter-connection

GPG 20 16 320 320 Intel Xeon E5-2640v2 8C2GHz

64 10GB Ethernet

Kalkyl 348 8 2784 1408 Intel Xeon 5520v2 4C226GHz

24ndash72 InfiniBand 20Gbs

TinTin 160 16 2560 2240AMD Opteron 6220v2 Bull-dozer 8C 30GHz 64ndash128

21 oversub-scribed QDRInfiniband

Athos 776 24 18624 6144 Intel Xeon E5-2697v2 12C27GHz

64 InfinibandFDR14

key scalability experiments are conducted onthe Athos cluster with 256 hosts (6144 cores)Even when global recovery data is not main-tained partitioning the network into s_groupsreduces network traffic and improves the per-formance of the Orbit and ACO benchmarksabove 80 hosts Crucially we exceed the 60node limit for distributed Erlang and do notreach the scalability limits of SD Erlang with256 nodesVMs and 6144 cores Chaos Mon-key experiments show that two versions ofACO are reliable and hence that SD Erlang pre-serves the Erlang reliability model Howeverthe ACO results show that maintaining globalrecovery data ie a global process name spacedramatically limits scalability in distributedErlang Scalability can however be recoveredby maintaining recovery data only within ap-propriately sized s_groups These results areconsistent with experiments with other bench-marks and on other architectures (Section VIII)

In future work we plan to incorporateRELEASE technologies along with other tech-nologies in a generic framework for buildingperformant large scale servers In addition pre-

liminary investigations suggest that some SDErlang ideas could improve the scalability ofother actor languages21 For example the Akkaframework for Scala could benefit from semi-explicit placement and Cloud Haskell frompartitioning the network

Appendix A Architecture

Specifications

The specifications of the clusters used for mea-surement are summarised in Table 2 We alsouse the following NUMA machines (1) AnAMD Bulldozer with 16M L216M L3 cache128GB RAM four AMD Opteron 6276s at23 GHz 16 ldquoBulldozerrdquo cores each giving a to-tal of 64 cores (2) An Intel NUMA with 128GBRAM four Intel Xeon E5-4650s at 270GHzeach with eight hyperthreaded cores giving atotal of 64 cores

21See Deliverable 67 (httpwwwrelease-projecteudocumentsD67pdf) available online

Acknowledgements

We would like to thank the entire RELEASEproject team for technical insights and adminis-trative support Roberto Aloi and Enrique Fer-nandez Casado contributed to the developmentand measurement of WombatOAM This workhas been supported by the European Uniongrant RII3-CT-2005-026133 ldquoSCIEnce SymbolicComputing Infrastructure in Europerdquo IST-2011-287510 ldquoRELEASE A High-Level Paradigm forReliable Large-scale Server Softwarerdquo and bythe UKrsquos Engineering and Physical SciencesResearch Council grant EPG0551811 ldquoHPC-GAP High Performance Computational Alge-bra and Discrete Mathematicsrdquo

References

[1] Gul Agha ACTORS A Model of ConcurrentComputation in Distributed Systems PhDthesis MIT 1985

[2] Gul Agha An overview of Actor lan-guages SIGPLAN Not 21(10)58ndash67 1986

[3] Bulldozer (microarchitecture) 2015httpsenwikipediaorgwiki

Bulldozer_(microarchitecture)

[4] Apache SF Liblcoud 2016 https

libcloudapacheorg

[5] C R Aragon and R G Seidel Random-ized search trees In Proceedings of the 30thAnnual Symposium on Foundations of Com-puter Science pages 540ndash545 October 1989

[6] Joe Armstrong Programming Erlang Soft-ware for a Concurrent World PragmaticBookshelf 2007

[7] Joe Armstrong Erlang Commun ACM5368ndash75 2010

[8] Stavros Aronis Nikolaos Papaspyrou Ka-terina Roukounaki Konstantinos Sago-nas Yiannis Tsiouris and Ioannis E

Venetis A scalability benchmark suite forErlangOTP In Torben Hoffman and JohnHughes editors Proceedings of the EleventhACM SIGPLAN Workshop on Erlang pages33ndash42 New York NY USA September2012 ACM

[9] Thomas Arts John Hughes Joakim Jo-hansson and Ulf Wiger Testing telecomssoftware with Quviq QuickCheck In Pro-ceedings of the 2006 ACM SIGPLAN work-shop on Erlang pages 2ndash10 New York NYUSA 2006 ACM

[10] Robert Baker Peter Rodgers SimonThompson and Huiqing Li Multi-level Vi-sualization of Concurrent and DistributedComputation in Erlang In Visual Lan-guages and Computing (VLC) The 19th In-ternational Conference on Distributed Multi-media Systems (DMS 2013) 2013

[11] Luiz Andreacute Barroso Jimmy Clidaras andUrs Houmllzle The Datacenter as a ComputerMorgan and Claypool 2nd edition 2013

[12] Basho Technologies Riakdocs BashoBench 2014 httpdocsbasho

comriaklatestopsbuilding

benchmarking

[13] J E Beasley OR-Library Distributingtest problems by electronic mail TheJournal of the Operational Research Soci-ety 41(11)1069ndash1072 1990 Datasetsavailable at httppeoplebrunelac

uk~mastjjbjeborlibwtinfohtml

[14] Cory Bennett and Ariel Tseitlin Chaosmonkey released into the wild NetflixBlog 2012

[15] Robert D Blumofe Christopher F JoergBradley C Kuszmaul Charles E LeisersonKeith H Randall and Yuli Zhou Cilk Anefficient multithreaded runtime system In

Proceedings of the fifth ACM SIGPLAN sym-posium on Principles and practice of parallelprogramming pages 207ndash216 ACM 1995

[16] Olivier Boudeville Technical manual ofthe sim-diasca simulation engine EDFRampD 2012

[17] Istvaacuten Bozoacute Viktoacuteria Foumlrdos Daacuteniel Hor-paacutecsi Zoltaacuten Horvaacuteth Tamaacutes Kozsik Ju-dit Koszegi and Melinda Toacuteth Refactor-ings to enable parallelization In JurriaanHage and Jay McCarthy editors Trends inFunctional Programming 15th InternationalSymposium TFP 2014 Revised Selected Pa-pers volume 8843 of LNCS pages 104ndash121Springer 2015

[18] Irina Calciu Dave Dice Yossi Lev Vic-tor Luchangco Virendra J Marathe andNir Shavit NUMA-aware reader-writerlocks In Proceedings of the 18th ACM SIG-PLAN Symposium on Principles and Prac-tice of Parallel Programming pages 157ndash166New York NY USA 2013 ACM

[19] Francesco Cesarini and Simon Thomp-son Erlang Programming A ConcurrentApproach to Software Development OrsquoReillyMedia 1st edition 2009

[20] Rohit Chandra Leonardo Dagum DaveKohr Dror Maydan Jeff McDonald andRamesh Menon Parallel Programming inOpenMP Morgan Kaufmann San Fran-cisco CA USA 2001

[21] Natalia Chechina Huiqing Li AmirGhaffari Simon Thompson and PhilTrinder Improving the network scalabil-ity of Erlang J Parallel Distrib Comput90(C)22ndash34 2016

[22] Natalia Chechina Kenneth MacKenzieSimon Thompson Phil Trinder OlivierBoudeville Viktoacuteria Foumlrdos Csaba HochAmir Ghaffari and Mario Moro Her-nandez Evaluating scalable distributed

Erlang for scalability and reliability IEEETransactions on Parallel and Distributed Sys-tems 2017

[23] Natalia Chechina Mario Moro Hernan-dez and Phil Trinder A scalable reliableinstant messenger using the SD Erlang li-braries In Erlangrsquo16 pages 33ndash41 NewYork NY USA 2016 ACM

[24] Koen Claessen and John HughesQuickCheck a lightweight tool forrandom testing of Haskell programsIn Proceedings of the 5th ACM SIGPLANInternational Conference on FunctionalProgramming pages 268ndash279 New YorkNY USA 2000 ACM

[25] H A J Crauwels C N Potts and L Nvan Wassenhove Local search heuristicsfor the single machine total weighted tar-diness scheduling problem INFORMSJournal on Computing 10(3)341ndash350 1998The datasets from this paper are includedin Beasleyrsquos ORLIB

[26] Jeffrey Dean and Sanjay GhemawatMapreduce Simplified data processing onlarge clusters Commun ACM 51(1)107ndash113 2008

[27] David Dewolfs Jan Broeckhove VaidySunderam and Graham E Fagg FT-MPI fault-tolerant metacomputing andgeneric name services A case study InEuroPVMMPIrsquo06 pages 133ndash140 BerlinHeidelberg 2006 Springer-Verlag

[28] Alan AA Donovan and Brian WKernighan The Go programming languageAddison-Wesley Professional 2015

[29] Marco Dorigo and Thomas Stuumltzle AntColony Optimization Bradford CompanyScituate MA USA 2004

[30] Faith Ellen Yossi Lev Victor Luchangcoand Mark Moir SNZI scalable NonZero

indicators In Proceedings of the 26th AnnualACM Symposium on Principles of DistributedComputing pages 13ndash22 New York NYUSA 2007 ACM

[31] Jeff Epstein Andrew P Black and Si-mon Peyton-Jones Towards Haskell inthe Cloud In Haskell rsquo11 pages 118ndash129ACM 2011

[32] Martin Fowler Refactoring Improving theDesign of Existing Code Addison-WesleyLongman Publishing Co Inc BostonMA USA 1999

[33] Ana Gainaru and Franck Cappello Errorsand faults In Fault-Tolerance Techniques forHigh-Performance Computing pages 89ndash144Springer International Publishing 2015

[34] Martin Josef Geiger New instancesfor the Single Machine Total WeightedTardiness Problem Technical ReportResearch Report 10-03-01 March 2010Datasets available at httplogistik

hsu-hhdeSMTWTP

[35] Guillaume Germain Concurrency ori-ented programming in termite schemeIn Proceedings of the 2006 ACM SIGPLANworkshop on Erlang pages 20ndash20 NewYork NY USA 2006 ACM

[36] Amir Ghaffari De-bench a benchmarktool for distributed erlang 2014 https

githubcomamirghaffariDEbench

[37] Amir Ghaffari Investigating the scalabil-ity limits of distributed Erlang In Proceed-ings of the 13th ACM SIGPLAN Workshopon Erlang pages 43ndash49 ACM 2014

[38] Amir Ghaffari Natalia Chechina PhipTrinder and Jon Meredith Scalable persis-tent storage for Erlang Theory and prac-tice In Proceedings of the 12th ACM SIG-PLAN Workshop on Erlang pages 73ndash74New York NY USA 2013 ACM

[39] Seth Gilbert and Nancy Lynch Brewerrsquosconjecture and the feasibility of consistentavailable partition-tolerant web servicesACM SIGACT News 3351ndash59 2002

[40] Andrew S Grimshaw Wm A Wulf andthe Legion team The Legion vision of aworldwide virtual computer Communica-tions of the ACM 40(1)39ndash45 1997

[41] Carl Hewitt Actor model for discre-tionary adaptive concurrency CoRRabs10081459 2010

[42] Carl Hewitt Peter Bishop and RichardSteiger A universal modular ACTOR for-malism for artificial intelligence In IJ-CAIrsquo73 pages 235ndash245 San Francisco CAUSA 1973 Morgan Kaufmann

[43] Rich Hickey The Clojure programminglanguage In DLSrsquo08 pages 11ndash11 NewYork NY USA 2008 ACM

[44] Zoltaacuten Horvaacuteth Laacuteszloacute Loumlvei TamaacutesKozsik Roacutebert Kitlei Anikoacute Nagyneacute ViacutegTamaacutes Nagy Melinda Toacuteth and RolandKiraacutely Building a refactoring tool forErlang In Workshop on Advanced SoftwareDevelopment Tools and Techniques WAS-DETT 2008 2008

[45] Laxmikant V Kale and Sanjeev KrishnanCharm++ a portable concurrent objectoriented system based on c++ In ACMSigplan Notices volume 28 pages 91ndash108ACM 1993

[46] David Klaftenegger Konstantinos Sago-nas and Kjell Winblad On the scalabilityof the Erlang term storage In Proceedingsof the Twelfth ACM SIGPLAN Workshop onErlang pages 15ndash26 New York NY USASeptember 2013 ACM

[47] David Klaftenegger Konstantinos Sago-nas and Kjell Winblad Delegation lock-ing libraries for improved performance of

multithreaded programs In Euro-Par 2014Parallel Processing volume 8632 of LNCSpages 572ndash583 Springer 2014

[48] David Klaftenegger Konstantinos Sago-nas and Kjell Winblad Queue delegationlocking IEEE Transactions on Parallel andDistributed Systems 2017 To appear

[49] Rusty Klophaus Riak core Building dis-tributed applications without shared stateIn ACM SIGPLAN Commercial Users ofFunctional Programming CUFP rsquo10 pages141ndash141 New York NY USA 2010ACM

[50] Avinash Lakshman and Prashant MalikCassandra A decentralized structuredstorage system SIGOPS Oper Syst Rev44(2)35ndash40 April 2010

[51] J Lee et al Python actor runtime library2010 oslcsuiuieduparley

[52] Huiqing Li and Simon Thompson Auto-mated API Migration in a User-ExtensibleRefactoring Tool for Erlang Programs InTim Menzies and Motoshi Saeki editorsAutomated Software Engineering ASErsquo12IEEE Computer Society 2012

[53] Huiqing Li and Simon Thompson Multi-core profiling for Erlang programs usingPercept2 In Proceedings of the Twelfth ACMSIGPLAN workshop on Erlang 2013

[54] Huiqing Li and Simon Thompson Im-proved Semantics and ImplementationThrough Property-Based Testing withQuickCheck In 9th International Workshopon Automation of Software Test 2014

[55] Huiqing Li and Simon Thompson SafeConcurrency Introduction through SlicingIn PEPM rsquo15 ACM SIGPLAN January2015

[56] Huiqing Li Simon Thompson GyoumlrgyOrosz and Melinda Toumlth Refactoring

with Wrangler updated In ACM SIG-PLAN Erlang Workshop volume 2008 2008

[57] Frank Lubeck and Max Neunhoffer Enu-merating large Orbits and direct conden-sation Experimental Mathematics 10(2)197ndash205 2001

[58] Andreea Lutac Natalia Chechina Ger-ardo Aragon-Camarasa and Phil TrinderTowards reliable and scalable robot com-munication In Proceedings of the 15th Inter-national Workshop on Erlang Erlang 2016pages 12ndash23 New York NY USA 2016ACM

[59] LWNnet The high-resolution timerAPI January 2006 httpslwnnet

Articles167897

[60] Kenneth MacKenzie Natalia Chechinaand Phil Trinder Performance portabilitythrough semi-explicit placement in dis-tributed Erlang In Proceedings of the 14thACM SIGPLAN Workshop on Erlang pages27ndash38 ACM 2015

[61] Jeff Matocha and Tracy Camp A taxon-omy of distributed termination detectionalgorithms Journal of Systems and Software43(221)207ndash221 1998

[62] Nicholas D Matsakis and Felix S Klock IIThe Rust language In ACM SIGAda AdaLetters volume 34 pages 103ndash104 ACM2014

[63] Robert McNaughton Scheduling withdeadlines and loss functions ManagementScience 6(1)1ndash12 1959

[64] Martin Odersky et al The Scala program-ming language 2012 wwwscala-langorg

[65] William F Opdyke Refactoring Object-Oriented Frameworks PhD thesis Uni-versity of Illinois at Urbana-Champaign1992

[66] Nikolaos Papaspyrou and KonstantinosSagonas On preserving term sharing inthe Erlang virtual machine In TorbenHoffman and John Hughes editors Pro-ceedings of the 11th ACM SIGPLAN ErlangWorkshop pages 11ndash20 ACM 2012

[67] RELEASE Project Team EU Framework7 Project 287510 (2011ndash2015) 2015 httpwwwrelease-projecteu

[68] Konstantinos Sagonas and ThanassisAvgerinos Automatic refactoring ofErlang programs In Principles and Prac-tice of Declarative Programming (PPDPrsquo09)pages 13ndash24 ACM 2009

[69] Konstantinos Sagonas and Kjell WinbladMore scalable ordered set for ETS usingadaptation In Proc ACM SIGPLAN Work-shop on Erlang pages 3ndash11 ACM Septem-ber 2014

[70] Konstantinos Sagonas and Kjell WinbladContention adapting search trees In Pro-ceedings of the 14th International Sympo-sium on Parallel and Distributed Computingpages 215ndash224 IEEE Computing Society2015

[71] Konstantinos Sagonas and Kjell WinbladEfficient support for range queries andrange updates using contention adaptingsearch trees In Languages and Compilersfor Parallel Computing - 28th InternationalWorkshop volume 9519 of LNCS pages37ndash53 Springer 2016

[72] Marc Snir Steve W Otto D WWalker Jack Dongarra and Steven Huss-Lederman MPI The Complete ReferenceMIT Press Cambridge MA USA 1995

[73] Sriram Srinivasan and Alan MycroftKilim Isolation-typed actors for Java InECOOPrsquo08 pages 104ndash128 Berlin Heidel-berg 2008 Springer-Verlag

[74] Don Syme Adam Granicz and AntonioCisternino Expert F 40 Springer 2015

[75] Simon Thompson and Huiqing Li Refac-toring tools for functional languages Jour-nal of Functional Programming 23(3) 2013

[76] Marcus Voumllker Linux timers May 2014

[77] WhatsApp 2015 httpswwwwhatsappcom

[78] Tom White Hadoop - The Definitive GuideStorage and Analysis at Internet Scale (3 edrevised and updated) OrsquoReilly 2012

[79] Ulf Wiger Industrial-Strength Functionalprogramming Experiences with the Er-icsson AXD301 Project In ImplementingFunctional Languages (IFLrsquo00) Aachen Ger-many September 2000 Springer-Verlag

Introduction

Context

Scalable Reliable Programming Models

Actor Languages

Erlangs Support for Concurrency

Scalability and Reliability in Erlang Systems

ETS Erlang Term Storage

Benchmarks for scalability and reliability

Orbit

Ant Colony Optimisation (ACO)

Erlang Scalability Limits

Scaling Erlang on a Single Host

Distributed Erlang Scalability

Persistent Storage

Improving Language Scalability

SD Erlang Design

S_groups

Semi-Explicit Placement

S_group Semantics

Semantics Validation

Improving VM Scalability

Improvements to Erlang Term Storage

Improvements to Schedulers

Improvements to Time Management

Scalable Tools

Refactoring for Scalability

Scalable Deployment

Monitoring and Visualisation

Percept2

DTraceSystemTap tracing

Devo

SD-Mon

Systemic Evaluation

Orbit


Evaluation Summary

Discussion