E-book Social media voor kulturhusen

Universiteit van Amsterdam

Making Multi-core Computing Mainstream

Professor Chris JesshopeUniversity of Amsterdam

Invited presentation IPA Lentedagen 2010 (c) Chris Jesshope 20/4/2010

Overview

i. Motivation and background - why multi-core?

ii. Introduction to the EU funded Apple-CORE project

iii. The SVP execution model

iv. Microgrid implementation and evaluation

v. Summary

Moore’s law and its consequences

• If dimensions on silicon are reduced by l

• density increases by l2 and clock speed by l

• giving a performance increase of l3

• ... but power density also grows by l2

• Power proportional to frequency & Voltage2

The problem exponentials

• Power density increases exponentially over time• Chip area reachable in a clock decreases exponentially• We are now in an era where clock frequency is limited

by cooling constraints and it takes many cycles for a signal to cross a chip - distributed systems on a chip

130 nm

100 nm

20 mm chip edge

The resulting problems

• Power - without expensive cooling technology we have reached the limit if processor speed in CMOS the only

• ILP - executing multiple instructions in a single cycle by executing instructions out of order is power intensive and has limits difficult to achieve more than 4-way issue

• Memory gap - there is an increasing gap between processor and memory speed as a consequence of the signal propagation

• As we will see multi-core is a solution but it introduces another problem - how to program and manage explicit concurrency

Invited presentation INFOS 2010 (c) Chris Jesshope 28/3/2010

Multi-core - a Solution?

• Multi-core is the only way to increase chip performance

• Multi-core means simpler cores (in-order issue) and are more power efficient to implement

move from few complex cores to many simpler ones

• Applications are more power efficiently running on many rather cores than on a few cores... with some provisos

• applications must be scalable

• communications must be local

• programs probably need to be rewritten and concurrent programming which is difficult

Shape of the future?• Intel recently announced its Single-chip Cloud Computer

a multicore comprising 48 IA architecture cores

• SCC is a research vehicle to investigate programming models and operating systems for the multi-core era

• It comprises:

• 48 P54C cores operating at 1GHz (FDIV bug fixed!)

• 24 node mesh network with 64GBytes/sec links

• Partitioned shared address space with no hardware coherency between cores

• Up to 64GBytes shared off-chip memory

• 256 KBytes L2 cache per core (not shared)

• 16KByte shared message passing buffer

• Supports FV scaling with 24 frequency domains and 6 voltage islands (N.b. Pd=kfv2)

Making it Mainstream

• The changes required are fundamental and disruptive!

• Questions that need to be answered include:

• what programming model is required how is it supported in the architecture?

• do we need a full operating system on each core?

• how do we deal with locality and in its absence how to introduce asynchronous instruction execution

• Finally can we have generic concurrent programs

• i.e. source or even binary code compatibility across generations of multi-core processors

Overview

v. Summary

• The goal of the EU Project Apple-CORE is indeed to make multi-core mainstream

• this means expressing and composing computations as concurrently as possible

• i.e. replace sequential with concurrent composition

• We uses an execution model designed for multi-cores

• The Self-adaptive Virtual Processor or SVP

• SVP implements both an OS kernel and a related programming model

• In Apple-CORE - SVP is implemented in the core’s binary instructions and it is the only OS required on most cores

Overview

v. Summary

Invited presentation INFOS 2010 (c) Chris Jesshope 28/3/2010

The changing Landscape

• In the past, memory has been ubiquitous and cores were scarce - we therefore rationed processor cycles

• In the multi-core era, we consider both memory and cores to be ubiquitous

• Move from time sharing to space sharing, the cluster is the processor

• SVP introduces the concept of place (a cluster) allocated for exclusive use of a thread (space sharing)

Motivation for SVP

• Provide an execution model that supports binary-code compatibility across generations of processors 1 .. many cores

• Capture locality through implicit communication and support weakly-consistent memory systems

• shared, partitioned shared and distributed memories

• Provide explicit and dynamic resource management

• Support work migration and load balancing through the use of delegation - threads continually created and terminating

• Provide security to support the exclusive use of both Memory and Cores

Functional concurrency

• SVP composes threads (tasks) concurrently

• The abstract SVP API supports thread create with asynchronous termination on sync

• Parent blocks on sync until child has terminated

• SVP Programs capture all concurrency in an application using hierarchy

• Implementations apply a sequential schedule when the concurrency resources available are less than those exposed by the program

create foo

create bar

syncsync

Scheduling by instructionnon-blocking thread

waits for all parameters then

creates

a, b, c

thread executes to completion

blocking threadcreates then sets

parameters

thread blocks on parameters using SVP shared objects (i-structures)

• Pairwise communication between threads uses shared objects - support blocking read & non-blocking write

• Blocking is used to schedule threads in the SVP kernel• The same mechanism can be used to decouple instruction

execution - e.g. memory accesses are a concurrent activity

Replication• SVP’s create API also supports replication

• A family of statically defined homogeneous threads(dynamically heterogeneous) can be created by supplying an index range {start, step, limit}

• threads are automatically created with their index variable pre-defined - testing this index allows heterogeneity

• To create a family of threads requires at least one additional thread concurrent context

• one context executes one child thread at a time

• more contexts mean more concurrency in execution

Replication & Communication

main thread

create(f1;;0;7;1;;)

sync(f1)

Family of 8 homogeneous threadsthreadi i=0..7

create(f2;;0;5;1;;)

sync(f2)

subordinate family - 1 shown

blocking communication

• Shared communication is pairwise between adjacent threads• i.e. in linear chains between child threads

• This capture locality in the model and avoids communication deadlock deadlock communication is guaranteed to be acyclic

Communication from the parent to all children using global objects is also supported

Memory consistency• We define consistency domains over memory to

differentiate between shared and global memories

• Even within a consistency domain concurrent threads can not reliably read each other’s memory except:i. A family of threads created can see any memory written by its

parent before create

ii. A parent can see memory written by its children after its syncs

iii. Memory written in one thread prior to a shared write can be read by that thread’s successor and memory written by a parent thread prior to a global write can be seen by all its children

iv. Between consistency domains only the values explicitly communicated by the shared and global objects are visible

SVP resource model• Processing resources are introduced into the SVP

through by acquiring a place and specifying it as a parameter on create• threads are then created at that place - a delegation

• a place server SEP must be provided by the SVP run-time

• Place is an abstraction defined for each SVP implementation - its contains one or more of:• an address to access the place

• a security key to allow execution at a place

• virtualisation of the physical place

• mutual exclusion at a place

Together these allow

single-use keys and exclusive

Mutual exclusion

• As SVP’s weakly consistent memory cannot be used for locks SVP introduces the concept of a mutex place that is shared between threads and which serialises all requests to create tasks

• This provides the necessary support to implement processor and memory allocation which require exclusive access to data structures identifying usage

• Memory reads and writes at exclusive places provide consistency between otherwise unrelated threads

• Similar to Dijkstra’s secretary concept

Place allocation• There are two model-defined places

• default - no place specified (interpreted by implementation)

• local - forces locality of threads (e.g. on same core)

• All other places are allocated by a resource server - the SEP

Control thread

Exclusive placeSEP

request resource

place allocated

Workplace

create work

place returned

Overview

i. Motivation and background - why multi-core

v. Summary

Implementations

• In Apple-CORE we have implemented SVP in the ISA of a many-core processor

• we have Alpha and SPARC software emulation

• we have FPGA SPARC implementation (one core)

• We also have a software implementation of the SVP built over pthreads

• we plan to re-implement this on SCC to achieve a an efficient software kernel (granularity = small functions)

Compilers

• Have defined a core language µTC (micro-threaded C) it captures all of the SVP concepts

• Have µTC to Alpha & SPARC compilers based on GCC

• Have parallelising C and SaC compilers that target the core SVP language

• SaC is a data-parallel functional language

The chip architecture

Distributed on-chip memory (COMA)

64 cores32

cores32

cores16

64 cores

Work distribution network

The cluster is the processor - binary code executes on a cluster of cores i.e. same code on one or more cores

External interfaces

synchronous operations

The SVP Core

Thread aware I-cache

threadQ

thread/instruction

select

I-structureRegister

asynchronous operations e.g. FPU/load

Instructions tagged with thread and family index Data tagged with RF address

Active thread queue contains threads that are guaranteed to execute at least one instruction

Communication to neighbouring core

Communication from neighbouring core

The core supports asynchrony

may take 100s or 1000s of cyclesH/W support for 100s of

threads/core

FPU, FPGA, encryption etc.

Can be shared between cores

Concurrency overheads

Thread creation takes 1 pipeline cycle on one core in the cluster to create an arbitrary number of threads

Context switch on every cycle Text

Chip Parametersfor evaluation

• Results presented execute code on a microgrid of 128 cores on a cycle-accurate software emulation

• Alpha ISA with 1024 int & 512 float register file

• 256 thread table entries 16 family tables entries

• 1K L1 I-cache and 1K L1 D-cache

• 32K L2 cache shared between 4 cores

• 2 DDR3 standard memory channels

• In all experiments the same binary code was executed on clusters of bare cores of arbitrary size with cold caches

128-core MicrogridCluster

1 FPU +

2 cores

L2 cache

E-book Social media voor kulturhusen

Documents

Sociale media voor wak en efgoeddag

Workshop Social Media voor pensioenfondsen

Sociale media voor professionals

20131119 sociale media voor verenigingen herselt

20131126 sociale media voor 111111

Social media voor marketeers

20141018 sociale media voor verenigingen kasterlee

20141126 sociale media voor sportclubs brecht

Ga voor duurzaam e-Book

Sociale media voor beginners 2012-voormiddag

20150205 sociale media voor sportclubs ekeren

Social media voor bakkers

Social Media voor BEET Intermediair

Social Media inzetten voor evenementen

social media voor credit managers

social media voor fotografen

Sociale media voor dummies

Social Media voor corporate communication

Sociale Media Voor Khm Studenten Ingekort

Social Media voor journalisten