OGO 2.1 SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001

Preview:

Citation preview

OGO 2.1SGI Origin 2000

Robert van Liere

CWI, Amsterdam

TU/e, Eindhoven

11 September 2001

unite.sara.nl

• SGI Origin 2000• Located at SARA in Amsterdam

• Hardware configuration :– 128 MIPS R10000 CPUs @ 250 Mhz– 64 Gbyte main memory– 1 Tbyte disk storage– 11 ethernet @ 100 Mbits– 1 ethernet @ 1 Gbit

Contents

• Architecture– Overview– Module interconnect– Memory hierarchies

• Programming– Parallel models– Data placement

• Pros and cons

Overview - Features

• 64 bit RISC microprocessors

• Large main memory

• “Scalable” in CPU, memory and I/O

• Shared memory programming model

Overview - Applications

• Worldwide : +/- 30.000 systems – ~ 50 with >128 CPUs– ~ 100 with 64-128 CPUs– ~ 500 with 32-64 CPUs

• Computing serving : many CPUs and memory• Database serving : many disks• Web serving : many I/O

System architecture – 1 CPU

• CPU + cache• One system bus• Memory• I/O (network + disk)

• Cached data

System architecture – N CPU

• Symmetric multi-processing (SMP)

• Multi-CPU + caches• One shared bus• Memory• I/O

N CPU – cache coherency

• Problem:– Inconsistent cached data

• Solution:– Snooping– Broadcasting

• Not scalable

Architecture – Origin 2000

• Node board

• 2 CPU + cache• Memory• Directory• HUB• I/O

Origin 2000 Interconnect

• Node boards

• Routers– Six ports

Interconnect Topology

Sample Topologies

128 Topology

Virtual Memory

• One CPU, multi programs

• Page• Paging disk• Page replacement

O2000 Virtual Memory

• Multi CPU, Multi progs

• Non-Uniform Memory Access

• Efficient programs:– Minimize data movement– Data “close” to CPU

Latencies and Bandwidth

Application performance

• Scientific computing– LU, ocean, barnes, radiosity

• Linear speedup– More CPUs -> performance

Programming support

• IRIX operating system• Parallel programming

– C source level with compiler pragmas– Posix Threads– UNIX processes

• Data placement– dplace , dlock, dperf

• Profiling– timex, ssrun

Parallel Programs

• Functional Decomposition– Decompose the problem into different tasks

• Domain Decomposition– Partition the problem’s data structure

• Consider– Mapping tasks/parts onto CPUs– Coordinate work and communication of CPUs

Task Decomposition

• Decompose problem

• Determine dependencies

Task Decomposition

• Map tasks on threads

• Compare:– Sequential case– Parallel case

Efficient programs

• Use many CPUs– Measure speedups

• Avoid:– Excessive data dependencies – Excessive cache misses– Excessive inter-node communication

Pros vs Cons

• Multi-processor (128 )• Large memory (64 Gbyte)

• Shared memory programming

• Slow integer CPU

• Performance penalty:– Data dependencies– Off board memory

Recommended