View
232
Download
0
Tags:
Embed Size (px)
Citation preview
OGO 2.1SGI Origin 2000
Robert van Liere
CWI, Amsterdam
TU/e, Eindhoven
11 September 2001
unite.sara.nl
• SGI Origin 2000• Located at SARA in Amsterdam
• Hardware configuration :– 128 MIPS R10000 CPUs @ 250 Mhz– 64 Gbyte main memory– 1 Tbyte disk storage– 11 ethernet @ 100 Mbits– 1 ethernet @ 1 Gbit
Contents
• Architecture– Overview– Module interconnect– Memory hierarchies
• Programming– Parallel models– Data placement
• Pros and cons
Overview - Features
• 64 bit RISC microprocessors
• Large main memory
• “Scalable” in CPU, memory and I/O
• Shared memory programming model
Overview - Applications
• Worldwide : +/- 30.000 systems – ~ 50 with >128 CPUs– ~ 100 with 64-128 CPUs– ~ 500 with 32-64 CPUs
• Computing serving : many CPUs and memory• Database serving : many disks• Web serving : many I/O
System architecture – 1 CPU
• CPU + cache• One system bus• Memory• I/O (network + disk)
• Cached data
System architecture – N CPU
• Symmetric multi-processing (SMP)
• Multi-CPU + caches• One shared bus• Memory• I/O
N CPU – cache coherency
• Problem:– Inconsistent cached data
• Solution:– Snooping– Broadcasting
• Not scalable
Architecture – Origin 2000
• Node board
• 2 CPU + cache• Memory• Directory• HUB• I/O
Origin 2000 Interconnect
• Node boards
• Routers– Six ports
Interconnect Topology
Sample Topologies
128 Topology
Virtual Memory
• One CPU, multi programs
• Page• Paging disk• Page replacement
O2000 Virtual Memory
• Multi CPU, Multi progs
• Non-Uniform Memory Access
• Efficient programs:– Minimize data movement– Data “close” to CPU
Latencies and Bandwidth
Application performance
• Scientific computing– LU, ocean, barnes, radiosity
• Linear speedup– More CPUs -> performance
Programming support
• IRIX operating system• Parallel programming
– C source level with compiler pragmas– Posix Threads– UNIX processes
• Data placement– dplace , dlock, dperf
• Profiling– timex, ssrun
Parallel Programs
• Functional Decomposition– Decompose the problem into different tasks
• Domain Decomposition– Partition the problem’s data structure
• Consider– Mapping tasks/parts onto CPUs– Coordinate work and communication of CPUs
Task Decomposition
• Decompose problem
• Determine dependencies
Task Decomposition
• Map tasks on threads
• Compare:– Sequential case– Parallel case
Efficient programs
• Use many CPUs– Measure speedups
• Avoid:– Excessive data dependencies – Excessive cache misses– Excessive inter-node communication
Pros vs Cons
• Multi-processor (128 )• Large memory (64 Gbyte)
• Shared memory programming
• Slow integer CPU
• Performance penalty:– Data dependencies– Off board memory