Upload
frederick-philip-anthony
View
218
Download
0
Embed Size (px)
DESCRIPTION
10/15/043 Objectives and Motivations Objectives Support advancements for HPC with Unified Parallel C (UPC) on cluster systems exploiting high-throughput, low-latency System-Area Networks (SANs) and LANs Design and analysis of tools to support UPC on SAN-based systems Benchmarking and case studies with key UPC applications Analysis of tradeoffs in application, network, service, and system design Motivations Increasing demand in sponsor and scientific computing community for shared- memory parallel computing with UPC New and emerging technologies in system-area networking and cluster computing Scalable Coherent Interface (SCI) Myrinet (GM) InfiniBand (VAPI) QsNet (Quadrics Elan) Gigabit Ethernet and 10 Gigabit Ethernet Clusters offer excellent cost-performance potential
Citation preview
10/15/04 1
Distributed Shared-MemoryParallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt.
High-Perf. Networking (HPN) GroupHCS Research Laboratory
ECE DepartmentUniversity of Florida
Principal Investigator: Professor Alan D. GeorgeSr. Research Assistant: Mr. Hung-Hsun Su
10/15/04 2
Outline
Objectives and Motivations
Background
Related Research
Approach
Results
Conclusions and Future Plans
10/15/04 3
Objectives and Motivations Objectives
Support advancements for HPC with Unified Parallel C (UPC) on cluster systems exploiting high-throughput, low-latency System-Area Networks (SANs) and LANs
Design and analysis of tools to support UPC on SAN-based systems Benchmarking and case studies with key UPC applications Analysis of tradeoffs in application, network, service, and system design
Motivations Increasing demand in sponsor and scientific computing community for shared-
memory parallel computing with UPC New and emerging technologies in system-area networking and cluster computing
Scalable Coherent Interface (SCI) Myrinet (GM) InfiniBand (VAPI) QsNet (Quadrics Elan) Gigabit Ethernet and 10 Gigabit Ethernet
Clusters offer excellent cost-performance potential
10/15/04 4
Background Key sponsor applications and developments toward
shared-memory parallel computing with UPC UPC extends the C language to exploit parallelism
Currently runs best on shared-memory multiprocessors or proprietary clusters (e.g. AlphaServer SC) Notably HP/Compaq’s UPC compiler
First-generation UPC runtime systems becoming available for clusters MuPC, Berkeley UPC
Significant potential advantage in cost-performance ratio with Commercial Off-The-Shelf (COTS) cluster configurations Leverage economy of scale Clusters exhibit low cost relative to tightly coupled SMP, CC-
NUMA, and MPP systems Scalable performance with COTS technologies
Com3
Com3Com3
UPCUPC
UPCUPC
??
10/15/04 5
Related Research University of California at Berkeley
UPC runtime system UPC to C translator Global-Address Space Networking
(GASNet) design and development Application benchmarks
George Washington University UPC specification UPC documentation UPC testing strategies, testing
suites UPC benchmarking UPC collective communications Parallel I/O
Michigan Tech University Michigan Tech UPC (MuPC) design and
development UPC collective communications Memory model research Programmability studies Test suite development
Ohio State University UPC benchmarking
HP/Compaq UPC compiler
Intrepid GCC UPC compiler
10/15/04 6
Related Research -- MuPC & DSM MuPC (Michigan Tech UPC)
First open-source reference implementation of UPC for COTS clusters Any cluster that provides Pthreads and MPI can use Built as a reference implementation, performance is secondary
Limitations in application size, memory mode Not suitable for performance-critical applications
UPC/DSM/SCI SCI-VM (DSM system for SCI)
HAMSTER interface allows multiple modules to support MPI and shared-memory models Created using Dolphin SISCI API, ANSI C
SCI-VM not under constant development, so future upgrades sketchy Not feasible for amount of work needed versus expected performance
Better possibilities with GASNet
10/15/04 7
Related Research -- GASNet Communication system created by U.C. Berkeley
Target for Berkeley UPC system Global-Address Space Networking (GASNet) [1]
Language-independent, low-level networking layer for high-performance communication
Segment region for communication on each node, three types Segment-fast: sacrifice size for speed Segment-large: allows large memory area for shared space, perhaps
with some loss in performance (though firehose [2] algorithm is often employed)
Segment-everything: expose the entire virtual memory space of each process for shared access Firehose algorithm allows memory to be managed into buckets for
efficient transfers Interface for high-level global address space SPMD languages
UPC [3] and Titanium [4]
Divided into two layers Core
Active Messages Extended
High-level operations which take direct advantage of network capabilities communication system from U.C. Berkeley
A reference implementation available that uses the Core layer
10/15/04 8
Related Research -- Berkeley UPC Second open-source implementation of
UPC for COTS clusters First with a focus on performance
GASNet for all accesses to remote memory Network conduits allow for high performance
over many different interconnects
Targets a variety of architectures x86, Alpha, Itanium, PowerPC, SPARC, MIPS,
PA-RISC
Best chance as of now for high-performance
UPC applications on COTS clusters Note: Only supports strict shared-memory
access and therefore only uses the blocking
transfer functions in the GASNet spec
UPC Code Translator
Translator Generated CCode
Berkeley UPC RuntimeSystem
GASNetCommunication
System
Network Hardware
Platform-independent
Network-independent
Compiler-independent
Language-independent
10/15/04 9
Approach Collaboration
HP/Compaq UPC Compiler V2.1 running in our lab on ES80 AlphaServer (Marvel)
Support of testing by OSU, MTU, UCB/LBNL, UF, et al. with leading UPC tools and system for function performance evaluation
Field test of newest compiler and system Exploiting SAN Strengths for UPC
Design and develop new SCI Conduit for GASNet in collaboration UCB/LBNL
Evaluate DSM for SCI as option of executing UPC Benchmarking
Use and design of applications in UPC to grasp key concepts and understand performance issues NAS benchmarks from GWU DES-cypher benchmark from UF
Performance Analysis Network communication experiments UPC computing experiments
Emphasis on SAN Options and Tradeoffs SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE,
etc.
Upp
er L
ayer
sA
pplic
atio
ns, T
rans
lato
rs,
Doc
umen
tatio
n
Mid
dle
Laye
rsR
untim
e Sy
stem
s, In
terf
aces
Low
er L
ayer
sR
untim
e Sy
stem
s, In
terf
aces
UF
HC
S L
abOhio StateBenchmarks
Michigan TechBenchmarks, modeling,
specification
UC BerkeleyBenchmarks, UPC-to-C translator, specification
GWUBenchmarks, documents,
specification
Benchmarks
Michigan TechUPC-to-MPI translation
and runtime system
UC BerkeleyC runtime system, upper
levels of GASNet
HPUPC runtime system on
AlphaServer
UC BerkeleyGASNet
GASNet collaboration,beta testing
GASNet collaboration,
network performance
analysis
10/15/04 10
GASNet SCI Conduit Scalable Coherent Interface (SCI)
Low-latency, high-bandwidth SAN Shared-memory capabilities
Require memory exporting and importing
PIO (require importing) + DMA (need 8 bytes alignment)
Remote write ~10x faster than remote read
SCI conduit AM enabling (core API)
Dedicated AM message channels (Command) Request/Response pairs to prevent
deadlock Flags to signal arrival of new AM (Control)
Put/Get enabling (extended API) Global segment (Payload)
Control X
Command X-1
...
Command X-N
Payload X
Local (In use)
Local (free)
Control 1
...
...
Control N
Command 1-X
...
Command X-X
Control X
Payload 1
...
Payload X
...
Control 1
...
Control X
Command 1-X
...
Command X-X
Command N-X
Payload N
Command X-X
...
...
Control N
...
Command N-X
...
Control Segments(N total)
Command Segments(N*N total)
Payload Segments(N total)
SCI Space
Node X
Physical Address
Virtual Address
ExportingImporting
DMA Queues (Local)
10/15/04 11
GASNet SCI Conduit - Core APIActive Message Transferring
1. Obtain free slot Tract locally using array of flags
2. Package AM Header3. Transfer Data
Short AM PIO write (Header)
Medium AM PIO write (Header) PIO write (Medium Payload)
Long AM PIO write (Header) PIO write (Long Payload)
Payload size 1024 Unaligned portion of payload
DMA write (multiple of 64 bytes)4. Wait for transfer completion5. Signal AM arrival
Message Ready Flag Value = type of AM
Message Exist Flag Value = TRUE
6. Wait for reply/control signal Free up remote slot for reuse
AM Header
Medium AM Payload
Long AM Payload
Flags
Control
Command Y-1
Node X
...
Command Y-X
...
Command Y-N
Payload Y
Node Y
Wait for Completion
New Messages Availiable?
Process all new messages
Yes No
Check Message Exist Flag
Polling Done
Polling End
Polling Start
Other processing
Process reply
message
Memory
AM Reply or ack
Extract Message
Information
10/15/04 12
Experimental Testbed Elan, VAPI (Xeon), MPI, and SCI conduits
Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset
SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus
Quadrics: 528 MB/s (340 MB/s sustained) Elan3, using PCI-X in two nodes with QM-S16 16-port switch
InfiniBand: 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO 2000 8-port switch from Infinicon
RedHat 9.0 with gcc compiler V 3.3.2, SCI uses MP-MPICH beta from RWTH Aachen Univ., Germany. Berkeley UPC runtime system 1.1
VAPI (Opteron) Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S
server motherboard InfiniBand: Same as in VAPI (Xeon)
GM (Myrinet) conduit (c/o access to cluster at MTU) Nodes*: Dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16-port M3F-
SW16 switch RedHat 7.3 with Intel C compiler V 7.1., Berkeley UPC runtime system 1.1
ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor
connections Tru64 5.1B Unix, HP UPC V2.1 compiler
* via testbed made available courtesy of Michigan Tech
10/15/04 13
SCI Conduit GASNet Core Level Experiments Experimental Setup
SCI Conduit Latency/Throughput (testam 10000
iterations) SCI Raw
PIO latency (scipp) DMA latency and throughput
(dma_bench) Analysis
Latency a little high, but constant overhead (not exponential
Throughput follows RAW trend
Short/Medium AM Ping-Pong Latency
05
101520
25303540
0 1 2 4 8 16 32 64 128
256
512
1024
Payload Size (Bytes)
Late
ncy
(us)
SCI Raw SCI Conduit
Long AM Ping-Pong Latency
0
50
100
150
200
250
0 1 2 4 8 16 32 64 128
256
512
1K 2K 4K 8K 16K
Payload Size (Bytes)
Late
ncy
(us)
SCI Raw SCI Conduit
Long AM Throughput
0
50
100
150
200
250
64 128
256
512
1K 2K 4K 8K 16K
32K
64K
128K
256K
Payload Size (Bytes)
Thro
ughp
ut (M
B/s
)
SCI Raw SCI Conduit
PIO/DMA Mode Shift
PIO/DMA Mode Shift
10/15/04 14
SCI Conduit GASNet Extended Level Experiments Experimental Setup
GASNet configured with segment Large As fast as segment-fast for inside the segment Makes use of Firehose for memory outside the segment (often more efficient than segment-fast)
GASNet Conduit experiments Berkeley GASNet test suite
Average of 1000 iterations Each uses put/get operations to take advantage of implemented extended APIs Executed with target memory falling inside and then outside the GASNet segment
Reported only inside results unless difference was significant Latency results use testsmall Throughput results use testlarge
Analysis Elan shows best performance for latency of puts and gets VAPI is by far the best bandwidth; latency very good GM latencies a little higher than all the rest HCS SCI conduit shows better put latency than MPI on SCI for sizes > 64 bytes; very close to MPI on SCI for smaller messages HCS SCI conduit has latency slightly higher than MPI on SCI GM and SCI provide about the same throughput
HCS SCI conduit slightly higher bandwidth for largest message sizes Quick look at estimated total cost to support 8 nodes of these interconnect architectures:
SCI: ~$8,700 Myrinet: ~$9,200 InfiniBand: ~$12,300 Elan3: ~$18,000 (based on Elan4 pricing structure, which is slightly higher)
* via testbed made available courtesy of Michigan Tech
10/15/04 15
GASNet Extended Level Latency
0
5
10
15
20
25
30
35
40
1 2 4 8 16 32 64 128 256 512 1K
Message Size (bytes)
Rou
nd-tr
ip L
aten
cy (u
sec)
GM put GM get Elan put Elan getVAPI put VAPI get MPI SCI put MPI SCI getHCS SCI put HCS SCI get
10/15/04 16
GASNet Extended Level Throughput
0
100
200
300
400
500
600
700
800
128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K
Message Size (bytes)
Thro
ughp
ut (M
B/s
)
GM put GM get Elan put Elan getVAPI put VAPI get MPI SCI put MPI SCI getHCS SCI put HCS SCI get
10/15/04 17
Matisse IP-Based Networks Switch-based GigE network with DWDM backbone between switches for high scalability Product in alpha testing stage Experimental setup
Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset
Setup: 1 switch – all nodes connected to 1 switch 2 switch – half of the nodes connected to each switch with either short (1km) or long (12.5km) fiber in between the
switches Tests:
Low Level - Pallas Benchmark: ping-pong and send-recv GASNet Level - testsmall
Pallas PingPong Test
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00
8 32 128
512
2K 8K 16K
32K
64K
128K
256K
512K
1024
K
2048
K
4096
K
Message Size (bytes)
Thro
ughp
ut (M
B/s
ec)
1 switch 2 switches - short fiber 2 switches - long fiber
Pallas SendRecv Test (2 Nodes)
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
8 32 128
512
2K 8K 16K
32K
64K
128K
256K
512K
1024
K
2048
K
4096
K
Message Size (bytes)
Thro
ughp
ut (M
B/s
ec)
1 switch 2 switches - short fiber 2 switches - long fiber
10/15/04 18
Matisse IP-Based Networks GASNet put/get
Latency for 2 switches with short/long fibers constant Short – 250 us Long – 374 us
Throughput is comparable with regular GigE Latency comparable with regular GigE (~255us for all sizes)
Testsmall Latency (1 switch)
0
50
100
150
200
250
300
1 2 4 8 16 32 64 128
256
512
1024
2048
Message Size (bytes)
Late
ncy
(us)
put get
Testsmall Throughput
0
2000
4000
6000
8000
1 2 4 8 16 32 64 128
256
512
1024
2048
Message Size (bytes)
Thro
ughp
ut (k
b/s)
1 switch - put 1 switch - get2 switches - short fiber - put 2 switches - short fiber - get2 switches - long fiber - put 2 switches - long fiber - getGigE - put GigE - get
10/15/04 19
UPC function performance A look at common shared-data operations
Comparison between accesses to local data through regular and private pointers Block copies between shared and private memory
upc_memget, upc_memput Pointer conversion (shared local to private) Pointer addition (advancing pointer to next location) Loads & Stores (to a single location local and remote)
Block copies upc_memget & upc_memget translate directly into GASNet blocking put and get
(even on local shared objects); see previous graph for results Marvel with HP UPC compiler shows no appreciable difference between local and
remote puts and gets and regular C operations Steady increase from 0.27 to 1.83 µsec for sizes 2 to 8K bytes Difference of < .5 µsec for remote operations
10/15/04 20
UPC function performance Pointer operations
Cast Local share to private All BUPC conduits ~2ns, Marvel needed ~90ns
Pointer addition below
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
MPI-SCI MPI GigE Elan GM VAPI Marvel
Exec
utio
n Ti
me
(use
c)
Private Shared
Shared-pointer manipulation about an order of magnitude greater than private.
10/15/04 21
UPC function performance Loads and stores with pointers (not bulk)
Data local to the calling node Pvt Shared are private pointers to the local shared space
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
MPI SCI MPI GigE Elan GM VAPI Marvel
Exec
utio
n Ti
me
(use
c)
Private Store Private Load Shared StoreShared Load Pvt Shared Store Pvt Shared Load
MPI on GigE shared store takes 2 orders of magnitude longer, therefore not shown. Marvel shared loads and stores twice an order of magnitude greater than private.
10/15/04 22
UPC function performance Loads and stores with pointers (not bulk)
Data remote to the calling node Note: MPI GigE showed a time of ~450µsec for loads and ~500µsec for stores
0
5
10
15
20
25
30
MPI-SCI Elan GM VAPI Marvel
Exec
utio
n Ti
me
(use
c)
Store Load
Marvel remote access through pointers the same as with local shared, two orders of magnitude less than Elan.
10/15/04 23
UPC Benchmark - IS from NAS Benchmark
0
5
10
15
20
25
30
GM Elan GigE mpi VAPI (Xeon) SCI mpi SCI Marvel
Exec
utio
n Ti
me
(sec
)
1 Thread 2 Threads 4 Threads 8 Threads
IS (Integer Sort, Class A), lots of fine-grain communication, low computation Poor performance in the GASNet communication system does not necessary indicate poor
performance in UPC application
10/15/04 24
UPC Benchmarks – FT from NAS Benchmarks* FT (3-D Fast Fourier Fransform, Class A), medium communication, high computation
Used optimized version 01 (private pointers to local shared memory) SCI Conduit unable to run due to driver limitation (size constraint)
High-bandwidth networks perform best (VAPI followed by Elan) VAPI conduit allows cluster of Xeons to keep pace with Marvel’s performance MPI on GigE not well suited for these types of problems (high-latency, low-bandwidth traits limit performance) MPI on SCI has lower bandwidth than VAPI but still maintains near-linear speedup for more than 2 nodes (skirts
TCP/IP overhead) GM performance a factor of processor speed (see 1 Thread)
0
5
10
15
20
25
30
35
40
45
50
GM Elan GigE mpi VAPI SCI mpi Marvel
Exec
utio
n Ti
me
(sec
)
1 Thread 2 Threads 4 Threads 8 Threads
* Using code developed at GWU
High-latency of MPI on GigE impedes performance.
10/15/04 25
UPC Benchmark - DES Differential Attack Simulator
S-DES (8-bit key) cipher (integer-based) Creates basic components used in differential cryptanalysis
S-Boxes, Difference Pair Tables (DPT), and Differential Distribution Tables (DDT) Bandwidth-intensive application Designed for high cache miss rate, so very costly in terms of memory access
0
500
1000
1500
2000
2500
3000
3500
GM Elan GigE mpi VAPI(Xeon)
SCI mpi SCI Marvel
Exec
utio
n Ti
me
(mse
c.) Sequential 1 Thread 2 Threads 4 Threads
10/15/04 26
UPC Benchmark - DES Analysis With increasing number of nodes, bandwidth and NIC
response time become more important Interconnects with high bandwidth and fast response
times perform best Marvel shows near-perfect linear speedup, but processing time of
integers an issue VAPI shows constant speedup Elan shows near-linear speedup from 1 to 2 nodes, but more
nodes needed in testbed for better analysis GM does not begin to show any speedup until 4 nodes, then
minimal SCI conduit performs well for high-bandwidth programs but with
the same speedup problem as GM MPI conduit clearly inadequate for high-bandwidth programs
10/15/04 27
UPC Benchmark - Differential Cryptanalysis for CAMEL Cipher Uses 1024-bit S-Boxes Given a key, encrypts data, then tries to guess key solely based on encrypted data using differential attack Has three main phases:
Compute optimal difference pair based on S-Box (not very CPU-intensive) Performs main differential attack (extremely CPU-intensive)
Gets a list of candidate keys and checks all candidate keys using brute force in combination with optimal difference pair computed earlier
Analyze data from differential attack (not very CPU-intensive) Computationally (independent processes) intensive + several synchronization points
0
50
100
150
200
250
SCI (Xeon) VAPI (Opteron) Marvel
Exec
utio
n Ti
me
(s) 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads
ParametersMAINKEYLOOP = 256
NUMPAIRS = 400,000
Initial Key: 12345
10/15/04 28
UPC Benchmark - CAMEL Analysis
Marvel Attained almost perfect speedup Synchronization cost very low
Berkeley UPC Speedup decreases with increasing number of threads
Cost of synchronization increases with number of threads Run time varied greatly as number of threads increased
Hard to get consistent timing readings Still decent performance for 32 threads (76.25% efficiency, VAPI) Performance is more sensitive to data affinity
10/15/04 29
Architectural Performance Tests Intel Pentium 4 Xeon
Features 32-bit processor Hyper-Threading technology
Increased CPU utilization Intel NetBurst microarchitecture
RISC processor core 4.3 GB/s I/O bandwidth
AMD Opteron Features
32-bit/64-bit processor Real-time support of 32-bit OS On-chip memory controllers Eliminates 4 GB memory barrier
imposed by 32-bit systems 19.2 GB/s I/O bandwidth per
processor
Intel Itanium II Features
64-bit processor Based on EPIC architecture 3-level cache design Enhanced Machine Check
Architecture (MCA) with extensive Error Correcting Code (ECC)
6.4 GB/s I/O bandwidth
Theme: Preliminary study of tradeoffs in available processor architectures, since their performance will clearly affect computation, communication, and synchronization in UPC clusters.
10/15/04 30
0
200
400
600
800
1000
Random Reads Random Writes SequentialReads
SequentialWrites
Disk copies
Thro
ughp
ut (M
B/s
)
Itanium2 Opteron XeonCPU Performance Results
AIM 9 10 iterations using 5MB files testing sequential and random reads, writes, and copies Itanium2 slightly above Opteron in both reads and writes except for random writes
where Opteron has a slight advantage Both Itanium2 and Opteron outperform Xeon by a wide margin in all cases except
sequential reads Xeon sequential reads are comparable to Opteron, but Itanium2 is much higher than
both Major performance gain from sequential reads compared to random, but sequential
writes do not receive nearly as large of a boost
Computation benchmarks excluded due to compiler problems with Itanium2.
10/15/04 31
10 Gigabit Ethernet – Preliminary results
Testbed Nodes: Each with dual x 2.4GHz Xeons, S2io
Xframe 10GigE card in PCI-X 100, 1GB PC2100 DDR RAM, Intel PRO/1000 1GigE, RedHat 9.0 kernel 2.4.20-8smp, LAM-MPI V7.0.3
0
20
40
60
80
100
120
140
160
180
200
0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Message Size (bytes)
Rou
nd-tr
ip L
aten
cy (u
sec)
10GigE GigE 0
50
100
150
200
250
300
350
400
450
64 128 256 512 1K 2K 4K 8K 16K 32K 64K
Message Size (bytes)
Thro
ughp
ut (M
B/s
)
10GigE GigE
10GigE is promising due to expected economy-of-scale issues of Ethernet
S2io 10GigE shows impressive throughput, though slightly less than half of theoretical maximum; further tuning needed to go higher
Results show much-needed decrease in latency versus other Ethernet options
10/15/04 32
Conclusions Key insights
HCS SCI conduit shows promise Performance on-par with other conduits On-going collaboration with vendor (Dolphin) to resolve the memory constraint issue
Berkeley UPC system a promising COTS cluster tool Performance on par with HP UPC (also see [6])
Performance of COTS clusters match and sometimes beat performance of high-end CC-NUMA Various conduits allow UPC to execute on many interconnects
VAPI and Elan are initially found to be strongest Some open issues with bugs and optimization
Active bug reports and development team help improvements Very good solution for clusters to execute UPC, but may not quite be ready for production use
No debugging or performance tools available Xeon cluster suitable for applications with high Read/Write ratio Opteron cluster suitable for generic application due to comparable Read/Write capability Itanium2 excellent for sequential reads, about the same as Opteron for everything else 10GigE provides high bandwidth with much lower latencies than 1GigE
Key accomplishments to date Baselining of UPC on shared-memory multiprocessors Evaluation of promising tools for UPC on clusters Leveraging and extension of communication and UPC layers Conceptual design of new tools for UPC Preliminary network and system performance analyses for UPC systems Completion of optimized GASNet SCI conduit for UPC
10/15/04 33
References1. D. Bonachea, U.C. Berkeley Tech Report (UCB/CSD-02-1207) (spec v1.1), October
2002.
2. C. Bell, D. Bonachea, “A New DMA Registration Strategy for Pinning-Based High Performance Networks,” Workshop on Communication Architecture for Clusters (CAC'03), April, 2003.
3. W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, K. Warren, “Introduction to UPC and Language Specification,” IDA Center for Computing Sciences, Tech. Report., CCS-TR99-157, May 1999.
4. K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken , “Titanium: A High-Performance Java Dialect,” Concurrency: Practice and Experience, Vol. 10, No. 11-13, September-November 1998.
5. B. Gordon, S. Oral, G. Li, H. Su and A. George, “Performance Analysis of HP AlphaServer ES80 vs. SAN-based Clusters,” 22nd IEEE International Performance, Computing, and Communications Conference (IPCCC), April, 2003.
6. W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, K. Yelick, “A Performance Analysis of the Berkeley UPC Compiler,” 17th Annual International Conference on Supercomputing (ICS), June, 2003.