View
213
Download
0
Category
Preview:
Citation preview
Workshop on Commodity-Based Visualization Clusters
Learning From the Stanford/DOE Visualization Cluster
Mike Houston, Greg Humphreys, Randall Frank, Pat Hanrahan
2Workshop on Commodity-Based Visualization Clusters
Outline
Stanford’s current cluster– Design decisions– Performance evaluation– Bottleneck evaluation
Cluster “Landscape”– General classification– Bottleneck evaluation
Stanford’s next cluster– Design goals– Research directions
3Workshop on Commodity-Based Visualization Clusters
Stanford/DOE Visualization Cluster
The Chromium Cluster
4Workshop on Commodity-Based Visualization Clusters
Cluster Configuration (Jan. 2000)
Cluster: 32 graphics nodes + 4 server nodes Computer: Compaq SP750
– 2 processors (800 MHz PIII Xeon, 133MHz FSB)– i840 core logic (big issue for vis-clusters)
• Simultaneous fast graphics and networking• Network: 64-bit, 66 MHz PCI• Graphics: AGP-4x
– 256 MB memory– 18GB SCSI 160 disk (+ 3*36GB on servers)
Graphics (Sept. 2002)– 16 NVIDIA GeForce3 w/ DVI (64 MB)– 16 NVIDIA GeForce4 TI4200 w/ DVI (128 MB)
Network– Myrinet 64-bit, 66 MHz (LANai 7)
5Workshop on Commodity-Based Visualization Clusters
Graphics Evaluation
NVIDIA GeForce3– 25 MTri/s triangle rate observed– 680 MPix/s fill rate observed
NVIDIA GeForce4– 60 MTri/s triangle rate observed– 800 MPix/s fill rate observed
Read Pixels performance– 35 MPix/s (140 MB/s) RGBA– 22 MPix/s (87 MB/s) Depth
Draw Pixels performance– 45 MPix/s (180 MB/s) RGBA– 21 MPix/s (85 MB/s) Depth
6Workshop on Commodity-Based Visualization Clusters
Network Evaluation
Myrinet LANai 7 PCI64A boards– Theoretical Limit: 160 MB/s – 142 MB/s observed peak under Linux– ~100 MB/s observed sustained under Linux
ServerNet not chosen– Driver support– Large switching infrastructure required
Gigabit Ethernet– Performance and scalability concerns
7Workshop on Commodity-Based Visualization Clusters
Myrinet Issues
Fairness: Clients starved of network resources– Implemented credit scheme to minimize congestion
Lack of buffering in switching fabric– Causes poor performance in high load conditions– Open issue
Partitioned Cluster
Unpartitioned Cluster
8Workshop on Commodity-Based Visualization Clusters
i840 Chipset Evaluation
66MHz 64bit PCI performance not full speed:– 210 MB/s PCI read (40% of theoretical peak)– 288 MB/s PCI write (54% of theoretical peak)– Combined read/write ~121 MB/s
AGP– Fast Writes / Side Band Addressing unstable under Linux
9Workshop on Commodity-Based Visualization Clusters
Sort-First Performance
Configuration– Application runs application on client– Primitives distributed to servers
Tiled Display– 4x3 @ 1024x768– Total resolution: 4096x2304,
9 Megapixel
Quake 3– 50 fps
Atlantis– 450 fps
10Workshop on Commodity-Based Visualization Clusters
Sort-Last Performance
Configuration– Parallel rendering on multiple nodes– Composite to final display node
Volume Rendering on 16 nodes– 1.57 GVox/s [Humphreys 02]– 1.82 GVox/s (tuned) 9/02– 256x256x1024 volume1
rendered twice
1Data Courtesy of G. A Johnson, G.P.Cofer, S.L Gewalt, and L.W. Hedlund from the Duke Center for In Vivo Microscopy (an
NIH/NCRR National Resource)
11Workshop on Commodity-Based Visualization Clusters
Cluster Accomplishments
Development Platform– WireGL– Chromium
Cluster configuration replicated Interactive Performance
– 256x512x1024 volume @ 15fps– 9 Megapixel Quake3 @ 50fps
12Workshop on Commodity-Based Visualization Clusters
Sources of Bottlenecks
Sort-First– Packing speed (processor)– Primitive distribution (network and bus)– Rendering (processor and graphics chip)
Sort-Last– Rendering (graphics chip)– Composite (network, bus, and read/draw pixels)
13Workshop on Commodity-Based Visualization Clusters
Bottleneck Evaluation – Stanford
Sort-First: Processor and Network Sort-Last: Network and Read/Draw
0 200 400 600 800 1000
Read/Draw
Network
Bus
Graphics
Processor
Throughput
14Workshop on Commodity-Based Visualization Clusters
The Landscape of Graphics Clusters
Many Options– Low End <$2500/node– Mid End ~$5000/node– High End >$7500/node
Tradeoffs– Different bottlenecks– Price/Performance– Scalability– Usage
Evaluation– Based off of published benchmarks and specs
15Workshop on Commodity-Based Visualization Clusters
Cluster Interconnect Options
Many choices– GigE
• ~100 MB/s– Myrinet 2000 (http://www.myrinet.com)
• 245MB/s– SCI/Dolphin (http://www.dolphinics.com)
• 326 MB/s– Quadrics (http://www.quadrics.com)
• 340 MB/s
Future options– 10 GigE– Infiniband– HyperTransport
16Workshop on Commodity-Based Visualization Clusters
Low End
General Definition– Single CPU– Consumer Mainboard– Integrated Graphics– High Speed commodity network
Example Node Configuration– Nvidia NForce2– AMD Athlon 2400+– 512 MB DDR– GigE and 10/100– 1U rack chassis– Estimated Price: $1500
17Workshop on Commodity-Based Visualization Clusters
Bottleneck Evaluation – Low End
Bus/Network limited
0 200 400 600 800 1000
Read/Draw
Network
Bus
Graphics
Processor
Throughput
18Workshop on Commodity-Based Visualization Clusters
Mid End
General Definition– Dual Processor
– “Workstation” mainboard
– High performance bus
• 64-bit PCI or PCI-X
– High Speed Commodity / Low end cluster interconnect
– High-End consumer graphics board Example Node Configuration
– Intel i860
– Dual Intel P4 Xeon 2.4GHz
– 2GB RDRAM
– ATI Radeon 9700
– GigE onboard + Myrinet 2000
– 2U rack chassis
– Estimated Price: $4000
19Workshop on Commodity-Based Visualization Clusters
Bottleneck Evaluation – Mid End
Sort-First: Network limited Sort-Last: Read/Draw and Network limited
0 200 400 600 800 1000
Read/Draw
Network
Bus
Graphics
Processor
Throughput
20Workshop on Commodity-Based Visualization Clusters
High End
General Definition– Dual or Quad processor– Cutting edge bus
• PCI-X, HyperTransport, PCI Enhanced– High Speed Commodity/ High end cluster interconnect– “Professional” graphics board– RAID system
Example Node Configuration– ServerWorks GC-WS– Dual P4 Xeon 2.6GHz– Nvidia Quadro4 900XGL– 4GB DDR– GigE onboard + Infiniband– Estimated Price: $7500
21Workshop on Commodity-Based Visualization Clusters
Bottleneck Evaluation – High End
Sort-First: Well balanced Sort-Last: Read/Draw limited
0 200 400 600 800 1000
Read/Draw
Network
Bus
Graphics
Processor
Throughput
22Workshop on Commodity-Based Visualization Clusters
Balanced System is Key
Only as fast as slowest component– Spend money where it matters!
0 200 400 600 800 1000
Network
Bus
Graphics
Processor
Throughput
High End
Mid End
Stanford
Low End
23Workshop on Commodity-Based Visualization Clusters
Goals for Next Cluster
Performance– Sort-Last
• 5 GVox/s• 1 GTri/s
– Sort-First at 4096x2304• Quake3 @ >100fps
Research– Remote visualization– Time-varying datasets– Compositing
24Workshop on Commodity-Based Visualization Clusters
What we plan to build
16 Node cluster, 1U nodes Mainboard chipsets
– Intel Placer– ServerWorks GC-WS– AMD Hammer
Memory– 2-4GB
Graphics Chip– Nvidia NV30 – ATI R300/350
Interconnect– Infiniband, Quadrics
Disk– IDE RAID or SCSI
25Workshop on Commodity-Based Visualization Clusters
Continuing Chipset Issues
Why do chipsets perform so poorly?– “Workstation”
• Intel i860– 215 MB/s read (40% of theoretical)– 300 MB/s write (56% of theoretical)
• AMD 760MPX– 300 MB/s read (56% of theoretical)– 312 MB/s write (59% of theoretical)
– “Server”• ServerWorks ServerSet III LE
– 423 MB/s read (79% of theoretical)– 486 MB/s write (91% of theoretical)
Why can’t a “server” have an AGP slot?Performance numbers from http://www.conservativecomputer.com
26Workshop on Commodity-Based Visualization Clusters
Ongoing Bottlenecks
Readback performance– Will be fixed “soon”– Hardware compositing?
Chipset Performance– Achieve fraction of theoretical– Need faster busses in commodity chipsets
Network Performance– Scalability– Fast is VERY expensive
27Workshop on Commodity-Based Visualization Clusters
Conclusions
What we still need– More vendors– More chipsets– More performance
Graphics Clusters are getting better– Chipsets– Interconnects– Form factor– Processing– Graphics Chips
Things are really starting to get interesting!
Recommended