Upload
amber-kerr
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
The x86 Server Platform
.. Resistance is futile….
Dec 6, 2004
2
Server shipments – Total vs x86
3
Market Share: Servers, United States, 2Q04
United States: Vendor Revenue by Operating System (Millions of Dollars)
2Q03 3Q03 4Q03 1Q04 2Q04
Market Share2Q03
Market Share2Q04
Growth2Q03-2Q04
Growth1Q04-2Q04
Windows 1,534.1 1,692.3 1,671.6 1,645.6 1,665.5 34.79% 36.18% 8.6% 1.2%
Unix 1,622.6 1,474.6 1,554.1 1,374.2 1,471.9 36.79% 31.98% -9.3% 7.1%
Others 820.2 823.7 1,142.4 897.2 852.6 18.60% 18.52% 3.9% -5.0%
Linux 433.2 497.3 552.5 555.0 613.2 9.82% 13.32% 41.5% 10.5%
Total 4,410.2 4,487.9 4,920.7 4,472.1 4,603.1 100.00% 100.00% 4.4% 2.9%
Michael McLaughlin, Market Share: Servers, United States, 2Q04 7 October 2004, Gartner
4
x86 Platform CPUsIntel
• Xeon MP – Gallatin (future is Potomac)
• Xeon SP/DP – EM64T - Nacona
• Itanium II MP – Madison (future is Montecito)
AMD
• Opteron
5
Gallatin - MP130 nm
3 GHz
4 MB L3 Cache
FSB - 400 MHz
6
ES7000 – 32 Gallatins
7
Nacona – Single Processor with EM64T
90 nm
Clock Speed – 3.2-3.6 GHz
L3 – 4 MB
FSB – 800 Mhz
8
Itanium II - Madison130 nm
9 MB L3 cache
1.6 GHz
FSB – 400 MHz
this is a footer
10
11
STOPWhy Multi-Core?.. And while we’re at it, why Multi-Threading?
It’s all about the balance of
• Silicon real estate
• Compiler technology
• Cost
• Power
…. to meeting the constant pressure to double performance every 18 months
12
Memory Latency vs CPU Speed
0.01
0.1
1.0
10.0
0.01
0.1
1.0
10.0
1990 1995 2000 2005 2010
MicroprocessorOperating Frequency (GHz)
DRAM AccessFrequency (10-9 sec)-1
Microprocessor on-chip clock
Commodity DRAM
Production Year
13
Processor ArchitectureWhen latency ↓ Ø and bandwidth ↑ ∞ we will have the perfect CPU
A great deal of innovation has centered around approximating this perfect world
• CISC
• CPU Cache
• RISC
• EPIC
• Multi-Threading
• Multiple Cores
14
Complex Instruction Set ComputerHardware implements assembler instructions
MULT A, B
• hardware loads registers, multiplies and stores results
• Multiple clocks needed for an instruction
RAM requirements are relatively small
Compilers translate high level languages down to assembler instructions – Von Neumann
http://www.hardwarecentral.com/hardwarecentral/tutorials/2427
hardware
15
CPU CacheWhen CPU speeds started to increase, memory latency emerged as a bottleneck
CPU caches were used to keep local references “close” to the CPU
For SMP systems, memory banks were more than a clock away
• It is not uncommon today to find 3 orders of magnitude between the fastest and slowest memory latency
16
Reduced Instruction Set ComputerHardware is simplified – fewer transistors are needed for full instruction set
RAM requirements are higher to store intermediate results and more code
Compilers are more complex
Clock speeds increase because instructions are simpler
Deterministic, simple instructions allow pipelining
17
Pipelining
Higher Clock Speeds!
25% busy
100% busy 80% busy 60% busy 40% busy
18
Branch PredictionWhile processing in parallel, branches occur
Branch prediction is used to increase the probability that a specific branch will be followed
If incorrect, the pipeline is “dead” and the CPU stalls
Statistics• 10%-20% of instructions are branches
• Predictions are incorrect about 10% of the time
As the pipeline increases, probability of miss increases and cycles will be discarded
• 80-deep pipeline / 20% branches / 10% miss => 80% chance of miss and a penalty of 80 cycles
19
Itanium II Epic Instruction SetExplicitly Parallel Instruction Computing
Compiler can indicate code that can be executed in parallel
Both branches are pipelined
• No lost cycles due to miss-prediction
Pipeline can be deeper
Complexity continues to move into the compiler
20
Multi-Threading
21
22
Multiple CoresFabrication sizes continue to diminish
The additional real estate has been used to put more and more memory on the die
Multi-core technology provides a new way to exploit the additional space
The clock rates cannot continue to climb due to the excessive heat
• P = C * V2 * f C - switch capacitance V – Supply Voltage f – clock frequency
Multiple cores is the next step to providing faster execution times for applications
23
(End of 2005?)
24
25
26
27
28
29
30
AMD Opteron 800 Series
130 nm
Clock Speed – 1.4-2.4 GHz
L2 – 1 MB
6.4 GB/s Hypertransport
31
Architectural Comparison
DDR 144-bit
Opteron Opteron
OpteronOpteron
PCI-XBridge
PCI-XBridge
I/OHub
OtherBridge
Hypertransport™ - 6.4 GB/s
Xeon Xeon Xeon Xeon
SNC
I/OHub
MemoryAddressBuffer
MemoryAddressBuffer
MemoryAddressBuffer
MemoryAddressBuffer
PCI-XBridge
PCI-XBridge
PCI-XBridge
6.4 GB/s
32
Mapping Workloads onto ArchitectureConsider a dichotomy of workloads:
• Large Memory Model – This needs a large, single system image and a large amount of coherent memory
- Database apps - SQL Server / Oracle
- Business Intelligence – Data Warehousing + Analytics
- Memory-resident databases
- 64 bit architectures allow memory addressability above 1 TB
• Small/Medium Memory Model – This can be cost-effective in workloads that do not require extensive shared memory/state
- Stateless Applications and Web Services
- Web Servers
- Clusters of systems for parallelized applications and grids
33
Large Server VendorsIntel Announcement (Nov 19)Otellini said product development, marketing and software efforts (for Itanium) will all now be aimed at "greater than four-way systems". He also said, "The mainframe isn't dead. That's where I'd like to push Itanium over time."
The size of the SMP is affected by Intel’s chip set support for coherent memory
OEM Vendors (Unisys, HP, SGI, Fujitsu, IBM) • Each has unique “chip set” to build basic four-ways into
large SMP systems
• IBM has Power5, which is a direct competitor
Intel 32-bit and EM674T• This could emerge as the flagship product
34
Where Are We Going?Since the early CISC computers, we have moved more and more of the complexity out to the compiler to achieve parallelism and fully exploit the silicon “real estate”
The power requirements, along with the smaller fabrication sizes, have pushed the CPU vendors to exploit multiple cores
The key to performance for these future machines will be the application’s ability to exploit parallelism