The x86 Server Platform.. Resistance is futile…. Dec 6, 2004

The x86 Server Platform

.. Resistance is futile….

Dec 6, 2004

2

Server shipments – Total vs x86

3

Market Share: Servers, United States, 2Q04

United States: Vendor Revenue by Operating System (Millions of Dollars)

2Q03 3Q03 4Q03 1Q04 2Q04

Market Share2Q03

Market Share2Q04

Growth2Q03-2Q04

Growth1Q04-2Q04

Windows 1,534.1 1,692.3 1,671.6 1,645.6 1,665.5 34.79% 36.18% 8.6% 1.2%

Unix 1,622.6 1,474.6 1,554.1 1,374.2 1,471.9 36.79% 31.98% -9.3% 7.1%

Others 820.2 823.7 1,142.4 897.2 852.6 18.60% 18.52% 3.9% -5.0%

Linux 433.2 497.3 552.5 555.0 613.2 9.82% 13.32% 41.5% 10.5%

Total 4,410.2 4,487.9 4,920.7 4,472.1 4,603.1 100.00% 100.00% 4.4% 2.9%

Michael McLaughlin, Market Share: Servers, United States, 2Q04 7 October 2004, Gartner

4

x86 Platform CPUsIntel

• Xeon MP – Gallatin (future is Potomac)

• Xeon SP/DP – EM64T - Nacona

• Itanium II MP – Madison (future is Montecito)

AMD

• Opteron

5

Gallatin - MP130 nm

3 GHz

4 MB L3 Cache

FSB - 400 MHz

6

ES7000 – 32 Gallatins

7

Nacona – Single Processor with EM64T

90 nm

Clock Speed – 3.2-3.6 GHz

L3 – 4 MB

FSB – 800 Mhz

8

Itanium II - Madison130 nm

9 MB L3 cache

1.6 GHz

FSB – 400 MHz

this is a footer

10

11

STOPWhy Multi-Core?.. And while we’re at it, why Multi-Threading?

It’s all about the balance of

• Silicon real estate

• Compiler technology

• Cost

• Power

…. to meeting the constant pressure to double performance every 18 months

12

Memory Latency vs CPU Speed

0.01

0.1

1.0

10.0

0.01

0.1

1.0

10.0

1990 1995 2000 2005 2010

MicroprocessorOperating Frequency (GHz)

DRAM AccessFrequency (10-9 sec)-1

Microprocessor on-chip clock

Commodity DRAM

Production Year

13

Processor ArchitectureWhen latency ↓ Ø and bandwidth ↑ ∞ we will have the perfect CPU

A great deal of innovation has centered around approximating this perfect world

• CISC

• CPU Cache

• RISC

• EPIC

• Multi-Threading

• Multiple Cores

14

Complex Instruction Set ComputerHardware implements assembler instructions

MULT A, B

• hardware loads registers, multiplies and stores results

• Multiple clocks needed for an instruction

RAM requirements are relatively small

Compilers translate high level languages down to assembler instructions – Von Neumann

http://www.hardwarecentral.com/hardwarecentral/tutorials/2427

hardware

15

CPU CacheWhen CPU speeds started to increase, memory latency emerged as a bottleneck

CPU caches were used to keep local references “close” to the CPU

For SMP systems, memory banks were more than a clock away

• It is not uncommon today to find 3 orders of magnitude between the fastest and slowest memory latency

16

Reduced Instruction Set ComputerHardware is simplified – fewer transistors are needed for full instruction set

RAM requirements are higher to store intermediate results and more code

Compilers are more complex

Clock speeds increase because instructions are simpler

Deterministic, simple instructions allow pipelining

17

Pipelining

Higher Clock Speeds!

25% busy

100% busy 80% busy 60% busy 40% busy

18

Branch PredictionWhile processing in parallel, branches occur

Branch prediction is used to increase the probability that a specific branch will be followed

If incorrect, the pipeline is “dead” and the CPU stalls

Statistics• 10%-20% of instructions are branches

• Predictions are incorrect about 10% of the time

As the pipeline increases, probability of miss increases and cycles will be discarded

• 80-deep pipeline / 20% branches / 10% miss => 80% chance of miss and a penalty of 80 cycles

19

Itanium II Epic Instruction SetExplicitly Parallel Instruction Computing

Compiler can indicate code that can be executed in parallel

Both branches are pipelined

• No lost cycles due to miss-prediction

Pipeline can be deeper

Complexity continues to move into the compiler

20

Multi-Threading

21

22

Multiple CoresFabrication sizes continue to diminish

The additional real estate has been used to put more and more memory on the die

Multi-core technology provides a new way to exploit the additional space

The clock rates cannot continue to climb due to the excessive heat

• P = C * V2 * f C - switch capacitance V – Supply Voltage f – clock frequency

Multiple cores is the next step to providing faster execution times for applications

23

(End of 2005?)

24

25

26

27

28

29

30

AMD Opteron 800 Series

130 nm

Clock Speed – 1.4-2.4 GHz

L2 – 1 MB

6.4 GB/s Hypertransport

31

Architectural Comparison

DDR 144-bit

Opteron Opteron

OpteronOpteron

PCI-XBridge

PCI-XBridge

I/OHub

OtherBridge

Hypertransport™ - 6.4 GB/s

Xeon Xeon Xeon Xeon

SNC

I/OHub

MemoryAddressBuffer

MemoryAddressBuffer

MemoryAddressBuffer

MemoryAddressBuffer

PCI-XBridge

PCI-XBridge

PCI-XBridge

6.4 GB/s

32

Mapping Workloads onto ArchitectureConsider a dichotomy of workloads:

• Large Memory Model – This needs a large, single system image and a large amount of coherent memory

- Database apps - SQL Server / Oracle

- Business Intelligence – Data Warehousing + Analytics

- Memory-resident databases

- 64 bit architectures allow memory addressability above 1 TB

• Small/Medium Memory Model – This can be cost-effective in workloads that do not require extensive shared memory/state

- Stateless Applications and Web Services

- Web Servers

- Clusters of systems for parallelized applications and grids

33

Large Server VendorsIntel Announcement (Nov 19)Otellini said product development, marketing and software efforts (for Itanium) will all now be aimed at "greater than four-way systems". He also said, "The mainframe isn't dead. That's where I'd like to push Itanium over time."

The size of the SMP is affected by Intel’s chip set support for coherent memory

OEM Vendors (Unisys, HP, SGI, Fujitsu, IBM) • Each has unique “chip set” to build basic four-ways into

large SMP systems

• IBM has Power5, which is a direct competitor

Intel 32-bit and EM674T• This could emerge as the flagship product

34

Where Are We Going?Since the early CISC computers, we have moved more and more of the complexity out to the compiler to achieve parallelism and fully exploit the silicon “real estate”

The power requirements, along with the smaller fabrication sizes, have pushed the CPU vendors to exploit multiple cores

The key to performance for these future machines will be the application’s ability to exploit parallelism

Documents

The x86 Server Platform.. Resistance is futile…. Dec 6, 2004