OpenSPARC T1 Processor DV Club – Silicon ValleyMay 23rd, 2006
Shrenik MehtaDirector, Frontend Technologies & OpenSPARC ProgramSystem GroupSun Microsystems, Inc.
www.opensparc.net
Design Verification Club – Silicon Valley www.opensparc.net 2
Agenda
1.New wave – Chip Multi-Threading (CMT)2.UltraSPARC T13.Participation Age – UltraSPARC T1
verification4.OpenSPARC™ community
Design Verification Club – Silicon Valley www.opensparc.net 3
Making the Right WavesIm
prov
ed P
rice/P
erfo
rman
ce
1980 20001990 2010
Chip Multi-threading(CoolThreadsTM)Symmetrical
Multi-processing (SMP)
Reduced Instruction Set Computing (RISC)
Design Verification Club – Silicon Valley www.opensparc.net 4
Attributes of Commercial Workloads
Attribute
ApplicationCategory
WebServer
Instruction-level Parallelism
Thread-level Parallelism
Instruction/DataWorking Set
Data Sharing
SAP 2T SAP 3T(DB)
DSS(TPC-H)
ServerJava
OLTP ERP ERP DSS
Low Low Low LowMedium High
High High High High High High
Large Large Large Medium Large Large
Low Medium High Medium High Medium
TIER1Web
(Web99)
TIER2App Serv
(JBB)
TIER3Data
(TPC-C)
Web Services Client Server Data Warehouse
Design Verification Club – Silicon Valley www.opensparc.net 5
Memory BottleneckRelative Performance
10000
11990 1995 2005 1980
1000
100
10
1985 2000
2x Every 6 Years
2x Every 2 Years
Gap
CPU FrequencyDRAM Speeds
Source: Sun World Wide Analyst Conference Feb. 25, 2003
Design Verification Club – Silicon Valley www.opensparc.net 6
Single Threaded Performance
Single Threading
Thread
Memory Latency Compute
Time
HURRYUP ANDWAIT!
C C C
Typical Processor Utilization:15–25%
M M M
Up to 85% Cycles Waiting for Memory
Design Verification Club – Silicon Valley www.opensparc.net 7
Single Threaded Performance Chip Multi-threaded
(CMT) Performance
The Power of CMT - CoolThreads
UltraSPARC T1 Processor Utilization: Up
to 85%
C MC MC MThread 1
Memory Latency ComputeTim
e
C MC MC M
C MC MC M
C MC MC M
Thread 2
Thread 3
Thread 4
Design Verification Club – Silicon Valley www.opensparc.net 8
Chip Multi-Threading (CMT) to the rescue
CMP (chip multiprocessing)
HMT (hardware multithreading)
CMT (chip
multithreading)
n cores per processor m strands per core n x m threads per processor
Design Verification Club – Silicon Valley www.opensparc.net 9
Why CMT Works
“Goal: 100% Resource Utilization”
20% MaximumCore Size
SPARC: 4 threads per core● Increases core die area by ~20%● Improves performance by ~50–100%
.05
1.0
10.0
2.0Multi-Thread, Single-Core
Multi-Thread, Multi-Core
Single-Thread, Single-Core
Rel
ativ
e Pe
rform
ance
on
thre
ad-
rich
mem
ory
boun
d w
orkl
oads
Copyright Sun Microsystems 2006, Sun Microsystems, Inc. All rights reserved. Used by permission.Page 10
Example - SpecJBB Execution Efficiency
0 4 8
SingleThreaded
Four Threaded
Idle Time
72%Efficiency
Cycles
3.79 cycles
1.56 cycles
Idle Time
21%Efficiency
11 + 3.79
=
44 + 1.56
=
Compute Pipeline Conflict Pipeline Latency Memory LatencyA. S. Leon et al., “A Power-Efficient High Throughput 32-Thread SPARC Processor,” ISSCC06, Paper 5.1
Design Verification Club – Silicon Valley www.opensparc.net 11
• SPARC V9 implementation• Up to eight 4-way multi-
threaded cores for up to32 simultaneous threads
• All cores connected through a 134.4GB/s crossbar switch
• High-bandwidth 12-way associative 3MB Level-2 cache on chip
• 4 DDR2 channels (23GB/s)• Power : < 80W • ~300M transistors • 378 sq. mm die
1 of 8 Cores BUS
C8C7C6C5C4C3C2C1
L2$L2$L2$L2$
Xbar
DDR-2SDRAM
DDR-2SDRAM
DDR-2SDRAM
DDR-2SDRAM
FPU
UltraSPARC T1 Processor
Sys I/FBuffer Switch
Core
Design Verification Club – Silicon Valley www.opensparc.net 12
CMT: On-chip = High Bandwidth
SwitchP P P P
M M M M
IO
SwitchP P P P
M M M M
IO
SwitchP P P P
M M M M
IO
SwitchP P P P
M M M M
IO
Niagara: 32 Threads
Direct crossbar interconnectLower cost, better RAS, lower BTUs,
lower and uniform latency,greater and uniform bandwidth. . .
PPPPPPPP
Mem Ctlr
Mem Ctlr
Mem Ctlr
Mem CtlrI/OSwitc
hTraditional SMP: 32 Threads
Example: Typical SMP Machine ConfigurationOne motherboard, no switch ASICs
Switch
SwitchP P P P
M M M M
IO
SwitchP P P P
M M M M
IO
SwitchP P P P
M M M M
IO
SwitchP P P P
M M M M
IO
Switch
XBar
L2L2L2L2
Design Verification Club – Silicon Valley www.opensparc.net 13
Single-Core Processor CMT Processor
(Not to Scale)
C1 C2 C3 C4
C5 C6 C7 C8
Faster Can Be Cooler
107C
102C
96C
91C
85C
80C
74C
69C
63C
58C
Design Verification Club – Silicon Valley www.opensparc.net 14
UltraSPARC-T1: Some Design Choices
• Simpler core architecture to maximize cores on die
• Caches, dram channels shared across cores give better area utilization
• Shared L2 decreases cost of coherence misses by an order of magnitude
• On die memory controllers reduce miss latency
• Crossbar good for b/w, latency, functional verification
• 378mm2 die in 90nm dissipating ~70W
www.opensparc.net Design Verification Club – Silicon Valley 15
UltraSPARC-T1 Processor Core● Four threads per core● Single issue 6 stage pipeline● 16KB I-Cache, 8KB D-Cache> Unique resources per thread
> Registers> Portions of I-fetch datapath> Store and Miss buffers
> Resources shared by 4 threads> Caches, TLBs, Execution Units> Pipeline registers and DP
● Core Area = 11mm2 in 90nm● MT adds ~20% area to core
IFU
EXU
MUL
TRAP
MMU LSU
Design Verification Club – Silicon Valley www.opensparc.net 16
SPARC Core Pipeline
Fetch Thrd Sel Decode Execute Memory WB
ICacheItlb
Instbuf x 4
DCacheDtlbStbuf x 4Decode
Regfile x4
Thread selects
ThrdSelMux
ThrdSelMux
PC logic x 4
Threadselect logic
Instruction typemissestraps & interruptsresource conflicts
CrossbarInterface
AluMulShftDiv
Modular Exponientation Unit
Design Verification Club – Silicon Valley www.opensparc.net 17
Thread Selection Policy● Switch between available threads every cycle giving
priority to least recently executed thread● Threads become unavailable due to:
● Long latency ops like loads, branch, mul, div.● Pipeline stalls such as cache misses, traps, and resource conflicts
● Loads are speculated as cache hits, and the thread is switched in with lower priority
Design Verification Club – Silicon Valley www.opensparc.net 18
CMT Benefits
Performance
Cost● Fewer servers● Less floor space● Reduced power consumption● Less air conditioning● Lower administration and
maintenance
Reliability
Design Verification Club – Silicon Valley www.opensparc.net 19
CMT Pays Off with CoolThreadsTM Technology
*See disclosures
Sun Fire T1000 and T2000
● Up to 5x the performance● As low as 1/5 the energy ● As small as 1/4 the size
Design Verification Club – Silicon Valley www.opensparc.net 20
CoolThreads Servers are a Hit
“...the UltraSparc T1 is truly a revolutionary
processor.”
“...performance in several profiles unmatched for the
power and space it consumes.”
“These servers could save companies millions.”
“...the machines put Sun at the cutting edge of one of the chip
industry's biggest trends... multi-core systems.”
Design Verification Club – Silicon Valley www.opensparc.net 21
New wave requires rethinking everything
Why not open source hardware?
Design Verification Club – Silicon Valley www.opensparc.net 22
World's First Open Source Microprocessor
• Governed by GPL (2)• Complete chip architecture• Register Transfer Logic (RTL)• Hypervisor API• Verification suite and
architectural models• Simulation model for Solaris
bringup on s/w
OpenSPARC.net
Design Verification Club – Silicon Valley www.opensparc.net 23
Virtualisation
• Thin software layer between OS and platform hardware
• Para-virtualised OS
• Hypervisor + sun4v interface• Virtualises machine HW and isolates OS from register-
level• Delivered with platform not OS• Not itself an OS
SPARC hardware
Hypervisor
Solaris
UserApp
sun4v virtual machine
stableinterface“sun4v”
UserApp
OpenBoot
UserApp
Design Verification Club – Silicon Valley www.opensparc.net 24
It’s aboutParticipation
www.opensparc.net Design Verification Club – Silicon Valley 25
UltraSPARC T1 - Verification• Philosophy
> Verify once and then leverage, e.g. SPARC core, L2$ bank• Strategy
> Directed “Black box” testing for architectural, “cycle” independent features
> Stand Alone Testing to flush out all basic bugs allows for more “independent bug peeling” and better controllability
> Directed Random “White box” testing with Functional Coverage Feedback
> Some formal methods, code review in high risk areas
www.opensparc.net Design Verification Club – Silicon Valley 26
Architecture Verification• Cross checking with Architectural Simulator (AS)
> Checks “Architectural state”, but follows RTL on certain conditions like memory order
• Program Reference Manual (PRM) Coverage> Aim to “qualify” PRM with set of diags> Exhaustive set of directed diags, very basic operations, with
certain coverage goals> Timing independent, only checks for functionality> Divided in opcode groups – ALU, State-regs, Control-Transfer,
Mem-ops> Traps, ASI & MMU, State-register coverage
www.opensparc.net Design Verification Club – Silicon Valley 27
Architecture Verification (2)• Self-Checking Diags
> Covers areas where AS lags>Especially in Diagnostics, IO/DMA, tick-regs, errors, reset
and self-modifying code>Memory model verification>Performance diags
www.opensparc.net Design Verification Club – Silicon Valley 28
Architecture Verification (3)• System Interface Verification
> Producer-Consumer variations between SPARC core and I/O devices for memory and I/O addresses
> All threads accessibility to DRAM controller and I/O's> Traps on unsupported instructions
• Reset Verification> Power-up state of the machine as per PRM> Chip state after warm reset, including DRAM controller and I/O
• Debug Support Verification> Cache set associativity functionality> Error injection, clock-stop, TAP access
www.opensparc.net Design Verification Club – Silicon Valley 29
Microarchitecture Verification
FetchExecuteTLBLd/StTrapsSystemFloatSwitchL2DRAM
Uni
t-Foc
used
Tes
ting
Function-FocusedMT Coherency RAS ... Perf Arch
www.opensparc.net Design Verification Club – Silicon Valley 30
Microarchitecture Verification (2)• Unit focused Verification
> Write Checkers> Write Directed, C/Perl, Internal code generator diags to verify
functionality> All major blocks Pipe, Traps, TLB, Load Store, Streaming,
Floating point, crossbar, L2 and DRAM controller• Function focused Verification
> Rely on AS and functional checkers> Generate directed and pseudo-random> Total Store Order (memory ordering), Multi-threading,
RAS/Errors> Notice that function focused verification spans across blocks
www.opensparc.net Design Verification Club – Silicon Valley 31
Memory verification using Symbolic Simulation
RTL SPICE Netlist
testbench file (.tb)
ssf -runcv
coverage analysis
Pass
debug
Fail
Transistor Schematic
Main Steps1. Schematic to SPICE netlist generation2. Testbench generation3. Symbolic simulation and debug4. Symbolic coverage analysis
www.opensparc.net Design Verification Club – Silicon Valley 32
Clock Domain Checking Verification• Clock Domain Checking (CDC)
> Static analysis of clock domain crossings• Key features of CDC
> Identify clock domain crossings> Detect missing synchronizers> Identify incorrect trasfer protocols> Verify reconvergence of independently synchronized signals
• Found 8 bugs across 6000 crossings in 3 clock domains
www.opensparc.net Design Verification Club – Silicon Valley 33
Performance Verification• Goal to make sure performance of RTL model matches
the Architectural performance simulator• Single thread and multi-thread architectural and
performance diags matches within 2-3% accuracy• Run bigger Multi-threaded application traces under mini-
OS kernel• Found 50+ bugs
www.opensparc.net Design Verification Club – Silicon Valley 34
Hardware Acceleration• Prevent functional bug escapes by running long directed
and random tests• Facilitate early SW development and system integration
> Booted Hypervisor, Open Boot Prom (OBP) and multi-threaded Solaris prior to Tapeout
• Aid Post-silicon debug and fix verification> Repeatability, check-point replay
• Ran estimated 5+ Trillion simulation cycles
www.opensparc.net Design Verification Club – Silicon Valley 35
Functional Verification Tapeout Criteria• 15B fullchip random cycles without a hardware failure• Full chip bug rate < 1 per week• Coverage
> Functional coverage greater than 95%> Line/Condition coverage greater than 99%> Remaining untested coverage reviews and waivers
• All architectural diags including all PRM documented features written and passing
• Tester reset tested in RTL & full gate level simulation• Hypervisor and OBP brought up in RTL• No open bugs blocking bring up activities
Design Verification Club – Silicon Valley www.opensparc.net 36
VerificationMetrics
www.opensparc.net Design Verification Club – Silicon Valley 37
RTL Stability, by week
www.opensparc.net Design Verification Club – Silicon Valley 38
Weekly New Bugs
www.opensparc.net Design Verification Club – Silicon Valley 39
Cumulative Bugs
www.opensparc.net Design Verification Club – Silicon Valley 40
Coverage Objects
www.opensparc.net Design Verification Club – Silicon Valley 41
System level Coverage
www.opensparc.net Design Verification Club – Silicon Valley 42
Coverage• 20K+ total coverage objects
IFU EXU FFU MMU SPU TLU LSU MSS FPU DRAM
IOB JBI0
500
1000
1500
2000
2500
3000
3500
4000
4500
Coverage Distrubution across Functional Blocks
Coverage Objects
www.opensparc.net Design Verification Club – Silicon Valley 43
Verification Effectiveness
• 30% Post TO1.0 bugs were critical• 80% of post TO1.0 bugs were fixed, rest became features• Most of the bugs had hardware or software workaround• Memory and I/O subsystem had most escapes• Data compares quite well against other commercial CPUs
Critical/Not-critical Fixed/Feature Workaround (hw/sw)0
5
10
15
20
25
30
35
40
45
Test Escapes
Fetch Execution Memory I/O Electrical DFT
0
2
4
6
8
10
12
14
16
Escapes across various functionalities
Design Verification Club – Silicon Valley www.opensparc.net 44
Feedback"The positive effect of good verification I've noted in recent project—Sun's Niagara . Projects had a close working relationship between the design team and the verification team. As a result of the close teamwork, very early silicon worked extremely well. On many larger design projects, the design team builds the design and then "throws it over the wall" to the verification team.This bureaucratic approach often appears in entrenched organizations of larger, established companies. Smaller startups cannot afford such aluxury, and the Niagara team was formed as a startup (Afara WebSystems) and then bought by Sun Microsystems. The Afara team managed to maintain its team approach even after being swallowed up by the much larger Sun."
-- Kevin Krewell, Analyst Microprocessor report 5/31/05
Design Verification Club – Silicon Valley www.opensparc.net 45
Innovationwill happen everywhere
About the Community: opensparc.net
Innovation Happens Everywhere
Clustermaps for http://opensparc.net
Design Verification Club – Silicon Valley www.opensparc.net 46
Get the Source, Start Innovating
Innovate anywhere – within it or outside it
Things you can do:- use as is- add/delete cores- add new instr.- change FPU- add video/graphics- add network interface- change memory interface- change I/O interface- change cache/mem interface- etc.
IO BUS
C4C3C2C1
L2$ BankL2$ BankL2$ BankL2$ Bank
Crossbar FPU
Sys I/FBuffer Switch Core
20+ GB/s read/write
16KB I$
8KB D$
16KB I$
8KB D$
16KB I$
8KB D$
16KB I$
8KB D$
C8C7C6C5
16KB I$
8KB D$
16KB I$
8KB D$
16KB I$
8KB D$
16KB I$
8KB D$
L2$ Bank L2$ Bank L2$ Bank L2$ Bank
MemoryController
MemoryController
MemoryController
MemoryController
16B @ 333 MT/s
16B @ 200Mhz3.2GB/s peak, 2.5GB/s effective
Crossbar
DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM
4 threads per core
3MB L2$
Design Verification Club – Silicon Valley www.opensparc.net 47
OpenSPARC Communities
Chip Designers
Hardware IP Suppliers
EDA Vendors
CMT Tools
Academia/Universities
Operating Systems
BenchmarkingReference flowFPGAEmulationVerificationPhysical DesignMulti-threaded tools
Architecture, ISA, VLSI course workThreading, Scaling, ParallelizationBenchmarks
PCI cores, SERDES etc.
Compilers, ThreadingOptimizationPerformance Analysis
OpenSolaris,Linux, BSD variants,Embedded OSs
SoC designs, Hard macrosTelecom applications
www.opensparc.net Design Verification Club – Silicon Valley 48
64 Bits. 32 Threads. Free.
Imagine what’s next...
"Sun's decision to release Verilog source code for the UltraSPARC hardware design under a free software license is a historic step. Sun is showing its profound understanding of the forces shaping our technological future in making this decision. -- Eben Moglen, founding director of the Software Freedom Law Center
"The free world welcomes Sun's decision to use the Free Software Foundation's GNU GPL for the freeing of OpenSPARC. We'd love to see other hardware companies follow in Sun's footsteps."
-- Richard Stallman, Free Software Foundation
Shrenik MehtaDirector, Frontend Technologies & OpenSPARC ProgramSun Microsystems, Inc.
Thank You!
www.opensparc.net
www.opensparc.net Design Verification Club – Silicon Valley 50
Conversations We Should Have Now
OpenStandards(SystemVerilog, Property Specification Language)
Open Verification
Network with Peers OpenSource(Software, hardware)
Design Verification Club – Silicon Valley www.opensparc.net 51
Legal Substantiation – Benchmarks• Sun Fire T2000 (8 cores, 1 chip) 14,001 SPECweb2005. IBM p5 550 (4 cores, 2 chips) 7881 SPECweb2005. IBM
eServer Xseries x346 (2 cores, 2 chips) 4348 SPECweb2005. SPEC, SPECweb reg tm of Standard Performance Evaluation Corporation. Sun Fire T2000 results submitted to SPEC. Other results from www.spec.org as of December 6, 2005.
• SPECjAppServer2004 with BEA - Sun Fire T2000 (8 cores, 1 chip) 615.64 JOPS@Standard. SPECjAppServer2004 Sun Fire rx4600 (4 cores, 4 chip) 471.28 JOPS@Standard. SPEC, SPECjAppServer reg tm of Standard Performance Evaluation Corporation. Sun Fire T2000 results submitted to SPEC. Other results from www.spec.org as of 12/06/2005.
• SPECjAppServer2004 with Sun Java System Application Server. SPECjAppServer2004 Sun Fire T2000 (8 cores, 1 chip) 436.71 JOPS@Standard. SPECjAppServer2004 HPrx4640 (4 cores, 4 chip) 471.28 JOPS@Standard. SPEC, SPECjAppServer reg tm of Standard Performance Evaluation Corporation. Sun Fire T2000 results submitted to SPEC. Other results from www.spec.org as of 12/06/2005. Sun HW+SW application tier cost = $37,484.95, appl cost per JOP = $85.83. HP HW+SW application tier cost = $140,537.88, appl cost per JOP = $298.20 HP Bill of Material from http://www.spec.org/jAppServer2004/results/res2005q3/jAppServer2004-20050913-00016.html BEA pricing from http://www.awaretechnologies.com/BEA/index.html.System pricing dated 11/29/05
• Sun Fire T1000 Server (1 chip, 8 cores, 1-way) 51,540 bops, 12,885 bops/JVM. IBM x346 (2 chip, 4 cores, 4-way) 39,585 bops, 39,585 bops/JVM. IBM p520 (1 chip, 2 cores, 2-way) 32,820 bops, 32,820 bops/JVM. Dell SC1425 (2 chip, 2 cores, 2-way) 24,208 bops, 24,208 bops/JVM. SPEC, SPECjbb reg tm of Standard PerformanceEvaluation Corporation. Sun Fire T1000 results submitted to SPEC. Other resultss as of 12/6/2005 on www.spec.org
Design Verification Club – Silicon Valley www.opensparc.net 52
Legal Substantiation – Benchmarks (Cont'd)
• NotesBench R6iNotes Sun Fire T2000 (1x1200 MHz UltraSPARC T1, 32GB), 4 partitions, Solaris[TM] 10, Lotus[R] Domino 7.0, 19,000 users, $4.36 per user, 16,061 NotesMark tpm, 400 ms avg rt. NotesBench R6iNotes IBM x346 (2 x 3.4 GHz Xeon processors, 8GB), 1 partition, SuSE Linux 8, Lotus[R] DominoR6.5.3, 6,50 users, $9.07 per user, 5,109 NotesMark tpm, 569 ms avg rt.More info www.notesbench.org
• Portal tests conducted on Sun Fire Sun Fire T2000 configured with 6 cores, 1.0GHz UltraSPARC T1 processor and 16GB memory. Dell 6650 configured with 4 x Intel Xeon processors at 2GHz and 4GB memory. 1 processor was active to run the test. Internal test using SLAMD Distributed Load Generation Engine AE & Mercury Load Runner. Both systems were installed with Solaris 10 OS and Sun Java Portal Server 7. Test date: 11/14/05.
• Crypto Processing RSA & DSA sign operations @ 1024-bit
Design Verification Club – Silicon Valley www.opensparc.net 53
Legal Substantiation – Benchmarks (Cont'd)
• Two-tier SAP ECC 5.0 Standard Sales and Distribution (SD) benchmark Sun Fire T2000 (1 processor, 8 cores, 32 threads) 1.2 GHz UltraSPARC T1, 32 GB mem, 950 SD benchmark users, 1.91 sec avg resp, MaxDB 7.5 database, Solaris 10. SAP certification number was not available at press time, please see: www.sap.com/benchmark. Benchmark data submitted for approval: 950 SD Users (Sales &Distribution), Ave. dialog resp. time: 1.91 seconds, Throughput: Fully processed order line items/hour:95,670, Dialog steps/hour: 287,000, SAPS: 4,780, Average DB req. time (dia/upd): 0.080 sec / 0.157 sec, CPU utilization of central server: 99%, central server OS: Solaris 10, RDBMS: MaxDB 7.5, SAP ECC
Release: 5.0, Configuration: Sun Fire Model T2000, 1 processor/ 8 cores / 32 threads, UltraSPARC T1, 1200 MHz, 64 KB(D) + 128 KB(I) L1 cache, 3 MB L2 cache, 32 GB main memory. Two-tier SAP SD Standard Application Benchmark Release 4.70 (64-bit)results for the HP ProLiant DL580 (4-way, 4 procs, 4 cores, 4 threads) included 4x 3.33 Ghz Xeon, 32 GB mem, 937 SAP SD Benchmark users,1.96 sec avg response time, Cert#2005012, running Microsoft® Windows Server
2003, Enterprise x64 Edition (64-bit) and Microsoft SQL Server 2000 Enterprise Edition (32-bit), certified on March 29, 2005. Two-tier SAP SD Standard Application Benchmark Release 4.70 (64-bit) results for the HP rx4640 (4-way, 4 procs, 4 cores, 4 threads) included 4x 1.5 Ghz Itanium2, 32 GB mem, 880 SAP SD Benchmark users, 1.89 sec avg response time, Cert#2004030, running HP-UX 11i,Oracle 9i certified on June 4, 2004. More information on SAP Benchmark results can be found at www.sap.com/benchmark