Workload Optimized Systems:The Wheel of Reincarnation
Michael Sporer, Netezza Appliance Hardware Architect21 April 2013
Outline2
� Definition
� Technology
� Minicomputers – Prime
� Workstations – Apollo
� Graphics – Stellar/Stardent
� SMP Database Machines – Data General
� MPP Database Machines – Netezza/IBM
� What’s Next?
� Conclusion
What Is Workload Optimized?3
� Technology is advancing
� Groups of technologies play together
� Step functions lead to qualitative opportunities
� Opportunity is defined by a specific market
� Market is identified by workload
� Hardware/Software co-design optimize workload
� Often startups are leading indicator
� Established vendors follow or perish
Technology - Hardware4
� Transistor Technology – Moore’s Law� ICs -> CPUs, ASICs, FPGAs
� DRAM
� Busses and their protocols
� Magnetic Recording Technology - bpi� Disks
� Networks – Ethernet, primarily
� Graphic displays -> GPUs
� Cache and chip interconnect
� SMP and NUMA
Technology - Software5
� Time-share OS
� Virtual Memory
� Languages: ASM, Fortran, C, C++, Java
� Object-Oriented Methods
� Open Source
Minicomputer – Prime (1)6
� Early 1970’s
� Engineers need computer access
� Mainframes too expensive
� New IC technology, DRAM, PROM and cost-effective, removable 14” Winchester disk drives
� Firmware technology for flexibility and ISA range
� Virtual memory and large address spaces
� OS technology pioneered by MULTICS
Minicomputer – Prime (2)7
� Designed a multi-user time-shared minicomputer for CAD and SW development
� Architecture defined by software engineers
� Wrote our own optimizing compilers
� Process switch and VM table walk in firmware
� 32-bit ISA as extension to prior 16-bit version
� All 16-bit applications ran at GA
� New and recompiled apps ran in 32-bit mode
� Floating Point in firmware in PROM
Workstation – Apollo (1)8
� Early 1980’s
� Engineers needed dedicated performance
� Workgroups cooperating
� Early microprocessors (68010)
� Ethernet proves networks can be built
� High resolution graphic displays available
� Bit-map technology proven by PARC and MIT
Workstation – Apollo (2)9
� Design a personal engineering workstation
� High resolution bit-map display
� Network OS with seamless file sharing
� 2D graphics SW system for cut and paste
� VM implemented by using 2x68010
� Application development environment
Graphics Supercomputer – Stellar/Stardent (1)
10
� Engineers need 3D displays for CAD, Biotech
� Heavy compute load using multiple threads
� Sea-of-gates IC technology
� YACC
� Single-clock scan path design proven (by IBM)
� Parallelizing compilers beginning to be understood
Graphics Supercomputer – Stellar/Stardent (2)
11
� Design a Graphics Supercomputer� 20 MIPs, 80 MFLOPs, 120K Gouraud-shaded triangles
� Home-brew C-like simulation language using YACC� 11 designs, 49 SEA-of-Gates chips per system
� First-pass operational
� Scan path SW
� Unix OS
� Home-brew parallelizing Fortran compiler
SMP Database Machine – Data General (1)12
� Early 90’s
� Database usage surging – mostly OLTP
� SMP reaching limits
� Dense AISCs available
� Intel CPUs becoming good
SMP Database Machine – Data General (2)13
� Need lots of CPUs and lots of I/O to process the high OLTP tps demands and database size
� Design a 32-way system with distributed I/O� SMP: 8 groups of 4-way M88K CPUs on a packet bus
� NUMA: 8 groups of 4-way Pentium CPUs on an SCI bus
� DG/UX to handle NUMA
� I/O subsystem aware of NUMA
MPP Database Machine – Netezza/IBM (1)14
� Data accumulating – could it be used?
� OLTP systems groaning under OLAP workloads
� Cyber Bricks at Microsoft – computation is low cost
� Active Disks at CMU – parallel algorithms near disk
� Inexpensive low-power CPU
� Inexpensive FPGA
� Inexpensive consumer disks
� Cheap 100Mb -> 1Gb Ethernet
MPP Database Machine – Netezza/IBM (2)15
� Design a simple SPU – disk + FPGA + CPU + NIC
� Put lots into a rack
� Develop MPP SW� Postgres front end
� Tuned Optimizer
� Parallelizer
� Query compiler – code for CPU and FPGA
� Distributer of Snippet of work
� FPGA – Decompress, Restrict, Project – reduce CPU load
� Appliance – easy to install, use, service, support
Netezza Snippet Processor16
� CPU
� FPGA
� Disk
� Network1M GateFPGA
1 GB RAMSocketed DIMM
440GXPower PC
GigE toeach SPU
Enterprise SATA Disk Drive
400 GB
Netezza Streaming Processing17
� Identify work not needed to done
� Do as much work as we can in parallel on SPU
� Move restricts into the FPGA
� Increase effective disk performance
� Execute operational streaming analytics
ZoneMaps
Base TableData Blocks
Col 1:Date
Col 2:Zip
Zone Maps: 18 out of 48 Extents Read
CBTMaps
CBT: 2 out of 48 Extents Read
Dim
ensi
on 1
: Dat
e
Dim 2:Zip
Analytic Functions
“On Stream”
Project Restrict
PowerPCQuery Engine
JoiningSorting
Grouping
Snippet Queue
Fault Recovery
Main Memory
StreamingRecord Processor
Transaction/LockManager
DMA
SPUSwap
Mirror
Primary
Project Restrict
Compiled Tables
Netezza Evolution18
� First product – 2003� 112 80MHz CPU
� 112 small FPGAs – 8-bit datapath
� 64MB DRAM
� 33MB/s 40GB disk
� Current product – 2013� 112 2.2GHz CPU cores
� 112 FPGA cores – 32-bit datapath
� 8GB/core DRAM + 512MB/FPGA core for disk cache
� 160MB/s 600GB disk
What Netezza Got Right19
� Streaming balanced architecture� Disk, CPU, network overlap – query time is max of all 3
� FPGA technology� Improve scan performance of HDD
� Decrease data to be processed by CPU -> cheap CPU
� MPP – move processing toward disk
� Focus on GB/sec/$
� New products when step function in technology
What’s Next From Market?20
� Competition in markets
� Urgency to get ahead or at least stay even
� Need more sophisticated use of data
� More complex analytics
� On more data
� Better ability to do what if
� More uses and users of DB systems
What’s Next From Technology?21
� At limit of number of disks per rack
� Cache often not big enough
� Flash provides very high IOPs� Crossover IOPs/$
� Lots of cores per CPU
� Lots of DRAM bandwidth
� Faster, lower overhead networks using RDMA
� GPUs – can they be harnessed?
� FPGAs – how do we get SW developers to use them?
Conclusions22
� We are ready for a big step in technology (>10x)
� Old SW stacks will no longer work� OLTP systems continue to not scale
� OLAP systems centered on HDDs
� What are SW limits?� MPP – can it scale to 1000 nodes? How about 10,000?
� How do we minimize overheads hidden by HDDs?
� It is imperative for HW/SW to be co-designed.
� How are you handling it?