Some early results for SiCortex machines

John F. Mucci

Founder and CEO, SiCortex, Inc.

Lattice 2008

The Company

Computer Systems company building complete, high processor count, richly interconnected, low power Linux computers

Strong belief (and now some proof) that a more efficient HPC computer can be built from the silicon up.

Around 80 or so really bright people, plus me.

Venture funded, based in Maynard, Massachusetts, USA

http://www.sicortex.com, for whitepapers and tech. info

What We Are Building

A family of fully-integrated HPC Linux systems delivering best-of-breed:

Total Cost of Ownership

Delivered Performance:Per Watt

Per Square Foot

Per Dollar

Per Byte/IO

Usability and deployability

Reliability

SiCortex Product Family

105 Gflops48 GB

6.5 GB/s I/O200 Watts

0.95 Teraflops864 GB

30 GB/s I/O2.5 KWatts

8.55 Teraflops7.8 Terabytes216 GB/S I/O20.5 Kwatts

2.14 Teraflops1.8 Terabytes68 GB/S I/O5+ KWatts

Make it Green, Don't Paint it Green

Through increasing component density and integration ~= Performance

~= Reliability

~= 1/Power

Innovate where it counts!Single core performance and architecture

ScalabilityMemory, Network, I/O

ReliabilityNetwork, on-chip and off-chip ECC, software to recover

Software usability

Buy, borrow, the rest...

The SiCortex Node Chip

QuickTime™ and a decompressor

are needed to see this picture.

27-Node Module

3x 8-lane PCIe Modules

27x Node

54x DDR2 DIMM

2x Gigabit Ethernet

Fibre Channel10 Gb EthernetInfiniBand

Compute: 236 GF/secMemory b/w: 345 GB/secFabric b/w: 78 GB/secI/O b/w: 7.5 GB/sec

Fabric Interconnect

26 25 24 23 22 21 20 19 18

0 1 2 3 4 5 6 7 8

Network and DMA Engine

NetworkUnidirectional, 3 unique routes between any pair.3 in, 3 out, plus loopback, each 2GB/sFully passive, reliable, in order, no switch

DMA Engine100% user level, no pinningSend/Get/Put semanticsRemote Initiation

MPI: 1.0us, 1.5GB/sec

Standard Linux/MPI Environment

Integrated, Tuned and TestedOpen Source:• Linux• GNU C, C++, Fortran• Cluster file system (Lustre)• MPI libraries (MPICH2)• Math libraries• Performance tools• Management software

SiCortex:• Optimized compiler• Console, boot, diagnostics• Low-level communications

libraries, device drivers• Management software• 5 Minute boot time

Licensed:• Debugger, Trace Visualizer

Libraries

gentooLinux

QCD: MILC and su3_rmd

A widely used Lattice Gauge Theory QCD simulation for:

Molecular dynamics evolution, hadron spectroscopy, matrix elements and charge studies

The ks_imp_dyn/su3_rmd case is a widely studied benchmark.

Time tends to be dominated by the Conjugate Gradient

http://physics.indiana.edu/~sg/milc.html

http://faculty.cs.tamu.edu/wuxf/research/tamu-MILC.pdf

MILC su3_rmd Scaling (Input-10/12/14)

1 10 100 1000 10000

# Cores

Slowdown

v10v12v14AMD64/IBBGL

Understanding what might be possible

SiCortex system is new; compared to 10+ years of hacking and optimizationSo we took a look at a test suite and benchmark provided by Andrew Pochinsky @ MITLooks at the problem in three phases

What performance do you get running from L1 cacheWhat performance do you get running from L2And from main memoryVery useful to see where cycles and time are spent.

And gives hints about what compilers might do and how to restructure codes.

So what did we see?

By select hand coding of Andrews code we have seen:

Out of L1 cache 1097 Mflops Out of L2 cache 703 MflopsOut of Memory 367 Mflops

Compiler is improving each time we dive deeper into the code.But we’re not experts on QCD, could use some help.

What conclusion might we draw

Good communications makes for excellent scaling (MILC)

Working on single node performance tuning (on Pochinsky code) gives direction on performance and insight for compiler.

DWF formulations have higher computation/communications ratio. And we do quite well. Will do even better with formulations that have increased communications.

SiCortex and Really Green HPC

Come downstairs (at the foot of the stairs) and take a look and give it a try.

It’s a 72 processor (100+ Gflop) desktop system using ~200 watts. Totally compatible with its bigger family members. Up to 5832 processor system.

More delivered performance per square foot, per dollar, and per watt

Performance Criteria for the Tools Suite

•Work on unmodified codes•Quick and easy characterization of:

– Hardware utilization (on and off-core)– Memory– I/O– Communication– Thread/Task load balance

•Detailed analysis using sampling•Simple instrumentation•Advanced instrumentation and tracing•Trace-based visualization•Expert access to PMU and perfmon2

17 Proprietary and Confidential

• papiex - Overall application performance • mpipex - MPI profiling• ioex – I/O profiling• hpcex - source code profiling• pfmon - highly focused instrumentation• gptlex – dynamic call path generation• tauex - automatic profiling and visualization• vampir - parallel execution tracing

• gprof is there too (but is not MPI-aware)

Application Performance Tool Suite

For fun (and debugging)

Thanks

Some early results for SiCortex machines

Documents

Machines: 17CT Machines - Unitec Partsunitecparts.com/wp-content/uploads/2013/02/10-Machines-W...Machines 66 Machines: 22CT (Yonkers) 74 Machines Machines © Unitec Parts Company,

Software Release Notes for SiCortex Systems - MCS and … · (PN 2907-04 Rev. 01) 1 of 29 Software Release Notes for SiCortex Systems March 25, 2009 Version 4.0 FT This document describes,

3028619 LEGO Education Factsheet · STEAM Park Coding Express and many more WeDo 2.0 Early Simple Machines Simple Machines LEGO® MINDSTORMS® EDUCATION EV3 LEGO® Education SPIKETM

LESSON 18. learn of man’s early attempts to fly. read a Greek myth. make a look at early flying machines. learn of man’s early attempts to fly. read a

Sewing Machines, Embroidery Machines, Apparel Machines and Home Textile Machines in Dubai, UAE

Machines that Make machines

Emulation: Machines Within Machines

Simple Machines Mr. Kings Science Classes. IntroductionIntroduction Simple machines are machines with few or no moving parts. Simple machines are machines

IFIP AICT 387 - Early Italian Computing Machines and Their … · 2017-08-25 · Early Italian Computing Machines and Their Inventors 205 this field were France and the German-speaking

LEGO Education Early Years€¦ · combines standard LEGO bricks and the innovative Tech Machines. Includes 2 x Tech Machines (LEG5002) and 1 x Creative Brick Set (LEG5020). Ages

Electrical machines 2 AC Machines

S2 Tech · automation, weighing systems, medical machines, plastic injection molding machines, general testing machines, textile machines, marble machines, on vehicles, machine tools,

Early Italian Computing Machines and Their Inventors.dl.ifip.org/db/series/ifip/ifip387/Henin12.pdf · Early Italian Computing Machines and their Inventors ... only a score of

Kubota Corporation - US Trademarks and Copyrights… · · 2018-03-06metalworking machines, namely, mills, bending machines, boring machines, broaching machines, cutting machines,

Summer 2013 Early Literacy Newsletter · 2014-03-13 · of Sloth Lucy Cooke Ruth Rose’s Pick: I’m Not Sleepy by Jane Chapman . Page 2 Summer 2013 Early Literacy Newsletter Machines

Machines and Tools...Machines & Tools: General Equipment Information GF Central Plastics has been a manufacturer of polyethylene fusion equipment since the early days of plastic product

Early Aviation 1783 - 1914 Chapter One. Early Aviation Leonardo da Vinci –Designed several flying machines –Based on flapping-wing like birds –Called

-Cnc-MachinesNC machines 1.2 CNC machines 1.3 DNC machines

Paving the Way: Multicore and Multi-Multicore Matt Reilly Chief Engineer SiCortex, Inc 1

bending machines - Metal Fabrication Machines