42
What I Learned Building a Parallel Processor from Scratch ISPASS, Philadelphia March 31, 2015 1

What I learned building a parallel processor from scratch

Embed Size (px)

Citation preview

Page 1: What I learned building a parallel processor from scratch

What I Learned Building a Parallel Processor from Scratch

ISPASS, Philadelphia

March 31, 2015

1

Page 2: What I learned building a parallel processor from scratch

My 2007 “Epiphany”

It's more efficient to design a pipeline of specialized processors than having a single massive general purpose processor that runs a complete application.

DUH!

2

Page 3: What I learned building a parallel processor from scratch

My Background

8 years designing DSPs

100 people, $100M R&D

TigerSHARC was a tech success,

but a huge financial flop!

Mixed signal SOCs for CCDs

Invented new ISA in 2 weeks

3-4 designers per derivative

Huge success (>$100M)

+

3

Page 4: What I learned building a parallel processor from scratch

My 2008 Processor Design Goals

● ANSI-C programmable● Floating point (industrial + medical markets)● Scalabale● 10X boost in GFLOPS/Watt (otherwise, who would buy?)● Simple enough to be built by < 5 engineers

4

Page 5: What I learned building a parallel processor from scratch

The Epiphany Multicore Architecture

RISC SRAM

NOCData

Mover

(3,3)

(0,0)

● ANSI-C Programmable● IEEE 32bit floating point ● Nothing shared● Distributed memory model● Scalabale to “infinity”● 50 GFLOPS/W

RISC(1GHZ)

SRAM

NOCData

Mover

5

Page 6: What I learned building a parallel processor from scratch

20/20 Decision Report Card

6

✔ NO HW cache✔ NO HW coherence✔ Tiny custom RISC ISA✔ Floating point✔ 2D Mesh✔ C-programmability✔ Distributed memory✔ 64-bit LDR/STR✔ 64 registers✔ 4 bank local memory✔ 5+3 dual issue pipeline✔ Single flit NOC✔ Byte addressability

✔ HW loops✔ Autoincrement LDR/STR ✔ Slow reads✔ Fetch from external✔ NO 64bit FPU✔ NO 64bit addressing

✔ 32KB of memory✔ NO native integer mult✔ NO DRAM interface

Page 7: What I learned building a parallel processor from scratch

My Development Process● One person owns the whole path from arch to layout● Spreadsheet (triangulating architectures)● Emacs/VI (does code look clean on paper?)● Manually “count” gates in code. “Feel complexity”● Verilog Simulation (checks for logical/mental mistakes)● Synthesize design (in FPGA or EDA to measure cost!!)● Use advisors (peer review to look for gotchas) ● Application development (No but I should have!!)

(*considering the lack of benchmarks, it's a small miracle Epiphany works as well as it does)7

Page 8: What I learned building a parallel processor from scratch

Step 1: Study History - 3 months

My inspiration of what to do (and NOT to do)● Henessy and Patterson's classic books. (Re)Read them!● Blackfin, TigerSHARC, and c64x (DSP features)● ARM and MIPS (standard features)● Tilera and Ambric (manycore features)

GOTCHA: Watch out for “clever” features! The really useful features will be common to all architectures.

8

Page 9: What I learned building a parallel processor from scratch

Step 2: Create a baseline - 3 months● How much area goes to computation?● How expensive is SRAM and cache?● What is the $ cost of IP and design effort?● How expensive is the network?● What is the performance gain of instruction X in app Y?● How often is instruction X used?● What does a “typical” application look like

These numbers are your anchor, get them right!9

Page 10: What I learned building a parallel processor from scratch

Step 3: Verify assumptions - 6 months

● Design and simulate (to verify cost assumptions)

● Build prototypes (to verify simulation assumptions)

● Talk to customers (to verify product feature assumptions)

10

Page 11: What I learned building a parallel processor from scratch

Epiphany (-1)● May 2008● Naive but close...● Could have had 64

cores in 2008!● Epiphany arch has

only changed by “delta” since 2008

● Submitted many grant proposals, papers (all rejected)

11

Page 12: What I learned building a parallel processor from scratch

Epiphany-0● Oct 2008 synthesis and layout prototype● Virtual prototypes (synthesis and layout of cores)● Showed that 1 Ghz operation and 50GFLOPS/W was feasible● Gave me the courage to ask friends and family for money ($200K)

12

Page 13: What I learned building a parallel processor from scratch

Epiphany-I● Jun 2009 tapeout● 16 cores, 65nm prototype● ~1 person, 12 weeks● Changes 7 days after TO!● Barely worked but proved

it was possible● ~10mm^2

13

Page 14: What I learned building a parallel processor from scratch

Raised capital (Nov 2009)

14

Series-A● $1.5M from BittWare, a

DSP board company.

(a TigerSHARC customer)● They invested because

their customers were interested

● Complicated deal● 6 traunches x $250K

Page 15: What I learned building a parallel processor from scratch

Building a Team (Dec, 2009)

Oleg RaikhmanDV/Tools

Roman TroganDesign/Layout

Andreas OlofssonArch/Design

Lesson: Team makeup is the #1 determinant for success. You must have it in place before you set out on journey.

15

Page 16: What I learned building a parallel processor from scratch

Epiphany-II● May 2010 tapeout● 16 cores, 65nm● ~3 person, ~6 weeks● Changes 1 day before TO● A vendor screwed up!● Should have been a

product!● ~12mm^2

16

Page 17: What I learned building a parallel processor from scratch

It Works!● Sept 2010 bringup● Ran floating point

program!● Proved efficiency!● Packaging yield was

less than 10% on eLink (vendor!!).

● SPI debug channel was key to debugging vendor packaging issue17

Page 18: What I learned building a parallel processor from scratch

Epiphany-III● Dec 2010 tapeout● 16 cores, 65nm● This is a product!● Kept improving with

every iteration● RTL changes 1 day

before TO● ~25 GFLOPS/W● ~12mm^2

18

Page 19: What I learned building a parallel processor from scratch

New Packaging● 60um staggered pitch is

not trivial!● Vendor failure on E1

and E2 almost broke company!

● New vendor (Kyocera)● You can't be on the

bleeding edge without the right partners!

● Better results!19

Page 20: What I learned building a parallel processor from scratch

First Delivery● May 2011● Chip worked

perfectly out of the box!

● This silicon became the E16G301

● Felt at peace● Happy moment, but

only the beginning...

20

Page 21: What I learned building a parallel processor from scratch

Embedded Systems Conference

● May, 2011● Launched Epiphany-III● Lots of press!● Competitors curious● But where were the

customers!!??● Flew home depressed

and pissed off...21

Page 22: What I learned building a parallel processor from scratch

First sale!● June 2011● A $5,000 pitiful eval kit● To a big competitor...● They never even turned

it on...● ...which was probably a

good thing.

22

Page 23: What I learned building a parallel processor from scratch

Epiphany-IV● Aug 2011 tapeout● (Jul 2012 samples)● 64 cores, 28nm● 50 GFLOPS/W● RTL changes 2 days

before TO● 3 engineers, 12 weeks!

23

Page 24: What I learned building a parallel processor from scratch

First sale!● June 2011● A $5,000 pitiful eval kit● To a big competitor...● They never even turned

it on...● ...which was probably a

good thing.

24

Page 25: What I learned building a parallel processor from scratch

First Board Product

● May, 2012 ship● FMC board from

Bittware● 4 chip array● Expensive and

not a solution● Niche market● Never shipped in

volume25

Page 26: What I learned building a parallel processor from scratch

Software● 2012 focus● Invested heavily in

software tools + demos

● Face recognition, FFT, Matmul

● Beat ARM by 10X on demo for BIG smartphone company!

● Response: “Meh...”26

Page 27: What I learned building a parallel processor from scratch

Parallella● Sept, 2012● Running out of

money...● We needed a

parallel community and a better kit, so the $99 Parallella was born.

● KS combines pre-purchase, community, and marketing.27

Page 28: What I learned building a parallel processor from scratch

The Parallella Computer

● An open parallel computer● Launched in 2012 at $99● Open source SW/HW!● Dual-core ARM A9 processor● FPGA logic● 1GB RAM, USB, HDMI, GigE● 16/64 Epiphany coprocessors● 50 Gbit/sec IO, 25/100

GFLOPS● <5 Watts● 10,000 shipped (20,000 built)● 200 Universities

Page 29: What I learned building a parallel processor from scratch

OPTIMISTIC & STUBBORN

29

NaiveOptimism

BirthdayBottomed Out

PureStubborness

Relaunch

Page 30: What I learned building a parallel processor from scratch

Success?● Nov 2012● Woke up

next morning stressed and behind schedule.

● Only sell what you have!!

30

Page 31: What I learned building a parallel processor from scratch

Powerup (Gen0)● May 2013● My first board and it

worked!● Only minor issues● Challenge was super

aggressive deisgn targets (credit card, features, cost, schedule)

31

Page 32: What I learned building a parallel processor from scratch

Gen0 Release● July 2013● Gen0 board

released!● We build working

cluster with 42 boards!

● Sent out 50 boards to KS backers, only ~1 got significant use!

● Why?32

Page 33: What I learned building a parallel processor from scratch

Chips are back!● Aug 2013● Full mask tapeout● New Tier-1 package

vendor (ASE). ● ~90% yield!● Good thermals● How to test 10,000

chips with $0 NRE?

33

Page 34: What I learned building a parallel processor from scratch

New Investment...● Dec 20, 2013● $3.6M from Ericsson + VC● Saved the company...but

came in too late to save eng team...

● ...and we were still buried in KS comittments and logistics issues...

● 5,000 customers waiting for us to deliver

34

Page 35: What I learned building a parallel processor from scratch

2014 Distractions: Manufacturig, logistics, T-shirts, cases, logistics, power supplies, accessories...what is Adapteva about again?

35

Page 36: What I learned building a parallel processor from scratch

Shipping HW● June 2014● Shipped to 200

Universities & 10,000 developers

● Built the “A1”, the world's densest cluster and launched at ISC14

● Making progress...

36

Page 37: What I learned building a parallel processor from scratch

Now what?

Faulty cores

Cooperating Program(messages)

NODE:64KB1GHz RISC Core2 GFLOPS/core

Physical Constraints:1.5ns/hop latency10pJ / FLOP30pJ / off chip read/write10pJ / on chip end2end(give or take....)

Page 38: What I learned building a parallel processor from scratch

Some Key Hardware Lessons ● Designing leading edge chips is fairly easy today for a small

team with the right experience (<<$100M)● Designing boards is easier than designing chips but not trivial● One bad choice can break a startup (99/100 is a failing grade)● Never push the tools, process, or capabilities of your vendor● Architects should be vertically integrated● Symmetry is a very powerful design principle● Your product is only strong as your weakest vendor partner● Tiled architectures are MAGICAL

38

Page 39: What I learned building a parallel processor from scratch

Some Software Lessons● Customers take the path of least resistance● Developers take the path of least resistance● Surprisingly hard to find applications that need more

performance....

...and those applications are really complex..

..meaning they have tons of legacy stack software...● SW trumps energy efficiency in 99.99% of applications● Parallel programming still very much a niche

39

Page 40: What I learned building a parallel processor from scratch

MY FINAL LESSON:(took me 7 years to learn)

IT'S THE SOFTWARE STUPID!!!!

40

Page 41: What I learned building a parallel processor from scratch

..so we keep pushing the ball up hill

The Parallel Architectures Library ● A new “standard library” for parallel● Compact C library with optimized routines for vector math,

synchronization, and multi-processor communication.● Designed to be portable across multiple ISAs● Open source (apache 2.0 permissive license)● Open invitation to participate!!● https://github.com/parallella/pal

41

Page 42: What I learned building a parallel processor from scratch

What I (still) Know to be True...● The world will continue to be driven by $$● Moore's law WILL come to an end● Parallel computing is inevitable● Architectures like Epiphany are the future● CPUs, FPGAs, and manycore will coexist● Never doubt the ingenuity of an engineer● It's a dirty job but someone has to do it...

4242