Title to be determined - GPU Technology...

Preview:

Citation preview

| |

Mauro Calderara, Sascha Brück, Mathieu Luisier

Using Today’s Fastest Chips to Design the Chips of

Tomorrow

| |

What we want to do

How we do it

Apr 08 2016 Mauro Calderara 2

Overview

| |

What we want to do → Quantum Transport: electrons and structures

How we do it → How GPUs saved the day

Apr 08 2016 Mauro Calderara 3

Overview

| | Apr 08 2016 Mauro Calderara 4

Probably you’re familiar with this

| | Apr 08 2016 Mauro Calderara 5

Zooming in

| | Apr 08 2016 Mauro Calderara 6

The future?

(link to video: http://iis.ee.ethz.ch/~mauro/movie_SC15.avi)

| | Apr 08 2016 Mauro Calderara 7

From a somewhat more abstract POV

Device

| | Apr 08 2016 Mauro Calderara 7

From a somewhat more abstract POV

Device

? e

| | Apr 08 2016 Mauro Calderara 7

From a somewhat more abstract POV

Device

? e e

| | Apr 08 2016 Mauro Calderara 7

From a somewhat more abstract POV

Device e

? e e

| | Apr 08 2016 Mauro Calderara 7

From a somewhat more abstract POV

Device e

e

e e

? e e

| |

How do electrons behave w.r.t the

device?

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

| |

How do electrons behave w.r.t the

device?

Change in parameters → change in

behavior?

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

| |

How do electrons behave w.r.t the

device?

Change in parameters → change in

behavior?

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

e

e

e

e

e e

| |

How do electrons behave w.r.t the

device?

Change in parameters → change in

behavior?

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

e

e

e

e

e e

Gate voltage

| |

How do electrons behave w.r.t the

device?

Change in parameters → change in

behavior?

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

e

e

e

e

e e

Gate voltage

Dimensions

Material

properties

| |

How do electrons behave w.r.t the

device?

Change in parameters → change in

behavior?

Applies not just to transistors

Batteries

Storage devices

...

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

e

e

e

e

e e

Gate voltage

Dimensions

Material

properties

| | Apr 08 2016 Mauro Calderara 9

How would we do that? The ‘‘easy’’ case:

| | Apr 08 2016 Mauro Calderara 9

How would we do that? The ‘‘easy’’ case:

→ device behaves like bulk material

| | Apr 08 2016 Mauro Calderara 10

How would we do that? The ‘‘difficult’’ case:

| | Apr 08 2016 Mauro Calderara 10

How would we do that? The ‘‘difficult’’ case:

→ device behaves like atomic structure

| | Apr 08 2016 Mauro Calderara 11

The cost of going small

Why is this ‘‘easy’’ ... ... and this ‘‘difficult’’?

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and

use semi empirical model.

Very finite! Need to do

it from first principles.

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and

use semi empirical model.

Very finite! Need to do

it from first principles.

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and

use semi empirical model.

Very finite! Need to do

it from first principles.

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and

use semi empirical model.

Very finite! Need to do

it from first principles.

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and

use semi empirical model.

Very finite! Need to do

it from first principles.

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and

use semi empirical model.

Very finite! Need to do

it from first principles.

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and

use semi empirical model.

Very finite! Need to do

it from first principles.

run

tim

e

run

tim

e

| |

run

tim

e

run

tim

e

Apr 08 2016 Mauro Calderara 13

The cost of going small

Semi-empirical → O(Hours) First principles → O(Months)

| |

run

tim

e

run

tim

e

Apr 08 2016 Mauro Calderara 13

The cost of going small

Semi-empirical → O(Hours) First principles → O(Months)

| |

run

tim

e

run

tim

e

Apr 08 2016 Mauro Calderara 13

The cost of going small

Semi-empirical → O(Hours) First principles → O(Months)

| |

What we want to do → Quantum Transport: electrons and structures

How we do it → How GPUs saved the day

Apr 08 2016 Mauro Calderara 14

Overview

| | Apr 08 2016 Mauro Calderara 15

Where does all that time go?

run

tim

e

~ 40x

| | Apr 08 2016 Mauro Calderara 15

Where does all that time go?

run

tim

e

~ 40x

Solve an eigenvalue

problem (not discussed

here).

| | Apr 08 2016 Mauro Calderara 15

Where does all that time go?

run

tim

e

~ 40x Invert the matrix from

before (selectively!) using

a recursive algorithm.

Solve an eigenvalue

problem (not discussed

here).

| |

Instead of trying to invert selectively,

solve system using generic sparse

solver package

Apr 08 2016 Mauro Calderara 16

Avoiding the inversion, use a sparse solver instead ru

nti

me

~ 40x

| |

Instead of trying to invert selectively,

solve system using generic sparse

solver package

Gain: speed, parallelism, capacity for

somewhat larger systems

Apr 08 2016 Mauro Calderara 16

Avoiding the inversion, use a sparse solver instead ru

nti

me

~ 40x

| |

Instead of trying to invert selectively,

solve system using generic sparse

solver package

Gain: speed, parallelism, capacity for

somewhat larger systems

Cost: code now mem-bw bound

And: not such a good fit for GPUs ...

Apr 08 2016 Mauro Calderara 16

Avoiding the inversion, use a sparse solver instead ru

nti

me

~ 40x

| |

Instead of trying to invert selectively,

solve system using generic sparse

solver package

Gain: speed, parallelism, capacity for

somewhat larger systems

Cost: code now mem-bw bound

And: not such a good fit for GPUs ...

Apr 08 2016 Mauro Calderara 16

Avoiding the inversion, use a sparse solver instead ru

nti

me

~ 40x

| |

run

tim

e

We’ve been able to solve that one

Apr 08 2016 Mauro Calderara 17

Tackling the eigenvalue problem

run

tim

e

~ 200x

| |

Good speedup so far

(now: O(Days), still not

quite there...)

Apr 08 2016 Mauro Calderara 18

Now what?

run

tim

e

~ 70x overall

| |

Good speedup so far

(now: O(Days), still not

quite there...)

But

Apr 08 2016 Mauro Calderara 18

Now what?

run

tim

e

~ 70x overall

| |

Good speedup so far

(now: O(Days), still not

quite there...)

But

Apr 08 2016 Mauro Calderara 18

Now what?

run

tim

e

~ 70x overall

Mem-BW bound by sparse solver

?

| |

Good speedup so far

(now: O(Days), still not

quite there...)

But

Apr 08 2016 Mauro Calderara 18

Now what?

run

tim

e

~ 70x overall

Mem-BW bound by sparse solver

?

| |

Good speedup so far

(now: O(Days), still not

quite there...)

But

Apr 08 2016 Mauro Calderara 18

Now what?

run

tim

e

~ 70x overall

Mem-BW bound by sparse solver

| |

Good speedup so far

(now: O(Days), still not

quite there...)

But

Apr 08 2016 Mauro Calderara 18

Now what?

run

tim

e

~ 70x overall

Mem-BW bound by sparse solver

? Advisor PhD student

| |

Inverting sparse system not feasible

Apr 08 2016 Mauro Calderara 19

A Sparse Solver for Transport Problems running on GPUs

-1

=

| |

Inverting sparse system not feasible

In our case: also not neccessary

Apr 08 2016 Mauro Calderara 19

A Sparse Solver for Transport Problems running on GPUs

-1

=

| |

Inverting sparse system not feasible

In our case: also not neccessary

Need first and last block rows only

Apr 08 2016 Mauro Calderara 19

A Sparse Solver for Transport Problems running on GPUs

-1

=

| |

Inverting sparse system not feasible

In our case: also not neccessary

Need first and last block rows only

If we can compute this fast, we can

interleave the solving step with the BC

computation

obtain the full solution very efficiently

Apr 08 2016 Mauro Calderara 19

A Sparse Solver for Transport Problems running on GPUs

-1

=

| |

Recursive algorithm based on the

Schwinger-Dyson equation

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑋𝑖 ← (𝐴𝑖,𝑖 − 𝐴𝑖,𝑖+1𝑋𝑖+1) \ 𝐴𝑖,𝑖−1

for i = 2:N

𝑄𝑖 ← −𝑋𝑖 𝑄𝑖−1

| |

Recursive algorithm based on the

Schwinger-Dyson equation

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑋𝑖 ← (𝐴𝑖,𝑖 − 𝐴𝑖,𝑖+1𝑋𝑖+1) \ 𝐴𝑖,𝑖−1

for i = 2:N

𝑄𝑖 ← −𝑋𝑖 𝑄𝑖−1

| |

Recursive algorithm based on the

Schwinger-Dyson equation

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑋𝑖 ← (𝐴𝑖,𝑖 − 𝐴𝑖,𝑖+1𝑋𝑖+1) \ 𝐴𝑖,𝑖−1

for i = 2:N

𝑄𝑖 ← −𝑋𝑖 𝑄𝑖−1

N

N+1

N-1

N-2

𝑋 𝐴

| |

Recursive algorithm based on the

Schwinger-Dyson equation

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑋𝑖 ← (𝐴𝑖,𝑖 − 𝐴𝑖,𝑖+1𝑋𝑖+1) \ 𝐴𝑖,𝑖−1

for i = 2:N

𝑄𝑖 ← −𝑋𝑖 𝑄𝑖−1

N

N+1

N-1

N-2

𝑋 𝐴

| |

Recursive algorithm based on the

Schwinger-Dyson equation

xGEMM + xGESV + xGEMM

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑋𝑖 ← (𝐴𝑖,𝑖 − 𝐴𝑖,𝑖+1𝑋𝑖+1) \ 𝐴𝑖,𝑖−1

for i = 2:N

𝑄𝑖 ← −𝑋𝑖 𝑄𝑖−1

N

N+1

N-1

N-2

𝑋 𝐴

| |

Recursive algorithm based on the

Schwinger-Dyson equation

xGEMM + xGESV + xGEMM

Very fast on accelerators

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑋𝑖 ← (𝐴𝑖,𝑖 − 𝐴𝑖,𝑖+1𝑋𝑖+1) \ 𝐴𝑖,𝑖−1

for i = 2:N

𝑄𝑖 ← −𝑋𝑖 𝑄𝑖−1

N

N+1

N-1

N-2

𝑋 𝐴

| |

Recursive algorithm based on the

Schwinger-Dyson equation

xGEMM + xGESV + xGEMM

Very fast on accelerators

Parallelizable

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑋𝑖 ← (𝐴𝑖,𝑖 − 𝐴𝑖,𝑖+1𝑋𝑖+1) \ 𝐴𝑖,𝑖−1

for i = 2:N

𝑄𝑖 ← −𝑋𝑖 𝑄𝑖−1

N

N+1

N-1

N-2

𝑋 𝐴

| |

Runs on GPUs, compute bound

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

| |

Runs on GPUs, compute bound

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)]

Performance [log(FLOPS)]

| |

Runs on GPUs, compute bound

Interleaves with EV computation

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)]

Performance [log(FLOPS)]

| |

Runs on GPUs, compute bound

Interleaves with EV computation

Memory efficient

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)]

Performance [log(FLOPS)]

| |

Runs on GPUs, compute bound

Interleaves with EV computation

Memory efficient

Much faster than sparse solvers

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)]

Performance [log(FLOPS)]

| |

Runs on GPUs, compute bound

Interleaves with EV computation

Memory efficient

Much faster than sparse solvers

Whole simulation: O(Hours)

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)]

Performance [log(FLOPS)]

| |

Runs on GPUs, compute bound

Interleaves with EV computation

Memory efficient

Much faster than sparse solvers

Whole simulation: O(Hours)

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)]

Performance [log(FLOPS)]

~ 10x

/ 80x

| |

Apr 08 2016 Mauro Calderara 22

Summary

| |

Transforming a sparse problem to a dense one can be a good thing

Apr 08 2016 Mauro Calderara 22

Summary

| |

Transforming a sparse problem to a dense one can be a good thing

Large speedup over state of the art (15x - 150x)

Apr 08 2016 Mauro Calderara 22

Summary

| |

Transforming a sparse problem to a dense one can be a good thing

Large speedup over state of the art (15x - 150x)

Significant increase in capacity (100’000 atoms → 10x - 100x)

Apr 08 2016 Mauro Calderara 22

Summary

| |

Transforming a sparse problem to a dense one can be a good thing

Large speedup over state of the art (15x - 150x)

Significant increase in capacity (100’000 atoms → 10x - 100x)

Uses hybrid ressources very efficiently (15 PF sustained)

Apr 08 2016 Mauro Calderara 22

Summary

| |

Transforming a sparse problem to a dense one can be a good thing

Large speedup over state of the art (15x - 150x)

Significant increase in capacity (100’000 atoms → 10x - 100x)

Uses hybrid ressources very efficiently (15 PF sustained)

Made ballistic ab-initio QT simulations for realistic structures a reality

Apr 08 2016 Mauro Calderara 22

Summary

| | Apr 08 2016 Mauro Calderara 23 (link to video: http://iis.ee.ethz.ch/~mauro/movie_Ag_Switch.avi)

| | Apr 08 2016 Mauro Calderara 24

Questions?

Recommended