On the Meltdown & Spectre Design FlawsOn the Meltdown & Spectre Design Flaws Slides taken from Dr. Mark Hill With some small changes. Any errors introduced are my own

On the Meltdown & Spectre Design Flaws

Slides taken from Dr. Mark Hill

With some small changes.

Any errors introduced are my own.

Executive Summary

Architecture 1.0: the timing-independent functional behavior of a computer

Micro-architecture: the implementation techniques to improve performance

Question: What if a computer that is completely correct by Architecture 1.0

can be made to leak protected information via timing, a.k.a., Micro-Architecture?

Implication: The definition of Architecture 1.0 is inadequate to protect information

Meltdown leaks kernel

memory, but software &

hardware fixes exist

Spectre leaks memory

outside of bounds checks or

sandboxes, and is scary

Outline

Computer Architecture & Micro-Architecture Review

Timing Side-Channel Attack

Virtual Memory Stuff

Meltdown

Spectre

Wrap-Up

Computer Architecture 0.0 -- Pre-1964

Software Lagged Hardware

● Each new machine design was different

● Software needed to be rewritten in assembly/machine language

● Unimaginable today

Going forward: Need to separate HW interface from implementation

Each Computer was New

● Implemented machine (has mass) → hardware

● Instructions for hardware (no mass) → software


Computer Architecture 1.0 -- Born 1964

IBM System 360 defined an instruction set architecture

● Stable interface across a family of implementations

● Software did NOT have to be rewritten


Micro-architecture: implementation techniques that change timing to go fast

Note: The code is not IBM 360 assembly, but is the example used later.

branch (R1 >= bound) goto error

load R2 ← memory[train+R1]

and R3 ← R2 && 0xffff

load R4 ← memory[save+SIZE+R3]


Micro-architecture Harvested Moore’s Law Bounty

For decades, every ~2 years: 2x transistors, 1.4x faster & 1x chip power possible;

2300 transistors for Intel 4004 → millions per core & billions for caches

(Micro-)architects took this ever doubling budget to make each processor core

execute > 100x than what it would otherwise.

Key techniques w/ tutorial next:

● Instruction Speculation

● Hardware Caching

Hidden by Architecture 1.0: timing-independent functional behavior unchanged


Instruction Speculation Review

Multi-cycle:

add

Predict direction: target or fall thru

Pipelining, branch prediction, & instruction speculation

add

load

branch

and Speculate!

store Speculate more!

load

Speculation correct: Commit architectural changes of and (register) & store (memory) go fast!

Mis-speculate: Abort architectural changes (registers, memory); go in other branch direction


Hardware Caching Review

Main Memory (DRAM) too slow

Add Hardware Cache(s): small, transparent hardware memory

E.g., 4-entry direct-mapped cache

--0

--1

--2

--3

12?

MissInsert 12

120

--1

--2

--3

07?

MissInsert 07

120

--1

--2

073

12?

HIT!

No

changes

120

--1

--2

073

16?

MissVictim 12

Insert 16

160

--1

--2

073

Note 12

victimized

“early” due

to “alias”


Micro-architecture Harvested Moore’s Law Bounty

For decades, every ~2 years: 2x transistors, 1.4x faster & 1x chip power possible;

2300 transistors for Intel 4004 → millions per core & billions for caches

(Micro-)architects took this ever doubling budget to make each processor core

execute > 100x what it would otherwise

Hidden by Architecture 1.0: timing-independent functional behavior unchanged

branch (R1 >= bound) goto error ; Speculate branch not taken

load R2 ← memory[train+R1] ; Speculate load & speculate cache hit

and R3 ← R2 && 0xffff ; Speculate AND

load R4 ← memory[save+SIZE+R3] ; Speculate load & speculate cache hit


Whither Computer Architecture 1.0?

Architecture 1.0: timing-independent functional behavior


can be made to leak protected information via timing, a.k.a., micro-architecture?


This is what Meltdown and Spectre do. Let's see why and explore implications.


Side channel attack

● A side-channel attack is any attack based on information gained from the

implementation of a computer system (Wikipedia)

● Example:○ Encryption algorithms like AES do a lot of XOR operations against a secret key.

○ If you can get them to XOR against values you control AND if XOR takes more power if the

result is “1” than if it is “0”

○ Then you can just feed the device values to XOR the key with (say 0x01, then 0x02, then

0x04, etc.) and watch the power consumption (really closely)

○ Can figure out the key!


Other weird side-channel attacks

● Time to do a computation might vary depending on values.

○ There was one attack that watched a blinking LED on the device to measure delays and thus

get secret information (with a really high-speed photodetector).

○ Old Unix machines took a variable amount of time to check to see if a password was correct

based on the password tried and the actual password.

■ So timing attack could reveal actual password.

● Often not too hard to fix

○ “Just” make everything take the same amount of time (Slower, but safer).


Basic idea

1. Load two different memory locations

○ Data gets in cache (assume DM) in two different lines (Call them line “A” and line “B”).

2. Load data you aren’t allowed to access.○ Call that data the secret

○ Will cause a fault

○ But only once the load hits the head of the RoB!

3. Load a memory location based on one bit of that secret○ Do it so you either get placed in “A” or “B”. Will kick out the data in those.

4. Fault happens. Recover as normal○ All instuctions after the fault are nuked. Almost as if they had never happened.

○ But cache has changed!

5. Now load original data○ One of the two will be slow.

○ Now you know one bit of the secret.


But if you aren’t allowed to access the data…

● There are a bunch of fun things going on at once.○ The processor (at least on Intel machines and maybe some ARM machines) will grab the data

even if you don’t have permission to do so.

○ The page table for a user process generally includes mappings to a bunch of things that user

process isn’t allowed to touch

■ All of Kernel stuff

■ All of physical memory

● So you can go hunting for data.


Meltdown (https://meltdownattack.com/meltdown.pdf)

Can leak the contents of kernel memory at up to 500KB/s

Meltdown

https://meltdownattack.com/meltdown.pdf

Meltdown & Hardware

Demonstrated for many Intel x86-64 cores; NOT demonstrated for AMD

Key: When to suppress load with protection violation (user load to kernel memory)

● EARLY: AMD appears to suppress early, e.g., at TLB access

● LATE: Intel appears to suppress at end after micro-arch state changes

A SWAG (Scientific Wild A** Guess) why

● Both are correct by Architecture 1.0

● Performance shouldn’t matter as this case is supposed to be rare

● Do what’s easiest & have luck that is good (AMD) or bad (Intel)

Meltdown

Meltdown & Software

Bad: Meltdown operates with bug-free OS software (by Architecture 1.0)

Good: Major commercial OSs patched for Meltdown ~January 2018

Idea: Don’t map (much) of protected kernel address space in user process

● Offending load now fails address translation & does nothing

● Patches quickly derived from KAISER developed for side-channel attacks of

Kernel Address Space Layout Randomization (KASLR)

● Performance impact 0-30% syscall frequency & core model.

Future hardware can fix Meltdown (like AMD) so maybe we dodged a bullet

Meltdown

Spectre (https://spectreattack.com/spectre.pdf)

Spectre

https://spectreattack.com/spectre.pdf

Need Computer Architecture 2.0?

With Meltdown & Spectre, Architecture 1.0 is inadequate to protect information

Augment Architecture 1.0 with Architecture 2.0 specification of

● (Abstraction of) time-visible micro-architecture?

● Bandwidth of known (unknown?) timing channels?

Change Microarchitecture to mitigate timing channel bandwidth

● Suppress some speculation

● Undo most changes on mis-speculation

Can this be (formally) solved or must it be managed like crime?

Wrap up

Need Computer Architecture 2.0?

More generally, can we reduce our dependence on SPECULATION?

Wrap up

Executive Summary


Micro-architecture: the implementation techniques to improve performance


can be made to leak protected information via timing, a.k.a., Micro-Architecture?


Meltdown leaks kernel

memory, but software &

hardware fixes exist

Spectre leaks memory

outside of bounds checks or

sandboxes, and is scary

Wrap up

Some References

New York Times: https://www.nytimes.com/2018/01/03/business/computer-flaws.html

Meltdown paper: https://meltdownattack.com/meltdown.pdf

Spectre paper: https://spectreattack.com/spectre.pdf

A blog separating the two bugs: https://danielmiessler.com/blog/simple-explanation-difference-meltdown-spectre/

Google Blog: https://security.googleblog.com/2018/01/todays-cpu-vulnerability-what-you-need.html and

https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

Industry News Sources: https://arstechnica.com/gadgets/2018/01/whats-behind-the-intel-design-flaw-forcing-numerous-

patches/ and https://www.theregister.co.uk/2018/01/02/intel_cpu_design_flaw/

Wrap up

https://www.nytimes.com/2018/01/03/business/computer-flaws.html

https://meltdownattack.com/meltdown.pdf

https://spectreattack.com/spectre.pdf

https://danielmiessler.com/blog/simple-explanation-difference-meltdown-spectre/

https://security.googleblog.com/2018/01/todays-cpu-vulnerability-what-you-need.html

https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

https://arstechnica.com/gadgets/2018/01/whats-behind-the-intel-design-flaw-forcing-numerous-patches/

https://www.theregister.co.uk/2018/01/02/intel_cpu_design_flaw/

Final ExamAnd other closing stuff

● First line of the TCL script:○ set hdlin_ff_always_sync_set_reset true

○ Helps deal with reset problems.

● What is it doing?○ http://www.sunburst-design.com/papers/CummingsSNUG2003Boston_Resets.pdf page 9

covers this.

○ By the way, Cliff Cummings (first author on that report) is the best resource for Verilog advice

and the like. I rely on his stuff for nearly any deep problem I find.

http://www.sunburst-design.com/papers/CummingsSNUG2003Boston_Resets.pdf

● The inputs to both legs of the MUX can be forced to 0 by holding rst_n

asserted low, however if ld is unknown (X) and the MUX model is pessimistic,

then the flops will stay unknown (X) rather than being reset.

● Why would the model be “pessimistic”? ○ Hey, another Sutherland paper. http://www.sutherland-hdl.com/papers/2013-DVCon_In-love-

with-my-X_paper.pdf

○ I’m not sure I get why you’d want to be pessimistic in this case, but sure, I can see the tool

thinking the value out of the MUX might be X if ld is X.

http://www.sutherland-hdl.com/papers/2013-DVCon_In-love-with-my-X_paper.pdf

Review/Q&A stuff

● Wednesday 4/18 9:30-11:30am. ○ Room TBA

● Sunday 4/22○ Time and date TBA.

Coverage.

● Expect a Tomasulo’s “fill in the boxes” question at the end.

○ Be sure to look over the algorithms you aren’t using

○ And even review the one you are (you may have made some changes)

● Expect a focus on:

○ Power

○ Multi-core cache coherence

○ Static optimizations

Class summary

• Major topics– ILP in hardware (Out-of-order processors)

• How they work AND why we use them

– Caches and Virtual Memory

– Multi-processor

– ILP in software (Complier, IA-64)

– Power

• Less major topics– Memory disambiguation

– Branch prediction• Direction and target

– Advanced OoO issues• Superscalar, instruction scheduling, multi-threading, etc.

The big questions

• What is computer architecture?

• What are the metrics of performance?

• What are the techniques we use to maximize these

metrics?

ILP in hardware (1/2)

• ILP definitions– Hazards vs dependencies

• Data, Name and Control dependencies

– What ILP means and finding it.

• Dynamic Scheduling– Tomasulo’s (three versions!)

• You can be promised a question on this!

• Branch Prediction– Local, global, hybrid/correlating

• Tournament and gshare

– BTBs

ILP in hardware (2/2)

• Multiple Issue

– Static• Static Superscalar

• VLIW

– Dynamic superscalar

• Speculation

– Branch, data

• ILP limit studies

ILP in hardware: Questions

• True or False

1. The original T-algorithm only allows reordering within basic blocks

2. In P6, if it weren’t for precise interrupts, it would be okay to retire

instructions out-of-order as long as they had finished executing and a

branch isn’t skipped over.

3. ILP in hardware is limited in scope due to the “instruction window” which

is basically the size of the RS.

Quick idea: SMT

• One processor, two threads.

Caching (1/2)• There is a huge amount of stuff associated

with caching. The important stuff– Locality

• Temporal/Spatial

• 3’Cs model

• Stack distance model

– Nuts-and-bolts• Replacement policies (LRU, pseudo-LRU)

• Performance (hit rate, Thit; Tmiss, average access time)

• Write back/Write thru

• Block size

– Basic improvement• Multi-level cache

• Critical word first

• Write buffers

Caching (2/2)

• Non-standard caches

– Hash

– Victim

– Skew

• Misc.

– Virtual addresses and caching

– Impact of prefetching

– Latency hiding with OO execution

Cache: Questions (1/2)

• Changing __________ has an impact on compulsory misses.

• A victim cache is more likely to help with ________ than ________ though it can help both (3’Cs)

• At least _____ bits are required to keep exact track of LRU in a 5-way associative cache.

Cache question (2/2)

• A ____________ cache has a number of sets equal to

the number of lines in the cache.

• A fully-associative cache with N lines will miss an access

that has a stack distance of ________ (state the largest

range you can).

Multi-processor

• Amdahl’s law as it applies to MP.

• Bus-based multi-processor– Snooping

– MESI

– Bus transaction types (BRL etc.)

• Distributed-shared– Directory schemes

• Synchronization– Critical sections

– Spin-locks

Multi-processor: Question

• Under the MESI protocol what is the

advantage of having a distinct clean and

dirty exclusive state?

Software techniques for ILP (1/2)

• Pipeline scheduling– Reordering instructions in a basic block to remove pipe stalls

– Loop unrolling

• Static information passed to processor – Static branch prediction

– Static dependence information

• Loop issues– Detecting loop dependencies

– Software pipelining

Software techniques for ILP (2/2)

• Global code scheduling

– Predicated instruction and CMOV

– Memory reference speculation

– Issues with preserving exception behavior

• IA-64 as a case study of hardware support for software ILP

techniques

– Speculative loads

– Advanced loads

– Software pipelining optimizations

Software techniques for ILP: Questions

• What is the most significant disadvantage of

loop unrolling?

• Using CMOV re-write the following code

snippet, removing the branch. Don’t change

exception behavior and assume DIV only

causes an exception if R3=0

BNE R1 R2 skip

R1=R2/R3

skip: nop

Power

• Understand why it’s important

• Power vs. Energy

• How it’s related to the existence of multi-core

• Understand voltage scaling issues

Documents

On the Meltdown & Spectre Design FlawsOn the Meltdown & Spectre Design Flaws Slides taken from Dr. Mark Hill With some small changes. Any errors introduced are my own