Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
On the Meltdown & Spectre Design Flaws
Slides taken from Dr. Mark Hill
With some small changes.
Any errors introduced are my own.
Executive Summary
Architecture 1.0: the timing-independent functional behavior of a computer
Micro-architecture: the implementation techniques to improve performance
Question: What if a computer that is completely correct by Architecture 1.0
can be made to leak protected information via timing, a.k.a., Micro-Architecture?
Implication: The definition of Architecture 1.0 is inadequate to protect information
Meltdown leaks kernel
memory, but software &
hardware fixes exist
Spectre leaks memory
outside of bounds checks or
sandboxes, and is scary
Outline
Computer Architecture & Micro-Architecture Review
Timing Side-Channel Attack
Virtual Memory Stuff
Meltdown
Spectre
Wrap-Up
Computer Architecture 0.0 -- Pre-1964
Software Lagged Hardware
● Each new machine design was different
● Software needed to be rewritten in assembly/machine language
● Unimaginable today
Going forward: Need to separate HW interface from implementation
Each Computer was New
● Implemented machine (has mass) → hardware
● Instructions for hardware (no mass) → software
Computer Architecture & Micro-Architecture Review
Computer Architecture 1.0 -- Born 1964
IBM System 360 defined an instruction set architecture
● Stable interface across a family of implementations
● Software did NOT have to be rewritten
Architecture 1.0: the timing-independent functional behavior of a computer
Micro-architecture: implementation techniques that change timing to go fast
Note: The code is not IBM 360 assembly, but is the example used later.
branch (R1 >= bound) goto error
load R2 ← memory[train+R1]
and R3 ← R2 && 0xffff
load R4 ← memory[save+SIZE+R3]
Computer Architecture & Micro-Architecture Review
Micro-architecture Harvested Moore’s Law Bounty
For decades, every ~2 years: 2x transistors, 1.4x faster & 1x chip power possible;
2300 transistors for Intel 4004 → millions per core & billions for caches
(Micro-)architects took this ever doubling budget to make each processor core
execute > 100x than what it would otherwise.
Key techniques w/ tutorial next:
● Instruction Speculation
● Hardware Caching
Hidden by Architecture 1.0: timing-independent functional behavior unchanged
Computer Architecture & Micro-Architecture Review
Instruction Speculation Review
Multi-cycle:
add
Predict direction: target or fall thru
Pipelining, branch prediction, & instruction speculation
add
load
branch
and Speculate!
store Speculate more!
load
Speculation correct: Commit architectural changes of and (register) & store (memory) go fast!
Mis-speculate: Abort architectural changes (registers, memory); go in other branch direction
Computer Architecture & Micro-Architecture Review
Hardware Caching Review
Main Memory (DRAM) too slow
Add Hardware Cache(s): small, transparent hardware memory
E.g., 4-entry direct-mapped cache
--0
--1
--2
--3
12?
MissInsert 12
120
--1
--2
--3
07?
MissInsert 07
120
--1
--2
073
12?
HIT!
No
changes
120
--1
--2
073
16?
MissVictim 12
Insert 16
160
--1
--2
073
Note 12
victimized
“early” due
to “alias”
Computer Architecture & Micro-Architecture Review
Micro-architecture Harvested Moore’s Law Bounty
For decades, every ~2 years: 2x transistors, 1.4x faster & 1x chip power possible;
2300 transistors for Intel 4004 → millions per core & billions for caches
(Micro-)architects took this ever doubling budget to make each processor core
execute > 100x what it would otherwise
Hidden by Architecture 1.0: timing-independent functional behavior unchanged
branch (R1 >= bound) goto error ; Speculate branch not taken
load R2 ← memory[train+R1] ; Speculate load & speculate cache hit
and R3 ← R2 && 0xffff ; Speculate AND
load R4 ← memory[save+SIZE+R3] ; Speculate load & speculate cache hit
Computer Architecture & Micro-Architecture Review
Whither Computer Architecture 1.0?
Architecture 1.0: timing-independent functional behavior
Question: What if a computer that is completely correct by Architecture 1.0
can be made to leak protected information via timing, a.k.a., micro-architecture?
Implication: The definition of Architecture 1.0 is inadequate to protect information
This is what Meltdown and Spectre do. Let's see why and explore implications.
Computer Architecture & Micro-Architecture Review
Side channel attack
● A side-channel attack is any attack based on information gained from the
implementation of a computer system (Wikipedia)
● Example:○ Encryption algorithms like AES do a lot of XOR operations against a secret key.
○ If you can get them to XOR against values you control AND if XOR takes more power if the
result is “1” than if it is “0”
○ Then you can just feed the device values to XOR the key with (say 0x01, then 0x02, then
0x04, etc.) and watch the power consumption (really closely)
○ Can figure out the key!
Timing Side-Channel Attack
Other weird side-channel attacks
● Time to do a computation might vary depending on values.
○ There was one attack that watched a blinking LED on the device to measure delays and thus
get secret information (with a really high-speed photodetector).
○ Old Unix machines took a variable amount of time to check to see if a password was correct
based on the password tried and the actual password.
■ So timing attack could reveal actual password.
● Often not too hard to fix
○ “Just” make everything take the same amount of time (Slower, but safer).
Timing Side-Channel Attack
Basic idea
1. Load two different memory locations
○ Data gets in cache (assume DM) in two different lines (Call them line “A” and line “B”).
2. Load data you aren’t allowed to access.○ Call that data the secret
○ Will cause a fault
○ But only once the load hits the head of the RoB!
3. Load a memory location based on one bit of that secret○ Do it so you either get placed in “A” or “B”. Will kick out the data in those.
4. Fault happens. Recover as normal○ All instuctions after the fault are nuked. Almost as if they had never happened.
○ But cache has changed!
5. Now load original data○ One of the two will be slow.
○ Now you know one bit of the secret.
Timing Side-Channel Attack
But if you aren’t allowed to access the data…
● There are a bunch of fun things going on at once.○ The processor (at least on Intel machines and maybe some ARM machines) will grab the data
even if you don’t have permission to do so.
○ The page table for a user process generally includes mappings to a bunch of things that user
process isn’t allowed to touch
■ All of Kernel stuff
■ All of physical memory
● So you can go hunting for data.
Timing Side-Channel Attack
Meltdown (https://meltdownattack.com/meltdown.pdf)
Can leak the contents of kernel memory at up to 500KB/s
Meltdown
Meltdown & Hardware
Demonstrated for many Intel x86-64 cores; NOT demonstrated for AMD
Key: When to suppress load with protection violation (user load to kernel memory)
● EARLY: AMD appears to suppress early, e.g., at TLB access
● LATE: Intel appears to suppress at end after micro-arch state changes
A SWAG (Scientific Wild A** Guess) why
● Both are correct by Architecture 1.0
● Performance shouldn’t matter as this case is supposed to be rare
● Do what’s easiest & have luck that is good (AMD) or bad (Intel)
Meltdown
Meltdown & Software
Bad: Meltdown operates with bug-free OS software (by Architecture 1.0)
Good: Major commercial OSs patched for Meltdown ~January 2018
Idea: Don’t map (much) of protected kernel address space in user process
● Offending load now fails address translation & does nothing
● Patches quickly derived from KAISER developed for side-channel attacks of
Kernel Address Space Layout Randomization (KASLR)
● Performance impact 0-30% syscall frequency & core model.
Future hardware can fix Meltdown (like AMD) so maybe we dodged a bullet
Meltdown
Need Computer Architecture 2.0?
With Meltdown & Spectre, Architecture 1.0 is inadequate to protect information
Augment Architecture 1.0 with Architecture 2.0 specification of
● (Abstraction of) time-visible micro-architecture?
● Bandwidth of known (unknown?) timing channels?
Change Microarchitecture to mitigate timing channel bandwidth
● Suppress some speculation
● Undo most changes on mis-speculation
Can this be (formally) solved or must it be managed like crime?
Wrap up
Need Computer Architecture 2.0?
More generally, can we reduce our dependence on SPECULATION?
Wrap up
Executive Summary
Architecture 1.0: the timing-independent functional behavior of a computer
Micro-architecture: the implementation techniques to improve performance
Question: What if a computer that is completely correct by Architecture 1.0
can be made to leak protected information via timing, a.k.a., Micro-Architecture?
Implication: The definition of Architecture 1.0 is inadequate to protect information
Meltdown leaks kernel
memory, but software &
hardware fixes exist
Spectre leaks memory
outside of bounds checks or
sandboxes, and is scary
Wrap up
Some References
New York Times: https://www.nytimes.com/2018/01/03/business/computer-flaws.html
Meltdown paper: https://meltdownattack.com/meltdown.pdf
Spectre paper: https://spectreattack.com/spectre.pdf
A blog separating the two bugs: https://danielmiessler.com/blog/simple-explanation-difference-meltdown-spectre/
Google Blog: https://security.googleblog.com/2018/01/todays-cpu-vulnerability-what-you-need.html and
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
Industry News Sources: https://arstechnica.com/gadgets/2018/01/whats-behind-the-intel-design-flaw-forcing-numerous-
patches/ and https://www.theregister.co.uk/2018/01/02/intel_cpu_design_flaw/
Wrap up
Final ExamAnd other closing stuff
● First line of the TCL script:○ set hdlin_ff_always_sync_set_reset true
○ Helps deal with reset problems.
● What is it doing?○ http://www.sunburst-design.com/papers/CummingsSNUG2003Boston_Resets.pdf page 9
covers this.
○ By the way, Cliff Cummings (first author on that report) is the best resource for Verilog advice
and the like. I rely on his stuff for nearly any deep problem I find.
● The inputs to both legs of the MUX can be forced to 0 by holding rst_n
asserted low, however if ld is unknown (X) and the MUX model is pessimistic,
then the flops will stay unknown (X) rather than being reset.
● Why would the model be “pessimistic”? ○ Hey, another Sutherland paper. http://www.sutherland-hdl.com/papers/2013-DVCon_In-love-
with-my-X_paper.pdf
○ I’m not sure I get why you’d want to be pessimistic in this case, but sure, I can see the tool
thinking the value out of the MUX might be X if ld is X.
Review/Q&A stuff
● Wednesday 4/18 9:30-11:30am. ○ Room TBA
● Sunday 4/22○ Time and date TBA.
Coverage.
● Expect a Tomasulo’s “fill in the boxes” question at the end.
○ Be sure to look over the algorithms you aren’t using
○ And even review the one you are (you may have made some changes)
● Expect a focus on:
○ Power
○ Multi-core cache coherence
○ Static optimizations
Class summary
• Major topics– ILP in hardware (Out-of-order processors)
• How they work AND why we use them
– Caches and Virtual Memory
– Multi-processor
– ILP in software (Complier, IA-64)
– Power
• Less major topics– Memory disambiguation
– Branch prediction• Direction and target
– Advanced OoO issues• Superscalar, instruction scheduling, multi-threading, etc.
The big questions
• What is computer architecture?
• What are the metrics of performance?
• What are the techniques we use to maximize these
metrics?
ILP in hardware (1/2)
• ILP definitions– Hazards vs dependencies
• Data, Name and Control dependencies
– What ILP means and finding it.
• Dynamic Scheduling– Tomasulo’s (three versions!)
• You can be promised a question on this!
• Branch Prediction– Local, global, hybrid/correlating
• Tournament and gshare
– BTBs
ILP in hardware (2/2)
• Multiple Issue
– Static• Static Superscalar
• VLIW
– Dynamic superscalar
• Speculation
– Branch, data
• ILP limit studies
ILP in hardware: Questions
• True or False
1. The original T-algorithm only allows reordering within basic blocks
2. In P6, if it weren’t for precise interrupts, it would be okay to retire
instructions out-of-order as long as they had finished executing and a
branch isn’t skipped over.
3. ILP in hardware is limited in scope due to the “instruction window” which
is basically the size of the RS.
Quick idea: SMT
• One processor, two threads.
Caching (1/2)• There is a huge amount of stuff associated
with caching. The important stuff– Locality
• Temporal/Spatial
• 3’Cs model
• Stack distance model
– Nuts-and-bolts• Replacement policies (LRU, pseudo-LRU)
• Performance (hit rate, Thit; Tmiss, average access time)
• Write back/Write thru
• Block size
– Basic improvement• Multi-level cache
• Critical word first
• Write buffers
Caching (2/2)
• Non-standard caches
– Hash
– Victim
– Skew
• Misc.
– Virtual addresses and caching
– Impact of prefetching
– Latency hiding with OO execution
Cache: Questions (1/2)
• Changing __________ has an impact on compulsory misses.
• A victim cache is more likely to help with ________ than ________ though it can help both (3’Cs)
• At least _____ bits are required to keep exact track of LRU in a 5-way associative cache.
Cache question (2/2)
• A ____________ cache has a number of sets equal to
the number of lines in the cache.
• A fully-associative cache with N lines will miss an access
that has a stack distance of ________ (state the largest
range you can).
Multi-processor
• Amdahl’s law as it applies to MP.
• Bus-based multi-processor– Snooping
– MESI
– Bus transaction types (BRL etc.)
• Distributed-shared– Directory schemes
• Synchronization– Critical sections
– Spin-locks
Multi-processor: Question
• Under the MESI protocol what is the
advantage of having a distinct clean and
dirty exclusive state?
Software techniques for ILP (1/2)
• Pipeline scheduling– Reordering instructions in a basic block to remove pipe stalls
– Loop unrolling
• Static information passed to processor – Static branch prediction
– Static dependence information
• Loop issues– Detecting loop dependencies
– Software pipelining
Software techniques for ILP (2/2)
• Global code scheduling
– Predicated instruction and CMOV
– Memory reference speculation
– Issues with preserving exception behavior
• IA-64 as a case study of hardware support for software ILP
techniques
– Speculative loads
– Advanced loads
– Software pipelining optimizations
Software techniques for ILP: Questions
• What is the most significant disadvantage of
loop unrolling?
• Using CMOV re-write the following code
snippet, removing the branch. Don’t change
exception behavior and assume DIV only
causes an exception if R3=0
BNE R1 R2 skip
R1=R2/R3
skip: nop
Power
• Understand why it’s important
• Power vs. Energy
• How it’s related to the existence of multi-core
• Understand voltage scaling issues