The Emerging Secure Processor Designs Youtao Zhang University of Pittsburgh

TheThe Emerging Secure Processor DesignsEmerging Secure Processor Designs

Youtao ZhangYoutao Zhang

University of PittsburghUniversity of Pittsburgh

Outline

Background Why to design secure processors?

Secure Processor Design What can secure processors do?

Future trends What will future secure processors be?

Cyber Security is Important

Widespread Internet and Network accesses Good connectivity but easier attack

Cyber attack is a real problem More and more people are being affected Leading cause of financial losses

Software piracy: $29B in 2003 Virus: $55B in 2003

More severe for mission critical applications battlefields

Various Types of Security Attacks

Intellectual property theft Illegally copying software

Virus/Worms Triggered by a special event, a malicious program can

do harmful things Trojans

Accessing the computer through a back door Denial of service

Limitation of Software-based Designs

Current practices: Using serial code

Causing piracy a little bit difficult, but not too much Install/update anti-virus software

Defend known viruses Firewall/intrusion detection

Configuration is not that easy, false +/– Secure software design

Performance is an issue

Enhance Security through Hardware

Secure co-processors Speed up crypto-computation operations

IBM 4758 secure coprocessor

Secure embedded processors Tamper resistant registers/buffers

Storing sensitive information Smartcard

Emerging Secure Processors

Trusted computing group (TCG )

IBM: SecureBlue, 4/2006 Intel: LaGrande, 2002 Microsoft: NGSCB

What Secure Processors can Do?

Design Objectives Protect sensitive data, program execution, intellectual

property, network communication Confidentiality: adversary cannot know the data Integrity: adversary cannot tamper with the data

Ease of use, execution speed, backward compatibility, manageability Performance

The First Design: Let us Encrypt ALL

XOM model (Execution only memory) ASPLOS’2000

Design goal Protecting intellectual property Secure execution even the hardware and OS are captured

Design strategy Everything offchip is encrypted

What is Trusted?

Disk

TCB

CPUcore

Cache OSMemory

Processor Boundary

Only CPU is trusted Hacking a high-end CPU chip is difficult

It is possible to scan an embedded chip using a microscope

Other components are not trusted

XOM Design

Use processor’s PKI to encrypt key

CPUCore

Cache

….

key

En-/De-cryptionUnit

….Processor

Memory

keyPrivate KeyPublic Key

key

Use session key to encrypt program

Problem 1: Slow Encryption

Encryption Only Memory Latency =100 cycles, encryption Latency = 50 cycles

Performance Degradation for XOM

34

42

23

16

29

2

39

2

13

8

22 21

0

10

20

30

40

50

ammp art bzip2equake gcc gzip mcf mesa parser vortex vpr Average

Execu

tion T

ime Incr

ease

[%

]

The Lengthened Critical Path

Encryption lies on the memory access critical path

Main MemoryMain MemoryMain MemoryMain Memory

Write Write BufferBufferWrite Write BufferBuffer

L2 CacheL2 CacheL2 CacheL2 Cachew

ritew

rite

Encryption/ Decryption

Unit

readread

Our Work

Slow memory accessmemory access and crypto-operationcrypto-operation XOM design: sequential Performance degradation

~20% for SPEC2000 benchmark program

Our design: parallel Performance degradation: 2% However

Direction encryption is no longer secure Adopt One-time-pad scheme

One-Time-Pad Encryption

XOM : clear-data = AES (cipher-data ) Our design: clear-data = cipher-data OTP

OTP = AES ( address || counter ||…)


L2 CacheL2 CacheL2 CacheL2 Cache


Unit




Unit

CPU CPU

Offloading the Crypto-Computation

Original scheme Encryption input depends on memory accesses Carried in serial with memory accesses Latency: 100 cycles + 50 cycles = 150 cycles

Our scheme Decouple en-/decryption and memory accesses Carried in parallel Hide the crypto-computation latency

One-Time Pad (OTP) Encryption

AES normal mode

OTP encryption

cleartext AES ciphertext

cleartext ciphertext

random value (pad)AESseed seed

Seed Selection

Independent of data value, known before data is available Use memory address

Multiple accesses of the same location use different seeds Use one-time sequence

The Sequence Numbers

time t0 t1 t2

V 1 2 3

(1) Use A only P(A) 1 P(A) 2 P(A) 3

(2) Use A and t P(A,t0)1 P(A,t1)2 P(A,t2)3

Write V → A

OTP = AES (Address, one-time-seq)

OTP = AES (Seed)= AES (nonce, Address, one-time-seq)

Comparing XOM and OTP Based Encryption

XOM XOM w/ OTP

A1 100

A2 100

A1 Ekey(100)

A2 Ekey(100)

A1 Ekey(A1,t1)100

A2 Ekey(A2,t2)100

t1 100

t2 100

A t1 Ekey(100)

t2 Ekey(100)

t1 Ekey(A,t1)100

t2 Ekey(A,t2)100

spatially

temporally

Our scheme better randomizes encrypted data in memory

More Intuitively

Our Architectural Design

Write Write BufferBufferWrite Write BufferBufferread


Sequence Sequence Number Number CacheCache

Physical Address

write

Virtual Address

EncryptionUnit

Security BoundarySecurity Boundary


Experimental Results

Settings Simplescalar toolset SPEC2000 programs

Simulation Baseline 4-issue out of order Caches

Separate 32KB 4-way L1 I-cache and D-cache 256KB 4-way L2 cache

Performance Comparison

0

10

20

30

40

Pro

gra

m S

low

dow

n [%

] XOM SNC-LRU

Equal Area Comparison

0.0

0.4

0.8

1.2

Norm

alize

d E

xecu

tion T

ime w

rt

XOM-256KL2 XOM-384KL2 SNC-32way-LRU-256KL2

1.211.21

1.121.12

1.021.02

SNC of Different Size

0

1

2

3

4

5

6

7

8

Slo

wdow

n for diff

ere

nt SNC si

zes

32KB 64KB 128KB

17.89

SNC of Different Associatively

0

2

4

6

8

10

12

ammp bzip2 gcc mcf parser vprSlo

wdow

n for diff

ere

nt SNC a

ssociativi

ty [%

]

fully associative 32-way set associative

II: Tamper Resistant Memory Model

Replay attack

CPUCore

CacheEn-/De-cryption

Unit

….Processor

Memory

keyPrivate KeyPublic Key

key

$1000

$1000$1000

$1000

$1000$50

Merkle TreeMemory Data

MAC1_1

MAC1_2

MAC2_1

MAC2_2

Root MAC

Ensuring Data Integrity

Any change in memory Result in a new root-MACroot-MAC

Update for each memory write Detect illegal modification of memory

Format of released code Code/data segments An encrypted session key An encrypted root-MAC

keyRoot-MAC

III: Protecting Multiprocessors

CPU/Memory is protected Confidentiality Integrity

CPU/CPU is NOTNOT Require both

Main Memory

A B C

BUS

Bus Attacks in Real Life

Mod-ChipModify game console

to boot up all CD/DVDs!

DSP Chip

BIOS Chip

The Algorithm

Secure Protect both confidentiality and integrity

Performance: < 1% Fast crypto operations Pad update in parallel

AES P

AESP

pad1

1

2pad

34

3

Data Bus

Originating Processor Snooping Processors

C C

Potential Attacks on the Bus

Type 1: Dropping 11 22 33

Type 2: Reordering 11 22 33

Type 3: Spoofing11 22 22

11 22 33

×

Secure SMP multiprocessors

Design goals To secure cache-to-cache bus transfers Security

Confidentiality Integrity

Efficiency Fast

Cryptographic Algorithms

CBC-AESP P

Mlast

C

Mlast

MAES

CAES

M

CFB-AES

AESM

P

Mlast

AESC

P

C

Mlast

Bus Encryption Scheme

Crypto Operations Fast: One XOR operation

Pad Update Done in parallel with bus transfer

AES P

AESP

pad1

1

2pad

34

3

Data Bus

Originating Processor Snooping Processors

C C

attack

Basic Authentication Scheme

Detect an attack Check the one on its own and the one received

Issues: Periodic authentication Previous data sequence (chaining mode) PID should also be included ….

Pt

AESMACt

MACt-1

Bus

==

MACt'

Defending Type 1: Dropping Attack

×

111 1

× 22111 13333 332 2

[PACT’04]

Defending Type 1: Dropping Attack

111

22

1

1,2

nn 1..n

×111 22 133 2,31,3

Pt

AESMACt

MACt-1

33 33

Defending Type 2: Reordering Attacks

11

22

nn

1

1,2

1..n

22

11

nn

2

2,1

2,1,..,n

Defending Type 3: Spoofing Attacks

Replaying 11 22 22

11 22 33Insertion

1 2 3 4

xx yy zz xx yy zz

PID: 2 3 4 1 2 3 2 1

Architectural DesignApplication2Application1

CPU0 CPU1 CPU2 CPUn

MainMemory

BUS

… … ……

CryptographicUnit

Private key

Public key

SK1 …G1

Sk2 …G2

Skn …Gn

Additional Info Table

Experiment Environment

Tools Simics full-system multiprocessor simulator 5 benchmarks from SPLASH2 suite

Configuration Machine: 1Ghz, SPARC V9, Solaris 9 Cache

Separate L1 I- and D-cache: write-through, 64K, 32B line Integrated L2 Cache: write-back, 1M/4M, 64B line MESI Coherence Protocol

Latency cache-to-cache: 120 cycles; cache-to-memory: 180 cycles AES: 80 cycles

Performance Slowdown

Write Invalidate Model + 4M Write Back L2 Cache

00.020.040.060.08

0.10.120.140.160.18

0.2

fft radix barnes lu ocean average

Perc

en

tag

e S

low

do

wn

(%

)

2P 4P

Bus Traffic Increase

Write Invalidate Model + 4M Write Back L2 Cache

00.05

0.10.15

0.20.25

0.30.35

0.40.45

0.5


Bu

s A

cti

vit

y I

ncre

ase (

%)

2P 4P

Integrated SystemWrite Invalidate Coherence Model + 1M Write Back L2 Cache

0.05

0.03

0.04

0.01 0.

080.

04

0

2

4

6

8

10

12

14

16

18


Per

cen

tag

e S

low

do

wn

(%

)

SENSS SENSS+Mem_OTP_Chash

IV: Defending Software Vulnerability

Buffer overflow attack The most common attack Procedure:

Insert malicious code into the user space Identify a vulnerable program point Control flow change to the malicious code

Still possible on secure processors Need software support

Hardware alone cannot defend all attacks

Need Compiler Support

How? Adopt existing approaches

StackGuard, array boundary check New approaches

Hardware supported information tracking Alarm when insecure input is used as return address Efficient, w/o significant performance loss Possible to include OS support for better protection

Future Trends

Wide adoption of secure processors Business transactions, mission critical applications General-purpose applications

Achieving more ambitious security goals Protect inter-process communication Defend viruses, worms, and DoS attacks

Collaborating with OS, compilers

Conclusion

Secure Processor is A promising solution is an insecure world Currently active in research community and industry

Documents

The Emerging Secure Processor Designs Youtao Zhang University of Pittsburgh