Fault Tolerance and Security Geraint Price Information Security Group Royal Holloway

Fault Tolerance and Security

Geraint Price

Information Security Group

Royal Holloway

3-5th April 2005 Security and Protection of Information 2005

2

Outline

Introduction Background

Security Fault Tolerance

Major Contributions A Personal Perspective Future Challenges Conclusions


3

Introduction

Computer Security and Fault Tolerance share a subset of goals The ability to tolerate or mitigate failure in a

computer system The assumptions that underpin traditional

solutions make their merger non-trivial Security: Remove any replication and tighten

control Fault Tolerance: Replicate and compare results


4

Introduction – II

Recent cross-over research began with Reiter’s work on Rampart (mid 90s)

Spawned a new interest in the application of fault tolerant mechanisms in security: Tacoma: Provision of replication for mobile agents MAFTIA: A large-scale project to study

survivability in Internet applications We concentrate on two avenues of research:

Development of the fault model Progression of the replication mechanisms


5

Background – Security

Why the relatively late interaction? In our opinion, it has much to do with the

history of computer security: Trusted Computing Base Research was weighted towards confidentiality

and integrity – not availability Others had noted this gap in the computer

security literature [Needham,’94]


6

Background – Security – II

Very little in the open literature that dealt with Denial of Service (the absence of availability)

A notable exception [Gligor, ‘86]: An increase in Maximum Waiting Time (MWT) Legitimate and other forms of denial of service –

system returns before MWT Interesting exception [Turn and Habibi, ‘86]:

A security function is fault tolerant, if given the presence of a fault, the system’s security policy remains intact


7

Background – Fault Tolerance Fault Modelling:

Fault → Error → Failure Fault: Adjudged or hypothesized cause of error Error: The part of the system that may lead to

failure Failure: Service deviates from specification Four techniques within the dependability

paradigm: Fault prevention, fault tolerance, fault removal, fault

forecasting


8

Background – Fault Tolerance – II Replication Mechanisms:

Underlying group communication mechanisms Early work conducted at Cornell University:

Isis toolkit: CBCAST (Causal broadcast), ABCAST (Atomic broadcast)

Group Structures: State Machine Approach: Active replication, which

masks the failure of a proportion of the servers Primary Backup Approach: Passive replication, if the

primary fails, then a backup takes over


9

Major Contributions

Rampart Castro and Liskov Quorum Systems MAFTIA Tacoma Other Projects


10

Rampart

Group communication implemented by Reiter [Reiter, ’94 & ‘96]

First system to implement replicated service based on Byzantine agreement protocols

Main communication structure derived from the earlier work on Isis at Cornell

Extension over the Isis work through its ability to tolerate the malicious failure of a proportion of the servers within the group


11

Rampart – II

Choices over communication primitives within Rampart: State machine approach to replication Digital signatures to provide message

authentication in group communication primitive Lack of efficiency and scalability Although it has its drawbacks, it inspired the

majority of the remaining work The main research agenda as a result was

the search for more efficient protocols


12

Castro & Liskov

A new replication mechanism to overcome efficiency concerns [Castro & Liskov, ‘99]

Two main differences to Rampart: Primary backup model Pair-wise symmetric key Message Authentication

Codes A test implementation over NFS was only 3%

slower than Digital Unix NFS Efficiency gains are due to optimistic

protocols under normal operation


13

Quorum Systems

Data replication in a group of servers [Malkhi & Reiter, ‘97]

Move away from the state machine approach Increase scalability by removing the server-

to-server communication for a read operation However, their work does require server-to-

server communication for state update, and hence a write operation


14

MAFTIA

Malicious and Accidental Fault Tolerance for Internet Applications

Large EU funded project: 6 partners Expertise in fault tolerance, distributed computing,

cryptography, formal verification and intrusion detection

3 main areas of work: conceptual framework and architecture; mechanisms and protocols; formal verification and assessment


15

MAFTIA – Conceptual Model

Extension of the Fault → Error → Failure model Re-defining a Fault as an Intrusion:

Intrusion: A malicious, externally-induced fault resulting from an attack that has been successful in exploiting a vulnerability

Attack: A malicious interaction fault, through which an attacker aims to deliberately violate one or more security properties

Vulnerability: A fault created during development of the system, or during operation, that could be exploited to create an intrusion


16

MAFTIA – Conceptual Model – II In breaking down an Intrusion, they highlight

the possibility of targeting the removing or preventing of both Attacks and Vulnerabilities

Although MAFTIA’s main focus was Intrusion Tolerance, they classify a whole range of security mechanisms according to the fault prevention, tolerance, removal and forecasting paradigms mentioned earlier


17

MAFTIA – Hybrid Failure Model Composite fault model with a hybrid failure

assumption The presence and severity of vulnerabilities, attacks

and intrusions varies from component to component Assumptions present in their architectural design:

Built on top of trustworthy components: Java Card Trusted Timely Computing Base (TTCB) Trusted Middleware component


18

MAFTIA – Hybrid Failure Model – II The key element of the MAFTIA architecture

is the TTCB: Provision of time based services through the use

of a Control Channel Dedicated and heavily protected security kernel –

fail silent rather than arbitrary failure Implementation of a reliable broadcast

protocol that can tolerate up to f of f+2 failures [Correia et al., ‘02 ]


19

Tacoma

Tromso And COrnell Moving Agents project Provision of security and fault tolerance were two

key elements Resilience for the agent on a potentially malicious

host: Replicated agents, with voting mechanisms

Fault tolerance for mobile agents: Extension of the primary backup approach

“… preserving the necessary consistency between replicas can be done efficiently only within a local-area network”


20

Other Projects

COCA: Replication of a CA to provide availability Byzantine quorum systems Proactive recovery

OASIS (Organically Assured and Survivable Information Systems) Umbrella project which sponsors separate work

items in the field of resilient security


21

A Personal Perspective

Control of Execution: Adapting fault tolerant principles for a secure

environment can come down to a principle of control

In the Fault → Error → Failure model, breaking the chain requires retaining control

Whose security policy are we protecting? Proposed mechanisms for allowing a client to

share that control [Price, ‘99]


22

A Personal Perspective – II

Use of Other Mechanisms: Some of our previous work identified the

possibility of using timing checks [Price, ’01] Remove the attacker’s ability to delay or replay

messages with impunity Some variants of replay attacks rely on this

With hindsight, there is an interesting comparison with MAFTIA’s use of a Control Channel


23

Future Challenges

Relaxation of assumptions: Fully Byzantine failure models are difficult to protect against

– and hence solutions are inefficient Most of the work since Rampart have concentrated on

feasible means of relaxing these failure assumptions: can we do better?

Further use of hardware: MAFTIA’s use of trusted hardware allows for more efficient

protocols – can the principle be generalised? Mixed failure environments [Siu et al., ‘98] Trusted Computing Group


24

Future Challenges – II

Other dependability models: Fault tolerance is only part of a very mature dependability

literature Disjoint v Inclusive error recovery? MAFTIA defined a whole classification within their model

Security service classification: Quorum based systems use the parallelism of a read

operation to increase efficiency Can we class different services according to their

communication requirements?


25

Conclusions

Until 10 years ago, the work in this field was sparse and sporadic

Now there is a large body of work in this area Practical efficiency is still a key research topic Broaden our search for other applicable

mechanisms Availability and survivability on the Internet is

only going to become more important

Documents

Fault Tolerance and Security Geraint Price Information Security Group Royal Holloway