CS5032 Lecture 13: organisations and failure

ORGANISATIONS AND DEPENDABILITY 1

DR JOHN ROOKSBY

IN THIS LECTURE…

High Reliability Organisations

These are organisations that are able to achieve high reliability from complex, critical systems

• This lecture will cover five of the key qualities said to be held by these organisations

This lecture will use Nuclear Powered Carriers as an example High Reliability Organisation, and NASA at the time of the Columbia disaster as an example of an unreliable organisation

NORMAL ACCIDENTS

Charles Perrow, and introduced the idea that failures are normal in complex systems. Perrow argued serious failures are likely when there is:

• Interactive complexity: The presence of unfamiliar, unplanned and unexpected sequences of events in a system that are not visible or immediately comprehensible

• Tight coupling: The presence of interdependent components. Tight coupling will make a system more prone to cascading errors.

So complex, tightly coupled systems shouldn’t be built?

HRO researchers argue that some complex, tightly coupled systems are far more dependable than others – because of the way they are managed

PRINCIPLES


Low Reliability Organisations

Focus on failure Focus on Success

Focus on reliability Focus on efficiency

Reluctant to simplify Rely on Simplicity

Dynamic hierarchies Inflexible Hierarchy

De-centralised decision making Centralised decision making

Open information Hide Information

Multiple perspectives Single perspectives

Are committed to resilience Are on “automatic pilot”

NUCLEAR POWERED CARRIERS

Complex, high risk socio-technical systems

• Multiple (mechanical and digital) systems• Dangerous objects (aircraft, fuel, and explosives) in close

proximity. Aircraft taking off and landing in 48-60 second intervals.

• 6000 crew. Several different kinds of aircraft, multiple squadrons. All work interdependently and must be coordinated.

• Carriers are 24 stories high and carry enough fuel for 15 years. 2000 telephones. 3,360 compartments and spaces

NUCLEAR POWERED CARRIERS

High risk

• Nuclear reactor accidents• Fire, flooding, grounding, collision• Fuel and weapons explosions• Mistaken identification of friends and foes• High risks both to crew and a much larger public

High reliability

• Low “crunch rates” • Comparatively few major accidents

COLUMBIA DISASTER

Feb 1st 2003 - Columbia disintegrates during re-entry into the earth’s atmosphere

The thermal protection system had been damaged during launch when a large piece of foam insulation broke off the main propellant tank and hit the shuttle

• Known problem. • The majority of shuttle launches had included foam

strikes, but nothing had been done about the design• They were aware the foam had struck the wing, but it

was not treated as serious• Engineers concerns were not listened to

NASA

NASA had repeated similar failings

• The Challenger disaster, 28th Jan 1986 (mission STS 51-L)• The Columbia disaster, 1st Feb 2003 (Mission STS-107)

Many of the failings were the result of deep routed organisational findings

NASA strived to implement HRO principles

FIVE PRINCIPLES


Low Reliability Organisations

Focus on failure Focus on Success

Focus on reliability Focus on efficiency

Reluctant to simplify Rely on Simplicity

Dynamic hierarchies Inflexible Hierarchy

De-centralised decision making Centralised decision making

Share information Hide Information

Multiple perspectives Single perspectives

Are committed to resilience Are on “automatic pilot”

1. RELIABILITY OVER EFFICIENCY

High Reliability Organisations give reliability precedence over efficiency

• Decisions are made on the grounds of reliability first and then efficiency

• Efficiency initiatives are treated with scepticism

1. RELIABILITY OVER EFFICIENCY

High Reliability Organisations do the following:

• Managers regularly talk to and familiarise themselves with staff about how they do their work and why.

• Organisations develop safety measures as well as financial measures, and include these in employee evaluations

• Organisations assign value to the avoidance of accidents• High redundancy despite cost• Cautious actions when necessary despite cost

• Carriers have to persuade congress that enormous amounts

of redundancy (in jobs, communication structures, parts) are

necessary, and that enormous amounts of training are

necessary

• Constant training despite cost. Commanding officers demand

that carriers have regular sea exercises, that they are not just

kept in port

NASA Prioritised efficiency over reliability

• In the 1990s NASA faced drastic cuts and became overly concerned with pleasing congress. NASA Initiated the Faster, Better, Cheaper strategy in the mid-90s. Wanted to stick to a strict schedule.

• With STS-107 they worried that the time needed to analyse the foam strike would delay the next mission. Didn’t want to change the next missions objectives to a rescue mission.

• Saw positioning the shuttle over Hawaii for images to be made as time consuming and costly

2. PREOCCUPATION WITH FAILURE

High Reliability Organisations are preoccupied with failure (They do not focus on success)

• Workers need to be heedful to the possibility of failure• Failures are understood to be normal (but unacceptable)• Know there can be unexpected failure modes, even in common

activities

2. PREOCCUPATION WITH FAILURE

High Reliability Organisations address failure by

• Constant training of all people (simulations, apprenticing, practice)

• Using incident reporting• Designing in extensive redundancy• Maintaining contingencies for critical operations• Requiring proofs that something is safe, not that it is unsafe

• There is constant tracking of issues around malfunctioning,

defective and substandard equipment. They act on these by

training crew how to overcome problems and pressuring

vendors to make improvements

• Extensive redundancy (overlapping jobs, multiple channels

and centres of communications, spare parts, multiple sources

for decision making).

• Example: if an aircrafts landing gear warning light comes on,

the spotter, commander and pilot all work together to establish

what the issues is.

• Multiple contingencies are maintained. Example: There will

always be multiple options for how to land the plane (or for

the pilot to escape).

• Foam had been shed on 65 of 79 missions prior to STS-107.

There were repeated resolves to do something about this and

yet nothing happened.

• After the foam strike, engineers who raised concerns were

asked to prove it posed a danger rather than prove it didn’t.

• No sustained effort to acquire images of the shuttle, or to

share them internally

• A shuttle was available for a rescue mission but never

actually considered.

3. SHARING THE BIG PICTURE

High Reliability Organisations want everyone to know the whole picture

• If people are narrowly focused they will act only in their own interest.

• People need to maintain awareness of other people and events around the organisation

3. SHARING THE BIG PICTURE


• Train people broadly• Educate people about overarching objectives, and set

statements of purpose• Give people access to information on what is happening

elsewhere• Clearly specify how people and teams fit into the whole

• Maintain awareness through many communication devices

and multiple kinds of communication device, and have

multiple centers of communication, each has direct access to

information, each is vigilant.

• Have well articulated hierarchies

• Deck hands are motivated because they are treated as core

parts of teams

• People are rotated through different jobs. Top personnel are

rotated to a different position every 90 days.

• Employees had little understanding of the overall

organisation, and its internal processes

• A team was set up with the correct expertise to assess the

foam strike damage but its objectives were fuzzy and it had

no direct connection to management

• But not given the appropriate official category “Tiger Team”

• The investigators did not know the process for requesting

images, and were rebuked when they tried because they did

not have the authority to request them or the correct approval

4. RELUCTANCE TO SIMPLIFY

All organisations have to simplify and abstract, to filter out unnecessary information (particularly for getting “big pictures”)

But High Reliability Organisations

• Use labels and categories as little as possible as they stop you from looking further into details and events.

• Continually rework labels and categories• Listen to wisdom, but with skepticism • Do not focus on information that supports expectations, but

focus on that which doesn’t fit or disconfirms desires

• There are clear responsibilities and tasks, but in practice the crew are constantly negotiating, communicating and interacting

• If there is a problem with an aircraft, multiple people take multiple views.

• Narrowed the foam strike down to a ‘tile incident’, because

management had expertise in Tiles. It was a reinforced-

carbon carbon panel (RCC) incident.

• The assessment of the damage was done using simulation

software called ‘Crater’ .

• This software was designed for simulating small projectiles

but the foam debris was 640 times larger than the data used

to calibrate Crater.

• Crater was not understood by NASA and the simulation was

actually run and interpreted outside the organisation.

• The simulation was only run twice and the people who ran it

did not think it was very useful, but did not communicate this

well

5. MIGRATION OF DECISION MAKING

High Reliability Organisations migrate decision making as far down the organisation as possible

• Decisions are not made by one central authority. Decisions need to be made where there is expertise. This helps decisions to be made quickly and correctly

5. MIGRATION OF DECISION MAKING

In order to defer expertise:

• Decision making ability migrated to the lowest appropriate levels

• People are trained in making decisions and are given the right resources to do so

• There is recognition of skill levels and legitimacy through the organisation and people are trusted

• There is hierarchy, but decision making is pushed to the extremes. For example if there is debris on the runway, whoever spots it can halt operations and have it cleared

• Rank is not treated as an issue here

• NASA Mission STS-107

• Decision making centralised among managers and ignored

the expert opinions of engineers

• Required authority for decisions to be made

• Example: When images were requested, the organisation

worried about the rank of the requestor

KEY POINTS• Organisational approaches are necessary for achieving

dependable systems. Dependability is not a quality of a technology but a quality of technology-in-practice.

• Technologies are not inherently dependable, but require people to operate and manage them in ways that are dependable

• The HRO literature has identified a number of qualities of highly reliable organisations. These mainly relate to the operation of technology, although some researchers have studied software development organisations from this perspective.

READINGKH Roberts (1990) Some Characteristics of One Type of High Reliability Organisation. Organisational Science, 1, 2: 160-76.

Book: Charles Perrow (1984) Normal Accidents, Living with High Risk Technologies

Book chapter: Karl Weick (2005) Making Sense of Blurred Images. In W Starbuck and M Farjoun, Organisation at the Limit. Blackwell publishing

Business

CS5032 Lecture 13: organisations and failure