Upload
don-shafer
View
962
Download
3
Embed Size (px)
DESCRIPTION
Software Kills latest version for Rocky Mountain Safety Conference - 17 September 2009
Citation preview
Copyright © 2009 Athens Group, Inc.
15th Annual American Industrial Hygiene Association - Rocky
Mountain Section (AIHA-RMS), and the American Society of
Safety Engineers (ASSE) Colorado Chapter
FALL TECHNICAL CONFERENCE
September 16th & 17th, 2009 Arvada Center
“Environmental Health & Safety Broadening Our Alliances”
Accidents Caused by
Software Errors
Don Shafer, CSDP
Chief Technology and HSE Officer
Athens Group, Inc.
2 Copyright©2009 Athens Group, Inc.
A Safety Minute – 17 September 2009
• Safety - Dropped object fatality in the Keppel FELS shipyard
• Since arriving in Singapore, I‟ve be mildly shocked at the nonchalance shown
around lifting operations. It‟s quite common to see crane operations lifting loads
over active walkways; walkways that are not taped off and often there‟s little
notice by the workers using the walkways of the lifting operations occurring
above them.
• This resulted in tragedy yesterday. A basket of scrap cable was being
transferred from a rig to the dock. Reportedly, the crane was blowing its horn as
per protocol for transfer operations. The load shifted and a chunk of cable
dropped, landing on the head of a dockworker underneath. I don‟t believe the
dockworker was participating in the lifting operation.
• Even more disturbing – no one on the dock came to his aid, rig personnel ran
over an attempted CPR. The ambulance arrived without paramedics, rig
personnel accompanied the dockworker to the hospital where he was
pronounced dead.
• Be careful out there, your PPE will protect you from many things, but your
awareness will save your life.
3 Copyright©2009 Athens Group, Inc.
Presentation Outline
• Examples of Software Related Incidents
– Software you can “see”
– Software you cannot “see”
• Proven Practices to Reduce Software Risk
– Life Cycle Recognition
– Configuration Management
– FMECA
4 Copyright©2009 Athens Group, Inc.
Those old Software Safety Chestnuts!
5 Copyright©2009 Athens Group, Inc.
And, some not so old!
1. Air France - What is known about the crash of an Air France airbus on 1 June
bears similarities with the little-noticed loss much earlier of two computer-
controlled passenger jets. Those two crashes raised questions of whether the
pilots or systems were really in control. Airbus said this data showed that the
pilots might have received conflicting information about their speed. There
was a “divergence in airspeed measurement” by the onboard systems of the
Air France aircraft. This is one of the matters being investigated, said Airbus.
Data to the onboard computers about air speed came from sensors called
pitot tubes, at least one of which was due for replacement. French authorities
have suggested that inconsistent air speed readings are not dangerous.
6 Copyright©2009 Athens Group, Inc.
Software you can “see”
6
7 Copyright©2009 Athens Group, Inc.
Initial requirements definition;FMECA of the control system andsoftware change control protocols would have avoided this incident.
Solution
The elevators and bales of an older-model top drive reacted
erratically to a rapid and erroneous user command. The
vendor had released a software patch to that model to
prevent this erratic behavior, but somehow it had not been
communicated or installed on that drilling unit. There was
little or no initial design and testing of the control software
and the software interlock issue was not discovered. Little or
no system requirements gathering were done on the control
system and no FMECA was done on the control portion of
the top drive. There was no consistent management of
change treating software as an asset on the MODU between
the supplying vendor and the operator.
Incident
Result
Safety Incident: Injured Rig Hands
The bales swung around and injured two
of the rig hands, resulting in reportable
LTIs.
Estimated Lost Time: 5 days
Day Rate: $310,000.
Minimum Cost: $1,550,000.
8 Copyright©2009 Athens Group, Inc.
Safety Incident: Potentially Deadly Mishap
An FMECA of the equipment covering operational states and message flow could have prevented this incident.
Solution
A driller was performing a test with a riser joint suspended
70 feet (21 meters) above the drill floor. Prior to leaving the
drill cabin for a Job Risk Analysis meeting with the
roughnecks, the driller selected “standby” mode on the
drilling chair. While doing so, he inadvertently pressed the
keypad button that activates Pipe Handling mode. In this
mode, the drill control system sends a pressure monitoring
command to the pipe elevator every three minutes. The
driller stepped out onto the drill floor and three minutes later
the pressure monitoring command was sent to the riser
handling equipment which mistook it for an unlock
command.
Incident
Result
The riser tool released the joint which fell through the well center into the ocean. The joint fell perfectly through the slips. Neither personal injury nor collateral equipment damage was experienced.
Estimated Lost Time: 1.5 daysDay Rate: $310,000Minimum Cost: $465,000
3 people
4 people
9 Copyright©2009 Athens Group, Inc.
Safety Incident: Dropped Blocks
An FMECA of the equipment covering operational states and message flow could have prevented this incident.Regression testing of software upgrades and formal change control should have taken place.
Solution
The semisubmersible MODU was in the final stages of
pulling the BOP. The BOP was being lifted the last meter to
gain clearance for access to the BOP transporter in the
moonpool. With the travelling block at the uppermost limit,
the Kinetic Energy Management System was „tripped‟, and
the resulting action was not as expected. The anti-bird
nesting components were incorrectly installed thus limiting
the 1200 psi used to function the service brake to 200psi.
There was no operator error and the incident was a result of
a disc brake system failure.
Incident
Result
Traveling blocks, complete with riser and suspended BOP, descended approximately 50 meters in an uncontrolled manner, until the Top Drive impacted against the riser gimbal at the rig floor level.
Estimated Lost Time: 5 daysDay Rate: $477,000Minimum Cost: $2,385,000
10 Copyright©2009 Athens Group, Inc.
Safety Incident: Top Drive Out of Control
Following software change control and testing protocols would have prevented this.
Solution
During the voyage to location, a technician was
„tweaking‟ the zone management parameters on a
newbuild. A few minutes later the top drive started
rotating by itself. The technician in his zeal to fix one
thing had broken another – thereby introducing
regression into the system. He was also unable to
quickly recover to a previous known state as he
wasn‟t following software change control protocols.
Incident
Result
The technician and the team had to scramble to correct the issue. Fortunately there was no equipment damaged or personnel injured.
Estimated Lost Time: 2 daysDay Rate: $380,000Minimum Cost: $760,000
11 Copyright©2009 Athens Group, Inc.
If formal control procedures had been adopted no unofficial change requests should have been carried out.
Solution
A vendor arrived onboard a rig after having been officially
requested to make changes to the rig‟s automation system.
While onboard, an unofficial request was made by a system
operator regarding the numbering of main engine cooling
system valves. The vendor either hadn‟t completely
understood the request or had been distracted and
inadvertently made the change to the wrong valve. Some
time later a different operator attempted to give a close
command to the valve in preparation for maintenance of the
system.
Incident
Result
Safety Incident: Generator Trip
Closing of the incorrect valve caused a generator trip.
Estimated Lost Time: .5 daysDay Rate: $310,000Minimum Cost: $155,000
12 Copyright©2009 Athens Group, Inc.
Safety Incident: Control System Reset Kills 4
An FMECA of the equipment covering operational states and message flow could have prevented this incident.Document the impact of resetting control systems during operations.
Solution
A control system failure occurred on a large, off-shore construction
vessel. Two control units were restarted twice, unsuccessfully. A
blinking red lamp on the PLC indicated that a memory reset was
required, even though a memory reset had NEVER been requested
by control system diagnostics during equipment operations. As soon
as the hydraulic power packs started, a loud bang was heard. A
quadruple joint of pipe dropped approximately one meter to the
welding deck below. A second quadruple joint of pipe in the pipe
elevator was released (all clamps opened and the hydraulic safety
stop swung away) and fell the full length of the tower, smashing
through a crowded access platform to the deck below.
The initialization instruction was pre-loaded in PLC EPROM memory
and the initialization included instructions to OPEN ALL CLAMPS.
Incident
Result
Eight personnel were injured – four fatally. All were located on the access platform and several were thrown overboard by the impact.
Estimated Lost Time: 20 daysDay Rate: $510KMinimum Cost: $10,200,000
13 Copyright©2009 Athens Group, Inc.
What Did We Learn?
• Understand the impact of resetting control systems during operations
When a system is reset, to what configuration do the components return?
Predefined, Fail-set or Fail-safe state?
Loss of communications caused revert to an unanticipated configuration –
pipe rams opened unexpectedly and string lost in-hole
• Known instances where systems were reset, as a matter of procedure on
established intervals to prevent incidents
Incident occurred as a result – Loss of station after failed reboot of DP
processor
• Statistically, most reset/reboot operations are completed without incident
False sense of security – perception that reset is actually preventing
incidents
14 Copyright©2009 Athens Group, Inc.
Software you cannot “see”
14Complexity is NOT your friend!
15 Copyright©2009 Athens Group, Inc.
Your IT Network is Safe?
IT contractor indicted for sabotaging offshore rig management
system, Company had refused to offer him a permanent job,
feds say, March 18, 2009:
Mario Azar, 28 of Upland, Calif., was charged with illegally
accessing and compromising a computer system used by
Pacific Energy Resources Ltd. (PER) to monitor offshore
platforms in California and Anchorage and to detect oil
leaks. The indictment papers allege that Azar's actions
affected the "integrity and availability" of the system and
resulted in it becoming temporarily unavailable. Though no
oil spill or environmental hazard occurred while the system
was compromised, Azar's actions caused thousands of
dollars in damage, the indictment said.
16 Copyright©2009 Athens Group, Inc.
Cyber criminals targeting energy – 15 March 2009
• Based on an analysis of more than 240 billion requests for
analysis by the corporate users, there was near 600% malware
growth between like quarters in 2007 and 2008, and a 300%
volume ratio increase from January 2008 through December
2008.
• A vertical industry analysis of malware growth found the energy
and oil sector to rank in the top five targets in all threat
categories. But energy and oil leads the pack by a long shot
when it comes to one important category: encounters with
unique new variants of data theft Trojans.
• With advances in the technology and sophistication of cyber
attacks, malware delivered through the web can be remotely
customized and configured once in place, based on the victim‟s
identity.
17 Copyright©2009 Athens Group, Inc.
What do the Authorities Say?
How to implement these activities
and processes is not prescribed by
the recommended practice. The
recommended practice is primarily
focused on the „what to do‟.
DNV RP D-201
Recommended Practice for Integrated Control Systems
18 Copyright©2009 Athens Group, Inc.
How can Software become Safer?
• Awareness of Development Life Cycle
• Software Configuration Management (SCM)
• Failure Mode and Effects Criticality Analysis
(FMECA)
19 Copyright©2009 Athens Group, Inc.
Athens Group Deliverables Life Cycle Model
Design AcceptanceConstruction Operation
FMECAFMECA FMECA
System
Requirements
Concept
Definition
Activity
Detailed
Design
Module Development
Unit Test
Integration and
Test
Preliminary
Design
Hardware
Requirements
DeploymentIntegrated System
Testing
Operations
Maintenance
Detailed
Design
Coding
Unit Test
Integration
and Test
Preliminary
Design
Software
Requirements
Alarm Management
Commissioning
Planning
Factory Acceptance
Testing
Troubleshooting and
Remediation
Contractual
Software
Standards
Vendor
Software
Process
Assessment
Design
Verification
RequirementsValidation
Controls &
Network
CommissioningStartup
Support
Vendor Management
Software Change Management
Acceptance
Planning
20 Copyright©2009 Athens Group, Inc.
Software Change Management
21 Copyright©2009 Athens Group, Inc.
Failure Mode and Effects Criticality Analysis (FMECA)
D e te r m in a t io n o f
E f fe c ts o f F a i lu r e
M o d e s
S O W a c c e p te d a n d
d o c u m e n ts m a d e a v a i la b le
I te r a t iv e Q u e s t io n s w i th
c u s to m e r a n d v e n d o r ( s )
B lo c k D ia g r a m s
a n d In i t ia l P r o d u c t /
P r o c e s s A n a ly s is
Id e n t i f ic a t io n o f
S y s te m a n d
F u n c t io n s
Id e n t i f ic a t io n o f
F a i lu r e M o d e s
Id e n t i f ic a t io n o f
P o s s ib le C a u s e s
P a r t ic ip a n t R e v ie w
o f A n a ly z e d
In fo r m a t io n
D o c u m e n ta t io n o f
F M E C A R e s u l tsIm p le m e n t R is k M it ig a t io n
S M E E n te r s H e re
S M E E x its H e re
F a c i la t i to r E n te r s H e r e P a r t ic ip a n t E n te rs H e r e
P a r t ic ip a n t E x its H e r e
F a c i l i ta to r E x its H e r e
22 Copyright©2009 Athens Group, Inc.
In Conclusion
You can – and MUST - make
Software Safer1. Awareness of Development Life Cycle
2. Software Configuration Management
(SCM)
3. Failure Mode and Effects Criticality
Analysis (FMECA)
23 Copyright©2009 Athens Group, Inc.
Don Shafer, CSDP
Chief Technology Officer
5608 Parkcrest Drive, Suite 200
Austin, Tx 78731
www.athensgroup.com
512.345.0600 x117
24 Copyright©2009 Athens Group, Inc.
References
• NORA Symposium 2008: Public Market for Ideas and Partnerships -
http://www.cdc.gov/niosh/nora/symp08/posters/035.html
• Fatalities Among Oil and Gas Extraction Workers --- United States,
2003 - 2006 - http://www.cdc.gov/mmwr/preview/mmwrhtml/mm5716a3.htm
• Therac 25 - http://sunnyday.mit.edu/papers/therac.pdf
• Air France 2009 -
http://www.computerweekly.com/Articles/2009/06/01/236245/air-france-
crash-thought-to-be-caused-by-system-failure.htm
http://www.computerweekly.com/Articles/2009/06/16/236447/air-france-
airbus-pitot-sensor-linked-to-two-fatal-crashes.htm
• 8 Software Related Death Incidents -
http://www.baselinemag.com/c/a/Projects-Processes/Eight-Fatal-
SoftwareRelated-Accidents/
25 Copyright©2009 Athens Group, Inc.
Speaker Bio
Don Shafer, CSDP, developed Athens Group's oil and gas practice and leads Athens
Group engineers in delivering superior rig software services and oil and gas exploration
as well as production and pipeline monitoring systems for clients such as BP, Noble,
Transocean, Maersk, ExxonMobil, Conoco Phillips and Shell. Prior to co-founding
Athens Group, Don led groups developing and marketing hardware and software
products for Motorola, AMD and Crystal Semiconductor. He was responsible for
managing a $129 million-a-year PC product group that produced the award-winning
audio components for Apple. From the development of low-level software drivers in yet-
to-be-released Microsoft operating systems to the selection and monitoring of Taiwan
semiconductor fabrication facilities, Don has led key product and process efforts.
Don earned a BS degree from the USAF Academy and an MBA from the University of
Denver. Treasurer of the IEEE Computer Society Board of Governors, Past Editor-in-
Chief of the IEEE Computer Society Press, IEEE Senior Member and software
engineering book series author, Shafer is an adjunct professor in the Cockrell School of
Engineering at the University of Texas at Austin. An avid writer, Don has contributed to
three books, written over 20 published articles, and is co-author of Quality Software
Project Management, recently released by Prentice-Hall. He is a contributor to the 2010
edition of the multi-volume Encyclopedia of Software Engineering. His latest patents
are in state-based machine control.
26 Copyright©2009 Athens Group, Inc.
Who We Are
• Founded 1998
• Offices in Houston and Austin, TX
• Pioneered Drilling Technology AssuranceSM (DTA)
Services
• 100% Referenceable Customers
– Over 70% have completed more than one project
with us
• Committed to supporting innovation and education in the
Oil & Gas Industry
27 Copyright©2009 Athens Group, Inc.
Who We‟ve Helped
28 Copyright©2009 Athens Group, Inc.
What We Deliver