Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
TRIBUTE TO FOUNDERS: ROGER SARGENT. PROCESS SYSTEMS ENGINEERING
TeCSMART: A Hierarchical Framework for Modeling andAnalyzing Systemic Risk in Sociotechnical Systems
Venkat Venkatasubramanian and Zhizun ZhangDept. of Chemical Engineering, Complex Resilient Intelligent Systems Laboratory, Columbia University,
New York, NY 10027
DOI 10.1002/aic.15302Published online in Wiley Online Library (wileyonlinelibrary.com)
Recent systemic failures in different domains continue to remind us of the fragility of complex sociotechnical systems.Although these failures occurred in different domains, there are common failure mechanisms that often underlie suchevents. Hence, it is important to study these disasters from a unifying systems engineering perspective so that one canunderstand the commonalities as well as the differences to prevent or mitigate future events. A new conceptual frame-work that systematically identifies the failure mechanisms in a sociotechnical system, across different domains is pro-posed. Our analysis includes multiple levels of a system, both social and technical, and identifies the potential failuremodes of equipment, humans, policies, and institutions. With the aid of three major recent disasters, how this frameworkcould help us compare systemic failures in different domains and identify the common failure mechanisms at all levelsof the system is demonstrated. VC 2016 American Institute of Chemical Engineers AIChE J, 00: 000–000, 2016
Keywords: artificial intelligence, design, fault diagnosis, safety, process control
Systemic Failures: Introduction
Recent systemic failures in different domains such as the
Global Financial Crisis (2007–2009), BP Deepwater Horizon
Oil Spill (2010), and Indian Power Outage (2012) continue to
remind us of the fragility of complex sociotechnical systems.
Systemic failures occur when an entire system collapses,
where the system is typically a large entity, whose failure neg-
atively impacts a large number of people and their environ-
ment, causing enormous financial losses. Examples of such
systems are refineries, inter-state power grids, country-wide
financial networks, large institutions, and so forth. Union
Carbide’s Bhopal Gas Tragedy in 1984, in which an estimated
5000 died and about 100,000 were seriously injured by the
accidental release of methyl isocynate was a systemic failure.
Another example is the Piper Alpha Disaster in 1988, where
an offshore oil platform operated by Occidental Petroleum in
the North Sea, U.K., exploded killing 167 and resulting in
about $2 billion in losses. The Challenger (1986) and Colum-
bia (2003) Space Shuttle Disasters, Schering Plough Inhaler
Recall (1999), the Northeast Power Blackout (2003), the
spread of SARS (2003), the BP Texas City Refinery Explosion
(2005), and the Johnson & Johnson Multidrug Recall (2010)
are all examples of systemic failures in different domains.
Examples of financial systemic failures include Enron (2001)
and WorldCom (2002) collapses, and the Madoff Ponzi
Scheme (2008). The collapse of the News of the World news-
paper organization (2011) is an example of systemic failure in
the media domain.
In each case, official postmortem inquiries were conductedand reports of the accidents were produced. Chemical engi-neers might study the BP Texas City Refinery Explosion
Report,1 and people from the financial world may browse TheFinancial Crisis Inquiry Report,2 but rarely does one comparefailures across the different domains to study their commonal-
ities and differences. But when one undertakes such a compar-ative study, one is struck by the commonality across differentdomains. There is an alarming sameness about such disasters,which can teach us important fundamental lessons. Although
the failures listed above occurred in different domains, in dif-ferent facilities, triggered by different events, there are, how-ever, common failure mechanisms that often underlie such
events. Systematically identifying and understanding thesemechanisms are essential to avoid such disasters in the future.
Modern technological advances are creating an increasingnumber of complex sociotechnical systems. By sociotechnicalwe mean that these systems comprise of social elements (i.e.,humans) as well as technical elements (such as pumps, valves,reactors, etc.). The human elements are not only an integralpart of the system, they are also often the cause of major fail-
ures. The task of designing such systems, and their controlmechanisms, to ensure safe operations over their life cycles isextremely challenging. Complex sociotechnical systems have
a very large number of interconnected components with non-linear interactions that can lead to “emergent” behavior—thatis, the behavior of the whole is more than the sum of its
parts—that can be difficult to anticipate and control.3 More-over, these systems are not isolated—they interact withhumans and the physical environment; in particular, humandecision making and the associated errors are part of the feed-
back processes in these systems. The cumulative effect of thenonlinearity, interconnectedness, and interactions with humans
Correspondence concerning this article should be addressed to at V. Venkata-subramanian at [email protected].
VC 2016 American Institute of Chemical Engineers
AIChE Journal 12016 Vol. 00, No. 00
and the environment makes these system-of-systems poten-tially fragile and susceptible to systemic failures.
We propose a conceptual framework that can assist in sys-tematically identifying the failure mechanisms in a complexsociotechnical system. Much like hazard and operability(HAZOP) analysis, which helps us identify potential hazardsin equipment and process flowsheets systematically, by exam-ining the failure modes of different components methodically,our framework examines the entire sociotechnical system,including the corporate, regulatory and societal layers, andidentifies the potential failure modes of equipment, humans,policies and institutions. We also demonstrate how this newframework helps us compare systemic failures in differentdomains, in a detailed manner, and reveal the common failuremechanisms at all levels of the system. We compare the BPTexas City Refinery Explosion, Global Financial Crisis andNortheast Blackout. Such a comparative analysis has not beenconducted before, as most people generally think that they arecompletely different events, occurring in entirely differentdomains, and, therefore, are unlikely to have any common fea-tures of any value. We show that there are indeed commonvaluable lessons.
This article is organized as follows. Next section discussesthe common patterns of failures at multiple levels. The sectionafter introduces our hierarchical modeling framework (Teleo-Centric System Model for Analyzing Risks and Threats[TeCSMART]). After that, it presents failure analysis andcomparison, and analyzes three prominent case studies—Global Financial Crisis, BP Texas City Refinery Explosion,and Northeast Power Blackout—using the TeCSMART frame-work, and discusses their similarities and differences sheddingnew light on systems failures. Such a model-based compara-tive study has not been made before. The last section discussesthe future directions.
Systemic Failures: Common Patterns of Failuresat Multiple Levels
Postmortem investigations of many disasters have shownthat systemic failures rarely occur due to a single failure of acomponent or personnel. Even though the senior managementof a company typically tried to spin the blame on some unanti-cipated equipment failure, operator error, or a rogue trader,that is rarely the case for major disasters. For instance, UnionCarbide initially claimed that the Bhopal Gas Tragedy wascaused by a disgruntled employee, who had sabotaged theequipment.4 Enron management initially blamed Andrew Fas-tow, Enron’s CFO, as the sole culprit.5 But, again and again,investigations have shown that there are always several layersof failures, ranging from low-level personnel to senior man-agement to regulatory agencies that have led to majordisasters.
Such investigations have shown that the safety procedureshad been deteriorating at the failed facilities for months, if notyears, prior to the accident. For example, in the case of PiperAlpha, the Permit-to-Work system had been dysfunctional formonths.6 In Bhopal, regular maintenance of safety backup sys-tems had not been conducted for months.4 Massey Energy ranup about 600 safety violations in its Upper Big Branch mineduring 2009-2010.7 OSHA statistics show that BP ran up 760“egregious, willful” safety violations during 2008–2010 inOhio and Texas. Compare this with the corresponding num-bers for the other oil companies: Sunoco (8), Conoco-Phillips(8), Citgo (2), and Exxon (1).8 These are clear evidences of a
breakdown of the corporate safety culture for months or years.One sees a similar pattern in financial disasters as well. For
example, in Enron, its senior management, led by Ken Layand Jeff Skilling, created an extreme performance-orientedrisky culture that seems to have tolerated unethical behavior,
which resulted in many violations, market manipulations, andso on.5 In the subprime crisis, the perverted incentive mecha-
nisms in mortgage lending and its subsequent securitizationand trading, caused individuals and corporations to makehighly-leveraged bets that resulted in risk extremes which
were unsustainable. Thus, it was not a question of if a disasterwould occur but when.
Another common pattern is that people had not identified all
the serious potential hazards. They had often failed to conducta thorough process hazards analysis that would have exposedthe serious hazards, which resulted in the disasters later. Such
incomplete hazards analysis was highlighted in the Cullenenquiry of Piper Alpha.53 Failure to perform such a hazards
analysis was partially responsible for the meltdown of BearStearns, Lehman Brothers, Merrill Lynch, and others in thesubprime market fiasco.9 However, the few who had per-
formed such hazards analysis did see the crash coming andprofited billions of dollars, as described in Michael Lewis’
book, now a movie, The Big Short.10 Yet another commoncause is the inadequate training of the plant personnel to han-dle serious emergencies.
All in all, typically, the responsibility for a systemic failure
goes all the way to the top levels of company management,who had only paid a lip service to safety, tolerated non-
compliant behavior, even encouraged excessive risk takingand unethical behavior, all of which resulted in a poor corpo-rate culture of safety,1,11–13 which in turn paved the way for
the disasters.We also find that serious failings by regulatory, ratings, and
auditing agencies, tolerated, sometimes even encouraged, by a
laissez-faire political environment, playing a significant role.First and foremost, it does not matter whether the systems arechemical, petrochemical, or financial—self policing does not
work. This seems so obvious that people should not have todie, or lose all their money, to make us realize this. Sensible
regulations are essential, but, more importantly, they must beaudited and enforced by suitably trained personnel who haveno conflicts of interest. The betrayal of public trust by Arthur
Andersen, the supposedly independent auditor of Enron,whose aiding and abetting of Enron’s cooked books wasinstrumental in its systemic failure.5 The subprime market fail-
ures showed us that the rating agencies, which were supposedto make an independent assessment of the subprime-
mortgage-backed securities, were so dependent on their WallStreet clients for their business that they merrily went stamp-ing AAA ratings on junk instruments. Of the AAA-rated
securities issued in 2006, an astonishing 93% were later down-graded to junk status.14
It is the same lesson we were taught by the BP Deepwater
Horizon Oil Spill—how the Minerals Management Servicewas inherently conflicted between its goals of awarding leasesand enforcing safety regulations.15 But, this lesson should
have been learnt a long time ago after the Piper Alpha Disas-ter. Based on the Cullen Report’s findings in 1988, the British
government moved the responsibility for safety oversight fromthe Department of Energy to the Health and Safety Executive(HSE), the independent watchdog agency for work-related
health, safety and illness. A separate division was created
2 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal
within the HSE to monitor safety of the offshore oil and gasindustry.6
Indeed, the importance of addressing non-technical com-mon causes, as those described above, as an integral part ofsystems safety engineering, was pointed out as far back as1968 by Jerome Lederer, the former director of the NASAManned Flight Safety Program for Apollo, who wrote:
System safety covers the entire spectrum of risk manage-
ment. It goes beyond the hardware and associated procedures
to system safety engineering. It involves: attitudes and motiva-
tion of designers and production people, employee/manage-
ment rapport, the relation of industrial associations among
themselves and with government, human factors in supervision
and quality control, documentation on the interfaces of indus-
trial and public safety with design and operations, the interest
and attitudes of top management, the effects of the legal sys-
tem on accident investigations and exchange of information,
the certification of critical workers, political considerations,
resources, public sentiment and many other non-technical but
vital influences on the attainment of an acceptable level of
risk control. These nontechnical aspects of system safety can-
not be ignored.
To understand systemic failures and learn from them, oneneeds to go beyond analyzing them as independent one-offaccidents, and examine them in the broader perspective of thepotential fragility of all complex systems. One needs to studythe disasters from a unifying sociotechnical systems engineer-ing perspective, so that one can thoroughly understand thecommonalities as well as the differences, gain insights aboutthe system-wide breakdown mechanisms in order to betterdesign, control and manage such systems in the future.
It is quite clear that to properly model and analyze systemicrisk, one not only needs to model failures at the lowest level ofa sociotechnical system (such as at the failures of equipment)but also, more importantly, model the human and institutionalfailures that occur at the higher levels of the system. Thehuman elements are not only an integral part of the system,they are also often the cause of major failures. Hence, it isimportant to account for them, as explicitly as possible, in anyrisk modeling framework. This has not always been the casein the engineering modeling literature. For instance, mostmodeling studies in the process control literature do notaccount for errors committed by humans in their methodolo-gies. HAZOP analysis, as another example, considers onlyequipment and operation failures in its guide-word basedapproach. We need a systematic methodology that can identifypotential failure mechanisms, due to equipment, process,human, and institutional failures, at different levels of a socio-technical system. This is what we try to accomplish in our arti-cle. This article is largely a conceptual contribution,describing a new modeling framework that articulates how thedifferent levels of a complex sociotechnical system may beformally approached using control-theoretic ideas. Buildingon our prior work,16,17 we present such an integrative multi-scale modeling framework, which addresses the role of thehuman element explicitly, and discuss its implications in thecontext of several prominent systemic failures in differentdomains.
In recent years, there has been interesting progress in under-standing and modeling systemic risk in complex sociotechni-cal systems. Economists and physicists have used networktheory to do this for financial systems.18,19 Control theorists
have proposed approaches by adopting traditional controltheory for understanding systems.20,21 Others have proposedagent based modeling22 or domain-independent system safetyprinciples.23 Our prior work in this area has stressed the needfor modeling cause-and-effect knowledge explicitly as well asthe need for a multiscale modeling framework.16,17,24–28 Philo-sophically, our framework is similar to what has been pro-posed by Rasmussen and Svedung.29 and by Leveson.30–33 Inparticular, it shares the main theme discussed by Leveson andStephanopoulos,30 but we differ in the conceptual details ofthe underlying modeling framework. In addition, we demon-strate the utility of our framework across different domainsusing a comparative analysis of three well-known systemicfailures which has not been done before.
TeCSMART Framework
Complexity, in general, is hard to define and quantify pre-cisely as it comes in different flavors and can mean differentthings in different contexts. For instance, there is algorithmic orcomputational complexity as defined by computer scientists,which measures how much computational effort or time a partic-ular problem might require for its solution—for example, poly-nomial vs. exponential time, as a function of some key scalingparameter of the given problem. Then there is the physics per-spective, dynamical system complexity, which originated fromthe field of nonlinear dynamics and chaos. This deals with thegeneral inability to predict the future behavior of a nonlineardynamical system. In other fields such as biology (life and socialsciences, in general), complexity is used to describe, in qualita-tive terms, the incredible diversity, organizational sophistication,and characteristics of individual agents (e.g., a cell or an animal),systems (e.g., ecosystem, human society), processes/phenomena(e.g., intercellular and intracellular interactions), and so forth.
While it may be hard to state exactly what complexity, orwhat a complex system, is, there is consensus, however, as towhat features are typically associated with a complex system.Complex systems typically consist of many diverse, autono-mous, and adaptive components that interact with one another,and their environment, in nonlinear, dynamical ways to pro-duce a very large set of potential future states or outcomes.Interactions between such parts at a given scale typically giverise to “emergent” properties at larger scales in space and/ortime, sometimes through self-organization, without any globalknowledge or central control, that are hard to predict from theproperties of the parts. They tend to have many feedback loops(both positive and negative), among their components as wellas with their environment, which can cause adaptation andinduce a goal-directed (i.e., teleological) behavior, eitherintentionally or implicitly, thereby potentially altering thecourse of their future behavior. Hence, their characteristics aretypically not reducible to an elementary level of description.
Thus, the essential features of a complex sociotechnical sys-tem may be summarized as: (1) goal-driven behavior, (2)many agents or components/sub-components, (3) organized ina multi-layered hierarchy or network, (4) nonlinear dynamicalinteractions among its agents (or components) and with theenvironment, (5) feedback loops, (6) decentralized control(i.e., local decision making), and (7) emergent behavior.
Most human engineered complex systems, such as chemicalplants, corporations, transportation networks, power grids,governments, societies, and so forth, are organized as a hier-archical network of human and nonhuman (e.g., machines)elements. Generally speaking, they comprise of autonomous
AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 3
and non-autonomous elements, which usually translate tohuman and nonhuman entities. In this article, we are not con-sidering nonhuman entities that are autonomous, such asrobots, as they have not reached human-like autonomous capa-bilities yet, even though this is going to be an important devel-opment in a couple of decades.
We call our modeling framework as TeCSMART (Teleo-Centric System Model for Analyzing Risks and Threats). Telosmeans goal or purpose in Greek. The central theme of ourapproach is the emphasis on recognizing and modeling goalsof different agents, at different levels of abstraction, in a com-plex sociotechnical system. Both individual players and groupsare goal-oriented, driven to act by their goals and incentives,in a complex system. Therefore, it is important to recognize
and model this goal-driven behavior. Individuals (or groups)usually have different goals, or even goals with conflicts ofinterests with each other or with goals from other individuals.The dynamics of how goals across the system interact, trans-form and disperse in the hierarchy, affects both individual andsystemic performances. We use a simple feedback controlmodule as a model for representing this goal-driven behavioras we discuss below.
We propose an integrative framework that tries to capturethe essential features of a complex teleological system withthe purpose of modeling, analyzing, and managing systemicrisk by accounting for the effects of both autonomous (i.e.,human) and nonhuman (i.e., “machines” or “mechanical”)entities in a unified and systematic manner. We model a
Figure 1. TeCSMART framework.
[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
4 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal
complex teleological system as a sociotechnical entity that is
embedded in a society, affected by the society’s goals andpolitical environment. This leads to a multi-scale modelingframework, having seven layers organized as a hierarchy, asshown in Figure 1, that naturally arise and represent differentperspectives of the entire system. Each layer above is azoomed-out, aggregate, view of the immediate layer below.
For example, the block representing process unit in the net-work of Plant View contains the individual feedback loop inEquipment View. The bottom layer of the stack is the basicbuilding block of a system (e.g., equipment and processes).The top layer of the stack is the macroscopic view of asociety.
Each layer has its own set of goals, which drive the
decision-making and actions taken by the agents in that level.The decisions are taken based on the inputs the layer receivesfrom the layers immediately above and below it. Similarly, theactions are communicated to these adjacent layers as outputs.These decisions/actions are indicated, in Figure 1, by thearrows that capture these information flows, up and down the
hierarchy. These information flows are the feedback loopsbetween the layers (i.e., interlayer feedback loops). There arealso feedback loops within a given layer, as depicted in Figure1, which are intralayer loops. Associated with each layer is aset of agents (autonomous and nonautonomous), organized ina particular configuration that is appropriate for the goals ofthat layer (e.g., the layout of equipment in a chemical plant,
called a flowsheet). Such a multilayered representation lendsitself naturally to account for emergent phenomena that arisefrom one scale to another.
We propose a uniform and unified input-output modelingframework, that is conceptually the same across all levels.This elementary input-output model structure that serves as abuilding block in our framework is shown in Figure 2. Speci-
fying such a uniform modeling structure across all levels hasthe advantage of integrating and unifying the analysis of theoutcomes at different levels in a consistent manner. Such atemplate structure allows us to systematically identify the vari-ous failure modes of the different elements at different levelsof the hierarchy as we discuss below. There are five key ele-
ments in this control-theoretic information modeling buildingblock: (1) sensor, (2) actuator, (3) controller, (4) “process”unit that transforms inputs to outputs, (5) connection (e.g.,wires and pipes). These combined with input and output com-plete the picture. The functions of these elements, as well astheir failure modes, at different levels of the hierarchy are
illustrated with examples in the discussion below, using exam-ples from chemical engineering. It is relatively easy to gener-alize this discussion to other engineering domains. The
domain of finance requires a special treatment and we makethat connection wherever needed.
As an organized group, these entities collect, decide, act on,report, and receive a variety of performance information andmetrics. At any level, the layer below acts as sensors, actua-tors, and processes in the interlayer feedback loop, while thelayer above it behaves like a controller that evaluates the lowerlevel performance and sets new goals. In a chemical plant, forexample, in the Equipment View Layer, they collect, decide,and act on individual process and equipment performance dataand metrics (such as temperature, pressure, flow rate, batchtimes, etc.), which are vital for safe, efficient and profitableoperation, and report them to the Plant View Layer, andreceive, in turn, local control specifications (such as tempera-ture and pressure set points) from Plant View Layer. The PlantView Layer agents make these decisions by considering infor-mation from all the processes and equipment under its purviewas well as by considering manufacturing targets (such as whatto make, how much to make, when to make, etc.). These tar-gets, in turn, are decided by the agents in the ManagementView, which get translated into the associated set points andconstraints by the agents in the Plant View, and communicateddown to the Equipment View as inputs. The target metrics aredecided by the agents in Management View by responding tocompetitive market conditions as dictated by the MarketView. In a similar manner, relevant information regardingmarket or company stability, performance, fair competition,etc., are monitored and acted on by the agents in the Regula-tory View, by enacting and enforcing appropriate regulationsapproved by the agents in the Government View (such as theCongress in the U.S.). In an ideal democracy, a government iselected by the citizens of that society, the Societal View, whohave the final word in determining what kind of governmentand laws they would like to live by.
Similar activities occur within layers through intralayerfeedback loops. In the Equipment View Layer, for example, astirred tank heater depicted in Figure 3 has sensors to measuretemperature and tank level. Controllers evaluate these metrics,and send new control signals to valves. In the ManagementView Layer, a firm’s accounting team collects the perform-ance data and share with the Board of Directors. The Boardsets company’s goal based on the data. Each division followsthe goal and carry out its daily operations. Periodically, newperformance data is collected and the goal updated. At each
Figure 2. Schematic of a feedback control system(Adapted from Ref. [34], Chemical processcontrol, fig. 13.1b, pp. 241).
Figure 3. Stirred tank heater example (Adapted fromRef. [34], Chemical process control, pp. 89).
AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 5
layer, if autonomous or nonautonomous agents do not comply
with the goal, disturbances arise at that layer. Controllers take
the disturbance into account and set goals accordingly. Such
intralayer feedback loops exist in all seven layers. Details of
each layer will be presented in the following discussion.
Perspective I: Equipment View Layer
In the Equipment View Layer, the focus is on individual
equipment such as reactors and distillation columns in the con-
text of a chemical plant and their operating conditions. A
chemical plant is a collection of a such process units suitably
organized (called a flowsheet) to meet the plant-wide goal of
manufacturing a desired chemical product at targeted levels of
quality, quantity, cost, time of delivery, etc., safely and opti-
mally. This collection is seen in Perspective II, the Plant View
Layer. The time scale for the Equipment View Layer is typi-
cally in seconds and minutes as process dynamics happens in
real-time.In the Equipment View Layer, the autonomous agents
involved are typically engineers and operators, and the non-
autonomous agents are equipment including control systems.
While regulatory control systems can exhibit a certain degree
of autonomy, that is negligible compared to the range of
autonomy exhibited by humans. Hence, we classify regulatory
controllers as nonautonomous.Consider, for example, a stirred tank heater process (Figure
3) where the goal is to control the level h and temperature T of
the fluid in the tank that is subject to fluctuations in the inlet
flow rate Fi and temperature Ti. The desired level of the fluid
is referred to as the set point level hset and the desired tempera-
ture Tset. These are accomplished by the two feedback control-
lers (loops 1 and 2), which receive the current F and T in
real-time from the sensors (level gauge and thermocouple), by
suitably manipulating the outlet flow rate F and steam flow
rate, Fsteam, by opening or losing the respective control valves(actuators). The seven elements of the information modelingblock for this system are: (1) input: Fi, Ti, Fset, Tset, Fsteam, (2)output: h and T, (3) sensors: level gauge and thermocouple, (4)actuator: outlet flow and steam valves, (5) controller, (6)“core” process unit: tank and heater, and (7) connection: pipesand wires. The constraints are lower and upper limits on thelevel and the temperature of the fluid in the tank.
The goal at the Equipment View level is centered on theperformance of individual equipment such as heaters, reactors,distillation columns, and so forth—that is, each equipment hasits goal of operating at the set point(s). At this level of granu-larity, typically, for engineering applications, one can developdetailed dynamical models of the equipment and processes.These tend to be a set of differential and algebraic equations(DAEs) that are solved to simulate process/equipment behav-ior. Since the purpose of this article is not to discuss thesemodels at length, we refer the interested reader to severalstandard sources in the literature.34–37 As an example, we listbelow the dynamical model equations for the stirred tankheater
Adh
dt5Fi2F
AhdT
dt5FiðTi2TÞ1 Q
qCp
Another kind of model used at this level, called signeddirected graph model (or signed digraph model [SDG]), isbased on graph-theoretic ideas to represent cause and effectrelationships in a process or equipment.24–26 The SDG modelfor the heater example is shown in Figure 4. The nodes repre-sent input and output variables. The arcs represent either posi-tive (solid lines) or negative (dotted lines) relations betweennodes. The figure is read as follows: a change in the inlet tem-perature Ti positively affects the temperature T in the stirredtank, for example, if Ti increases, T will increase. T negativelyaffects the temperature difference T�, which is the set pointtemperature Tset minus stirred tank temperature T. As Tincreases, T� decreases. It means that less steam Fsteam isneeded in the stirred tank, because T gets close to the set pointtemperature Tset. This positive relation between T� and Fsteam
is depicted by a solid arc between the two nodes. Fsteam, inturn, positively affects the temperature T in the stirred tank.This causal behavior among T, T�, and Fsteam refers to loop 2in Figure 3. These qualitative models are easier to develop andanalyze, in comparison with the DAE models, particularly formodeling and analyzing failure modes and hazards.17,28 How-ever, as they are qualitative in nature, they are limited to cer-tain kinds of queries and can lead to ambiguities.
Nevertheless, such cause-effect based qualitative modelsare very useful when modeling a social system, where DAEmodels are usually hard to develop, such as a bank-dealer sys-tem as discussed by Bookstaber et al.38 In this case, the nodesare variables related to a bank-dealer’s investment and lendingactivities. In Figure 5, the left-hand side depicts the connec-tions and activities within the bank-dealer, while the right-hand side shows the SDG model. A bank-dealer consists ofthree major desks, among which the finance desk determineswhere money should go; the prime broker determines howmuch money to lend based on the collateral collected; and thetrading desk determines whether sell to the market or buyfrom the market based on money received from the financedesk and the leverage ratio it holds. The SDG model is read as
Figure 4. SDG for the tank heater example.
6 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal
follows: finance desk collateral CFD positively affects thefunding capacity VFD. VFD in turn positively affects the loan
capacity of prime broker VPB and the leverage set point oftrading desk kSP
TD. In the prime broker, both the collateral
amount CPB and the margin rate vPB positively affect the loancapacity VPB. In the trading desk, the leverage set point kSP
TD
and current leverage kTD determine the leverage different �TD,which positively affects the inventory quantity of trading desk
QTD. As Bookstaber et al.38 demonstrate, using the SDGmodel, one can quickly examine the causal relations of a
social system like the bank-dealer system, and study unstableconditions and risks such as the fire sale and funding run
scenarios.One can always incorporate other modeling methods with
the TeCSMART framework. Usually, in order to develop aquantitative model (DAE model) or a qualitative model (SDG
model), one needs to determine the initial conditions of a sys-tem. System initial conditions at this level are values associ-
ated with equipment, such as sensor readings or controllerparameters. Examining failure modes using TeCSMART
framework provides a systematic way for identifying systeminitial conditions. By giving different system initial conditions,
modelers can develop suitable models to describe the systemand conduct in-depth risk analysis. Therefore, no matter what
modeling methods or risk assessment tools one will use, aHAZOP-like systematic analysis using TeCSMART frame-
work is feasible for analyzing risks in a sociotechnical system.It enables a systematic hazard identification for the risk assess-ment of a sociotechnical system.
The basic functional building block in Figure 2 allows us to
model systematically the potential failures at different levels
of both human and non-human elements. In the EquipmentView Layer, let us consider a sensor, for example. Using acommonly used model of its failure modes, we can state that asensor can fail high, low, or zero (i.e., no response, sensor isdead). Similarly for an actuator (a valve can fail high, low, orzero) and a controller. A process might have more failuremodes depending on its complexity, but it is usually not inhundreds, more like a dozen or so. The connections can fail,too, again high, low, zero, or reverse (in the case of flow ratein pipes, for example). One can modify these to make the setof failure modes more sophisticated, if needed, but even thiselementary set goes a long way as we discuss below. We willshow below how these failure modes can be generalized toaccommodate typical human failures as well at different levelsof the hierarchy.
Perspective II: Plant View Layer
The Plant View Layer is a collection of all the equipmentand processes organized in a particular configuration (or flow-sheet) to manufacture a desired product safely and optimally.The autonomous agents involved in this layer are managersand supervisors, and the nonautonomous agents are equipmentclusters. These clusters are usually grouped as critical processsteps or unit operations,39 such as reaction, distillation, etc.,which are needed in the manufacture of the desired product.Similarly, in the financial system example, the left figure inFigure 5 is the simplified “flowsheet” of a bank-dealer system.The Plant View agents collect and report metrics regardingaggregate production performance and safety to ManagementView and receive, in turn, plant-wide target specificationsfrom Management View, as noted above. Although this levelis also operating in real time, the Plant View decisions typi-cally have a larger time scale (hours or even days).
The goal at this level is to ensure meeting production per-formance targets (typically, product quantity and quality, cost,and time of delivery) safely and optimally at the overall plantlevel. These plant-wide targets would translate into equipmentspecific targets implemented as set points and constraints thatare communicated to the Plant View level. Models at this leveltend to be DAE models from Perspective I integrated togetherreflecting the overall flowsheet organization of the plant. Theflowsheet is then simulated to obtain plant-wide process andequipment behavior. One can also formulate such connectedmodels using the SDG models from the lower level as well toexplicitly capture the cause-and-effect relationships which arethen used for applications such as process hazardsanalysis.17,28,40–44
The input-output information model at this aggregate levelis shown in Figure 1. From this level onward, going up to thehigher levels, the emphasis shifts from decisions/actions madeby individual equipment to those made by personnel, and fromreal-time sensor data to aggregate information concerning theoverall plant performance. It moves from a data-centric toinformation-centric perspective. This is required to reflect thegoal of this layer—to make the desired products at the targetedlevel of quality, quantity, cost, time of delivery, safely andoptimally. That is the charge of the Plant Manager, given toher by the senior management at the next layer above.
The seven elements here, therefore, reflect this aggregatenature of information needed and used at this level: (1) input:aggregate, plant level, information on target as well as actualperformance metrics, (2) output: schedule, set points, resourceallocation, and so forth, (3) sensors: product quality and
Figure 5. SDG for the bank/dealer example (Adaptedfrom Ref. [38] Process Systems Engineeringas a Modeling Paradigm for Analyzing Sys-temic Risk in Financial Networks).
[Color figure can be viewed in the online issue, which is
available at wileyonlinelibrary.com.]
AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 7
quantity, resource utilization data, etc., (4) actuator: plant per-sonnel, (5) controller: Plant Manager, (6) “core” process unit:the entire plant, and (vii) connection: various communicationchannels among plant personnel such as the Manager, Supervi-sors, Engineers, and Operators.
The failure modes associated with the elements at this levelare conceptually similar to their counterparts at the lowerEquipment Layer. For instance, sensors in this layer are notphysical entities like thermocouples, but informational entitiesthat aggregate and transform relevant data into actionableinformation such as the projection made about the plant’sproduct output for the current month. This transformation iscarried out by a human, such as a process engineer. The engi-neer can also “fail” high, low, or zero in the sense that the esti-mation reported to the Plant Manager can be erroneous alongthese lines—for example, the projection may be too optimistic(i.e., failing high), too conservative (i.e., failing low), or noprojection is made (i.e., failing zero). Likewise, communica-tion can also fail along these lines—perhaps the projectionwas made, but the Manager was not informed. Similarly, in abank-dealer system, this layer represents the aggregation ofinvestment and funding activities of different asset classes.The three major desks are divided into groups (actuators) tohandle portfolios consisting of different assets. Sensors (i.e.,analysts monitoring the metrics) in the Equipment View Layerfor a bank-dealer system report leverage ratios or collateralcollected; while sensors in this layer are risk models of portfo-lios, which aggregate and transform individual risk factorsinto a comprehensive picture that describes the portfolio’srisk. We, thus, see that this template helps us identify system-atically where and how things can fail at different levels of thehierarchy.
It is important to note that we are not claiming that ourframework would capture all things that go wrong in a com-plex system. We are only suggesting that such a systematicapproach could capture many of the typical failures seen inpractice and we demonstrate this with the aid of three casestudies.
Perspective III: Management View Layer
The next level up is the Management View, where theagents involved are the critical decision makers such as theCEO, Senior Vice Presidents, and Board of Directors. Theirgoal is to maximize profitability and create value for the share-holders by making sure the company’s business performancemetrics (including safety) meet the expectations from the Mar-ket (which is the next level up). Influenced by the nature ofbusiness and accounting cycles, this layer operates in a timescale of quarter (i.e., 3-month period) to a year.
As seen in the control-theoretic information model of thislevel in Figure 6, this group of decision-makers (Managementteam) sets the overall policies that “control” (i.e., manage) thebehavior and outcomes of the corporation including its autono-mous and non-autonomous assets. Autonomous agents at thislayer include managers and supervisors of each division, whilethe nonautonomous agents are corporate assets. The Market atthe next level up sets and demands certain performance targetsbe met by the company for its survival and growth. These met-rics are usually financial at this level such as ROI, ROE, mar-ket share, sales growth, and so forth. These are the set pointsand constraints given to the Management team.
The Management team, in turn, translates these targets intoactionable quantitative information such as production per-
formance metrics, strategic deployment of resources, and soforth, at different plants (the corporation might have severalplants distributed all over the world) as well as more qualita-tive ones that define the company culture including the safetyculture. They also set the incentive policy to encourage betterperformance from the employees. These are communicated tothe Plant View Layer as their set points and constraints. TheManagement team decides on these targets by taking intoaccount of all relevant information concerned with the sur-vival, profitability and growth of the company in a competitiveand regulatory environment. Thus, the information flow is notonly from the company’s internal sources but also from theenvironment, which are the two levels immediately above.
Differing from the control policies at the lower levels,which mainly focus on controlling equipment (i.e., nonautono-mous agents), the policies from this layer onward, at thehigher levels, focus more on achieving the desired behaviorand outcomes from autonomous agents (i.e., humans). As aresult, while the lower level control policies can be based onprecise models of process/equipment (as captured by DAEmodels), the higher level policies will necessarily have to dealwith imperfect models of human behavior which cannot bereduced to a set of equations. Consider, for instance, the diffi-culties involved in “modeling” the culture of a corporation. Atbest, we might be able to identify certain key features or char-acteristics that define a corporation’s culture. From this levelonward, we have to rely more on graph-theoretic, game-theo-retic and agent-based modeling frameworks. Thus, from thislevel onward modeling becomes trickier, and the notion of“control” of agents transitions to the “management” of agents.Moreover, the importance of TeCSMART failure modes-based examination becomes more obvious. Such a systematicrisk analysis of human decision-making would help improvingsafety-related management activities, among other things.
The Management team acts as a “controller” to monitor thevarious performance metrics (e.g., sales, expenses, revenue,profits, ROI, ROE, etc.), compare them with the set points,and take appropriate actions by manipulating the relevant vari-ables (e.g., cost cutting, acquisition, etc.) in order to meet theset point targets. The Management level deals with the big pic-ture and general strategy for the corporation as a whole. Theseget translated into more detailed prescriptions and recommen-dations as they are communicated from this layer to the lowerlayers. The failure of the elements in Figure 6 can be modeledalong the lines of Equipment View and Plant View Layers.For example, the Performance Monitoring task (i.e., “sensor”)may fail because of errors in the measurements or estimations(e.g., fail high, low, or zero) or they may be communicated (ornot communicated at all) erroneously. One can methodicallyidentify similar failure modes for the other elements includingthe connections (which are the communication channels).
Figure 6. Control-theoretic model of managementlayer.
8 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal
Perspective IV: Market View Layer
Similar to the Plant View, the Market View is a collection
of companies that compete, in the appropriate product/service
categories, for economic survival, profitability and growth in a
free market environment. The agents at this level are mainly
the customers and corporations. Market is a well-studied con-
cept in economics. It usually refers to the exchange activities
that many parties engage in. In this article, we will not discuss
the economic aspect of Market, but interpret Market as a col-
lection of companies and their activities. Market activities
such as cooperation and competition can be explained using
the input-output model structure and intra-layer feedback
loops. From this layer and above, activities mainly involve
autonomous agents such as humans and human organizations.
The information generated at this level (e.g., stability of indi-
vidual companies and the market, fairness practices, etc.) are
communicated to the Regulatory View and from there receive
regulatory requirements and enforcement actions. While the
market dynamics is in real-time, as with the Plant View, the
relevant time scale is of the order of months.
Perspective V: Regulatory View Layer
As noted, regulatory agencies oversee the market and con-
trol the market behavior through the enforcement of regulatory
policies (Figure 7). The primary goal at this level is to ensure
the security, stability, and wellbeing of the society where these
companies operate. This means, of course, the security and
wellbeing of the citizens and their environment. This also
means ensuring that the free market, where these companies
compete, is stable, efficient and fair. The autonomous agents
are regulatory agencies such as Occupational Safety and
Health Administration (OSHA), Environmental Protection
Agency (EPA), Securities and Exchange Commission (SEC),
Federal Reserve (FED), Federal Energy Regulatory Commis-
sion(FERC), Mineral Management Service (MMS), Food and
Drug Administration (FDA), and so on, and the appropriate
executives from the companies.These agencies receive from the agents in Government
View, namely, lawmakers and their staff, regulations which
they enforce on the market participants. They also monitor the
market and companies, collect information, and report the
effects of regulations to the agents in Government View for
potential improvements. This feedback control loop acts at a
time scale of years.One typical example of this view is the activity of the SEC
which regulates the securities industry. As shown in Figure 8,
SEC receives laws and regulatory directives from the agents in
Government View, such as the President, the Congress, and
the Federal Reserve Board. Through its five Divisions and 23
Offices, SEC enforces federal securities laws, issues new rules,
and oversees securities related activities. For instance, SEC reg-
ularly monitors the market for unusual trading patterns that
might reveal illegal acts such as insider trading, and takes cor-
rective actions, playing its role as a “controller” here, to ensure
fairness in the security markets. While SEC should be praised
for its postfinancial crisis actions on successfully going after
various Wall Street entities for their misconduct, various fail-
ures of the SEC before and during the crisis contributed to the
crisis, as Judge Rakoff argues persuasively.45 Many of these
failures are failures of the elements in Figure 7 that can be mod-
eled using our template of failure modes. In a similar manner,
many of the failures at the Minerals Management Agency46 that
contributed to the BP Oil Spill disaster can be modeled using
our approach. While we do not get into all the details, as that
would make our article too long, we do provide a summary of
these failures in a series of tables that compare regulatory fail-
ures in three different domains later in the article.
Perspective VI: Government View Layer
The Government View, like the Plant and Market Views, is
a collection of various agencies particularly organized to gov-
ern a society of autonomous and non-autonomous agents (e.g.,
physical assets). The objectives here are security, stability, and
the overall wellbeing of the agents and their environment
against a variety of risks and threats. Depending on the soci-
etal preference for capitalism, communism, socialism, mon-
archy, or dictatorship, the institutions and their structure can
be widely different. The objective of our article is not to dis-
cuss these in any detail (there are vast resources on this subject
in sociology and political science) but only to show how our
control-theoretic framework accommodates the structures and
functions at this level in a uniform and consistent manner
which is helpful for a system-theoretic analysis of system-
wide risks and threats. In the context of the U.S., this structure
is the three branches of government—executive, congress, and
judiciary—with the associated agencies they supervise. The
agents are the members of these branches. The time scale is
typically four years, the presidential election cycle, but institu-
tional memory in congress and judiciary can prolong this to
decades. That is, it can take that long to make significant
changes in governance.
Perspective VII: Societal View Layer
Finally, we arrive at the top most level in this modeling
hierarchy. The primary agents (autonomous) are the citizens
and elected officials in a democracy such as the U.S. It is, of
course, very different for other political structures, as noted.
Again, while the presidential election cycle imposes a certain
Figure 7. Control-theoretic model of regulatory layer.
Figure 8. Control-theoretic model of Securities andExchange Commission.
AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 9
natural characteristic time, institutional memories can prolongthis to decades. The societal “set points” are the preferences ofits citizenry, which can vary over time, typically, of the order
of decades or generations. In an ideal democracy, the citizensget to decide what kind of society or country they all wouldlike to live in. The overall goals of the citizens in the U.S., as
Table 1. Failure Taxonomy Part I
Class Definition Examples2,12,56
1. Monitoring Failures Failure to monitor the keyparameters effectively orhaving significant errors inthe monitored data
1.1 Fail to monitor Failure to monitor key perform-ance indicators (“failingzero”)
In BP Texas City Refinery Explosion, numerous measures for trackingvarious types of operational, environmental and safety performance,but no clear focus on the leading indicators for the potential cata-strophic or major incidents.
In Northeast Blackout, MISO did not discover that Harding-Chamber-line had tripped until after the blackout, when MISO reviewed thebreaker operation log that evening.
In Subprime Crisis, Moody’s did not sufficiently account for the dete-rioration in underwriting standards or a dramatic decline in home pri-ces. And Moody’s did not even develop a model specifically to takeinto account the layered risks of subprime securities until late 2006,after it had already rated nearly 19,000 subprime securities.
1.2 Failure to monitoreffectively
Failure to detect/report prob-lems in a timely manner
In Northeast Blackout, the Cleveland-Akron areas voltage problemswere well-known and reflected in the stringent voltage criteria usedby control area operators until 1998.
BP Texas City did not effectively assess changes involving people,policies, or the organization that could impact process safety.
1.3 Significant errors inmonitoring
Monitored data are significantlyinaccurate. It is either over-reporting (“failing high”) orunder-reporting (“failinglow”) the actual trend
In BP Texas City Refinery Explosion, a lack of supervisory oversightand technically trained personnel during the startup, an especiallyhazardous period, was an omission contrary to BP safety guidelines.An extra board operator was not assigned to assist, despite a staffingassessment that recommended an additional board operator for allISOM startups.
In Northeast Blackout, from 15:05 EDT to 15:41 EDT, during whichMISO did not recognize the consequences of the Hanna-Juniper loss,and FE operators knew neither of the lines loss nor its consequences.PJM and AEP recognized the overload on Star-South Canton, buthad not expected it because their earlier contingency analysis did notexamine enough lines within the FE system to foresee this result ofthe Hanna-Juniper contingency on top of the Harding-Chamberlinoutage.
2. Decision Making Failures Failure to provide the correctdecisions in a timely manner
2.1 Model failures Decisions are not supported bythe local system (i.e., “plant-model mismatch”)
In Subprime Crisis, financial institutions and credit rating agenciesembraced mathematical models as reliable predictors of risks, replac-ing judgment in too many instances.
In Northeast Blackout, one of MISOs primary system condition evalu-ation tools, its state estimator, was unable to assess system conditionsfor most of the period between 12:15 and 15:34 EDT, due to a com-bination of human error and the effect of the loss of DPLs Stuart-Atlanta line on other MISO lines as reflected in the state estimatorscalculations.
2.2 Inadequate or incorrectlocal decisions
Decisions made are unfavorableto the local system undersupervision
In BP Texas City Refinery Explosion, the process unit was starteddespite previously reported malfunctions of the tower level indicator,level sight glass, and a pressure control valve.
In Subprime Crisis, financial institutions’ inadequate decisions ofusing excessive leverage and complex financial instruments.
In Northeast Blackout, FE uses minimum acceptable normal voltageswhich are lower than and incompatible with those used by its inter-connected neighbors.
2.3 Inadequate or incorrectglobal decisions
Decisions made are unfavorablefor the global system, butcould be locally right
In Subprime Crisis, the banks had gained their own securitizationskills and did not need the investment banks to structure and distrib-ute. So the investment banks moved into mortgage origination toguarantee a supply of loans they could securitize and sell to thegrowing legions of investors. But they are lack of global views ofthe entire market.
In Northeast Blackout, many generators had pre-designed protectionpoints that shut the unit down early in the cascade, so there werefewer units on-line to prevent island formation or to maintain balancebetween load and supply within each island after it formed. In partic-ular, it appears that some generators tripped to protect the units fromconditions that did not justify their protection, and many others wereset to trip in ways that were not coordinated with the regions under-frequency load-shedding, rendering that UFLS scheme less effective.
10 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal
expressed in the Declaration of Independence document, areLife, Liberty and the Pursuit of Happiness.47 Given thesegoals, in every election, the citizens get to vote on a number ofissues related to economy, environment, education, health,security, privacy, race relations, etc.
This is the top most layer of the model. In its feedbackloop, there are citizens, elected government officials andregulators involved. In the Government View Layer, thethree branches of the U. S. government act as the“controller” of a collection of regulatory agencies and thecountry. In the Societal View Layer, citizens oversee andinfluence the society through elections. It usually takes dec-ades for a society to adapt and evolve in any significantfashion. The societal set point is related to the history andculture of a nation.
In all systemic failures, such as the ones mentioned above,we all play a role, through the Societal View Layer, and areaccountable for some of the blame, as it was our collectivedecision to elect (in the case of U.S.) a particular party, and itspolitical and regulatory views, to govern us. This accountabil-ity is a direct consequence of our responsibility. Consider, forexample, the responsibility of a CEO of a large petrochemicalcompany with many plant sites and tens of thousands ofemployees. The CEO may not know everything about whatgoes on in all her plant sites, on a daily basis, but when a dis-aster strikes she and her c-suite executives are held accounta-ble. Time and again, in all the official inquiries of majordisasters, whether it was Bhopal, Piper Alpha, BP Oil Spill,Global Financial Crisis, Northeast Power Blackout, and so on,the management was held responsible and accountable for
Table 2. Failure Taxonomy Part II
Class Definition Examples2,12,56
2.4 Resource Failures Failure to acquire, allocate and manage therequired resources properly to completethe tasks safely and achieve the goal(s)
2.4.1 Lack of resources Failure to acquire the necessary resources,such as funds, man power, time, etc.
In BP Texas City Refinery Explosion, BP has not alwaysensured that it identified and provided the resources requiredfor strong process safety performance at its U.S. refineries,including both financial and human resources.
In Subprime Crisis, in an interview with the FCIC, Greenspanwent further, arguing that with or without a mandate, the Fedlacked sufficient resources to examine the nonbank subsidia-ries. Worse, the former chairman said, inadequate regulationsends a misleading message to the firms and the market. But ifresources were the issue, the Fed chairman could have arguedfor more. It was always mindful, however, that it could be sub-ject to a government audit of its finances.
In Northeast Blackout, there is no UVLS system in place withinCleveland and Akron; had such a scheme been implementedbefore August, 2003, shedding 1,500 MW of load in that areabefore the loss of the Sammis-Star line might have preventedthe cascade and blackout.
2.4.2 Inadequate allo-cation of resources
Resources are deployed incorrectly. E.g.,over-staffing (“failing high”) in someareas while under-staffing (“failing low”)elsewhere
In BP Texas City Refinery Explosion, the incident at TexasCity and its connection to serious process safety deficiencies atthe refinery emphasize the need for OSHA to refocus resourceson preventing catastrophic accidents through greater PSMenforcement.
In Northeast Blackout, on August 14, the lack of adequatedynamic reactive reserves, coupled with not knowing the criti-cal voltages and maximum import capability to serve nativeload, left the Cleveland-Akron area in a very vulnerable state.
2.4.3 Training failures Failures related to the lack of organizedactivity(ies) aimed at helping employeesattain a required level of knowledge andskill needed in their current job. Thisincludes emergency response training
In BP Texas City Refinery Explosion, BP has not adequatelyensured that its U.S. refinery personnel and contractors havesufficient process safety knowledge and competence.
In Subprime Crisis, in theory, borrowers are the first defenseagainst abusive lending. But many borrowers do not understandthe most basic aspects of their mortgage. Borrowers with lessaccess to credit are particularly ill equipped to challenge themore experienced person across the desk.
In Northeast Blackout, the FE operators did not recognize theinformation they were receiving as clear indications of anemerging system emergency.
2.5 Conflict of interest Incorrect decisions reached due to a conflictof interest arising from competing goalsthat can affect proper judgment and exe-cution of tasks. E.g., safety vs financialgain, ethical failures such as corruption
In BP Texas City Refinery Explosion, cost-cutting, failure toinvest and production pressures from BP Group executive man-agers impaired process safety performance at Texas City.
In Subprime Crisis, many Moody’s former employees said thatafter the public listing, the company [Moody’s] culturechanged; it went from [a culture] resembling a university aca-demic department to one which values revenues at all costs,according to Eric Kolchinsky, a former managing director.
In Northeast Blackout, these protections should be set tightenough to protect the unit from the grid, but also wide enoughto assure that the unit remains connected to the grid as long aspossible. This coordination is a risk management issue thatmust balance the needs of the grid and customers relative tothe needs of the individual assets.
AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 11
their companies failures. In fact, in a historic first, establishingan encouraging precedent, recently in April 2016, former Mas-sey Energy CEO was sentenced to twelve months in prison asa result of the mining company’s disaster.48,49 Thus, the peo-ple in charge have to be held accountable for part of the blame.
In a democratic society, the people in charge are, ultimately,us, the citizens who elected the government.
Therefore, we are responsible, in some part, for the failuresresulting from its policies. We are thus responsible for Bhopal,BP Oil Spill, Subprime Crisis, and so on. This is why it is
Table 3. Failure Taxonomy Part III
Class Definition Examples2,12,56
3. Action Failures Actions carried out incorrectly orinadequately
3.1 Flawed actionsincludingsupervision
Failure to perform the right actions,or performing no action, or per-forming the wrong actions. Failureto follow standard operatingprocedures
In BP Texas City Refinery Explosion, numerous heat exchanger tubethickness measurements were not taken. Some pressure vessels, storagetanks, piping, relief valves, rotating equipment, and instruments wereoverdue for inspection in six operating units evaluated.
In Subprime Crisis, struggling to remain dominant, Fannie and Freddieloosened their underwriting standards, purchasing and guaranteeing risk-ier loans, and increasing their securities purchases. Yet their regulator,the Office of Federal Housing Enterprise Oversight (OFHEO), focusedmore on accounting and other operational issues than on Fannies andFreddies increasing investments in risky mortgages and securities.
In Northeast Blackout, numerous control areas in the Eastern Intercon-nection, including FE, were not correctly tagging dynamic schedules,resulting in large mismatches between actual, scheduled, and taggedinterchange on August 14.
3.2 Late response Failure to take the right actions at theright time
In BP Texas City Refinery Explosion, Neither Amoco nor BP replacedblowdown drums and atmospheric stacks, even though a series of inci-dents warned that this equipment was unsafe. In the years prior to theincident, eight serious releases of flammable material from the ISOMblowdown stack had occurred, and most ISOM startups experiencedhigh liquid levels in the splitter tower. Neither Amoco nor BP investi-gated these events.
In Subprime Crisis, declining underwriting standards and new mortgageproducts had been on regulators radar screens in the years before thecrisis, but disagreements among the agencies and their traditional prefer-ence for minimal interference delayed action.
In Northeast Blackout, the alarm processing application had failed onoccasions prior to August 14, leading to loss of the alarming of systemconditions and events for FEs operators. However, FE said that themode and behavior of this particular failure event were both first timeoccurrences and ones which, at the time, FEs IT personnel neither rec-ognized nor knew how to correct.
4. CommunicationFailures
Failures that are associated with thesystem of pathways (informal orformal) through which messagesflow to different levels and differ-ent people in the organization
4.1 Communicationfailure with externalentities
Failures of communication betweenan individual and/or a group/orga-nization and an external individualand/or organization
In BP Texas City Refinery Explosion, BP and Amoco did not cooperatewell to investigate previous incidents and replace blowdown drum.
In Subprime Crisis, the leverage was often hidden. Lenders rarely discussthe leverage and the associated high risk with their investors. Investorsrelied on the credit rating agencies, often blindly.
In Northeast Blackout, the Stuart-Atlanta 345-kV line, operated by DPL,and monitored by the PJM reliability coordinator, tripped at 14:02 EDT.However, since the line was not in MISOs footprint, MISO operatorsdid not monitor the status of this line and did not know it had gone outof service. This led to a data mismatch that prevented MISOs state esti-mator (a key monitoring tool) from producing usable results later in theday at a time when system conditions in FEs control area weredeteriorating.
4.2 Peer to Peer com-munication failure
Failures of communication betweenan individual and another individ-ual within a group and/ororganization
In BP Texas City Refinery Explosion, the night lead operator left earlybut very limited information about his control cations was given to dayboard operator.
In Northeast Blackout, FE computer support staff did not effectivelycommunicate the loss of alarm functionality to the FE system operatorsafter the alarm processor failed at 14:14, nor did they have a formalprocedure to do so.
4.3 Inter-level commu-nication failure
Failures of communication betweenan individual and another individ-ual at a greater or lower level ofauthority within the same groupand/or organization
In BP Texas City Refinery Explosion, Supervisors and operators poorlycommunicated critical information regarding the startup during the shiftturnover.
In Northeast Blackout, ECAR and MISO did not precisely define criticalfacilities such that the 345-kV lines in FE that caused a major cascadingfailure would have to be identified as critical facilities for MISO.MISOs procedure in effect on August 14 was to request FE to identifycritical facilities on its system to MISO.
12 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal
vitally important for the citizens to stay informed, engagedand active in the political process. This is particularly impor-tant to remember as we begin to address the mother of all sys-temic failures, the Climate Change Crisis, which has been inthe works for decades.
TeCSMART: Comparative Analysis of Three MajorDisastersFailure analysis and comparison
In this section, we discuss the results of applying the TeCS-MART framework to three prominent systemic failures,namely, the BP Texas City Refinery Explosion (2005), GlobalFinancial Crisis (2008–09), and the Northeast Power Blackout(2003). We in fact studied the following thirteen systemic fail-ures: (1) the Bhopal Disaster (1984), (2) the Space ShuttleChallenger Disaster (1986), (3) the Piper Alpha Disaster(1988), (4) the SARS Outbreak (2002-03), (5) the Space Shut-tle Columbia Disaster (2003), (6) the Northeast Power Black-out (2003), (7) the BP Texas City Refinery Explosion (2005),(8) Global Financial Crisis (2008-09), (9) the BP Deepwater
Horizon Oil Spill (2010), (10) the Upper Big Branch Mine
Disaster (2010), (11) the Chilean Mining Accident (2010),
(12) the Fukushima Daiichi Nuclear Disaster (2011), and (13)
the India Blackouts (2012), by carefully reviewing the official
postmortem reports of these disasters as well as other relevant
sources. However, we are presenting the comparative analysis
of only these three disasters for the sake of brevity. The other
cases have similar failure patterns as well. We analyzed and
classified over 700 failures mentioned in these reports.1,2,50–60
We categorize these failures into five primary classes, and 19
subclasses, that are consistent with the typical failure modes
we discussed in the previous section.The five classes are as follows: (1) Monitoring Failures; (2)
Decision Making Failures; (3) Action Failures; (4) Communi-
cation Failures; and (5) Structural Failures. Each category has
sub-categories that define more detailed failures. Subclass
details are listed in Tables 1–4. The five-class failure taxonomy
reveals “what can go potentially wrong” in a complex socio-technical system. It summarizes the failure modes modeledusing the TeCSMART framework. Different failure modes give
Table 4. Failure Taxonomy Part IV
Class Definition Examples2,12,56
5. Structural Failures Deficient structures and/or models5.1 Design failures Defects or deficiencies in the design
of the system/component/model, orjust wrong design of the system/component/model
In BP Texas City Refinery Explosion, occupied trailers were sited tooclose to a process unit handling highly hazardous materials. All fatal-ities occurred in or around the trailers.
In Subprime Crisis, where were Citigroups regulators while the com-pany piled up tens of billions of dollars of risk in the CDO business?Citigroup had a complex corporate structure and, as a result, facedan array of supervisors. The Federal Reserve supervised the holdingcompany but, as the Gramm-Leach-Bliley legislation directed, reliedon others to monitor the most important subsidiaries: the Office ofthe Comptroller of the Currency (OCC) supervised the largest banksubsidiary, Citibank, and the SEC supervised the securities firm, Cit-igroup Global Markets. Moreover, Citigroup did not really align itsvarious businesses with the legal entities. An individual working onthe CDO desk on an intricate transaction could interact with variouscomponents of the firm in complicated ways.
In Northeast Blackout, although MISO received SCADA input of thelines status change, this was presented to MISO operators as breakerstatus changes rather than a line failure. Because their EMS systemtopology processor had not yet been linked to recognize line failures,it did not connect the breaker information to the loss of a transmis-sion line. Thus, MISOs operators did not recognize the Harding-Chamberlin trip as a significant contingency event and could notadvise FE regarding the event or its consequences. Further, withoutits state estimator and associated contingency analyses, MISO wasunable to identify potential overloads that would occur due to variousline or equipment outages.
5.2 Maintenancefailures
Failure to adequately repair and main-tain equipment at all times
In BP Texas City Refinery Explosion, deficiencies in BPs mechanicalintegrity program resulted in the run to failure of process equipmentat Texas City.
In Northeast Blackout, FE had no periodic diagnostics to evaluate andreport the state of the alarm processor, nothing about the eventualfailure of two EMS servers would have directly alerted the supportstaff that the alarms had failed in an infinite loop lockup.
5.3 Operating proce-dure failures
Failure to develop and execute stand-ard operating procedures for alltasks
In BP Texas City Refinery Explosion, outdated and ineffective proce-dures did not address recurring operational problems during startup,leading operators to believe that procedures could be altered or didnot have to be followed during the startup process.
In Subprime Crisis, in addition to the rising fraud and egregious lend-ing practices, lending standards deteriorated in the final years of thebubble.
In Northeast Blackout, the PJM and MISO reliability coordinatorslacked an effective procedure on when and how to coordinate anoperating limit violation observed by one of them in the others area.The lack of such a procedure caused ineffective communicationsbetween PJM and MISO regarding PJMs awareness of a possibleoverload on the Sammis-Star line as early as 15:48.
AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 13
rise to systemic failures in different domains. However, thereare common failure modes shared by many, if not, all the sys-temic failures. Such common failure pathways help us identify,proactively, how things can potentially go wrong in a complexsystem. By studying these common failure mechanisms, peoplecould become more vigilant for new systems. Thus, the com-mon patterns identified by our comparative analysis are helpfulnot only diagnostically but also prognostically.
The comparative analysis of the three case studies is per-
formed in following three steps. (1) Carefully review the offi-
cial post mortem reports and classify the failures into
different classes/subclasses mentioned in Tables 1–4. For
example, the level control valve was accidentally turned off
by an operator in BP Texas City Refinery. This failure is clas-
sified as an flawed action (3.1 in Table 3). The over-grown
tree is a known problem for all power grid operators. But First
Energy (FE) failed to trim the over-grown trees, which led to
line trips. The inadequate tree trimming is classified as a late
response failure (3.2 in Table 3). (2) Once failures are classi-
fied properly, they are organized in the TeCSMART frame-
work according to the relevant agents and the failure
mechanisms. Relevant agents indicate the level of the failure
in the TeCSMART framework, and the failing mechanisms
explain which control component the failure is associated
with. One layer can have multiple failures, and one failure
can appear multiple times at different levels. Therefore, the
level control valve failure is a flawed action of actuator at the
Process View, and the inadequate tree trimming is due to late
response of actuator at the Plant View. (3) Compare failures
across domains to identify common patterns.
Case Studies
In this section, we briefly introduce the three prominent sys-
temic failures: Northeast Blackout (2003), BP Texas City
Refinery Explosion (2005), and Subprime Crisis (2008), and
compare their failures applying TeCSMART framework. The
comparison study shows the similarities and differences of the
three systemic failures. Moreover, the common patterns indi-
cate important failure modes, which can help improve system
design, control, and risk management.
Overview
The Northeast Blackout, which happened on August 14,
2003, was the largest blackout of North America power grid.
With many generating units tripping and transmission lines
disconnected at noon, the cascading sequence essentially com-
pleted around 4:13 p.m. A shut-down cascade triggered the
blackout. Supply/Demand mismatch and poor vegetation man-
agement triggered the power surges in transmission lines. FE’s
operators did not pay attention to the warning signs, and
poorly communicated with other line operators. Finally, the
power surges spread and the blackout emerged.56
BP Texas City refinery is the third largest refinery in the
United States. The refinery employs approximately 1800 BP
workers. On March 23, 2005, the refinery initiated the startup
of the ISOM raffinate splitter section. During the startup, the
control valve was accidentally turned off by an operator and
the tower was filled with flammable liquid for over 3h. The
pressure relief valve was activated by high pressure in the
tower and discharged liquid to the blowdown drum. The
Figure 9. Cross-domain comparison table.
[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
14 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal
blowdown drum overfilled and the stack vented flammable liq-
uid to the atmosphere, which formed a vapor cloud. When the
flammable vapor cloud reached an idling diesel pickup truck,
whose engine was on, an explosion occurred. The explosion
and fires killed 15 people, injured 180 others, and resulted in
financial losses exceeding $1.5 billion.12
In the summer of 2007, leading banks in the U.S. started to
fail as a result of falling real estate prices. Bear Stearns, the
fifth largest investment bank, whose stock had traded at $172
a share as late as January 2007 was sold to JP Morgan Chase
for a fire sale price of $2 on March 16, 2008; Lehman Broth-
ers, the fourth largest, went bankrupt; Fannie Mae and Freddie
Mac were taken over by government; American International
Group (AIG), the issuance giant, was bailed out by tax
payers.61 Over half million families lost their homes to fore-
closure. Nearly $11 trillion household wealth vanished.
Between January 2007 and March 2009, stock market lost half
its value.62 The final cost to the U.S. economy as a result of
the biggest financial crisis since Great Depression was about
$22 trillion! To get a sense of its magnitude, compare it with
the US GDP in 2014 which was $17.4 trillion.
TeCSMART Comparison
A cross-domain comparison, shown in Figure 9, was con-
ducted by analyzing and comparing failures of these three
prominent systemic failures. Figure 9 is a table where rows are
TeCSMART views and failure classes, and columns are the
three systemic failures. Table 5 lists agents of the three sys-
temic failures. As discussed before, we classify failure eviden-
ces found in the postmortem investigation reports into
different failure classes, related to specific control components
at the appropriate levels. Then we mark the failure class as a
colored cell in the table, with a color code that blue represents
BP Texas City Refinery Explosion; yellow represents Sub-
prime Crisis; and brown represents Northeast Blackout. If the
three colors appear in the same row, it means that particular
failure had occurred in all three cases. Therefore, by compar-
ing the colored cells, we are able to study the failure mecha-
nisms, their similarities and differences. Figure 10 highlights
failure classes classified in the comparison table (Figure 9).
Failures were found at every level in all the three cases.
Operational failures are more common at low levels; control-
ler failures dominate at high levels. Among the many impor-
tant observations and insights from the comparison, we
highlight a few and discuss them in depth.
The comparison shows that lack of appropriate training was
a widespread problem. In Figure 9, we have seen training fail-
ures in the bottom three views of all three cases. Evidence
shows that operators, even managers, have not received appro-
priate and sufficient training prior to the accidents. The opera-
tor training program was inadequate at BP Texas City
Refinery. The training department staff had been reduced from
28 to 8; there were no simulators for operators to practice han-
dling abnormal events.12 The training failure of BP is con-
firmed by the logic tree created by the Chemical Safety and
Hazard Investigation Board (CSB), highlighted in Figure 11a.
Similar things happened in the Northeast Blackout. FE opera-
tors were poorly trained to recognize emergency information.
They received signals indicating line trips, but made poor
decisions by relying solely on the Emergency Management
System (EMS). Unfortunately, EMS failed at this time. FE
engineers’ poor judgment and lack of training played a signifi-
cant role in the failure. Their lack of training was also high-
lighted by ThinkReliability in their causal map, depicted in
Figure 12. Such a pattern was also seen in the financial system
failure.2,64
Decision-makers are “controllers” in the TeCSMART
framework. In all three cases, almost every layer has shown
decision making failures. For example, the decision of initial-
izing the ISOM unit despite previously reported malfunctions
of the raffinate tower level indicator, pressure control valve,
and level sight glass, was a serious failure, which directly trig-
gered the overall disaster.12 Moreover, BP’s cost-cutting deci-
sions that led to the layoff of experienced workers from
Amoco contributed to the accident as well.1 These failures are
highlighted by CSB in Figures 11b, c. In Subprime Crisis,
fund managers’ decision to invest in subprime securities with-
out fully understanding the embedded risks was an leading
cause of the financial system collapse.2 FE’s decision of using
minimum acceptable normal voltages (highlighted in Figure
12), which are lower than and incompatible with those of its
neighbors, directly caused power surges and transmission lines
sag.56 At the management level, demonstrated by both our
comparison study and the CSB analysis (Figures 11a, c), a crit-
ical failure was BP not providing enough resources for strong
process safety performance in its U.S. refineries.12 At the
same level, CEOs of financial institutions decided to maintain
a large quantity of subprime related assets by using a very
high leverage. The high leverage magnified the scale of the
crisis dramatically. Moreover, sometimes a locally favorable
decision may bring undesired consequences to the system. In
Table 5. Agents of Each View
View
Agents
BP Texas City Refinery Explosion Subprime Crisis Northeast Blackout
Societal View U.S. citizens Citizens worldwide U.S. and Canada citizensGovernment View Employees of different branches of
GovernmentEmployees of U.S. and Foreign
GovernmentsEmployees of U.S.and Canada
GovernmentsRegulatory View Employees of OSHA Employees of FED, SEC, FDIC,
OCC, OTCEmployees of NERC and FERC of
U.S.;Employees of NEB of Canada
Market View Companies in oil & gas refiningindustry
Institutions in financial industry MAAC-ECAR-NPCC power grid
Management View BP senior management Senior management of financialinstitutions & credit rating agencies
Senior managementof FE, AEP,MISO, PJM
Plant View BP Texas City refinery management Dealers, investors, managers offinancial products
Eastlake 5 generation,Harding-Chamberlin line
Equipment View Engineers and operators, equipment Borrowers, lenders, brokers,subprime loans
Engineers and operators,equipment
AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 15
the North America Power Grid, the pre-protection point thatprotects single operators will not work for the whole system.When single operators dropped out from the grid, the pressurewas all on the other part of the system. Finally the system hadno options but to fail systemically.56
Monitoring problems often play a major role in sociotechni-cal disasters. Monitoring failures were observed at the man-agement level in all three cases. As discussed in last sectionand in Table 1, a sensor or a monitoring task can fail low,high, zero, or fail to detect in time. BP was not aware of haz-ards at Texas City Refinery, because BP failed to incorporateprevious incidents; even worse, the incidents investigations
were missing1 (“failing zero”). The monitoring failure of BP isparticularly mentioned by CSB in Figure 11d. Conversely,prior to the Subprime Crisis, Moody’s did not account for thedeterioration in underwriting standards and was not aware ofthe plummeting home prices. Moody’s did not develop amodel specifically to look into layered risks of subprime secur-ities, after it had rated nearly 19,000 subprime securities2
(“failing zero”). Deregulation and self-policing by financialinstitutions had stripped away key safeguards2 (“failing low”).Moreover, in Northeast Blackout, the Midcontinent Independ-ent System Operator, Inc. (MISO) failed to recognize the con-sequence of Hanna-Juniper line loss, while other operators
Figure 10. Failure modes in the comparison table.
[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
16 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal
recognized the overload but had not expected it because the
contingency analysis earlier did not examine enough lines to
foresee the Hanna-Juniper contingency. The failure of not rec-
ognizing the line loss in a timely manner worsened the situa-
tion. When the operators finally figured out the situation, it
was too late to respond56 (“failing to detect in time”). MISO’s
monitoring failure not only was highlighted by ThinkReliabil-
ity (in Figure 12) as lack of warning, but also raised concerns
of U.S.–Canada Power System Outage Task Force. The Task
Force report56 recommends FERC should not approve the
operation of a new Regional Transmission Operator (RTO) or
Independent System Operator (ISO) until the applicant has
met the minimum functional requirements for reliability coor-
dinators. This recommendation directly addressed the issue of
MISO’s, as a reliability coordinator, failing to recognize line
loss in its region.Beyond the decision making or monitoring failures, the
flawed actions of regulators and their limited oversight always
contribute to sociotechnical system collapses. The reports1,12
mention that OSHA did not conduct a comprehensive inspec-
tion of any of the 29 process units at the Texas City Refinery.
Knowing the high leverage and vast sums of Subprime loans,
the FED did not begin routinely examining subprime subsidia-
ries until a pilot program in July 2007. FED did not even issue
new rules until July 2008, a year after the subprime market had
shut down.2 North American Electric Reliability Corporation
(NERC), the power grid self-regulator, knowing FE’s potential
risk, did not enforce any changes or regulate FE’s activities.56
All these flawed actions contributed to the disasters. Regulators
also experience conflict of interest. Especially financial regula-
tors, who face challenges from powerful financial institutions.These observations are just a few examples of what we stud-
ied in the TeCSMART comparison. Comparing with the logic
tree and the causal map, TeCSMART comparison is able to
capture high-level failures such as regulatory failures, which
are not covered in the logic tree or causal map. More impor-
tantly, TeCSMART comparison can systematically identify
potential risks in a sociotechnical system by identifying
Figure 11. The logic tree of BP Texas City Refinery Explosion (Adapted from Ref. [12] Investigation Report RefineryExplosion and Fire).
[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 17
possible failure modes associated with different components at
different levels.
Summary and Conclusions
As the recent systemic failures in different domains remind
academicians and practitioners alike, one can never take sys-
tem safety for granted. All of us—individuals, corporate man-
agement, regulatory agencies, and communities, need to learn
the lessons from every accident, particularly from the systemic
ones. It is imperative to study all these disasters from a com-
mon systems engineering perspective so that one can thor-
oughly understand the commonalities as well as the
differences to prevent or mitigate future ones. This is the
approach we have adopted in this article.Analyzing systemic risk in a complex sociotechnical system
thus requires modeling the system at multiple levels, at multi-
ple perspectives, using a systematic and unified framework. It
is not enough to focus only on equipment failures. It is impor-
tant to systematically examine the potential failures associated
with humans and institutions at all levels in a society. We
have proposed such an approach, the TeCSMART framework,
which models sociotechnical systems in seven layers using
control-theoretic concepts. Using this framework, a HAZOP-
like hazards identification can be conducted for every layer of
a sociotechnical system. The failure modes identified using
TeCSMART framework, at all levels, serve as a common plat-
form to compare systemic failures from different domains to
elicit and understand common failure mechanisms which can
help with improved design and risk management in the future.
They also serve as the input information for developing other
types of models (e.g., DAE, SDG, game-theoretic, agent-
based) for more detailed studies.We carried out such a comparative analysis of 13 major sys-
temic events from different domains, analyzing over 700 fail-
ures discussed in official post mortem reports. Even though we
are only highlighting the results from three of them, for the
sake of brevity, the common failure patterns we identify in
this article were found in the other events as well. These 7001
failures can be systematically classified into the five categories
(and their subcategories) that can occur at all levels of the sys-
tem. Using a unifying control-theoretic framework, we show
how these correspond to common failure modes associated
with the elements of a control system, namely, sensor, control-
ler, actuator, process unit, and communication channels. Eventhough every systemic failure happens in some unique man-ner, and is not an exact replica of a past event, we show thatthe underlying failure mechanism can be traced back to similar
patterns associated with other events.No modern engineered system with ever increasing com-
plexity can be totally risk free. However, minimizing inherentrisks in our products and processes is an important societalchallenge, both intellectually and practically, for innovativescience and engineering. Safety is not the responsibility of justthe environment, health and safety department. It is everyone’sresponsibility in the facility. There exists a need for systems,
procedures, corporate and regulatory cultures that ensure this.In the long run, considerable technological help would comefrom progress in taming complexity, which would result inmore effective prognostic and diagnostic systems for monitor-ing, analyzing, and controlling systemic risks. But gettingthere would require innovative thinking, bolder vision, andovercoming certain misconceptions about process safety as an
intellectually dull activity.
Acknowledgment
This work is supported in part by the Center for the Man-agement of Systemic Risk at Columbia University.
Literature Cited
1. Baker J, Leveson N, Bowman F, Priest S. The report of the bp usrefineries independent safety review panel. Report; IndependentSafety Review. 2007.
2. Financial Crisis Inquiry Commission, United States. Financial CrisisInquiry Commission. The financial crisis inquiry report: Final reportof the national commission on the causes of the financial and eco-nomic crisis in the United States. PublicAffairs; 2011.
3. Ottino JM. Engineering complex systems. Nature. 2004;427(6973):399.
4. Jasanoff S. Learning from Disaster: Risk Management after Bhopal.Philadelphia: University of Pennsylvania Press, 1994. ISBN081221532X.
5. Plotz D. Play the Enron Blame Game! Slate.com. 2002. Access Date:February 23, 2016. [Available from: http://www.slate.com/articles/news_and_politics/politics/2002/02/play_the_enron_blame_game.html.]
6. CCPS. Building process safety culture: tools to enhance processsafety performance. Report; Center for Chemical Process Safety ofthe American Institute of Chemical Engineers, New York. 2005.
7. MSNBC. Mine Owner Ran Up Serious Violations: MSNBC; 2010.Access Date: February 23, 2016. [updated April 6, 2010. Availablefrom: http://www.nbcnews.com/id/36202623/.]
Figure 12. The cause map of Northeast Blackout (Adapted from Ref. [63] The cause map of Northeast Blackout of2003).
[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
18 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal
8. Thomas P, Jones Lisa A. Cloherty J, Ryan J. Bp’s dismal safetyrecord. 2010. Access Date: February 23, 2016. [updated May 27,2010. Available at: http://abcnews.go.com/WN/bps-dismal-safety-record/story?id510763042.]
9. Johnson LD, Neave EH. The subprime mortgage market: familiarlessons in a new context. Manag Res News. 2007;31(1):12–26.
10. Lewis M. The Big Short: Inside the Doomsday Machine. New York:W. W. Norton, 2011. ISBN 9780393078190.
11. Olive C, OConnor TM, Mannan MS. Relationship of safety cultureand process safety. J Hazard Mater. 2006;130(1):133–140.
12. CSB. Investigation report refinery explosion and fire. Report; U.S.Chemical Safety and Hazard Investigation Board. 2005.
13. Hopkins A. Failure to Learn: The BP Texas City Refinery Disaster.CCH Australia Ltd, 2008. ISBN 1921322446.
14. Krugman P. Berating the raters. 2010. Access Date: February 23,2016. [updated April 25, 2010. Available from: Available at:http://www.nytimes.com/2010/04/26/opinion/26krugman.html?_r50.]
15. Urbina I. Inspector general’s inquiry faults regulators: New YorkTimes; 2010. Access Date: February 23, 2016. [updated May 24,2010. Available from: http://www.nytimes.com/2010/05/25/us/25mms.html.]
16. Venkatasubramanian V. Systemic failures: Challenges and opportuni-ties in risk management in complex systems. AIChE J. 2011;57(1):2–9.
17. Venkatasubramanian V, Zhao JS, Viswanathan S. Intelligent systemsfor hazop analysis of complex process plants. Comput Chem Eng.2000;24(9–10):2291–2302.
18. Catanzaro M, Buchanan M. Network opportunity. Nat Phys. 2013;9(3):121–123.
19. Caldarelli G, Chessa A, Gabrielli A, Pammolli F, Puliga M. Recon-structing a credit network. Nat Phys. 2013;9(3):125–126.
20. Galbiati M, Delpini D, Battiston S. The power to control. Nat Phys.2013;9(3):126–128.
21. Ashby WR. Requisite variety and its implications for the control ofcomplex systems. In Facets of Systems Science 1991 (pp. 405–417).Springer US.
22. Natarajan S, Srinivasan R. Implementation of multi agents basedsystem for process supervision in large-scale chemical plants. Com-put Chem Eng. 2014;60:182–196.
23. Saleh JH, Marais KB, Favar FM. System safety principles: a multi-disciplinary engineering perspective. J Loss Prev Process Ind. 2014;29:283–294.
24. Maurya MR, Rengaswamy R, Venkatasubramanian V. A systematicframework for the development and analysis of signed digraphs forchemical processes. 1. algorithms and analysis. Ind Eng Chem Res.2003;42(20):4789–4810.
25. Maurya MR, Rengaswamy R, Venkatasubramanian V. A systematicframework for the development and analysis of signed digraphs forchemical processes. 2. control loops and flowsheet analysis. Ind EngChem Res. 2003;42(20):4811–4827.
26. Maurya MR, Rengaswamy R, Venkatasubramanian V. Applicationof signed digraphs-based analysis for fault diagnosis of chemicalprocess flowsheets. Eng Appl Artif Intell. 2004;17(5):501–518.
27. Srinivasan R, Venkatasubramanian V. Multi-perspective models forprocess hazards analysis of large scale chemical processes. ComputChem Eng. 1998;22(98):S961–S964.
28. Venkatasubramanian V, Vaidhyanathan R. A knowledge-basedframework for automating hazop analysis. AIChe J. 1994;40(3):496–505.
29. Rasmussen J, Svedung R, Svedung I. Proactive risk management ina dynamic society. Swedish Rescue Services Agency, Karlstad, Swe-den. 2000. ISBN 9789172530843.
30. Leveson NG, Stephanopoulos G. A system-theoretic, control-inspiredview and approach to process safety. AIChE J. 2014;60(1):2–14.
31. Levenson NG. Engineering a Safer World: System Thinking Appliedto Safety, 1st ed. Cambridge, MA: The MIT Press, 2011. ISBN9780262016629.
32. Leveson NG. A systems-theoretic approach to safety in software-intensive systems. IEEE Trans Dependable Secure Comput. 2004;1(1):66–86.
33. Leveson N. A new accident model for engineering safer systems.Safety Sci. 2004;42(4):237–270.
34. Stephanopoulos G. Chemical Process Control: An Introduction toTheory and Practice. Prentice-Hall, Englewood Cliffs, New Jersey07632. 1984.
35. Seborg D, Edgar TF, Mellichamp D. Process Dynamics & Control.United States of America: Wiley, 2006. ISBN 8126508345.
36. Ogunnaike BA, Ray WH. Process Dynamics, Modeling, and Con-trol, vol.1. New York: Oxford University Press, 1994.
37. Bequette BW, Bequette WB. Process Dynamics: Modeling, Analysis,and Simulation. Upper Saddle River, NJ: Prentice Hall PTR, 1998.ISBN 0132068893.
38. Bookstaber R, Glasserman P, Iyengar G, Luo Y,Venkatasubramanian V, Zhang Z. Process systems engineering as amodeling paradigm for analyzing systemic risk in financial networks.Off Financ Res Work Pap Ser. 2015;15(1).
39. Seider WD, Seader JD, Lewin DR. Product & Process Design Prin-ciples: Synthesis, Analysis and Evaluation. United States of America:Wiley,; 2009. ISBN 8126520329.
40. Srinivasan R, Venkatasubramanian V. Petri net-digraph models forautomating hazop analysis of batch process plants. Comput ChemEng. 1996;20(96):S719–S725.
41. Srinivasan R, Venkatasubramanian V. Automating hazop analysis ofbatch chemical plants: Part i. the knowledge representation frame-work. Comput Chem Eng. 1998;22(9):1345–1355.
42. Srinivasan R, Venkatasubramanian V. Automating hazop analysis ofbatch chemical plants: Part ii. algorithms and application. ComputChem Eng. 1998c;22(9):1357–1370.
43. Vaidhyanathan R, Venkatasubramanian V. Digraph-based models forautomated hazop analysis. Reliab Eng Syst Saf. 1995;50(1):33–49.
44. Vaidhyanathan R, Venkatasubramanian V. A semi-quantitative rea-soning methodology for filtering and ranking hazop results in hazo-pexpert. Reliab Eng Syst Saf. 1996;53(2):185–203.
45. Rakoff JS. The financial crisis: why have no high-level executivesbeen prosecuted?: The New York Review of Books; 2014. AccessDate: February 23, 2016. [updated January 9, 2014. Available from:http://www.nybooks.com/articles/2014/01/09/financial-crisis-why-no-executive-prosecutions/.]
46. Eilperin J, Higham S. How the minerals management services part-nership with industry led to failure. 2010. Available at: http://www.washingtonpost.com/wp-dyn/content/article/2010/08/24/AR2010082406754.html.
47. Jefferson T. United states declaration of independence: archives.gov;1776. Access Date: February 23, 2016. [Available from: http://www.archives.gov/exhibits/charters/declaration_transcript.html.]
48. Blinder A. Donald blankenship sentenced to a year in prison in minesafety case: New York Times; 2016. Access Date: April 23, 2016.[updated April 6, 2016. Available from: http://www.nytimes.com/2016/04/07/us/donald-blankenship-sentenced-to-a-year-in-prison-in-mine-safety-case.html?_r50.]
49. Steinzor R. Why Not Jail?: Industrial Catastrophes, Corporate Mal-feasance, and Government Inaction. New York: Cambridge Univer-sity Press, 2014. ISBN 1316194884.
50. Presidential Commission. Deepwater, the gulf oil disaster and thefuture of offshore drilling. Report; National Commission on the BPDeepwater Horizon Oil Spill and Offshore Drilling, Washington.2011.
51. Browning JB. Union carbide: Disaster at Bhopal. In: Managingunder Siege. Detroit, MI: Union Carbide Corporation, 1993:1–15.
52. Investigation of the challenger accident. Report; Committee onScience and Technology House of Representative, Washington.1986.
53. Cullen WD. The public inquiry into the piper alpha disaster. Report0046-0702, London. 1993.
54. WHO. Sars: how a global epidemic was stopped. Report. 2006.Geneva. Available at: http://www.tandfonline.com/doi/abs/10.1080/17441690903061389.
55. CAIB. Columbia accident investigation board report. Report; Colum-bia Accident Investigation Board: Washington. 2003. Available at:http://www.slac.stanford.edu/spires/find/books?irn5317624.
56. TaskForce. Final report on the August 14, 2003 blackout in theUnited States and Canada. Report; US-Canada Power System OutageTask Force. 2004.
57. McAteer JD, Beall K, Beck J, McGinley P. Upper big branch: theApril 5, 2010, explosion: a failure of basic coal mine safety prac-tices: Report to the governor. Report; Governors Independent Inves-tigation Panel, West Virginia. 2011.
58. Bonnefoy P. Poor safety standards led to chilean mine disaster:GlobalPost; 2010. Access Date: February 23, 2016. [updated August29, 2010. Available from: http://www.globalpost.com/dispatch/chile/100828/mine-safety.
59. Kurokawa K, Ishibashi K, Oshima K, Sakiyama H, Sakurai M,Tanaka K, Tanaka M, Nomura S, Hachisuka R, Yokoyama Y. Theofficial report of the fukushima nuclear accident independent
AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 19
investigation commission. Report; The Fukushima Nuclear AccidentIndependent Investigation Commission, Japan. 2012.
60. CERC. Report on the grid disturbance on 30th July 2012 and griddisturbance on 31st July 2012. Report, India; 2012.
61. Blackburn R. The subprime crisis. New left review 50: 63. 2008Mar 1.
62. Jickling M. Containing financial crisis. Report; CongressionalResearch Service. 2011.
63. Think Reliability. The cause map of northeast blackout 0f 2003.Houston. 2008. URL: http://www.thinkreliability.com/Instructor-Blogs/Blog%20-%20NE%20Blackout.pdf.
64. Schumer CE, Maloney CB. The subprime lending crisis: the eco-nomic impact on wealth, property values and tax revenues, and howwe got here. 2007. Available at: www.jec.senate.gov/Documents/Reports/10.25.07OctoberSubprimeReport.pdf.
Manuscript received Feb. 26, 2016, and revision received Apr. 30, 2016.
20 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal