20
TRIBUTE TO FOUNDERS: ROGER SARGENT. PROCESS SYSTEMS ENGINEERING TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk in Sociotechnical Systems Venkat Venkatasubramanian and Zhizun Zhang Dept. of Chemical Engineering, Complex Resilient Intelligent Systems Laboratory, Columbia University, New York, NY 10027 DOI 10.1002/aic.15302 Published online in Wiley Online Library (wileyonlinelibrary.com) Recent systemic failures in different domains continue to remind us of the fragility of complex sociotechnical systems. Although these failures occurred in different domains, there are common failure mechanisms that often underlie such events. Hence, it is important to study these disasters from a unifying systems engineering perspective so that one can understand the commonalities as well as the differences to prevent or mitigate future events. A new conceptual frame- work that systematically identifies the failure mechanisms in a sociotechnical system, across different domains is pro- posed. Our analysis includes multiple levels of a system, both social and technical, and identifies the potential failure modes of equipment, humans, policies, and institutions. With the aid of three major recent disasters, how this framework could help us compare systemic failures in different domains and identify the common failure mechanisms at all levels of the system is demonstrated. V C 2016 American Institute of Chemical Engineers AIChE J, 00: 000–000, 2016 Keywords: artificial intelligence, design, fault diagnosis, safety, process control Systemic Failures: Introduction Recent systemic failures in different domains such as the Global Financial Crisis (2007–2009), BP Deepwater Horizon Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the fragility of complex sociotechnical systems. Systemic failures occur when an entire system collapses, where the system is typically a large entity, whose failure neg- atively impacts a large number of people and their environ- ment, causing enormous financial losses. Examples of such systems are refineries, inter-state power grids, country-wide financial networks, large institutions, and so forth. Union Carbide’s Bhopal Gas Tragedy in 1984, in which an estimated 5000 died and about 100,000 were seriously injured by the accidental release of methyl isocynate was a systemic failure. Another example is the Piper Alpha Disaster in 1988, where an offshore oil platform operated by Occidental Petroleum in the North Sea, U.K., exploded killing 167 and resulting in about $2 billion in losses. The Challenger (1986) and Colum- bia (2003) Space Shuttle Disasters, Schering Plough Inhaler Recall (1999), the Northeast Power Blackout (2003), the spread of SARS (2003), the BP Texas City Refinery Explosion (2005), and the Johnson & Johnson Multidrug Recall (2010) are all examples of systemic failures in different domains. Examples of financial systemic failures include Enron (2001) and WorldCom (2002) collapses, and the Madoff Ponzi Scheme (2008). The collapse of the News of the World news- paper organization (2011) is an example of systemic failure in the media domain. In each case, official postmortem inquiries were conducted and reports of the accidents were produced. Chemical engi- neers might study the BP Texas City Refinery Explosion Report, 1 and people from the financial world may browse The Financial Crisis Inquiry Report, 2 but rarely does one compare failures across the different domains to study their commonal- ities and differences. But when one undertakes such a compar- ative study, one is struck by the commonality across different domains. There is an alarming sameness about such disasters, which can teach us important fundamental lessons. Although the failures listed above occurred in different domains, in dif- ferent facilities, triggered by different events, there are, how- ever, common failure mechanisms that often underlie such events. Systematically identifying and understanding these mechanisms are essential to avoid such disasters in the future. Modern technological advances are creating an increasing number of complex sociotechnical systems. By sociotechnical we mean that these systems comprise of social elements (i.e., humans) as well as technical elements (such as pumps, valves, reactors, etc.). The human elements are not only an integral part of the system, they are also often the cause of major fail- ures. The task of designing such systems, and their control mechanisms, to ensure safe operations over their life cycles is extremely challenging. Complex sociotechnical systems have a very large number of interconnected components with non- linear interactions that can lead to “emergent” behavior—that is, the behavior of the whole is more than the sum of its parts—that can be difficult to anticipate and control. 3 More- over, these systems are not isolated—they interact with humans and the physical environment; in particular, human decision making and the associated errors are part of the feed- back processes in these systems. The cumulative effect of the nonlinearity, interconnectedness, and interactions with humans Correspondence concerning this article should be addressed to at V. Venkata- subramanian at [email protected]. V C 2016 American Institute of Chemical Engineers AIChE Journal 1 2016 Vol. 00, No. 00

TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

TRIBUTE TO FOUNDERS: ROGER SARGENT. PROCESS SYSTEMS ENGINEERING

TeCSMART: A Hierarchical Framework for Modeling andAnalyzing Systemic Risk in Sociotechnical Systems

Venkat Venkatasubramanian and Zhizun ZhangDept. of Chemical Engineering, Complex Resilient Intelligent Systems Laboratory, Columbia University,

New York, NY 10027

DOI 10.1002/aic.15302Published online in Wiley Online Library (wileyonlinelibrary.com)

Recent systemic failures in different domains continue to remind us of the fragility of complex sociotechnical systems.Although these failures occurred in different domains, there are common failure mechanisms that often underlie suchevents. Hence, it is important to study these disasters from a unifying systems engineering perspective so that one canunderstand the commonalities as well as the differences to prevent or mitigate future events. A new conceptual frame-work that systematically identifies the failure mechanisms in a sociotechnical system, across different domains is pro-posed. Our analysis includes multiple levels of a system, both social and technical, and identifies the potential failuremodes of equipment, humans, policies, and institutions. With the aid of three major recent disasters, how this frameworkcould help us compare systemic failures in different domains and identify the common failure mechanisms at all levelsof the system is demonstrated. VC 2016 American Institute of Chemical Engineers AIChE J, 00: 000–000, 2016

Keywords: artificial intelligence, design, fault diagnosis, safety, process control

Systemic Failures: Introduction

Recent systemic failures in different domains such as the

Global Financial Crisis (2007–2009), BP Deepwater Horizon

Oil Spill (2010), and Indian Power Outage (2012) continue to

remind us of the fragility of complex sociotechnical systems.

Systemic failures occur when an entire system collapses,

where the system is typically a large entity, whose failure neg-

atively impacts a large number of people and their environ-

ment, causing enormous financial losses. Examples of such

systems are refineries, inter-state power grids, country-wide

financial networks, large institutions, and so forth. Union

Carbide’s Bhopal Gas Tragedy in 1984, in which an estimated

5000 died and about 100,000 were seriously injured by the

accidental release of methyl isocynate was a systemic failure.

Another example is the Piper Alpha Disaster in 1988, where

an offshore oil platform operated by Occidental Petroleum in

the North Sea, U.K., exploded killing 167 and resulting in

about $2 billion in losses. The Challenger (1986) and Colum-

bia (2003) Space Shuttle Disasters, Schering Plough Inhaler

Recall (1999), the Northeast Power Blackout (2003), the

spread of SARS (2003), the BP Texas City Refinery Explosion

(2005), and the Johnson & Johnson Multidrug Recall (2010)

are all examples of systemic failures in different domains.

Examples of financial systemic failures include Enron (2001)

and WorldCom (2002) collapses, and the Madoff Ponzi

Scheme (2008). The collapse of the News of the World news-

paper organization (2011) is an example of systemic failure in

the media domain.

In each case, official postmortem inquiries were conductedand reports of the accidents were produced. Chemical engi-neers might study the BP Texas City Refinery Explosion

Report,1 and people from the financial world may browse TheFinancial Crisis Inquiry Report,2 but rarely does one comparefailures across the different domains to study their commonal-

ities and differences. But when one undertakes such a compar-ative study, one is struck by the commonality across differentdomains. There is an alarming sameness about such disasters,which can teach us important fundamental lessons. Although

the failures listed above occurred in different domains, in dif-ferent facilities, triggered by different events, there are, how-ever, common failure mechanisms that often underlie such

events. Systematically identifying and understanding thesemechanisms are essential to avoid such disasters in the future.

Modern technological advances are creating an increasingnumber of complex sociotechnical systems. By sociotechnicalwe mean that these systems comprise of social elements (i.e.,humans) as well as technical elements (such as pumps, valves,reactors, etc.). The human elements are not only an integralpart of the system, they are also often the cause of major fail-

ures. The task of designing such systems, and their controlmechanisms, to ensure safe operations over their life cycles isextremely challenging. Complex sociotechnical systems have

a very large number of interconnected components with non-linear interactions that can lead to “emergent” behavior—thatis, the behavior of the whole is more than the sum of its

parts—that can be difficult to anticipate and control.3 More-over, these systems are not isolated—they interact withhumans and the physical environment; in particular, humandecision making and the associated errors are part of the feed-

back processes in these systems. The cumulative effect of thenonlinearity, interconnectedness, and interactions with humans

Correspondence concerning this article should be addressed to at V. Venkata-subramanian at [email protected].

VC 2016 American Institute of Chemical Engineers

AIChE Journal 12016 Vol. 00, No. 00

Page 2: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

and the environment makes these system-of-systems poten-tially fragile and susceptible to systemic failures.

We propose a conceptual framework that can assist in sys-tematically identifying the failure mechanisms in a complexsociotechnical system. Much like hazard and operability(HAZOP) analysis, which helps us identify potential hazardsin equipment and process flowsheets systematically, by exam-ining the failure modes of different components methodically,our framework examines the entire sociotechnical system,including the corporate, regulatory and societal layers, andidentifies the potential failure modes of equipment, humans,policies and institutions. We also demonstrate how this newframework helps us compare systemic failures in differentdomains, in a detailed manner, and reveal the common failuremechanisms at all levels of the system. We compare the BPTexas City Refinery Explosion, Global Financial Crisis andNortheast Blackout. Such a comparative analysis has not beenconducted before, as most people generally think that they arecompletely different events, occurring in entirely differentdomains, and, therefore, are unlikely to have any common fea-tures of any value. We show that there are indeed commonvaluable lessons.

This article is organized as follows. Next section discussesthe common patterns of failures at multiple levels. The sectionafter introduces our hierarchical modeling framework (Teleo-Centric System Model for Analyzing Risks and Threats[TeCSMART]). After that, it presents failure analysis andcomparison, and analyzes three prominent case studies—Global Financial Crisis, BP Texas City Refinery Explosion,and Northeast Power Blackout—using the TeCSMART frame-work, and discusses their similarities and differences sheddingnew light on systems failures. Such a model-based compara-tive study has not been made before. The last section discussesthe future directions.

Systemic Failures: Common Patterns of Failuresat Multiple Levels

Postmortem investigations of many disasters have shownthat systemic failures rarely occur due to a single failure of acomponent or personnel. Even though the senior managementof a company typically tried to spin the blame on some unanti-cipated equipment failure, operator error, or a rogue trader,that is rarely the case for major disasters. For instance, UnionCarbide initially claimed that the Bhopal Gas Tragedy wascaused by a disgruntled employee, who had sabotaged theequipment.4 Enron management initially blamed Andrew Fas-tow, Enron’s CFO, as the sole culprit.5 But, again and again,investigations have shown that there are always several layersof failures, ranging from low-level personnel to senior man-agement to regulatory agencies that have led to majordisasters.

Such investigations have shown that the safety procedureshad been deteriorating at the failed facilities for months, if notyears, prior to the accident. For example, in the case of PiperAlpha, the Permit-to-Work system had been dysfunctional formonths.6 In Bhopal, regular maintenance of safety backup sys-tems had not been conducted for months.4 Massey Energy ranup about 600 safety violations in its Upper Big Branch mineduring 2009-2010.7 OSHA statistics show that BP ran up 760“egregious, willful” safety violations during 2008–2010 inOhio and Texas. Compare this with the corresponding num-bers for the other oil companies: Sunoco (8), Conoco-Phillips(8), Citgo (2), and Exxon (1).8 These are clear evidences of a

breakdown of the corporate safety culture for months or years.One sees a similar pattern in financial disasters as well. For

example, in Enron, its senior management, led by Ken Layand Jeff Skilling, created an extreme performance-orientedrisky culture that seems to have tolerated unethical behavior,

which resulted in many violations, market manipulations, andso on.5 In the subprime crisis, the perverted incentive mecha-

nisms in mortgage lending and its subsequent securitizationand trading, caused individuals and corporations to makehighly-leveraged bets that resulted in risk extremes which

were unsustainable. Thus, it was not a question of if a disasterwould occur but when.

Another common pattern is that people had not identified all

the serious potential hazards. They had often failed to conducta thorough process hazards analysis that would have exposedthe serious hazards, which resulted in the disasters later. Such

incomplete hazards analysis was highlighted in the Cullenenquiry of Piper Alpha.53 Failure to perform such a hazards

analysis was partially responsible for the meltdown of BearStearns, Lehman Brothers, Merrill Lynch, and others in thesubprime market fiasco.9 However, the few who had per-

formed such hazards analysis did see the crash coming andprofited billions of dollars, as described in Michael Lewis’

book, now a movie, The Big Short.10 Yet another commoncause is the inadequate training of the plant personnel to han-dle serious emergencies.

All in all, typically, the responsibility for a systemic failure

goes all the way to the top levels of company management,who had only paid a lip service to safety, tolerated non-

compliant behavior, even encouraged excessive risk takingand unethical behavior, all of which resulted in a poor corpo-rate culture of safety,1,11–13 which in turn paved the way for

the disasters.We also find that serious failings by regulatory, ratings, and

auditing agencies, tolerated, sometimes even encouraged, by a

laissez-faire political environment, playing a significant role.First and foremost, it does not matter whether the systems arechemical, petrochemical, or financial—self policing does not

work. This seems so obvious that people should not have todie, or lose all their money, to make us realize this. Sensible

regulations are essential, but, more importantly, they must beaudited and enforced by suitably trained personnel who haveno conflicts of interest. The betrayal of public trust by Arthur

Andersen, the supposedly independent auditor of Enron,whose aiding and abetting of Enron’s cooked books wasinstrumental in its systemic failure.5 The subprime market fail-

ures showed us that the rating agencies, which were supposedto make an independent assessment of the subprime-

mortgage-backed securities, were so dependent on their WallStreet clients for their business that they merrily went stamp-ing AAA ratings on junk instruments. Of the AAA-rated

securities issued in 2006, an astonishing 93% were later down-graded to junk status.14

It is the same lesson we were taught by the BP Deepwater

Horizon Oil Spill—how the Minerals Management Servicewas inherently conflicted between its goals of awarding leasesand enforcing safety regulations.15 But, this lesson should

have been learnt a long time ago after the Piper Alpha Disas-ter. Based on the Cullen Report’s findings in 1988, the British

government moved the responsibility for safety oversight fromthe Department of Energy to the Health and Safety Executive(HSE), the independent watchdog agency for work-related

health, safety and illness. A separate division was created

2 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal

Page 3: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

within the HSE to monitor safety of the offshore oil and gasindustry.6

Indeed, the importance of addressing non-technical com-mon causes, as those described above, as an integral part ofsystems safety engineering, was pointed out as far back as1968 by Jerome Lederer, the former director of the NASAManned Flight Safety Program for Apollo, who wrote:

System safety covers the entire spectrum of risk manage-

ment. It goes beyond the hardware and associated procedures

to system safety engineering. It involves: attitudes and motiva-

tion of designers and production people, employee/manage-

ment rapport, the relation of industrial associations among

themselves and with government, human factors in supervision

and quality control, documentation on the interfaces of indus-

trial and public safety with design and operations, the interest

and attitudes of top management, the effects of the legal sys-

tem on accident investigations and exchange of information,

the certification of critical workers, political considerations,

resources, public sentiment and many other non-technical but

vital influences on the attainment of an acceptable level of

risk control. These nontechnical aspects of system safety can-

not be ignored.

To understand systemic failures and learn from them, oneneeds to go beyond analyzing them as independent one-offaccidents, and examine them in the broader perspective of thepotential fragility of all complex systems. One needs to studythe disasters from a unifying sociotechnical systems engineer-ing perspective, so that one can thoroughly understand thecommonalities as well as the differences, gain insights aboutthe system-wide breakdown mechanisms in order to betterdesign, control and manage such systems in the future.

It is quite clear that to properly model and analyze systemicrisk, one not only needs to model failures at the lowest level ofa sociotechnical system (such as at the failures of equipment)but also, more importantly, model the human and institutionalfailures that occur at the higher levels of the system. Thehuman elements are not only an integral part of the system,they are also often the cause of major failures. Hence, it isimportant to account for them, as explicitly as possible, in anyrisk modeling framework. This has not always been the casein the engineering modeling literature. For instance, mostmodeling studies in the process control literature do notaccount for errors committed by humans in their methodolo-gies. HAZOP analysis, as another example, considers onlyequipment and operation failures in its guide-word basedapproach. We need a systematic methodology that can identifypotential failure mechanisms, due to equipment, process,human, and institutional failures, at different levels of a socio-technical system. This is what we try to accomplish in our arti-cle. This article is largely a conceptual contribution,describing a new modeling framework that articulates how thedifferent levels of a complex sociotechnical system may beformally approached using control-theoretic ideas. Buildingon our prior work,16,17 we present such an integrative multi-scale modeling framework, which addresses the role of thehuman element explicitly, and discuss its implications in thecontext of several prominent systemic failures in differentdomains.

In recent years, there has been interesting progress in under-standing and modeling systemic risk in complex sociotechni-cal systems. Economists and physicists have used networktheory to do this for financial systems.18,19 Control theorists

have proposed approaches by adopting traditional controltheory for understanding systems.20,21 Others have proposedagent based modeling22 or domain-independent system safetyprinciples.23 Our prior work in this area has stressed the needfor modeling cause-and-effect knowledge explicitly as well asthe need for a multiscale modeling framework.16,17,24–28 Philo-sophically, our framework is similar to what has been pro-posed by Rasmussen and Svedung.29 and by Leveson.30–33 Inparticular, it shares the main theme discussed by Leveson andStephanopoulos,30 but we differ in the conceptual details ofthe underlying modeling framework. In addition, we demon-strate the utility of our framework across different domainsusing a comparative analysis of three well-known systemicfailures which has not been done before.

TeCSMART Framework

Complexity, in general, is hard to define and quantify pre-cisely as it comes in different flavors and can mean differentthings in different contexts. For instance, there is algorithmic orcomputational complexity as defined by computer scientists,which measures how much computational effort or time a partic-ular problem might require for its solution—for example, poly-nomial vs. exponential time, as a function of some key scalingparameter of the given problem. Then there is the physics per-spective, dynamical system complexity, which originated fromthe field of nonlinear dynamics and chaos. This deals with thegeneral inability to predict the future behavior of a nonlineardynamical system. In other fields such as biology (life and socialsciences, in general), complexity is used to describe, in qualita-tive terms, the incredible diversity, organizational sophistication,and characteristics of individual agents (e.g., a cell or an animal),systems (e.g., ecosystem, human society), processes/phenomena(e.g., intercellular and intracellular interactions), and so forth.

While it may be hard to state exactly what complexity, orwhat a complex system, is, there is consensus, however, as towhat features are typically associated with a complex system.Complex systems typically consist of many diverse, autono-mous, and adaptive components that interact with one another,and their environment, in nonlinear, dynamical ways to pro-duce a very large set of potential future states or outcomes.Interactions between such parts at a given scale typically giverise to “emergent” properties at larger scales in space and/ortime, sometimes through self-organization, without any globalknowledge or central control, that are hard to predict from theproperties of the parts. They tend to have many feedback loops(both positive and negative), among their components as wellas with their environment, which can cause adaptation andinduce a goal-directed (i.e., teleological) behavior, eitherintentionally or implicitly, thereby potentially altering thecourse of their future behavior. Hence, their characteristics aretypically not reducible to an elementary level of description.

Thus, the essential features of a complex sociotechnical sys-tem may be summarized as: (1) goal-driven behavior, (2)many agents or components/sub-components, (3) organized ina multi-layered hierarchy or network, (4) nonlinear dynamicalinteractions among its agents (or components) and with theenvironment, (5) feedback loops, (6) decentralized control(i.e., local decision making), and (7) emergent behavior.

Most human engineered complex systems, such as chemicalplants, corporations, transportation networks, power grids,governments, societies, and so forth, are organized as a hier-archical network of human and nonhuman (e.g., machines)elements. Generally speaking, they comprise of autonomous

AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 3

Page 4: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

and non-autonomous elements, which usually translate tohuman and nonhuman entities. In this article, we are not con-sidering nonhuman entities that are autonomous, such asrobots, as they have not reached human-like autonomous capa-bilities yet, even though this is going to be an important devel-opment in a couple of decades.

We call our modeling framework as TeCSMART (Teleo-Centric System Model for Analyzing Risks and Threats). Telosmeans goal or purpose in Greek. The central theme of ourapproach is the emphasis on recognizing and modeling goalsof different agents, at different levels of abstraction, in a com-plex sociotechnical system. Both individual players and groupsare goal-oriented, driven to act by their goals and incentives,in a complex system. Therefore, it is important to recognize

and model this goal-driven behavior. Individuals (or groups)usually have different goals, or even goals with conflicts ofinterests with each other or with goals from other individuals.The dynamics of how goals across the system interact, trans-form and disperse in the hierarchy, affects both individual andsystemic performances. We use a simple feedback controlmodule as a model for representing this goal-driven behavioras we discuss below.

We propose an integrative framework that tries to capturethe essential features of a complex teleological system withthe purpose of modeling, analyzing, and managing systemicrisk by accounting for the effects of both autonomous (i.e.,human) and nonhuman (i.e., “machines” or “mechanical”)entities in a unified and systematic manner. We model a

Figure 1. TeCSMART framework.

[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

4 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal

Page 5: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

complex teleological system as a sociotechnical entity that is

embedded in a society, affected by the society’s goals andpolitical environment. This leads to a multi-scale modelingframework, having seven layers organized as a hierarchy, asshown in Figure 1, that naturally arise and represent differentperspectives of the entire system. Each layer above is azoomed-out, aggregate, view of the immediate layer below.

For example, the block representing process unit in the net-work of Plant View contains the individual feedback loop inEquipment View. The bottom layer of the stack is the basicbuilding block of a system (e.g., equipment and processes).The top layer of the stack is the macroscopic view of asociety.

Each layer has its own set of goals, which drive the

decision-making and actions taken by the agents in that level.The decisions are taken based on the inputs the layer receivesfrom the layers immediately above and below it. Similarly, theactions are communicated to these adjacent layers as outputs.These decisions/actions are indicated, in Figure 1, by thearrows that capture these information flows, up and down the

hierarchy. These information flows are the feedback loopsbetween the layers (i.e., interlayer feedback loops). There arealso feedback loops within a given layer, as depicted in Figure1, which are intralayer loops. Associated with each layer is aset of agents (autonomous and nonautonomous), organized ina particular configuration that is appropriate for the goals ofthat layer (e.g., the layout of equipment in a chemical plant,

called a flowsheet). Such a multilayered representation lendsitself naturally to account for emergent phenomena that arisefrom one scale to another.

We propose a uniform and unified input-output modelingframework, that is conceptually the same across all levels.This elementary input-output model structure that serves as abuilding block in our framework is shown in Figure 2. Speci-

fying such a uniform modeling structure across all levels hasthe advantage of integrating and unifying the analysis of theoutcomes at different levels in a consistent manner. Such atemplate structure allows us to systematically identify the vari-ous failure modes of the different elements at different levelsof the hierarchy as we discuss below. There are five key ele-

ments in this control-theoretic information modeling buildingblock: (1) sensor, (2) actuator, (3) controller, (4) “process”unit that transforms inputs to outputs, (5) connection (e.g.,wires and pipes). These combined with input and output com-plete the picture. The functions of these elements, as well astheir failure modes, at different levels of the hierarchy are

illustrated with examples in the discussion below, using exam-ples from chemical engineering. It is relatively easy to gener-alize this discussion to other engineering domains. The

domain of finance requires a special treatment and we makethat connection wherever needed.

As an organized group, these entities collect, decide, act on,report, and receive a variety of performance information andmetrics. At any level, the layer below acts as sensors, actua-tors, and processes in the interlayer feedback loop, while thelayer above it behaves like a controller that evaluates the lowerlevel performance and sets new goals. In a chemical plant, forexample, in the Equipment View Layer, they collect, decide,and act on individual process and equipment performance dataand metrics (such as temperature, pressure, flow rate, batchtimes, etc.), which are vital for safe, efficient and profitableoperation, and report them to the Plant View Layer, andreceive, in turn, local control specifications (such as tempera-ture and pressure set points) from Plant View Layer. The PlantView Layer agents make these decisions by considering infor-mation from all the processes and equipment under its purviewas well as by considering manufacturing targets (such as whatto make, how much to make, when to make, etc.). These tar-gets, in turn, are decided by the agents in the ManagementView, which get translated into the associated set points andconstraints by the agents in the Plant View, and communicateddown to the Equipment View as inputs. The target metrics aredecided by the agents in Management View by responding tocompetitive market conditions as dictated by the MarketView. In a similar manner, relevant information regardingmarket or company stability, performance, fair competition,etc., are monitored and acted on by the agents in the Regula-tory View, by enacting and enforcing appropriate regulationsapproved by the agents in the Government View (such as theCongress in the U.S.). In an ideal democracy, a government iselected by the citizens of that society, the Societal View, whohave the final word in determining what kind of governmentand laws they would like to live by.

Similar activities occur within layers through intralayerfeedback loops. In the Equipment View Layer, for example, astirred tank heater depicted in Figure 3 has sensors to measuretemperature and tank level. Controllers evaluate these metrics,and send new control signals to valves. In the ManagementView Layer, a firm’s accounting team collects the perform-ance data and share with the Board of Directors. The Boardsets company’s goal based on the data. Each division followsthe goal and carry out its daily operations. Periodically, newperformance data is collected and the goal updated. At each

Figure 2. Schematic of a feedback control system(Adapted from Ref. [34], Chemical processcontrol, fig. 13.1b, pp. 241).

Figure 3. Stirred tank heater example (Adapted fromRef. [34], Chemical process control, pp. 89).

AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 5

Page 6: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

layer, if autonomous or nonautonomous agents do not comply

with the goal, disturbances arise at that layer. Controllers take

the disturbance into account and set goals accordingly. Such

intralayer feedback loops exist in all seven layers. Details of

each layer will be presented in the following discussion.

Perspective I: Equipment View Layer

In the Equipment View Layer, the focus is on individual

equipment such as reactors and distillation columns in the con-

text of a chemical plant and their operating conditions. A

chemical plant is a collection of a such process units suitably

organized (called a flowsheet) to meet the plant-wide goal of

manufacturing a desired chemical product at targeted levels of

quality, quantity, cost, time of delivery, etc., safely and opti-

mally. This collection is seen in Perspective II, the Plant View

Layer. The time scale for the Equipment View Layer is typi-

cally in seconds and minutes as process dynamics happens in

real-time.In the Equipment View Layer, the autonomous agents

involved are typically engineers and operators, and the non-

autonomous agents are equipment including control systems.

While regulatory control systems can exhibit a certain degree

of autonomy, that is negligible compared to the range of

autonomy exhibited by humans. Hence, we classify regulatory

controllers as nonautonomous.Consider, for example, a stirred tank heater process (Figure

3) where the goal is to control the level h and temperature T of

the fluid in the tank that is subject to fluctuations in the inlet

flow rate Fi and temperature Ti. The desired level of the fluid

is referred to as the set point level hset and the desired tempera-

ture Tset. These are accomplished by the two feedback control-

lers (loops 1 and 2), which receive the current F and T in

real-time from the sensors (level gauge and thermocouple), by

suitably manipulating the outlet flow rate F and steam flow

rate, Fsteam, by opening or losing the respective control valves(actuators). The seven elements of the information modelingblock for this system are: (1) input: Fi, Ti, Fset, Tset, Fsteam, (2)output: h and T, (3) sensors: level gauge and thermocouple, (4)actuator: outlet flow and steam valves, (5) controller, (6)“core” process unit: tank and heater, and (7) connection: pipesand wires. The constraints are lower and upper limits on thelevel and the temperature of the fluid in the tank.

The goal at the Equipment View level is centered on theperformance of individual equipment such as heaters, reactors,distillation columns, and so forth—that is, each equipment hasits goal of operating at the set point(s). At this level of granu-larity, typically, for engineering applications, one can developdetailed dynamical models of the equipment and processes.These tend to be a set of differential and algebraic equations(DAEs) that are solved to simulate process/equipment behav-ior. Since the purpose of this article is not to discuss thesemodels at length, we refer the interested reader to severalstandard sources in the literature.34–37 As an example, we listbelow the dynamical model equations for the stirred tankheater

Adh

dt5Fi2F

AhdT

dt5FiðTi2TÞ1 Q

qCp

Another kind of model used at this level, called signeddirected graph model (or signed digraph model [SDG]), isbased on graph-theoretic ideas to represent cause and effectrelationships in a process or equipment.24–26 The SDG modelfor the heater example is shown in Figure 4. The nodes repre-sent input and output variables. The arcs represent either posi-tive (solid lines) or negative (dotted lines) relations betweennodes. The figure is read as follows: a change in the inlet tem-perature Ti positively affects the temperature T in the stirredtank, for example, if Ti increases, T will increase. T negativelyaffects the temperature difference T�, which is the set pointtemperature Tset minus stirred tank temperature T. As Tincreases, T� decreases. It means that less steam Fsteam isneeded in the stirred tank, because T gets close to the set pointtemperature Tset. This positive relation between T� and Fsteam

is depicted by a solid arc between the two nodes. Fsteam, inturn, positively affects the temperature T in the stirred tank.This causal behavior among T, T�, and Fsteam refers to loop 2in Figure 3. These qualitative models are easier to develop andanalyze, in comparison with the DAE models, particularly formodeling and analyzing failure modes and hazards.17,28 How-ever, as they are qualitative in nature, they are limited to cer-tain kinds of queries and can lead to ambiguities.

Nevertheless, such cause-effect based qualitative modelsare very useful when modeling a social system, where DAEmodels are usually hard to develop, such as a bank-dealer sys-tem as discussed by Bookstaber et al.38 In this case, the nodesare variables related to a bank-dealer’s investment and lendingactivities. In Figure 5, the left-hand side depicts the connec-tions and activities within the bank-dealer, while the right-hand side shows the SDG model. A bank-dealer consists ofthree major desks, among which the finance desk determineswhere money should go; the prime broker determines howmuch money to lend based on the collateral collected; and thetrading desk determines whether sell to the market or buyfrom the market based on money received from the financedesk and the leverage ratio it holds. The SDG model is read as

Figure 4. SDG for the tank heater example.

6 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal

Page 7: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

follows: finance desk collateral CFD positively affects thefunding capacity VFD. VFD in turn positively affects the loan

capacity of prime broker VPB and the leverage set point oftrading desk kSP

TD. In the prime broker, both the collateral

amount CPB and the margin rate vPB positively affect the loancapacity VPB. In the trading desk, the leverage set point kSP

TD

and current leverage kTD determine the leverage different �TD,which positively affects the inventory quantity of trading desk

QTD. As Bookstaber et al.38 demonstrate, using the SDGmodel, one can quickly examine the causal relations of a

social system like the bank-dealer system, and study unstableconditions and risks such as the fire sale and funding run

scenarios.One can always incorporate other modeling methods with

the TeCSMART framework. Usually, in order to develop aquantitative model (DAE model) or a qualitative model (SDG

model), one needs to determine the initial conditions of a sys-tem. System initial conditions at this level are values associ-

ated with equipment, such as sensor readings or controllerparameters. Examining failure modes using TeCSMART

framework provides a systematic way for identifying systeminitial conditions. By giving different system initial conditions,

modelers can develop suitable models to describe the systemand conduct in-depth risk analysis. Therefore, no matter what

modeling methods or risk assessment tools one will use, aHAZOP-like systematic analysis using TeCSMART frame-

work is feasible for analyzing risks in a sociotechnical system.It enables a systematic hazard identification for the risk assess-ment of a sociotechnical system.

The basic functional building block in Figure 2 allows us to

model systematically the potential failures at different levels

of both human and non-human elements. In the EquipmentView Layer, let us consider a sensor, for example. Using acommonly used model of its failure modes, we can state that asensor can fail high, low, or zero (i.e., no response, sensor isdead). Similarly for an actuator (a valve can fail high, low, orzero) and a controller. A process might have more failuremodes depending on its complexity, but it is usually not inhundreds, more like a dozen or so. The connections can fail,too, again high, low, zero, or reverse (in the case of flow ratein pipes, for example). One can modify these to make the setof failure modes more sophisticated, if needed, but even thiselementary set goes a long way as we discuss below. We willshow below how these failure modes can be generalized toaccommodate typical human failures as well at different levelsof the hierarchy.

Perspective II: Plant View Layer

The Plant View Layer is a collection of all the equipmentand processes organized in a particular configuration (or flow-sheet) to manufacture a desired product safely and optimally.The autonomous agents involved in this layer are managersand supervisors, and the nonautonomous agents are equipmentclusters. These clusters are usually grouped as critical processsteps or unit operations,39 such as reaction, distillation, etc.,which are needed in the manufacture of the desired product.Similarly, in the financial system example, the left figure inFigure 5 is the simplified “flowsheet” of a bank-dealer system.The Plant View agents collect and report metrics regardingaggregate production performance and safety to ManagementView and receive, in turn, plant-wide target specificationsfrom Management View, as noted above. Although this levelis also operating in real time, the Plant View decisions typi-cally have a larger time scale (hours or even days).

The goal at this level is to ensure meeting production per-formance targets (typically, product quantity and quality, cost,and time of delivery) safely and optimally at the overall plantlevel. These plant-wide targets would translate into equipmentspecific targets implemented as set points and constraints thatare communicated to the Plant View level. Models at this leveltend to be DAE models from Perspective I integrated togetherreflecting the overall flowsheet organization of the plant. Theflowsheet is then simulated to obtain plant-wide process andequipment behavior. One can also formulate such connectedmodels using the SDG models from the lower level as well toexplicitly capture the cause-and-effect relationships which arethen used for applications such as process hazardsanalysis.17,28,40–44

The input-output information model at this aggregate levelis shown in Figure 1. From this level onward, going up to thehigher levels, the emphasis shifts from decisions/actions madeby individual equipment to those made by personnel, and fromreal-time sensor data to aggregate information concerning theoverall plant performance. It moves from a data-centric toinformation-centric perspective. This is required to reflect thegoal of this layer—to make the desired products at the targetedlevel of quality, quantity, cost, time of delivery, safely andoptimally. That is the charge of the Plant Manager, given toher by the senior management at the next layer above.

The seven elements here, therefore, reflect this aggregatenature of information needed and used at this level: (1) input:aggregate, plant level, information on target as well as actualperformance metrics, (2) output: schedule, set points, resourceallocation, and so forth, (3) sensors: product quality and

Figure 5. SDG for the bank/dealer example (Adaptedfrom Ref. [38] Process Systems Engineeringas a Modeling Paradigm for Analyzing Sys-temic Risk in Financial Networks).

[Color figure can be viewed in the online issue, which is

available at wileyonlinelibrary.com.]

AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 7

Page 8: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

quantity, resource utilization data, etc., (4) actuator: plant per-sonnel, (5) controller: Plant Manager, (6) “core” process unit:the entire plant, and (vii) connection: various communicationchannels among plant personnel such as the Manager, Supervi-sors, Engineers, and Operators.

The failure modes associated with the elements at this levelare conceptually similar to their counterparts at the lowerEquipment Layer. For instance, sensors in this layer are notphysical entities like thermocouples, but informational entitiesthat aggregate and transform relevant data into actionableinformation such as the projection made about the plant’sproduct output for the current month. This transformation iscarried out by a human, such as a process engineer. The engi-neer can also “fail” high, low, or zero in the sense that the esti-mation reported to the Plant Manager can be erroneous alongthese lines—for example, the projection may be too optimistic(i.e., failing high), too conservative (i.e., failing low), or noprojection is made (i.e., failing zero). Likewise, communica-tion can also fail along these lines—perhaps the projectionwas made, but the Manager was not informed. Similarly, in abank-dealer system, this layer represents the aggregation ofinvestment and funding activities of different asset classes.The three major desks are divided into groups (actuators) tohandle portfolios consisting of different assets. Sensors (i.e.,analysts monitoring the metrics) in the Equipment View Layerfor a bank-dealer system report leverage ratios or collateralcollected; while sensors in this layer are risk models of portfo-lios, which aggregate and transform individual risk factorsinto a comprehensive picture that describes the portfolio’srisk. We, thus, see that this template helps us identify system-atically where and how things can fail at different levels of thehierarchy.

It is important to note that we are not claiming that ourframework would capture all things that go wrong in a com-plex system. We are only suggesting that such a systematicapproach could capture many of the typical failures seen inpractice and we demonstrate this with the aid of three casestudies.

Perspective III: Management View Layer

The next level up is the Management View, where theagents involved are the critical decision makers such as theCEO, Senior Vice Presidents, and Board of Directors. Theirgoal is to maximize profitability and create value for the share-holders by making sure the company’s business performancemetrics (including safety) meet the expectations from the Mar-ket (which is the next level up). Influenced by the nature ofbusiness and accounting cycles, this layer operates in a timescale of quarter (i.e., 3-month period) to a year.

As seen in the control-theoretic information model of thislevel in Figure 6, this group of decision-makers (Managementteam) sets the overall policies that “control” (i.e., manage) thebehavior and outcomes of the corporation including its autono-mous and non-autonomous assets. Autonomous agents at thislayer include managers and supervisors of each division, whilethe nonautonomous agents are corporate assets. The Market atthe next level up sets and demands certain performance targetsbe met by the company for its survival and growth. These met-rics are usually financial at this level such as ROI, ROE, mar-ket share, sales growth, and so forth. These are the set pointsand constraints given to the Management team.

The Management team, in turn, translates these targets intoactionable quantitative information such as production per-

formance metrics, strategic deployment of resources, and soforth, at different plants (the corporation might have severalplants distributed all over the world) as well as more qualita-tive ones that define the company culture including the safetyculture. They also set the incentive policy to encourage betterperformance from the employees. These are communicated tothe Plant View Layer as their set points and constraints. TheManagement team decides on these targets by taking intoaccount of all relevant information concerned with the sur-vival, profitability and growth of the company in a competitiveand regulatory environment. Thus, the information flow is notonly from the company’s internal sources but also from theenvironment, which are the two levels immediately above.

Differing from the control policies at the lower levels,which mainly focus on controlling equipment (i.e., nonautono-mous agents), the policies from this layer onward, at thehigher levels, focus more on achieving the desired behaviorand outcomes from autonomous agents (i.e., humans). As aresult, while the lower level control policies can be based onprecise models of process/equipment (as captured by DAEmodels), the higher level policies will necessarily have to dealwith imperfect models of human behavior which cannot bereduced to a set of equations. Consider, for instance, the diffi-culties involved in “modeling” the culture of a corporation. Atbest, we might be able to identify certain key features or char-acteristics that define a corporation’s culture. From this levelonward, we have to rely more on graph-theoretic, game-theo-retic and agent-based modeling frameworks. Thus, from thislevel onward modeling becomes trickier, and the notion of“control” of agents transitions to the “management” of agents.Moreover, the importance of TeCSMART failure modes-based examination becomes more obvious. Such a systematicrisk analysis of human decision-making would help improvingsafety-related management activities, among other things.

The Management team acts as a “controller” to monitor thevarious performance metrics (e.g., sales, expenses, revenue,profits, ROI, ROE, etc.), compare them with the set points,and take appropriate actions by manipulating the relevant vari-ables (e.g., cost cutting, acquisition, etc.) in order to meet theset point targets. The Management level deals with the big pic-ture and general strategy for the corporation as a whole. Theseget translated into more detailed prescriptions and recommen-dations as they are communicated from this layer to the lowerlayers. The failure of the elements in Figure 6 can be modeledalong the lines of Equipment View and Plant View Layers.For example, the Performance Monitoring task (i.e., “sensor”)may fail because of errors in the measurements or estimations(e.g., fail high, low, or zero) or they may be communicated (ornot communicated at all) erroneously. One can methodicallyidentify similar failure modes for the other elements includingthe connections (which are the communication channels).

Figure 6. Control-theoretic model of managementlayer.

8 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal

Page 9: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

Perspective IV: Market View Layer

Similar to the Plant View, the Market View is a collection

of companies that compete, in the appropriate product/service

categories, for economic survival, profitability and growth in a

free market environment. The agents at this level are mainly

the customers and corporations. Market is a well-studied con-

cept in economics. It usually refers to the exchange activities

that many parties engage in. In this article, we will not discuss

the economic aspect of Market, but interpret Market as a col-

lection of companies and their activities. Market activities

such as cooperation and competition can be explained using

the input-output model structure and intra-layer feedback

loops. From this layer and above, activities mainly involve

autonomous agents such as humans and human organizations.

The information generated at this level (e.g., stability of indi-

vidual companies and the market, fairness practices, etc.) are

communicated to the Regulatory View and from there receive

regulatory requirements and enforcement actions. While the

market dynamics is in real-time, as with the Plant View, the

relevant time scale is of the order of months.

Perspective V: Regulatory View Layer

As noted, regulatory agencies oversee the market and con-

trol the market behavior through the enforcement of regulatory

policies (Figure 7). The primary goal at this level is to ensure

the security, stability, and wellbeing of the society where these

companies operate. This means, of course, the security and

wellbeing of the citizens and their environment. This also

means ensuring that the free market, where these companies

compete, is stable, efficient and fair. The autonomous agents

are regulatory agencies such as Occupational Safety and

Health Administration (OSHA), Environmental Protection

Agency (EPA), Securities and Exchange Commission (SEC),

Federal Reserve (FED), Federal Energy Regulatory Commis-

sion(FERC), Mineral Management Service (MMS), Food and

Drug Administration (FDA), and so on, and the appropriate

executives from the companies.These agencies receive from the agents in Government

View, namely, lawmakers and their staff, regulations which

they enforce on the market participants. They also monitor the

market and companies, collect information, and report the

effects of regulations to the agents in Government View for

potential improvements. This feedback control loop acts at a

time scale of years.One typical example of this view is the activity of the SEC

which regulates the securities industry. As shown in Figure 8,

SEC receives laws and regulatory directives from the agents in

Government View, such as the President, the Congress, and

the Federal Reserve Board. Through its five Divisions and 23

Offices, SEC enforces federal securities laws, issues new rules,

and oversees securities related activities. For instance, SEC reg-

ularly monitors the market for unusual trading patterns that

might reveal illegal acts such as insider trading, and takes cor-

rective actions, playing its role as a “controller” here, to ensure

fairness in the security markets. While SEC should be praised

for its postfinancial crisis actions on successfully going after

various Wall Street entities for their misconduct, various fail-

ures of the SEC before and during the crisis contributed to the

crisis, as Judge Rakoff argues persuasively.45 Many of these

failures are failures of the elements in Figure 7 that can be mod-

eled using our template of failure modes. In a similar manner,

many of the failures at the Minerals Management Agency46 that

contributed to the BP Oil Spill disaster can be modeled using

our approach. While we do not get into all the details, as that

would make our article too long, we do provide a summary of

these failures in a series of tables that compare regulatory fail-

ures in three different domains later in the article.

Perspective VI: Government View Layer

The Government View, like the Plant and Market Views, is

a collection of various agencies particularly organized to gov-

ern a society of autonomous and non-autonomous agents (e.g.,

physical assets). The objectives here are security, stability, and

the overall wellbeing of the agents and their environment

against a variety of risks and threats. Depending on the soci-

etal preference for capitalism, communism, socialism, mon-

archy, or dictatorship, the institutions and their structure can

be widely different. The objective of our article is not to dis-

cuss these in any detail (there are vast resources on this subject

in sociology and political science) but only to show how our

control-theoretic framework accommodates the structures and

functions at this level in a uniform and consistent manner

which is helpful for a system-theoretic analysis of system-

wide risks and threats. In the context of the U.S., this structure

is the three branches of government—executive, congress, and

judiciary—with the associated agencies they supervise. The

agents are the members of these branches. The time scale is

typically four years, the presidential election cycle, but institu-

tional memory in congress and judiciary can prolong this to

decades. That is, it can take that long to make significant

changes in governance.

Perspective VII: Societal View Layer

Finally, we arrive at the top most level in this modeling

hierarchy. The primary agents (autonomous) are the citizens

and elected officials in a democracy such as the U.S. It is, of

course, very different for other political structures, as noted.

Again, while the presidential election cycle imposes a certain

Figure 7. Control-theoretic model of regulatory layer.

Figure 8. Control-theoretic model of Securities andExchange Commission.

AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 9

Page 10: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

natural characteristic time, institutional memories can prolongthis to decades. The societal “set points” are the preferences ofits citizenry, which can vary over time, typically, of the order

of decades or generations. In an ideal democracy, the citizensget to decide what kind of society or country they all wouldlike to live in. The overall goals of the citizens in the U.S., as

Table 1. Failure Taxonomy Part I

Class Definition Examples2,12,56

1. Monitoring Failures Failure to monitor the keyparameters effectively orhaving significant errors inthe monitored data

1.1 Fail to monitor Failure to monitor key perform-ance indicators (“failingzero”)

In BP Texas City Refinery Explosion, numerous measures for trackingvarious types of operational, environmental and safety performance,but no clear focus on the leading indicators for the potential cata-strophic or major incidents.

In Northeast Blackout, MISO did not discover that Harding-Chamber-line had tripped until after the blackout, when MISO reviewed thebreaker operation log that evening.

In Subprime Crisis, Moody’s did not sufficiently account for the dete-rioration in underwriting standards or a dramatic decline in home pri-ces. And Moody’s did not even develop a model specifically to takeinto account the layered risks of subprime securities until late 2006,after it had already rated nearly 19,000 subprime securities.

1.2 Failure to monitoreffectively

Failure to detect/report prob-lems in a timely manner

In Northeast Blackout, the Cleveland-Akron areas voltage problemswere well-known and reflected in the stringent voltage criteria usedby control area operators until 1998.

BP Texas City did not effectively assess changes involving people,policies, or the organization that could impact process safety.

1.3 Significant errors inmonitoring

Monitored data are significantlyinaccurate. It is either over-reporting (“failing high”) orunder-reporting (“failinglow”) the actual trend

In BP Texas City Refinery Explosion, a lack of supervisory oversightand technically trained personnel during the startup, an especiallyhazardous period, was an omission contrary to BP safety guidelines.An extra board operator was not assigned to assist, despite a staffingassessment that recommended an additional board operator for allISOM startups.

In Northeast Blackout, from 15:05 EDT to 15:41 EDT, during whichMISO did not recognize the consequences of the Hanna-Juniper loss,and FE operators knew neither of the lines loss nor its consequences.PJM and AEP recognized the overload on Star-South Canton, buthad not expected it because their earlier contingency analysis did notexamine enough lines within the FE system to foresee this result ofthe Hanna-Juniper contingency on top of the Harding-Chamberlinoutage.

2. Decision Making Failures Failure to provide the correctdecisions in a timely manner

2.1 Model failures Decisions are not supported bythe local system (i.e., “plant-model mismatch”)

In Subprime Crisis, financial institutions and credit rating agenciesembraced mathematical models as reliable predictors of risks, replac-ing judgment in too many instances.

In Northeast Blackout, one of MISOs primary system condition evalu-ation tools, its state estimator, was unable to assess system conditionsfor most of the period between 12:15 and 15:34 EDT, due to a com-bination of human error and the effect of the loss of DPLs Stuart-Atlanta line on other MISO lines as reflected in the state estimatorscalculations.

2.2 Inadequate or incorrectlocal decisions

Decisions made are unfavorableto the local system undersupervision

In BP Texas City Refinery Explosion, the process unit was starteddespite previously reported malfunctions of the tower level indicator,level sight glass, and a pressure control valve.

In Subprime Crisis, financial institutions’ inadequate decisions ofusing excessive leverage and complex financial instruments.

In Northeast Blackout, FE uses minimum acceptable normal voltageswhich are lower than and incompatible with those used by its inter-connected neighbors.

2.3 Inadequate or incorrectglobal decisions

Decisions made are unfavorablefor the global system, butcould be locally right

In Subprime Crisis, the banks had gained their own securitizationskills and did not need the investment banks to structure and distrib-ute. So the investment banks moved into mortgage origination toguarantee a supply of loans they could securitize and sell to thegrowing legions of investors. But they are lack of global views ofthe entire market.

In Northeast Blackout, many generators had pre-designed protectionpoints that shut the unit down early in the cascade, so there werefewer units on-line to prevent island formation or to maintain balancebetween load and supply within each island after it formed. In partic-ular, it appears that some generators tripped to protect the units fromconditions that did not justify their protection, and many others wereset to trip in ways that were not coordinated with the regions under-frequency load-shedding, rendering that UFLS scheme less effective.

10 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal

Page 11: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

expressed in the Declaration of Independence document, areLife, Liberty and the Pursuit of Happiness.47 Given thesegoals, in every election, the citizens get to vote on a number ofissues related to economy, environment, education, health,security, privacy, race relations, etc.

This is the top most layer of the model. In its feedbackloop, there are citizens, elected government officials andregulators involved. In the Government View Layer, thethree branches of the U. S. government act as the“controller” of a collection of regulatory agencies and thecountry. In the Societal View Layer, citizens oversee andinfluence the society through elections. It usually takes dec-ades for a society to adapt and evolve in any significantfashion. The societal set point is related to the history andculture of a nation.

In all systemic failures, such as the ones mentioned above,we all play a role, through the Societal View Layer, and areaccountable for some of the blame, as it was our collectivedecision to elect (in the case of U.S.) a particular party, and itspolitical and regulatory views, to govern us. This accountabil-ity is a direct consequence of our responsibility. Consider, forexample, the responsibility of a CEO of a large petrochemicalcompany with many plant sites and tens of thousands ofemployees. The CEO may not know everything about whatgoes on in all her plant sites, on a daily basis, but when a dis-aster strikes she and her c-suite executives are held accounta-ble. Time and again, in all the official inquiries of majordisasters, whether it was Bhopal, Piper Alpha, BP Oil Spill,Global Financial Crisis, Northeast Power Blackout, and so on,the management was held responsible and accountable for

Table 2. Failure Taxonomy Part II

Class Definition Examples2,12,56

2.4 Resource Failures Failure to acquire, allocate and manage therequired resources properly to completethe tasks safely and achieve the goal(s)

2.4.1 Lack of resources Failure to acquire the necessary resources,such as funds, man power, time, etc.

In BP Texas City Refinery Explosion, BP has not alwaysensured that it identified and provided the resources requiredfor strong process safety performance at its U.S. refineries,including both financial and human resources.

In Subprime Crisis, in an interview with the FCIC, Greenspanwent further, arguing that with or without a mandate, the Fedlacked sufficient resources to examine the nonbank subsidia-ries. Worse, the former chairman said, inadequate regulationsends a misleading message to the firms and the market. But ifresources were the issue, the Fed chairman could have arguedfor more. It was always mindful, however, that it could be sub-ject to a government audit of its finances.

In Northeast Blackout, there is no UVLS system in place withinCleveland and Akron; had such a scheme been implementedbefore August, 2003, shedding 1,500 MW of load in that areabefore the loss of the Sammis-Star line might have preventedthe cascade and blackout.

2.4.2 Inadequate allo-cation of resources

Resources are deployed incorrectly. E.g.,over-staffing (“failing high”) in someareas while under-staffing (“failing low”)elsewhere

In BP Texas City Refinery Explosion, the incident at TexasCity and its connection to serious process safety deficiencies atthe refinery emphasize the need for OSHA to refocus resourceson preventing catastrophic accidents through greater PSMenforcement.

In Northeast Blackout, on August 14, the lack of adequatedynamic reactive reserves, coupled with not knowing the criti-cal voltages and maximum import capability to serve nativeload, left the Cleveland-Akron area in a very vulnerable state.

2.4.3 Training failures Failures related to the lack of organizedactivity(ies) aimed at helping employeesattain a required level of knowledge andskill needed in their current job. Thisincludes emergency response training

In BP Texas City Refinery Explosion, BP has not adequatelyensured that its U.S. refinery personnel and contractors havesufficient process safety knowledge and competence.

In Subprime Crisis, in theory, borrowers are the first defenseagainst abusive lending. But many borrowers do not understandthe most basic aspects of their mortgage. Borrowers with lessaccess to credit are particularly ill equipped to challenge themore experienced person across the desk.

In Northeast Blackout, the FE operators did not recognize theinformation they were receiving as clear indications of anemerging system emergency.

2.5 Conflict of interest Incorrect decisions reached due to a conflictof interest arising from competing goalsthat can affect proper judgment and exe-cution of tasks. E.g., safety vs financialgain, ethical failures such as corruption

In BP Texas City Refinery Explosion, cost-cutting, failure toinvest and production pressures from BP Group executive man-agers impaired process safety performance at Texas City.

In Subprime Crisis, many Moody’s former employees said thatafter the public listing, the company [Moody’s] culturechanged; it went from [a culture] resembling a university aca-demic department to one which values revenues at all costs,according to Eric Kolchinsky, a former managing director.

In Northeast Blackout, these protections should be set tightenough to protect the unit from the grid, but also wide enoughto assure that the unit remains connected to the grid as long aspossible. This coordination is a risk management issue thatmust balance the needs of the grid and customers relative tothe needs of the individual assets.

AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 11

Page 12: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

their companies failures. In fact, in a historic first, establishingan encouraging precedent, recently in April 2016, former Mas-sey Energy CEO was sentenced to twelve months in prison asa result of the mining company’s disaster.48,49 Thus, the peo-ple in charge have to be held accountable for part of the blame.

In a democratic society, the people in charge are, ultimately,us, the citizens who elected the government.

Therefore, we are responsible, in some part, for the failuresresulting from its policies. We are thus responsible for Bhopal,BP Oil Spill, Subprime Crisis, and so on. This is why it is

Table 3. Failure Taxonomy Part III

Class Definition Examples2,12,56

3. Action Failures Actions carried out incorrectly orinadequately

3.1 Flawed actionsincludingsupervision

Failure to perform the right actions,or performing no action, or per-forming the wrong actions. Failureto follow standard operatingprocedures

In BP Texas City Refinery Explosion, numerous heat exchanger tubethickness measurements were not taken. Some pressure vessels, storagetanks, piping, relief valves, rotating equipment, and instruments wereoverdue for inspection in six operating units evaluated.

In Subprime Crisis, struggling to remain dominant, Fannie and Freddieloosened their underwriting standards, purchasing and guaranteeing risk-ier loans, and increasing their securities purchases. Yet their regulator,the Office of Federal Housing Enterprise Oversight (OFHEO), focusedmore on accounting and other operational issues than on Fannies andFreddies increasing investments in risky mortgages and securities.

In Northeast Blackout, numerous control areas in the Eastern Intercon-nection, including FE, were not correctly tagging dynamic schedules,resulting in large mismatches between actual, scheduled, and taggedinterchange on August 14.

3.2 Late response Failure to take the right actions at theright time

In BP Texas City Refinery Explosion, Neither Amoco nor BP replacedblowdown drums and atmospheric stacks, even though a series of inci-dents warned that this equipment was unsafe. In the years prior to theincident, eight serious releases of flammable material from the ISOMblowdown stack had occurred, and most ISOM startups experiencedhigh liquid levels in the splitter tower. Neither Amoco nor BP investi-gated these events.

In Subprime Crisis, declining underwriting standards and new mortgageproducts had been on regulators radar screens in the years before thecrisis, but disagreements among the agencies and their traditional prefer-ence for minimal interference delayed action.

In Northeast Blackout, the alarm processing application had failed onoccasions prior to August 14, leading to loss of the alarming of systemconditions and events for FEs operators. However, FE said that themode and behavior of this particular failure event were both first timeoccurrences and ones which, at the time, FEs IT personnel neither rec-ognized nor knew how to correct.

4. CommunicationFailures

Failures that are associated with thesystem of pathways (informal orformal) through which messagesflow to different levels and differ-ent people in the organization

4.1 Communicationfailure with externalentities

Failures of communication betweenan individual and/or a group/orga-nization and an external individualand/or organization

In BP Texas City Refinery Explosion, BP and Amoco did not cooperatewell to investigate previous incidents and replace blowdown drum.

In Subprime Crisis, the leverage was often hidden. Lenders rarely discussthe leverage and the associated high risk with their investors. Investorsrelied on the credit rating agencies, often blindly.

In Northeast Blackout, the Stuart-Atlanta 345-kV line, operated by DPL,and monitored by the PJM reliability coordinator, tripped at 14:02 EDT.However, since the line was not in MISOs footprint, MISO operatorsdid not monitor the status of this line and did not know it had gone outof service. This led to a data mismatch that prevented MISOs state esti-mator (a key monitoring tool) from producing usable results later in theday at a time when system conditions in FEs control area weredeteriorating.

4.2 Peer to Peer com-munication failure

Failures of communication betweenan individual and another individ-ual within a group and/ororganization

In BP Texas City Refinery Explosion, the night lead operator left earlybut very limited information about his control cations was given to dayboard operator.

In Northeast Blackout, FE computer support staff did not effectivelycommunicate the loss of alarm functionality to the FE system operatorsafter the alarm processor failed at 14:14, nor did they have a formalprocedure to do so.

4.3 Inter-level commu-nication failure

Failures of communication betweenan individual and another individ-ual at a greater or lower level ofauthority within the same groupand/or organization

In BP Texas City Refinery Explosion, Supervisors and operators poorlycommunicated critical information regarding the startup during the shiftturnover.

In Northeast Blackout, ECAR and MISO did not precisely define criticalfacilities such that the 345-kV lines in FE that caused a major cascadingfailure would have to be identified as critical facilities for MISO.MISOs procedure in effect on August 14 was to request FE to identifycritical facilities on its system to MISO.

12 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal

Page 13: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

vitally important for the citizens to stay informed, engagedand active in the political process. This is particularly impor-tant to remember as we begin to address the mother of all sys-temic failures, the Climate Change Crisis, which has been inthe works for decades.

TeCSMART: Comparative Analysis of Three MajorDisastersFailure analysis and comparison

In this section, we discuss the results of applying the TeCS-MART framework to three prominent systemic failures,namely, the BP Texas City Refinery Explosion (2005), GlobalFinancial Crisis (2008–09), and the Northeast Power Blackout(2003). We in fact studied the following thirteen systemic fail-ures: (1) the Bhopal Disaster (1984), (2) the Space ShuttleChallenger Disaster (1986), (3) the Piper Alpha Disaster(1988), (4) the SARS Outbreak (2002-03), (5) the Space Shut-tle Columbia Disaster (2003), (6) the Northeast Power Black-out (2003), (7) the BP Texas City Refinery Explosion (2005),(8) Global Financial Crisis (2008-09), (9) the BP Deepwater

Horizon Oil Spill (2010), (10) the Upper Big Branch Mine

Disaster (2010), (11) the Chilean Mining Accident (2010),

(12) the Fukushima Daiichi Nuclear Disaster (2011), and (13)

the India Blackouts (2012), by carefully reviewing the official

postmortem reports of these disasters as well as other relevant

sources. However, we are presenting the comparative analysis

of only these three disasters for the sake of brevity. The other

cases have similar failure patterns as well. We analyzed and

classified over 700 failures mentioned in these reports.1,2,50–60

We categorize these failures into five primary classes, and 19

subclasses, that are consistent with the typical failure modes

we discussed in the previous section.The five classes are as follows: (1) Monitoring Failures; (2)

Decision Making Failures; (3) Action Failures; (4) Communi-

cation Failures; and (5) Structural Failures. Each category has

sub-categories that define more detailed failures. Subclass

details are listed in Tables 1–4. The five-class failure taxonomy

reveals “what can go potentially wrong” in a complex socio-technical system. It summarizes the failure modes modeledusing the TeCSMART framework. Different failure modes give

Table 4. Failure Taxonomy Part IV

Class Definition Examples2,12,56

5. Structural Failures Deficient structures and/or models5.1 Design failures Defects or deficiencies in the design

of the system/component/model, orjust wrong design of the system/component/model

In BP Texas City Refinery Explosion, occupied trailers were sited tooclose to a process unit handling highly hazardous materials. All fatal-ities occurred in or around the trailers.

In Subprime Crisis, where were Citigroups regulators while the com-pany piled up tens of billions of dollars of risk in the CDO business?Citigroup had a complex corporate structure and, as a result, facedan array of supervisors. The Federal Reserve supervised the holdingcompany but, as the Gramm-Leach-Bliley legislation directed, reliedon others to monitor the most important subsidiaries: the Office ofthe Comptroller of the Currency (OCC) supervised the largest banksubsidiary, Citibank, and the SEC supervised the securities firm, Cit-igroup Global Markets. Moreover, Citigroup did not really align itsvarious businesses with the legal entities. An individual working onthe CDO desk on an intricate transaction could interact with variouscomponents of the firm in complicated ways.

In Northeast Blackout, although MISO received SCADA input of thelines status change, this was presented to MISO operators as breakerstatus changes rather than a line failure. Because their EMS systemtopology processor had not yet been linked to recognize line failures,it did not connect the breaker information to the loss of a transmis-sion line. Thus, MISOs operators did not recognize the Harding-Chamberlin trip as a significant contingency event and could notadvise FE regarding the event or its consequences. Further, withoutits state estimator and associated contingency analyses, MISO wasunable to identify potential overloads that would occur due to variousline or equipment outages.

5.2 Maintenancefailures

Failure to adequately repair and main-tain equipment at all times

In BP Texas City Refinery Explosion, deficiencies in BPs mechanicalintegrity program resulted in the run to failure of process equipmentat Texas City.

In Northeast Blackout, FE had no periodic diagnostics to evaluate andreport the state of the alarm processor, nothing about the eventualfailure of two EMS servers would have directly alerted the supportstaff that the alarms had failed in an infinite loop lockup.

5.3 Operating proce-dure failures

Failure to develop and execute stand-ard operating procedures for alltasks

In BP Texas City Refinery Explosion, outdated and ineffective proce-dures did not address recurring operational problems during startup,leading operators to believe that procedures could be altered or didnot have to be followed during the startup process.

In Subprime Crisis, in addition to the rising fraud and egregious lend-ing practices, lending standards deteriorated in the final years of thebubble.

In Northeast Blackout, the PJM and MISO reliability coordinatorslacked an effective procedure on when and how to coordinate anoperating limit violation observed by one of them in the others area.The lack of such a procedure caused ineffective communicationsbetween PJM and MISO regarding PJMs awareness of a possibleoverload on the Sammis-Star line as early as 15:48.

AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 13

Page 14: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

rise to systemic failures in different domains. However, thereare common failure modes shared by many, if not, all the sys-temic failures. Such common failure pathways help us identify,proactively, how things can potentially go wrong in a complexsystem. By studying these common failure mechanisms, peoplecould become more vigilant for new systems. Thus, the com-mon patterns identified by our comparative analysis are helpfulnot only diagnostically but also prognostically.

The comparative analysis of the three case studies is per-

formed in following three steps. (1) Carefully review the offi-

cial post mortem reports and classify the failures into

different classes/subclasses mentioned in Tables 1–4. For

example, the level control valve was accidentally turned off

by an operator in BP Texas City Refinery. This failure is clas-

sified as an flawed action (3.1 in Table 3). The over-grown

tree is a known problem for all power grid operators. But First

Energy (FE) failed to trim the over-grown trees, which led to

line trips. The inadequate tree trimming is classified as a late

response failure (3.2 in Table 3). (2) Once failures are classi-

fied properly, they are organized in the TeCSMART frame-

work according to the relevant agents and the failure

mechanisms. Relevant agents indicate the level of the failure

in the TeCSMART framework, and the failing mechanisms

explain which control component the failure is associated

with. One layer can have multiple failures, and one failure

can appear multiple times at different levels. Therefore, the

level control valve failure is a flawed action of actuator at the

Process View, and the inadequate tree trimming is due to late

response of actuator at the Plant View. (3) Compare failures

across domains to identify common patterns.

Case Studies

In this section, we briefly introduce the three prominent sys-

temic failures: Northeast Blackout (2003), BP Texas City

Refinery Explosion (2005), and Subprime Crisis (2008), and

compare their failures applying TeCSMART framework. The

comparison study shows the similarities and differences of the

three systemic failures. Moreover, the common patterns indi-

cate important failure modes, which can help improve system

design, control, and risk management.

Overview

The Northeast Blackout, which happened on August 14,

2003, was the largest blackout of North America power grid.

With many generating units tripping and transmission lines

disconnected at noon, the cascading sequence essentially com-

pleted around 4:13 p.m. A shut-down cascade triggered the

blackout. Supply/Demand mismatch and poor vegetation man-

agement triggered the power surges in transmission lines. FE’s

operators did not pay attention to the warning signs, and

poorly communicated with other line operators. Finally, the

power surges spread and the blackout emerged.56

BP Texas City refinery is the third largest refinery in the

United States. The refinery employs approximately 1800 BP

workers. On March 23, 2005, the refinery initiated the startup

of the ISOM raffinate splitter section. During the startup, the

control valve was accidentally turned off by an operator and

the tower was filled with flammable liquid for over 3h. The

pressure relief valve was activated by high pressure in the

tower and discharged liquid to the blowdown drum. The

Figure 9. Cross-domain comparison table.

[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

14 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal

Page 15: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

blowdown drum overfilled and the stack vented flammable liq-

uid to the atmosphere, which formed a vapor cloud. When the

flammable vapor cloud reached an idling diesel pickup truck,

whose engine was on, an explosion occurred. The explosion

and fires killed 15 people, injured 180 others, and resulted in

financial losses exceeding $1.5 billion.12

In the summer of 2007, leading banks in the U.S. started to

fail as a result of falling real estate prices. Bear Stearns, the

fifth largest investment bank, whose stock had traded at $172

a share as late as January 2007 was sold to JP Morgan Chase

for a fire sale price of $2 on March 16, 2008; Lehman Broth-

ers, the fourth largest, went bankrupt; Fannie Mae and Freddie

Mac were taken over by government; American International

Group (AIG), the issuance giant, was bailed out by tax

payers.61 Over half million families lost their homes to fore-

closure. Nearly $11 trillion household wealth vanished.

Between January 2007 and March 2009, stock market lost half

its value.62 The final cost to the U.S. economy as a result of

the biggest financial crisis since Great Depression was about

$22 trillion! To get a sense of its magnitude, compare it with

the US GDP in 2014 which was $17.4 trillion.

TeCSMART Comparison

A cross-domain comparison, shown in Figure 9, was con-

ducted by analyzing and comparing failures of these three

prominent systemic failures. Figure 9 is a table where rows are

TeCSMART views and failure classes, and columns are the

three systemic failures. Table 5 lists agents of the three sys-

temic failures. As discussed before, we classify failure eviden-

ces found in the postmortem investigation reports into

different failure classes, related to specific control components

at the appropriate levels. Then we mark the failure class as a

colored cell in the table, with a color code that blue represents

BP Texas City Refinery Explosion; yellow represents Sub-

prime Crisis; and brown represents Northeast Blackout. If the

three colors appear in the same row, it means that particular

failure had occurred in all three cases. Therefore, by compar-

ing the colored cells, we are able to study the failure mecha-

nisms, their similarities and differences. Figure 10 highlights

failure classes classified in the comparison table (Figure 9).

Failures were found at every level in all the three cases.

Operational failures are more common at low levels; control-

ler failures dominate at high levels. Among the many impor-

tant observations and insights from the comparison, we

highlight a few and discuss them in depth.

The comparison shows that lack of appropriate training was

a widespread problem. In Figure 9, we have seen training fail-

ures in the bottom three views of all three cases. Evidence

shows that operators, even managers, have not received appro-

priate and sufficient training prior to the accidents. The opera-

tor training program was inadequate at BP Texas City

Refinery. The training department staff had been reduced from

28 to 8; there were no simulators for operators to practice han-

dling abnormal events.12 The training failure of BP is con-

firmed by the logic tree created by the Chemical Safety and

Hazard Investigation Board (CSB), highlighted in Figure 11a.

Similar things happened in the Northeast Blackout. FE opera-

tors were poorly trained to recognize emergency information.

They received signals indicating line trips, but made poor

decisions by relying solely on the Emergency Management

System (EMS). Unfortunately, EMS failed at this time. FE

engineers’ poor judgment and lack of training played a signifi-

cant role in the failure. Their lack of training was also high-

lighted by ThinkReliability in their causal map, depicted in

Figure 12. Such a pattern was also seen in the financial system

failure.2,64

Decision-makers are “controllers” in the TeCSMART

framework. In all three cases, almost every layer has shown

decision making failures. For example, the decision of initial-

izing the ISOM unit despite previously reported malfunctions

of the raffinate tower level indicator, pressure control valve,

and level sight glass, was a serious failure, which directly trig-

gered the overall disaster.12 Moreover, BP’s cost-cutting deci-

sions that led to the layoff of experienced workers from

Amoco contributed to the accident as well.1 These failures are

highlighted by CSB in Figures 11b, c. In Subprime Crisis,

fund managers’ decision to invest in subprime securities with-

out fully understanding the embedded risks was an leading

cause of the financial system collapse.2 FE’s decision of using

minimum acceptable normal voltages (highlighted in Figure

12), which are lower than and incompatible with those of its

neighbors, directly caused power surges and transmission lines

sag.56 At the management level, demonstrated by both our

comparison study and the CSB analysis (Figures 11a, c), a crit-

ical failure was BP not providing enough resources for strong

process safety performance in its U.S. refineries.12 At the

same level, CEOs of financial institutions decided to maintain

a large quantity of subprime related assets by using a very

high leverage. The high leverage magnified the scale of the

crisis dramatically. Moreover, sometimes a locally favorable

decision may bring undesired consequences to the system. In

Table 5. Agents of Each View

View

Agents

BP Texas City Refinery Explosion Subprime Crisis Northeast Blackout

Societal View U.S. citizens Citizens worldwide U.S. and Canada citizensGovernment View Employees of different branches of

GovernmentEmployees of U.S. and Foreign

GovernmentsEmployees of U.S.and Canada

GovernmentsRegulatory View Employees of OSHA Employees of FED, SEC, FDIC,

OCC, OTCEmployees of NERC and FERC of

U.S.;Employees of NEB of Canada

Market View Companies in oil & gas refiningindustry

Institutions in financial industry MAAC-ECAR-NPCC power grid

Management View BP senior management Senior management of financialinstitutions & credit rating agencies

Senior managementof FE, AEP,MISO, PJM

Plant View BP Texas City refinery management Dealers, investors, managers offinancial products

Eastlake 5 generation,Harding-Chamberlin line

Equipment View Engineers and operators, equipment Borrowers, lenders, brokers,subprime loans

Engineers and operators,equipment

AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 15

Page 16: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

the North America Power Grid, the pre-protection point thatprotects single operators will not work for the whole system.When single operators dropped out from the grid, the pressurewas all on the other part of the system. Finally the system hadno options but to fail systemically.56

Monitoring problems often play a major role in sociotechni-cal disasters. Monitoring failures were observed at the man-agement level in all three cases. As discussed in last sectionand in Table 1, a sensor or a monitoring task can fail low,high, zero, or fail to detect in time. BP was not aware of haz-ards at Texas City Refinery, because BP failed to incorporateprevious incidents; even worse, the incidents investigations

were missing1 (“failing zero”). The monitoring failure of BP isparticularly mentioned by CSB in Figure 11d. Conversely,prior to the Subprime Crisis, Moody’s did not account for thedeterioration in underwriting standards and was not aware ofthe plummeting home prices. Moody’s did not develop amodel specifically to look into layered risks of subprime secur-ities, after it had rated nearly 19,000 subprime securities2

(“failing zero”). Deregulation and self-policing by financialinstitutions had stripped away key safeguards2 (“failing low”).Moreover, in Northeast Blackout, the Midcontinent Independ-ent System Operator, Inc. (MISO) failed to recognize the con-sequence of Hanna-Juniper line loss, while other operators

Figure 10. Failure modes in the comparison table.

[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

16 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal

Page 17: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

recognized the overload but had not expected it because the

contingency analysis earlier did not examine enough lines to

foresee the Hanna-Juniper contingency. The failure of not rec-

ognizing the line loss in a timely manner worsened the situa-

tion. When the operators finally figured out the situation, it

was too late to respond56 (“failing to detect in time”). MISO’s

monitoring failure not only was highlighted by ThinkReliabil-

ity (in Figure 12) as lack of warning, but also raised concerns

of U.S.–Canada Power System Outage Task Force. The Task

Force report56 recommends FERC should not approve the

operation of a new Regional Transmission Operator (RTO) or

Independent System Operator (ISO) until the applicant has

met the minimum functional requirements for reliability coor-

dinators. This recommendation directly addressed the issue of

MISO’s, as a reliability coordinator, failing to recognize line

loss in its region.Beyond the decision making or monitoring failures, the

flawed actions of regulators and their limited oversight always

contribute to sociotechnical system collapses. The reports1,12

mention that OSHA did not conduct a comprehensive inspec-

tion of any of the 29 process units at the Texas City Refinery.

Knowing the high leverage and vast sums of Subprime loans,

the FED did not begin routinely examining subprime subsidia-

ries until a pilot program in July 2007. FED did not even issue

new rules until July 2008, a year after the subprime market had

shut down.2 North American Electric Reliability Corporation

(NERC), the power grid self-regulator, knowing FE’s potential

risk, did not enforce any changes or regulate FE’s activities.56

All these flawed actions contributed to the disasters. Regulators

also experience conflict of interest. Especially financial regula-

tors, who face challenges from powerful financial institutions.These observations are just a few examples of what we stud-

ied in the TeCSMART comparison. Comparing with the logic

tree and the causal map, TeCSMART comparison is able to

capture high-level failures such as regulatory failures, which

are not covered in the logic tree or causal map. More impor-

tantly, TeCSMART comparison can systematically identify

potential risks in a sociotechnical system by identifying

Figure 11. The logic tree of BP Texas City Refinery Explosion (Adapted from Ref. [12] Investigation Report RefineryExplosion and Fire).

[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 17

Page 18: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

possible failure modes associated with different components at

different levels.

Summary and Conclusions

As the recent systemic failures in different domains remind

academicians and practitioners alike, one can never take sys-

tem safety for granted. All of us—individuals, corporate man-

agement, regulatory agencies, and communities, need to learn

the lessons from every accident, particularly from the systemic

ones. It is imperative to study all these disasters from a com-

mon systems engineering perspective so that one can thor-

oughly understand the commonalities as well as the

differences to prevent or mitigate future ones. This is the

approach we have adopted in this article.Analyzing systemic risk in a complex sociotechnical system

thus requires modeling the system at multiple levels, at multi-

ple perspectives, using a systematic and unified framework. It

is not enough to focus only on equipment failures. It is impor-

tant to systematically examine the potential failures associated

with humans and institutions at all levels in a society. We

have proposed such an approach, the TeCSMART framework,

which models sociotechnical systems in seven layers using

control-theoretic concepts. Using this framework, a HAZOP-

like hazards identification can be conducted for every layer of

a sociotechnical system. The failure modes identified using

TeCSMART framework, at all levels, serve as a common plat-

form to compare systemic failures from different domains to

elicit and understand common failure mechanisms which can

help with improved design and risk management in the future.

They also serve as the input information for developing other

types of models (e.g., DAE, SDG, game-theoretic, agent-

based) for more detailed studies.We carried out such a comparative analysis of 13 major sys-

temic events from different domains, analyzing over 700 fail-

ures discussed in official post mortem reports. Even though we

are only highlighting the results from three of them, for the

sake of brevity, the common failure patterns we identify in

this article were found in the other events as well. These 7001

failures can be systematically classified into the five categories

(and their subcategories) that can occur at all levels of the sys-

tem. Using a unifying control-theoretic framework, we show

how these correspond to common failure modes associated

with the elements of a control system, namely, sensor, control-

ler, actuator, process unit, and communication channels. Eventhough every systemic failure happens in some unique man-ner, and is not an exact replica of a past event, we show thatthe underlying failure mechanism can be traced back to similar

patterns associated with other events.No modern engineered system with ever increasing com-

plexity can be totally risk free. However, minimizing inherentrisks in our products and processes is an important societalchallenge, both intellectually and practically, for innovativescience and engineering. Safety is not the responsibility of justthe environment, health and safety department. It is everyone’sresponsibility in the facility. There exists a need for systems,

procedures, corporate and regulatory cultures that ensure this.In the long run, considerable technological help would comefrom progress in taming complexity, which would result inmore effective prognostic and diagnostic systems for monitor-ing, analyzing, and controlling systemic risks. But gettingthere would require innovative thinking, bolder vision, andovercoming certain misconceptions about process safety as an

intellectually dull activity.

Acknowledgment

This work is supported in part by the Center for the Man-agement of Systemic Risk at Columbia University.

Literature Cited

1. Baker J, Leveson N, Bowman F, Priest S. The report of the bp usrefineries independent safety review panel. Report; IndependentSafety Review. 2007.

2. Financial Crisis Inquiry Commission, United States. Financial CrisisInquiry Commission. The financial crisis inquiry report: Final reportof the national commission on the causes of the financial and eco-nomic crisis in the United States. PublicAffairs; 2011.

3. Ottino JM. Engineering complex systems. Nature. 2004;427(6973):399.

4. Jasanoff S. Learning from Disaster: Risk Management after Bhopal.Philadelphia: University of Pennsylvania Press, 1994. ISBN081221532X.

5. Plotz D. Play the Enron Blame Game! Slate.com. 2002. Access Date:February 23, 2016. [Available from: http://www.slate.com/articles/news_and_politics/politics/2002/02/play_the_enron_blame_game.html.]

6. CCPS. Building process safety culture: tools to enhance processsafety performance. Report; Center for Chemical Process Safety ofthe American Institute of Chemical Engineers, New York. 2005.

7. MSNBC. Mine Owner Ran Up Serious Violations: MSNBC; 2010.Access Date: February 23, 2016. [updated April 6, 2010. Availablefrom: http://www.nbcnews.com/id/36202623/.]

Figure 12. The cause map of Northeast Blackout (Adapted from Ref. [63] The cause map of Northeast Blackout of2003).

[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

18 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal

Page 19: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

8. Thomas P, Jones Lisa A. Cloherty J, Ryan J. Bp’s dismal safetyrecord. 2010. Access Date: February 23, 2016. [updated May 27,2010. Available at: http://abcnews.go.com/WN/bps-dismal-safety-record/story?id510763042.]

9. Johnson LD, Neave EH. The subprime mortgage market: familiarlessons in a new context. Manag Res News. 2007;31(1):12–26.

10. Lewis M. The Big Short: Inside the Doomsday Machine. New York:W. W. Norton, 2011. ISBN 9780393078190.

11. Olive C, OConnor TM, Mannan MS. Relationship of safety cultureand process safety. J Hazard Mater. 2006;130(1):133–140.

12. CSB. Investigation report refinery explosion and fire. Report; U.S.Chemical Safety and Hazard Investigation Board. 2005.

13. Hopkins A. Failure to Learn: The BP Texas City Refinery Disaster.CCH Australia Ltd, 2008. ISBN 1921322446.

14. Krugman P. Berating the raters. 2010. Access Date: February 23,2016. [updated April 25, 2010. Available from: Available at:http://www.nytimes.com/2010/04/26/opinion/26krugman.html?_r50.]

15. Urbina I. Inspector general’s inquiry faults regulators: New YorkTimes; 2010. Access Date: February 23, 2016. [updated May 24,2010. Available from: http://www.nytimes.com/2010/05/25/us/25mms.html.]

16. Venkatasubramanian V. Systemic failures: Challenges and opportuni-ties in risk management in complex systems. AIChE J. 2011;57(1):2–9.

17. Venkatasubramanian V, Zhao JS, Viswanathan S. Intelligent systemsfor hazop analysis of complex process plants. Comput Chem Eng.2000;24(9–10):2291–2302.

18. Catanzaro M, Buchanan M. Network opportunity. Nat Phys. 2013;9(3):121–123.

19. Caldarelli G, Chessa A, Gabrielli A, Pammolli F, Puliga M. Recon-structing a credit network. Nat Phys. 2013;9(3):125–126.

20. Galbiati M, Delpini D, Battiston S. The power to control. Nat Phys.2013;9(3):126–128.

21. Ashby WR. Requisite variety and its implications for the control ofcomplex systems. In Facets of Systems Science 1991 (pp. 405–417).Springer US.

22. Natarajan S, Srinivasan R. Implementation of multi agents basedsystem for process supervision in large-scale chemical plants. Com-put Chem Eng. 2014;60:182–196.

23. Saleh JH, Marais KB, Favar FM. System safety principles: a multi-disciplinary engineering perspective. J Loss Prev Process Ind. 2014;29:283–294.

24. Maurya MR, Rengaswamy R, Venkatasubramanian V. A systematicframework for the development and analysis of signed digraphs forchemical processes. 1. algorithms and analysis. Ind Eng Chem Res.2003;42(20):4789–4810.

25. Maurya MR, Rengaswamy R, Venkatasubramanian V. A systematicframework for the development and analysis of signed digraphs forchemical processes. 2. control loops and flowsheet analysis. Ind EngChem Res. 2003;42(20):4811–4827.

26. Maurya MR, Rengaswamy R, Venkatasubramanian V. Applicationof signed digraphs-based analysis for fault diagnosis of chemicalprocess flowsheets. Eng Appl Artif Intell. 2004;17(5):501–518.

27. Srinivasan R, Venkatasubramanian V. Multi-perspective models forprocess hazards analysis of large scale chemical processes. ComputChem Eng. 1998;22(98):S961–S964.

28. Venkatasubramanian V, Vaidhyanathan R. A knowledge-basedframework for automating hazop analysis. AIChe J. 1994;40(3):496–505.

29. Rasmussen J, Svedung R, Svedung I. Proactive risk management ina dynamic society. Swedish Rescue Services Agency, Karlstad, Swe-den. 2000. ISBN 9789172530843.

30. Leveson NG, Stephanopoulos G. A system-theoretic, control-inspiredview and approach to process safety. AIChE J. 2014;60(1):2–14.

31. Levenson NG. Engineering a Safer World: System Thinking Appliedto Safety, 1st ed. Cambridge, MA: The MIT Press, 2011. ISBN9780262016629.

32. Leveson NG. A systems-theoretic approach to safety in software-intensive systems. IEEE Trans Dependable Secure Comput. 2004;1(1):66–86.

33. Leveson N. A new accident model for engineering safer systems.Safety Sci. 2004;42(4):237–270.

34. Stephanopoulos G. Chemical Process Control: An Introduction toTheory and Practice. Prentice-Hall, Englewood Cliffs, New Jersey07632. 1984.

35. Seborg D, Edgar TF, Mellichamp D. Process Dynamics & Control.United States of America: Wiley, 2006. ISBN 8126508345.

36. Ogunnaike BA, Ray WH. Process Dynamics, Modeling, and Con-trol, vol.1. New York: Oxford University Press, 1994.

37. Bequette BW, Bequette WB. Process Dynamics: Modeling, Analysis,and Simulation. Upper Saddle River, NJ: Prentice Hall PTR, 1998.ISBN 0132068893.

38. Bookstaber R, Glasserman P, Iyengar G, Luo Y,Venkatasubramanian V, Zhang Z. Process systems engineering as amodeling paradigm for analyzing systemic risk in financial networks.Off Financ Res Work Pap Ser. 2015;15(1).

39. Seider WD, Seader JD, Lewin DR. Product & Process Design Prin-ciples: Synthesis, Analysis and Evaluation. United States of America:Wiley,; 2009. ISBN 8126520329.

40. Srinivasan R, Venkatasubramanian V. Petri net-digraph models forautomating hazop analysis of batch process plants. Comput ChemEng. 1996;20(96):S719–S725.

41. Srinivasan R, Venkatasubramanian V. Automating hazop analysis ofbatch chemical plants: Part i. the knowledge representation frame-work. Comput Chem Eng. 1998;22(9):1345–1355.

42. Srinivasan R, Venkatasubramanian V. Automating hazop analysis ofbatch chemical plants: Part ii. algorithms and application. ComputChem Eng. 1998c;22(9):1357–1370.

43. Vaidhyanathan R, Venkatasubramanian V. Digraph-based models forautomated hazop analysis. Reliab Eng Syst Saf. 1995;50(1):33–49.

44. Vaidhyanathan R, Venkatasubramanian V. A semi-quantitative rea-soning methodology for filtering and ranking hazop results in hazo-pexpert. Reliab Eng Syst Saf. 1996;53(2):185–203.

45. Rakoff JS. The financial crisis: why have no high-level executivesbeen prosecuted?: The New York Review of Books; 2014. AccessDate: February 23, 2016. [updated January 9, 2014. Available from:http://www.nybooks.com/articles/2014/01/09/financial-crisis-why-no-executive-prosecutions/.]

46. Eilperin J, Higham S. How the minerals management services part-nership with industry led to failure. 2010. Available at: http://www.washingtonpost.com/wp-dyn/content/article/2010/08/24/AR2010082406754.html.

47. Jefferson T. United states declaration of independence: archives.gov;1776. Access Date: February 23, 2016. [Available from: http://www.archives.gov/exhibits/charters/declaration_transcript.html.]

48. Blinder A. Donald blankenship sentenced to a year in prison in minesafety case: New York Times; 2016. Access Date: April 23, 2016.[updated April 6, 2016. Available from: http://www.nytimes.com/2016/04/07/us/donald-blankenship-sentenced-to-a-year-in-prison-in-mine-safety-case.html?_r50.]

49. Steinzor R. Why Not Jail?: Industrial Catastrophes, Corporate Mal-feasance, and Government Inaction. New York: Cambridge Univer-sity Press, 2014. ISBN 1316194884.

50. Presidential Commission. Deepwater, the gulf oil disaster and thefuture of offshore drilling. Report; National Commission on the BPDeepwater Horizon Oil Spill and Offshore Drilling, Washington.2011.

51. Browning JB. Union carbide: Disaster at Bhopal. In: Managingunder Siege. Detroit, MI: Union Carbide Corporation, 1993:1–15.

52. Investigation of the challenger accident. Report; Committee onScience and Technology House of Representative, Washington.1986.

53. Cullen WD. The public inquiry into the piper alpha disaster. Report0046-0702, London. 1993.

54. WHO. Sars: how a global epidemic was stopped. Report. 2006.Geneva. Available at: http://www.tandfonline.com/doi/abs/10.1080/17441690903061389.

55. CAIB. Columbia accident investigation board report. Report; Colum-bia Accident Investigation Board: Washington. 2003. Available at:http://www.slac.stanford.edu/spires/find/books?irn5317624.

56. TaskForce. Final report on the August 14, 2003 blackout in theUnited States and Canada. Report; US-Canada Power System OutageTask Force. 2004.

57. McAteer JD, Beall K, Beck J, McGinley P. Upper big branch: theApril 5, 2010, explosion: a failure of basic coal mine safety prac-tices: Report to the governor. Report; Governors Independent Inves-tigation Panel, West Virginia. 2011.

58. Bonnefoy P. Poor safety standards led to chilean mine disaster:GlobalPost; 2010. Access Date: February 23, 2016. [updated August29, 2010. Available from: http://www.globalpost.com/dispatch/chile/100828/mine-safety.

59. Kurokawa K, Ishibashi K, Oshima K, Sakiyama H, Sakurai M,Tanaka K, Tanaka M, Nomura S, Hachisuka R, Yokoyama Y. Theofficial report of the fukushima nuclear accident independent

AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 19

Page 20: TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk … · 2018. 12. 10. · Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the

investigation commission. Report; The Fukushima Nuclear AccidentIndependent Investigation Commission, Japan. 2012.

60. CERC. Report on the grid disturbance on 30th July 2012 and griddisturbance on 31st July 2012. Report, India; 2012.

61. Blackburn R. The subprime crisis. New left review 50: 63. 2008Mar 1.

62. Jickling M. Containing financial crisis. Report; CongressionalResearch Service. 2011.

63. Think Reliability. The cause map of northeast blackout 0f 2003.Houston. 2008. URL: http://www.thinkreliability.com/Instructor-Blogs/Blog%20-%20NE%20Blackout.pdf.

64. Schumer CE, Maloney CB. The subprime lending crisis: the eco-nomic impact on wealth, property values and tax revenues, and howwe got here. 2007. Available at: www.jec.senate.gov/Documents/Reports/10.25.07OctoberSubprimeReport.pdf.

Manuscript received Feb. 26, 2016, and revision received Apr. 30, 2016.

20 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal