The term “black swan” has achieved mainstream status since the publication of Taleb’s highly successful book, The Black Swan. Taleb uses the colorful phrase to describe highly unlikely and yet highly consequential events that surprise and astonish us. He addresses a major limitation of EUT by asking, “what is the definition of risk when Pr(failure) approaches zero and consequence approaches infinity”? After all, the product of zero times infinity is mathematically undefined. What does it mean? [The terrorist attack of 9/11 is an example: its likelihood approached zero and its consequence approached infinity. It seems to render EUT useless].

Nicholas Taleb

To understand the risk of black swan events we must understand yet another probability – the so-called exceedence probability, EP. EP(C) is the probability of an event occurring that equals or exceeds a certain consequence, C. In other words, probability is a function of consequence. We drop the condition of independence between V and C assumed by the PRA and EUT models. EP(C) is typically obtained from historical evidence, but it is unrelated to Bayesian probability theory previously discussed.

To get risk, EP(C) is substituted in place of TV, so that risk is defined as the Probable Maximum Loss, PML:

R = PML = EP(C)C

Unlike simple PRA probability, exceedence probability is the sum of all point probabilities that exceed consequence, C. Exceedence is an a posteriori estimate obtained from historical data or simulations. In other words, the shape of EP(C) is determined by observations and measurement, not mathematical permutations. For example, the Richter magnitude scale for measuring earthquake intensity, the flood level scale for measuring the severity of floods, and methods for measuring the likelihood and intensity of hurricanes are based on historical measurement of past earthquakes, floods, and hurricanes. EP curves are well-known for natural disasters such as floods and earthquakes.

Exceedence probability curves are typically long-tailed distributions as shown in Figure 4.

Most incidents are small (left side of the curve showing higher exceedence probability), while a few are medium-sized, see the middle part of the distribution. Only the rare black swan associated with the “100-year flood” or “once-in-a-lifetime” event appear on the extreme right end of the long tail (right side of the curve showing nearly zero exceedence probability). Black swans are extreme events where probability approaches zero and consequences approach infinity – the mysterious undefined region.

Figure 4 includes two curves for PML risk. In the left plot, the dotted line rises as consequence increases. This represents a high-risk hazard. In the right plot, the dashed line quickly reaches a peak and then declines with increasing consequence. This represents a low-risk hazard, because it declines and eventually reaches zero. The shape of these curves is completely determined by the slope of the exceedence probability curve. In turn, the slope determines the thickness and length of the curve’s tail. High-risk hazards have long-tailed exceedence probability curves, and low-risk hazards have short-tailed curves. The EP curve contains a lot of useful information on both likelihood and risk. More importantly, it solves the black swan risk puzzle.

Here is the answer to the puzzle. High-risk hazards approach infinite risk as consequence increases; low-risk hazards approach zero risk. There is no ambiguity – hazards are either high or low-risk, depending on the slope (exponent) of their exceedence probability distributions. One approaches zero as probability approaches zero and consequence approaches infinity, and the other approaches infinity with plunging probability and escalating consequence. PML risk addresses a long-unanswered question about black swan events: some are more “risky” than others.

If we want to reduce the impact of hazards such as hurricanes and terrorism, the goal must be to shift hazard exceedence probability from high to low risk. This is the same as bending the exceedence curve downward (increasing its slope or exponent) to make the tail shorter and thinner. A variety of prevention and response techniques can be used to achieve this. Generally speaking, the more resilient a building, power grid, or rail network, the shorter is its long-tailed exceedence probability curve. Thus the slope of the exceedence curve is also a measure of the target’s resilience.

An exceedence exponent quantifies resilience. If the exponent is greater than 1.0, the tail is short, corresponding with a low-risk, resilient hazard. Conversely, if the exponent is less than 1.0, the tail is long, corresponding with a high-risk, fragile hazard. Black swan events may still surprise and astonish us, but this characterization of resiliency suggests a strategy for dealing with them.

Exceedence probability curves are based on a posteriori probability. We obtain them by observing or simulating hazardous events. While they are used to predict natural hazards, they can also be used to predict terrorist events. For example, the exceedence probability of al Qaeda-caused fatalities follows a long-tailed exceedence probability curve, and its risk curves indicate a low-hazard risk, see Table I.

PRA and other single-asset models are limited in how much they can tell us about risk. Paramount among the limitations of single-asset risk assessment methods is their inability to model interdependencies and network effects prominent in complex systems such as the Internet, power grids, telecommunications, transportation, supply chain systems, and the spread of diseases through human populations. In other words, a more expressive method is required to assess entire systems – not merely isolated assets.

System-level methods are critically important because systems-of-systems with interdependencies abound throughout modern society. Telecom networks are linked to power grids, and power grids are linked to water supplies and transportation networks. These highly interconnected systems are often linked together so tightly, that a failure in one quickly spreads to the others, resulting in major cascade failures.

For example, the 2003 Eastern Power Grid Blackout started as a single-asset mishap, but quickly cascaded throughout the Northeastern US and Canada leaving 55 million people in the dark. Without power, telecommunications, transportation, and energy systems could not function. Similarly, Internet viruses can disrupt web sites and potentially banks and power plants. In fact, it is not clear how fragile such critical systems are to cascade failures that spread like a contagion across assets. The fact that they are interconnected systems complicates risk assessment, because risk is more than the sum of all component risks.

It is impossible to assess this kind of system risk with a single-asset methodology such as PRA, MSRAM, and others. For this reason, a network model is used to analyze complex systems such as the Internet or power grid. Network modeling is actually a very simple way to represent such interconnected systems. Nodes can be anything – people, buildings, power plants, police stations, etc. Links can also be anything – transmission lines, roads, and relationships among people. When links tie components together, something unusual happens – a failure in one node or link can spread to other nodes and links. As it turns out, “interconnectedness” acts as a force-multiplier, which increases risk more than linearly. [Doubling the size of these networks can more than double the risk].

Figure 5 illustrates the simplicity and power of network analysis. The Boston mass transit system, MTA, is a vast network of stations, tunnels, railways, and bridges. These assets are tied together in such a way that a failure in one part can spread its effects to other parts. That is, it is prone to cascade failures, strictly because of the network’s topology (wiring diagram). Congestion in one part of the MTA network can affect other parts in non-linear ways. This means that the topology of such networks is as important to estimating risk of collapse as the fragility of individual assets! We need stronger tools to better understand the relationship between asset risk and system risk.

Charles Perrow (1925-) was an early pioneer of catastrophe theory in coupled or connected systems. His theory of normal accidents (NAT) sought to understand why and how seemingly simple mistakes or errors propagate through a system causing its ultimate collapse. According to Perrow, small discrepancies grow and magnify like a column of collapsing dominos when components of a nuclear power plant, telecommunications system, deep-water oilrig, or power plant are linked together. The probabilities of failure are not independent. Rather, they are conditional, and magnify as a collapse propagates through the complex system.

Consider flooding along the Mississippi river, which occurs with frequency defined by a long-tailed exceedence probability, see Table I.

Rising water in the north increases the probability of collapse of levies and dams downstream. Collapse of downstream levies increases the probability of flooding in New Orleans, and flooding in New Orleans threatens the port of New Orleans. The Louisiana offshore oil port, LOOP, may also be closed due to flooding, which impacts the global oil supply chain. This supply chain is so fragile that disruption of LOOP likely causes an increase in gasoline prices, which puts pressure on the economy. At each step of this chain of normal accidents, conditional probabilities change. And according to Bayes’ Theorem, conditional probabilities tend to increase as precursor conditions become certainties.

Perrow’s normal accident theory states the obvious: big incidents often start with small incidents that cascade and increase in intensity as the collapse propagates from asset to asset. But a less obvious result of normal accident theory is that system risk is quite different than simple or expected utility risk, because of interconnectivity. A profound but simple reality is that interconnectivity is a force multiplier that magnifies risk simply because a system is connected. That is, topology is another factor, besides T, V, and C that defines risk. To accurately estimate system risk, it is imperative that topology is included, and this is where the theory of self-organization becomes useful.

The principle feature of complex networked systems that distinguishes them from single-assets is their connectivity or topology. Consider the sample network shown in Figure 5.

What is the impact of a failure in one node (stations, buildings) on other nodes (rail, tunnel) that are connected to the faulty node? If each node has an associated risk defined by TVC, how is risk defined for the entire network system?

As a first approximation, network risk can be defined as the sum total of risks across all nodes in the network. This is classical expected utility theory: R = r1 + r2 +…+rn, for a network with n nodes. For simplicity, let ri = TiViCi. Unfortunately, this simple definition of system risk yields almost no insight into how or why an entire system might fail, because it ignores the effects of cascades. Instead of measuring the effect of a discrepancy or incident on one node, we want to measure the effect of a discrepancy or incident at one node on all other nodes.

The conundrum is that individual asset risk, ri, is inadequate to describe system-wide risk, because the sum over all nodes is overly aggressive to describe a network that partially fails most of the time. Expected utility theory falls short as a model of the risk potential of network cascades, because it is too static and ignores topology.

Per Bak (1948-2002) discovered a key property of complex networked systems in the 1980s that showed how to model network risk. He named it self-organized criticality, but it is also an explanation of why risk is magnified by topology. Bak observed that nearly all complex systems organize themselves to adapt to their environment, and as a side effect, they become more fragile.  He illustrated the impact of self-organization using a metaphor – the sand pile. As grains of sand are accumulated in a pile, the likelihood that some part of the pile will falter and collapse steadily increases until the sand pile reaches a critical state. Collapse is inevitable, but the timing and size of inevitable collapse is unpredictable. In fact, the size of sand pile landslides obeys a long-tailed exceedence probability. Hence, Bak’s sand pile has become a fitting metaphor for complex systems that behave unexpectedly even though they appear to be simple.

Lewis explains self-organization in great detail in Bak’s Sand Pile:

 

Complex networks evolve over time and tend to transform (re-wire) themselves, evolving from random interconnections to structured interconnections that increase the likelihood of cascade failures. In general, the transformation increases the overall number of links (percolation), the number of links connecting a favored node (hub), and the number of paths running through a favored node or link (betweeness). As a consequence, the number of nodes impacted by a failure in one node is magnified by percolation, hub size, and betweeness size.

As these networks evolve, they increase the self-organized criticality of the network, itself, which increases its overall risk of cascade failure – both the likelihood and size of the collapse in terms of affected nodes and links.

For example, the power grid has been evolving over the past many decades because of regulation and economics. Its topology contains hubs and betweeners, and it is known that the betweeners contribute to potential collapses simply because they increase self-organized criticality. Telecommunications infrastructures are shaped by the Internet and regulation. Self-organized criticality has emerged since the 1996 Telecommunications Act, resulting in telecommunication hubs called Telecom Hotels. Water supply networks, mass transit networks, and air transportation networks all respond to socio-economic and regulatory forces that tend to create hubs and betweener nodes and links.

Percolation, hub size increases, and betweeness size are known contributors of self-organized criticality. There may be others, but these three are enough to greatly increase system risk. Self-organization typically reduces redundancy and improves efficiency of a system, but it also drives the system towards its tipping point. Bak called the inevitable collapse punctuation, and noted that complex systems repeat a cycle of increasing self-organization followed by collapse, followed by chaotic shock waves, and followed by long periods of calm. He called this punctuated equilibrium, because complex systems oscillate between stable and unstable states with periods of equilibrium in between.

For the purposes of risk assessment, it is now clear how to define risk within a network. Network or system risk is directly related to self-organized criticality, but how? Model-Based Risk Analysis (MBRA) defines network risk as the expected utility across all nodes and links, weighted by hub size and/or betweeness:

System Risk = w1T1V1C1+w2T2V2C2+…+wnTnVnCn

In this formulation, both nodes and links are assigned a weight, wi, and PRA parameters Ti, Vi, and Ci. The weight is obtained from analyzing network topology, e.g. the number of links connecting each node to other nodes, and the number of paths passing through every node and link from and to other nodes. MBRA is computer software for automatically performing these calculations so analysts do not need to. MBRA produces risk and exceedence probability curves that are shaped by a network’s topology.

So the first property of complex networked systems leading to greater risk is Bak’s self-organization. We also know that self-organization is caused by optimization of system performance. Highly connected nodes are economically more efficient than isolated nodes, and highly utilized links are more cost-effective than lightly used links. Modern society rewards efficiency and optimization of resources, but these optimizations increase the fragility of these systems. And fragility leads to greater risk.

Bak wrote extensively about the causes of self-organized criticality in complex systems. Generally it is attributed to lack of surge capacity, reduction of redundancy, and optimization for greater efficiency. Many of our critical infrastructure systems are in a state of self-organized criticality because they are highly optimized and efficient. Of course operators like efficiency and optimal profitability, But according to Bak, efficiency and optimization are the causes of fragility. If resiliency is the desired goal, then less efficient systems will be required.