When I was growing up, I remember hearing this phrase from my father: "Proper Planning and Preparation Prevents Poor Performance” (I dropped one of the 7 "Ps" for this purpose, but the intent is still the same[1]). It is an old adage dating back to WWII and carried on by organizations on both sides of the "pond." It is a brash but effective saying that aligns with our "do things the hard way" approach to AI governance and building robust, performant, resilient modeling systems.
With the CrowdStrike IT incident of 2024, this generation’s Y2K finally happened[17]. For the year 2000, teams anticipated and avoided the possibility of worldwide computer failures at the hands of the millennium bug. However, CrowdStrike and the systems in its path didn’t have the benefit of foresight or avoidance. Instead, by deploying one threat detection configuration file update, they caused a massive collapse of IT infrastructure worldwide, making us realize how fragile our automated, computerized society has become[2].
Although unrelated to AI, the incident made us think about its parallels to AI systems and how we’ve yet to have our CrowdStrike-scale AI fiasco. We[9] and others[10] believe that such an event is a matter of “when”, not “if”. This is the first blog post in a series, building on our systems engineering discussions[11], about how to properly build and govern robust, performant, resilient systems, the right way. Our goal is to insulate your company from any AI-related disasters, as well as beat the odds of creating successful AI systems[12].
To begin, let's examine three intertwining concepts that are at the core of performant systems: Robustness, Resiliency, and the gold standard of Antifragility.
NIST defines Robustness, as it pertains to AI as: “The ability of an information assurance (IA) entity to operate correctly and reliably across a wide range of operational conditions, and to fail gracefully outside of that operational range.”[3]
Robustness is a very important but often overlooked component of building modeling systems. Partly due to the Kaggle-driven training of data scientists, generalization of modeling systems (learning from data but applying well to new data from Chapter 8. Generalizationn of Christoph Molnar and Timo Freiesleben’s work in progress new book on Supervised Machine Learning for Science[4]) is often a weak spot of AI systems. They are optimized to perform well on their training data but not generalizable enough to perform well on unseen data. Introducing noise into the modeling system and focusing on building models that don’t eke out that last 1% of performance but instead can handle less than ideal conditions needs to be the goal. To specifically target model robustness, introducing adversarial examples, distribution shifts, and tests of how a modeling system will perform on unseen/difficult data, are all key aspects of building a robust system. The fundamentals matter. Input selection and extensive testing help to make our systems more robust.
Resiliency is the ability to absorb adverse stimuli without destruction and return to its pre-event state[5]. As a concept in engineering, resiliency dates back to Thomas Tredgold’s 1818 publications On the Transverse Strength and Resilience of Timber[6]. A critical component of civil engineering, specifically under focus recently as part of critical infrastructure[13,14], such as the power grid, engineers attempt to create systems that can weather storms and return to their initial strength.
As an example of excellent resiliency engineering, the Akashi Kaikyo bridge in Japan has dampeners to act as a counterbalance to earthquake tremors and high winds[16]. It was successfully tested in 1995 when it withstood the 6.9 magnitude Great Hanshin earthquake. Do you build your AI systems with dampeners?
As with many engineering concepts, a job well done isn’t notable, and often leaves incentives for corners to be cut as robustness and resiliency, testing and planning are often easy components to leave out. Lacking resiliency and robustness is when these concepts come to a head, as in the tragic 2011 Fukushima nuclear disaster[15].
Antifragility is a characteristic of a system that becomes more robust when placed under stress[7]. Unlike resiliency, which we define as the ability to absorb damaging inputs without breaking, antifragility is the ability of a system to improve from challenging stimuli.
As a simple example, the human body is an antifragile system up to a point. If you stress it with exercise and appropriate recovery, it will generally adapt to get stronger.
Antifragility is the gold standard of engineering systems that are not only robust (operate under a large range of conditions), resilient (survive without long-term effects adverse stimuli), but able to adapt to adversarial scenarios. As in life, we want our AI systems to thrive in adversity and learn from our past experiences to make us stronger. A key question we need to ask ourselves if we are not actively building our AI systems to be antifragile, why are we using AI systems at all?
Bringing us back to CrowdStrike, there was a major failure point in the larger IT system. The principles of Robustness, Resiliency, and Antifragility were not applied at the CrowdStrike or individual company levels. The purpose of this article is not to point fingers or speculate on any root causes, but to help encourage us as an AI community to ensure that our systems are held to a higher standard. Spend some time evaluating how you govern, build, and validate your systems. Keeping these three scenarios in mind helps us build robust, resiliency, antifragile, and lasting, consequential systems.
Stay tuned for Part 2 around proper governance policies and processes and Part 3 around validation processes rooted in the three principles discussed in this post.
For more information about AI incidents and preparedness, listen to the full podcast Preparing AI for the unexpected: Lessons from recent IT incidents.