Resilience as a Design Principle, Not an Afterthought

Connecting regulatory confidence work to operational resilience requirements

On 19 July 2024, a single faulty configuration update from CrowdStrike crashed approximately 8.5 million Windows devices worldwide. Airlines grounded flights. Hospitals cancelled procedures. Payment systems froze. The worldwide financial damage ran into tens of billions.

The incident wasn’t a cyberattack. It was a routine software update that bypassed adequate testing. And it exposed a truth that regulators had been warning about for years:

Resilience cannot be bolted on after the architecture is built. It must be woven into the design from the first line on the whiteboard.

In How Architecture Supports Regulatory Confidence Without Slowing Delivery, I explored how enterprise architecture can satisfy regulatory demands without becoming a bottleneck. This post extends that argument into the specific domain of operational resilience. examining how the regulatory landscape has shifted, what it demands of architects and how to embed resilience as a first class design principle rather than a compliance afterthought.

The Regulatory Landscape Has Changed

We are no longer in an era where regulators ask “do you have a disaster recovery plan?” and accept a dusty binder as evidence. The regulatory expectation has fundamentally shifted from preventing failure to tolerating disruption.

The Basel Committee on Banking Supervision codified this shift in March 2021 with its Principles for Operational Resilience, defining operational resilience as:

“A bank’s ability to deliver critical operations in the face of disruption. This ability should enable a bank to identify and protect itself from threats and potential failures.” Source: Basel Committee, Principles for Operational Resilience

The UK’s Financial Conduct Authority (FCA) and Prudential Regulation Authority (PRA) translated this into binding requirements that came into full force on 31 March 2025, requiring firms to demonstrate they can remain within their impact tolerances for each important business service during severe but plausible disruptions.

The EU’s Digital Operational Resilience Act (DORA) ( Regulation (EU) 2022/2554) became applicable on 17 January 2025 and represents what FD Capital calls “the most comprehensive single piece of operational resilience regulation in the world.” DORA mandates ICT risk management frameworks, incident reporting, resilience testing, and third party risk management across the entire financial sector.

And the trajectory continues: on 18 March 2026, the FCA and PRA published final policy statements introducing a unified operational incident and third-party reporting regime, effective from 18 March 2027.

The message is unambiguous:

Regulators expect resilience to be architecturally embedded, continuously tested and demonstrably governed.

Why “Resilience as an Afterthought” Fails

The traditional approach treats resilience as a non-functional requirement. Something captured in a requirements document, delegated to infrastructure teams, and validated (if at all) through annual DR tests. This approach fails for three reasons:

1. It conflates recovery with resilience.

Disaster recovery asks: “How quickly can we restore service after failure?” Operational resilience asks a fundamentally different question: “How do we continue delivering service through disruption?” As the FCA’s operational resilience framework makes clear:

“Operational resilience focuses on tolerability, how much disruption a specific business service can sustain before causing intolerable harm to consumers or market integrity and what the firm must do to stay within that tolerance across the full range of severe but plausible scenarios.” Source: FCA’s operational resilience framework

Recovery is a subset of resilience. Architecture that only plans for recovery has already accepted a period of total failure.

2. It treats resilience as a property of components, not of services.

Regulators don’t ask whether your database cluster is highly available. They ask whether your payment processing service can continue operating within defined tolerances when that database cluster fails. This is a fundamentally different framing, one that requires architectural thinking across the full service delivery chain, not just infrastructure redundancy at the component level.

3. It assumes failure modes are predictable.

The CrowdStrike incident wasn’t in anyone’s risk register as a specific scenario. A security vendor’s routine update causing global infrastructure collapse wasn’t a “severe but plausible” scenario that most firms had tested against. Yet the FCA explicitly cited it as a lesson in why firms must think beyond their own boundaries when mapping dependencies and testing resilience.

From Resilience Engineering to Architecture Principles

The academic field of resilience engineering offers a crucial reframe. Unlike traditional safety approaches that focus on preventing specific known hazards, resilience engineering examines “a more general capability of systems to deal with hazards that were not previously known before they were encountered.”

Sidney Dekker, one of the founding scientists behind resilience engineering, developed the concept of Safety Differently in 2012:

“Safety Differently sees safety not as the absence of negative events but as the presence of positive capacities in people, teams and processes that make things go well.” Source: Sidney Dekker, Safety Differently

Translated into architectural terms: resilience isn’t the absence of failure. It’s the presence of architectural capacities that allow services to continue functioning when components inevitably fail.

Nassim Nicholas Taleb pushes this further with the concept of antifragility. Systems that don’t merely survive stress but actually improve because of it:

“Antifragility is fundamentally different from the concepts of resiliency (the ability to recover from failure) and robustness (the ability to resist failure).” Source: Antifragile: Things That Gain from Disorder

For enterprise architects, this creates a spectrum of design ambition:

Most organisations aspire to robust. Regulators now demand resilient. The best architects design for antifragile.

Five Architecture Principles for Resilience by Design

Drawing from regulatory requirements, resilience engineering theory, and practical architectural patterns, here are five principles that embed resilience into the fabric of enterprise architecture.

Principle 1: Design for Graceful Degradation, Not Binary Availability

Traditional availability thinking is binary: the service is either up or down. Resilient architecture recognises a spectrum of service delivery, from full capability through progressively degraded modes to complete unavailability.

The architectural pattern:

This maps directly to the regulatory concept of impact tolerances. The FCA has been clear that “time-based measures alone are insufficient”. Firms must define what degraded service looks like and at what point degradation becomes intolerable harm.

Architectural implementation:

Define service levels for each important business service (full, degraded, minimal, unavailable)
Design explicit fallback paths between levels
Implement circuit breakers that trigger graceful degradation rather than cascading failure
Test each degradation level independently

Principle 2: Map Dependencies as Attack Surfaces for Disruption

The CrowdStrike incident demonstrated that your most dangerous dependencies may not be the ones you think about most. A security tool (ostensibly protecting your systems) became the vector for their failure.

The Basel Committee’s principles require banks to “identify and protect itself from threats and potential failures” across the full operational chain. The UK framework requires mapping “the people, processes, technology, facilities and information supporting important business services.”

The architectural approach:

For each important business service, map dependencies across six dimensions:

Technology – Infrastructure, platforms, applications, data stores
Third parties – Vendors, cloud providers, SaaS platforms, data feeds
People – Skills, knowledge, availability, single points of expertise
Processes – Manual steps, approvals, handoffs, escalation paths
Facilities – Physical locations, network connectivity, power
Information – Data sources, reference data, configuration, secrets

For each dependency, ask: “If this dependency becomes unavailable for [impact tolerance duration], can the service continue within tolerance?”

If the answer is no, you’ve found an architectural vulnerability that requires design intervention, not just a risk register entry.

This connects directly to the STORMWATCH method I developed for risk identification. The Dependencies dimension in STORMWATCH becomes particularly critical when viewed through the lens of operational resilience . Every dependency is a potential disruption vector.

Principle 3: Isolate Blast Radius Through Architectural Boundaries

The bulkhead pattern (borrowed from naval architecture where watertight compartments prevent a single hull breach from sinking the entire ship) is the foundational resilience pattern for enterprise systems.

“Bulkhead pattern is a design pattern used in software architecture for isolating parts of an application into pools or compartments so that failure of one component will not cascade to other components.” Source: Wikipedia, Bulkhead Pattern

Architectural implementation:

At the enterprise level, this means:

Service isolation – Important business services should not share failure domains
Resource partitioning – Dedicated capacity for critical paths, not shared pools
Dependency isolation – A failing third party should only affect services that directly depend on it
Data isolation – Corruption or unavailability in one data domain shouldn’t cascade

The FCA’s findings one year on praised firms that demonstrated “documented review cycles under which firms reassess their important business services and impact tolerances annually or following material changes to the business.”

Architectural boundaries make this reassessment tractable. You can reason about resilience service by service rather than across an undifferentiated monolith.

Principle 4: Test Through Controlled Failure, Not Theoretical Scenarios

Netflix pioneered chaos engineering in 2011 with Chaos Monkey. A tool that randomly terminates production instances to verify that services survive component failure. The principle is counterintuitive but powerful:

“Resilience grows when systems experience controlled failure rather than quiet comfort.” Source: Digital Digest, Netflix Chaos Monkey

The regulatory framework demands something similar. The UK operational resilience rules require firms to conduct scenario testing against “severe but plausible” disruptions. Not just tabletop exercises, but tests that verify the architecture actually behaves as designed under stress.

The architectural testing hierarchy:

Level	Method	What It Proves
1	Component failover testing	Individual components recover
2	Dependency removal testing	Services degrade gracefully when dependencies fail
3	Chaos engineering	System-wide resilience under random failure
4	Scenario-based testing	End-to-end service delivery within impact tolerances
5	Cross-boundary testing	Resilience when third parties fail

The FCA’s CrowdStrike lessons emphasised that firms must test scenarios that originate outside their control boundary. Architecture that hasn’t been tested against third-party failure isn’t resilient. It’s merely hopeful.

Key insight: Testing isn’t a phase that happens after architecture is complete. The testability of resilience must be designed in. If you can’t inject failure at a specific point in your architecture, you can’t verify resilience at that point. Observability, controllability, and isolation are prerequisites for resilience testing and therefore prerequisites for resilience itself.

Principle 5: Govern Resilience as a Continuous Property, Not a Point-in-Time Assessment

The FCA’s April 2026 findings made clear that operational resilience is not a compliance milestone to be achieved and forgotten:

“Rather than seeking to eliminate disruption altogether, the regime is designed to ensure that firms can continue to deliver their most important business services, or to resume them within acceptable timeframes, even in the event of severe but plausible disruptions.” Source: FCA

This demands architectural governance that treats resilience as a living property. One that can degrade silently as systems evolve, dependencies change and new services are introduced.

Governance mechanisms:

Architecture Decision Records (ADRs) – that explicitly capture resilience trade offs for every significant decision
Resilience debt tracking – analogous to technical debt, but specifically measuring where architectural changes have degraded resilience posture
Continuous dependency mapping – automated discovery of new dependencies introduced through deployment
Impact tolerance validation – regular verification that stated tolerances remain achievable given current architecture
Change impact assessment – every architectural change evaluated for its effect on resilience of important business services

As I explored in Designing for Optionality in Enterprise Architecture, the best architectures preserve future choices. Resilience governance ensures that today’s changes don’t inadvertently close off tomorrow’s recovery options.

The Architect’s Role: From Compliance to Confidence

The shift from “resilience as compliance” to “resilience as design principle” fundamentally changes the enterprise architect’s role. We move from:

Documenting resilience requirements → Designing resilient architectures
Reviewing DR plans → Architecting graceful degradation
Cataloguing dependencies → Isolating blast radius
Scheduling annual tests → Embedding continuous verification
Reporting compliance status → Governing resilience as a living property

This is the connection to regulatory confidence. When resilience is architecturally embedded (when the design itself enforces isolation, enables degradation and supports testing) regulatory confidence becomes a natural byproduct rather than a separate workstream.

You don’t need to prove resilience through documentation alone because the architecture demonstrates it through its structure and behaviour.

Connecting the Threads

The regulatory landscape (Basel Committee principles, UK FCA/PRA requirements, EU DORA) and the forthcoming incident reporting regime, all converge on a single architectural truth:

Systems that treat resilience as an afterthought will fail both their users and their regulators.

The CrowdStrike incident of July 2024 was a global demonstration of what happens when centralisation and homogeneity create systemic fragility. As cybersecurity expert Ciaran Martin observed, it was “a very, very uncomfortable illustration of the fragility of the world’s core internet infrastructure.”

But fragility is a design choice. So is resilience.

The five principles outlined here (graceful degradation, dependency mapping, blast radius isolation, controlled failure testing and continuous governance) provide an architectural framework for making resilience a first class design concern. Not a checkbox. Not a compliance exercise. A fundamental property of how we build systems.

As Taleb reminds us, the opposite of fragile isn’t robust, it’s antifragile. The best architectures don’t merely survive disruption. They learn from it, adapt to it and emerge stronger. That’s not just good engineering. In today’s regulatory environment, it’s a requirement.

References and Further Reading

Principles for Operational Resilience. Bank for International Settlements – Basel Committee on Banking Supervision.
Principles for Operational Resilience – Executive Summary – Basel Committee on Banking
Operational Resilience – Financial Conduct Authority.
CrowdStrike Outage: Lessons for Operational Resilience. – Financial Conduct Authority
UK Operational Resilience Rules – FCA’s Findings One Year On. – Covington & Burling
UK Regulators Finalise Unified Operational Incident and Third Party Reporting Regime – Covington & Burling
New Operational Incident Reporting Rules for Banks and CRR Firms – TLT Solicitors
FCA Publishes Insights and Observations on Operational Resilience a Year On – HFW
Digital Operational Resilience Act (DORA) – European Securities and Markets Authority
DORA: Digital Operational Resilience Act UK Guide – FD Capital
Operational Resilience: The Complete UK Guide – FD Capital
Operational Resilience: CrowdStrike and Beyond – Clifford Chance
CrowdStrike Outage: A Lesson in Operational Resilience – Grant Thornton
UK Regulators Finalise Policy on Operational Incident and Third-Party Reporting – Global Financial Regulation Blog
Resilience Engineering.- Wikipedia
Sidney Dekker – Wikipedia
Antifragility – Wikipedia
Antifragile: Things That Gain from Disorder – Taleb, N.N.
Chaos Engineering – Wikipedia
Netflix Chaos Monkey: An Idea That Reshaped Modern Reliability– Digital Digest
Bulkhead Pattern – Wikipedia
Resiliency in the IBM Well-Architected Framework – IBM
Reliability Design Principles – Azure Well-Architected Framework – Microsoft
Important Business Services Mapping: Step-by-Step Guide – Risk Publishing
PRA Operational Resilience: What Firms Must Do Now – NAQ Cyber
CrowdStrike-related IT Outages – Wikipedia

STORMWATCH: Evolving a Risk Tool for the Age of AI and Cloud

Designing for Optionality in Enterprise Architecture

How Architecture Supports Regulatory Confidence Without Slowing Delivery

Leave a ReplyCancel reply