Disaster recovery exercises have a reputation for being painful, high-risk, and disruptive. Ours weren't — and that was by design.
The constraint that shaped everything
We couldn't impact real customers during testing. That single constraint forced every decision: synthetic traffic, feature flags, shadow environments, and careful sequencing across 15+ components.
What 100% success actually means
It doesn't mean nothing went wrong. It means every failure was caught by our runbooks before it propagated. The playbooks were the product of dozens of dry-run sessions with 7+ teams — that investment paid off on the day.
AWS regions aren't symmetric
The biggest lesson: don't assume your secondary region is a mirror image of primary. Configuration drift is invisible until you need to fail over. We built drift detection into our standard checks.