The evolution of chaos engineering: From chaos monkey at Netflix to reliability management in the AI era

I remember the early days of distributed systems where outages felt unavoidable, emergent, messy, and almost impossible to reason out. Watching services fail in production made one thing clear: if we wanted reliable systems, we had to practice failing intentionally and learn from it.

The idea didn’t appear overnight. I was an engineer and call leader at Amazon 15 years ago, where my team was responsible for the uptime and performance of the Amazon retail website. For every minute the website went down, we were losing tens of thousands of dollars in revenue. My team built a suite of fault injection tools to help prepare for big moments like Black Friday that were later adopted across the entire organization.

At the same time, over at Netflix, they launched their open source tool Chaos Monkey, which really pushed the idea of Chaos Engineering forward. While still a cutting-edge practice at the time, purposefully terminating servers in production revealed brittle assumptions and made the engineering org at Netflix not only more resilient, but one of the best in the world.

When I joined Netflix, I worked on their second generation of fault injection tooling that was less about randomly shutting down servers and more about running specific, hypothesis-driven experiments that contained the blast radius and maximized learnings. This is where we started talking about flu shot analogies. Leading engineering organizations around the world were starting to embrace the idea that thoughtfully injecting harm into their systems in order to methodically and systemically build up a resilience to it was a worthwhile endeavor.

Working for years at Amazon and Netflix gave me a front row seat to how these reliability efforts were critical to the success of both of those digital businesses. I had a strong feeling that every company would need them in the future, so I decided to launch Gremlin in 2016. My goal was straightforward: make chaos engineering safe, secure, and simple for every engineering team. Early practitioners saw the value of fault injection, but it was scattered, ad-hoc, and often confined to a few elite organizations. Gremlin set out to change that.

Safe tooling mattered because you can’t scale a practice that risks customer experience, and realistically, most engineering organizations didn’t have the experience to run these experiments confidently. In response, we built controls like blast-radius limits, safeguards, and straightforward rollbacks so teams could run experiments with confidence. The Gremlin agent can automatically stop and rollback tests, health checks can monitor systems for unexpected behaviors, and any team member can press a halt button to instantly stop a running test. This lowered the barrier to entry and let more teams validate assumptions in production and strengthen their systems without unnecessary risk.

Methodology mattered just as much. Chaos engineering is empirical: form a hypothesis about steady state, design controlled experiments to challenge that hypothesis, observe impact, and iterate. Gremlin worked to codify that flow into playbooks and tutorials so that teams could move beyond random fault injection to hypothesis-driven experiments. Teaching teams how to define steady-state metrics, craft meaningful attacks, and interpret results helped make chaos engineering a reproducible engineering practice rather than an occasional curiosity.

Integration was the third pillar. For chaos engineering to be useful, it had to live in the same workflows teams already used, e.g. CI/CD pipelines, observability stacks, and the incident review processes. We built integrations and examples showing how to correlate attacks with monitoring, schedule experiments, and leverage Gremlin as part of postmortem validation. Today, teams can set up intelligent health checks in one click, automate experiments with our API, and even connect Gremlin to their favorite AI model. That made chaos engineering less of a separate activity and more a part of normal reliability work.

So we’d successfully figured out how to make Chaos Engineering an effective, safe, and replicable practice. But that’s where we ran into a snag. By its nature, Chaos Engineering is about exploration and discovery, which makes it perfect for answering individual questions like, “What happens if 100ms of latency is added to our dependency chain?” Super important for individual engineers to answer, but the business worries about other questions like, “Is our system reliable enough to be there for our customers?”

That’s where we needed to move beyond Chaos Engineering into managing reliability across the entire organization in a way that could be measurable, trackable, and reportable. We started with the same Fault Injection techniques that had been proven effective by Chaos Engineering, then layered in a more systematic, scalable approach. Manual experiments are replaced with automated collections of tests, or test suites, designed to be safely run on a regular basis. The results are combined to produce a reliability score which can be tracked over time to see the reliability of systems across your company and prove the impact of reliability efforts.

Now we had a metric that didn’t just show when your system went down, but could predict future reliability and be used to prevent outages. This is the dream of Chaos Engineering finally realized. Specific experts can still use target experiments to uncover new failure modes, while known failure modes can be tested across an entire system to verify resilience, uncover risks, and give teams true reliability visibility with predictive metrics.

And it’s a good thing, because we now find ourselves at one of the most dramatic shifts in the history of software. AI is accelerating code generation and deployment at an unheard-of level, and while AI-generated code is great for velocity, it’s not without its risks. AI code has been shown to lead to an increase in bugs, config drift, and other incident-causing issues. That doesn’t mean we should all stop using AI, but it does necessitate better guardrails and testing.

Now more than ever, the proactive reliability testing and management that started with Chaos Engineering is critical. Engineering organizations simply can’t afford to hope for resilience and only discover failures when they cause a costly, painful outage. They need predictive reliability testing and metrics so they can systematically measure, manage, and improve reliability.

Source link

The evolution of chaos engineering: From chaos monkey at Netflix to reliability management in the AI era

By CIO Dive

Data Center Capex Surges 57 Percent in 2025 as AI Deployments Accelerate, According to Dell'Oro Group

Recommended.

Hollyland presentará soluciones audiovisuales avanzadas en la feria NAB 2025 en Las Vegas

Lunchbox and OPA! Partner to Redefine Marketplace Ordering for Restaurants

Trending.

Pia Debuts Automation Hub, A Centralized Marketplace For MSPs: Exclusive

Veeam Debuts Data Resiliency Maturity Model To Assess, Improve Customers’ Cyber Resiliency

Microsoft Vs. AWS Vs. Google Cloud Earnings Q1 2025 Face-Off

Many workers would take a pay cut to work from home — some would forgo at least 20% of their salary

Insurance Modernization at Risk as Workforce Strategies Fall Behind, Says Info-Tech Research Group

PTechHub

Industries

Navigation

Subscribe to Our Newsletter