Ptechhub
  • News
  • Industries
    • Enterprise IT
    • AI & ML
    • Cybersecurity
    • Finance
    • Telco
  • Brand Hub
    • Lifesight
  • Blogs
No Result
View All Result
  • News
  • Industries
    • Enterprise IT
    • AI & ML
    • Cybersecurity
    • Finance
    • Telco
  • Brand Hub
    • Lifesight
  • Blogs
No Result
View All Result
PtechHub
No Result
View All Result

The evolution of chaos engineering: From chaos monkey at Netflix to reliability management in the AI era

By CIO Dive by By CIO Dive
March 17, 2026
Home Enterprise IT
Share on FacebookShare on Twitter


I remember the early days of distributed systems where outages felt unavoidable, emergent, messy, and almost impossible to reason out. Watching services fail in production made one thing clear: if we wanted reliable systems, we had to practice failing intentionally and learn from it.

The idea didn’t appear overnight. I was an engineer and call leader at Amazon 15 years ago, where my team was responsible for the uptime and performance of the Amazon retail website. For every minute the website went down, we were losing tens of thousands of dollars in revenue. My team built a suite of fault injection tools to help prepare for big moments like Black Friday that were later adopted across the entire organization. 

At the same time, over at Netflix, they launched their open source tool Chaos Monkey, which really pushed the idea of Chaos Engineering forward. While still a cutting-edge practice at the time, purposefully terminating servers in production revealed brittle assumptions and made the engineering org at Netflix not only more resilient, but one of the best in the world.

When I joined Netflix, I worked on their second generation of fault injection tooling that was less about randomly shutting down servers and more about running specific, hypothesis-driven experiments that contained the blast radius and maximized learnings. This is where we started talking about flu shot analogies. Leading engineering organizations around the world were starting to embrace the idea that thoughtfully injecting harm into their systems in order to methodically and systemically build up a resilience to it was a worthwhile endeavor.

Working for years at Amazon and Netflix gave me a front row seat to how these reliability efforts were critical to the success of both of those digital businesses. I had a strong feeling that every company would need them in the future, so I decided to launch Gremlin in 2016. My goal was straightforward: make chaos engineering safe, secure, and simple for every engineering team. Early practitioners saw the value of fault injection, but it was scattered, ad-hoc, and often confined to a few elite organizations. Gremlin set out to change that.

Safe tooling mattered because you can’t scale a practice that risks customer experience, and realistically, most engineering organizations didn’t have the experience to run these experiments confidently. In response, we built controls like blast-radius limits, safeguards, and straightforward rollbacks so teams could run experiments with confidence. The Gremlin agent can automatically stop and rollback tests, health checks can monitor systems for unexpected behaviors, and any team member can press a halt button to instantly stop a running test. This lowered the barrier to entry and let more teams validate assumptions in production and strengthen their systems without unnecessary risk.

Methodology mattered just as much. Chaos engineering is empirical: form a hypothesis about steady state, design controlled experiments to challenge that hypothesis, observe impact, and iterate. Gremlin worked to codify that flow into playbooks and tutorials so that teams could move beyond random fault injection to hypothesis-driven experiments. Teaching teams how to define steady-state metrics, craft meaningful attacks, and interpret results helped make chaos engineering a reproducible engineering practice rather than an occasional curiosity.

 

Integration was the third pillar. For chaos engineering to be useful, it had to live in the same workflows teams already used, e.g. CI/CD pipelines, observability stacks, and the incident review processes. We built integrations and examples showing how to correlate attacks with monitoring, schedule experiments, and leverage Gremlin as part of postmortem validation. Today, teams can set up intelligent health checks in one click, automate experiments with our API, and even connect Gremlin to their favorite AI model. That made chaos engineering less of a separate activity and more a part of normal reliability work.

So we’d successfully figured out how to make Chaos Engineering an effective, safe, and replicable practice. But that’s where we ran into a snag. By its nature, Chaos Engineering is about exploration and discovery, which makes it perfect for answering individual questions like, “What happens if 100ms of latency is added to our dependency chain?” Super important for individual engineers to answer, but the business worries about other questions like, “Is our system reliable enough to be there for our customers?”

That’s where we needed to move beyond Chaos Engineering into managing reliability across the entire organization in a way that could be measurable, trackable, and reportable. We started with the same Fault Injection techniques that had been proven effective by Chaos Engineering, then layered in a more systematic, scalable approach. Manual experiments are replaced with automated collections of tests, or test suites, designed to be safely run on a regular basis. The results are combined to produce a reliability score which can be tracked over time to see the reliability of systems across your company and prove the impact of reliability efforts.

Now we had a metric that didn’t just show when your system went down, but could predict future reliability and be used to prevent outages. This is the dream of Chaos Engineering finally realized. Specific experts can still use target experiments to uncover new failure modes, while known failure modes can be tested across an entire system to verify resilience, uncover risks, and give teams true reliability visibility with predictive metrics. 

And it’s a good thing, because we now find ourselves at one of the most dramatic shifts in the history of software. AI is accelerating code generation and deployment at an unheard-of level, and while AI-generated code is great for velocity, it’s not without its risks. AI code has been shown to lead to an increase in bugs, config drift, and other incident-causing issues. That doesn’t mean we should all stop using AI, but it does necessitate better guardrails and testing.

Now more than ever, the proactive reliability testing and management that started with Chaos Engineering is critical. Engineering organizations simply can’t afford to hope for resilience and only discover failures when they cause a costly, painful outage. They need predictive reliability testing and metrics so they can systematically measure, manage, and improve reliability.

 



Source link

By CIO Dive

By CIO Dive

Next Post
Data Center Capex Surges 57 Percent in 2025 as AI Deployments Accelerate, According to Dell’Oro Group

Data Center Capex Surges 57 Percent in 2025 as AI Deployments Accelerate, According to Dell'Oro Group

Recommended.

How Microschools Became the Latest Tech Mogul Obsession

How Microschools Became the Latest Tech Mogul Obsession

August 18, 2025
CISO’s Expert Guide To AI Supply Chain Attacks

CISO’s Expert Guide To AI Supply Chain Attacks

November 11, 2025

Trending.

Chai AI Announces Upcoming Rollout of Apple and Google Age Verification APIs to Enhance Platform Safety

Chai AI Announces Upcoming Rollout of Apple and Google Age Verification APIs to Enhance Platform Safety

March 10, 2026
Huawei lanceert Next Generation FAN-oplossing

Huawei lanceert Next Generation FAN-oplossing

March 7, 2026
Baidu Announces Fourth Quarter and Fiscal Year 2025 Results

Baidu Announces Fourth Quarter and Fiscal Year 2025 Results

February 26, 2026
Half of Google’s software development now AI-generated | Computer Weekly

Half of Google’s software development now AI-generated | Computer Weekly

February 5, 2026
Huawei uvádí na trh řešení FAN nové generace

Huawei uvádí na trh řešení FAN nové generace

March 6, 2026

PTechHub

A tech news platform delivering fresh perspectives, critical insights, and in-depth reporting — beyond the buzz. We cover innovation, policy, and digital culture with clarity, independence, and a sharp editorial edge.

Follow Us

Industries

  • AI & ML
  • Cybersecurity
  • Enterprise IT
  • Finance
  • Telco

Navigation

  • About
  • Advertise
  • Privacy & Policy
  • Contact

Subscribe to Our Newsletter

  • About
  • Advertise
  • Privacy & Policy
  • Contact

Copyright © 2025 | Powered By Porpholio

No Result
View All Result
  • News
  • Industries
    • Enterprise IT
    • AI & ML
    • Cybersecurity
    • Finance
    • Telco
  • Brand Hub
    • Lifesight
  • Blogs

Copyright © 2025 | Powered By Porpholio