Introducing a Baby Chaos Monkey for Our Microservices

In a microservice system, the only real way to know how resilient you are is to break things on purpose and watch what happens. That’s the idea behind chaos engineering. Netflix’s Chaos Monkey is the famous example: it randomly kills services in production so the team finds out early whether the system can take it.

Killing live services is overkill for most teams, especially outside production, but the underlying idea is worth borrowing. So we added a small piece of middleware to one of our services: a manually triggered, route-level failure tool that acts like a baby Chaos Monkey.

Our infrastructure is a web of interdependent microservices. A small change in one can break something two or three services away, especially where the dependencies aren’t obvious. This middleware gives us a lightweight, controlled way to simulate partial outages in development and staging, without the risk or randomness of full-blown chaos engineering.

How it works:

The middleware can be configured via a dedicated route.
It only runs in development and staging environments.
It lets you specify which routes should fail, for how long, and with what status code.
After a maximum of 5 minutes, the configuration will automatically reset.

The config route accepts the following options:

{
  "forceFail": true,
  "forceFailStatus": 503,
  "forceFailDuration": 3,
  "forceFailRouteMatch": "/api/v1/payments"
}

Once active, the middleware checks each request against the configured forceFailRouteMatch. If the request URL contains that string, it immediately returns the configured forceFailStatus and a response body explaining that the failure was intentional.

It’s a simple trick that’s earned its keep. Faking a partial outage lets us watch how the other services react, check whether our error handling does what we think it does, and find the places where we’re missing a retry, a circuit breaker, or a clearer alert.