There is no denying that in the last few years, technologies like Docker and Kubernetes, to name a few of the most relevant, have revolutionized how we reason about software development and deployment.
But whilst the fast pace of the software development industry pushes developers to adopt the most recent technologies, it is important to take a step back and to have a better look at established patterns that enable parts of these technologies.
The circuit breaker pattern is one of those patterns, widely adopted in microservices architectures. We are going to compare the pros and cons of implementing it with two different approaches: Hystrix and Istio.
The Core Issue Of Microservices Synchronous Communication
Imagine a very simple Microservice architecture consisting of:
- A back-end service
- A front-end service
Let’s assume the back-end and the front-end communicate via synchronous HTTP calls.
Clients C1
and C2
call the front-end to retrieve some information.
Since the front-end doesn’t have all the required data, it calls the back-end to get the missing pieces.
But because of network communication, a lot of things can happen:
- A network failure between the front-end and the back-end
- The Back-end can be down because of a bug
- A service the back-end depends on (e.g. database) can be down
And as per Murphy’s law (“Anything that can go wrong will go wrong”), communication between the front-end and the back-end will fail sooner or later.
If we look into the life-cycle of a single call from the front-end to the back-end, and consider the back-end is down because of whatever reason, at some point, the front-end will cancel the call with a timeout.
Zooming out to the application level, multiple clients call the front-end at the same time, which translates to multiple calls to the back-end: the front-end will soon be flooded with requests and will drown in timeouts.
The only sane solution in this scenario is to fail-fast: front-end should be made aware that something has gone wrong on the back-end‘s side, and return a failure to its own clients immediately.
The Circuit-breaker Pattern
In the domain of electrical circuitry, a circuit breaker is an automatically operated electrical switch designed to protect an electrical circuit. Its basic function is to interrupt current flow after a fault is detected. It can then be reset (either manually or automatically) to resume normal operation after the fault is solved.
This looks pretty similar to our issue: to protect our application from an excess of requests, it’s better to interrupt communication between the front-end and the back-end as soon as a recurring fault has been detected in the back-end.
In his book Release It, Michael Nygard uses this analogy and makes a case for a design pattern applied to the timeout issue above. The process flow behind it is pretty simple:
- If a call fails, increment the number of failed calls by one
- If the number of failed calls goes above a certain threshold, open the circuit
- If the circuit is open, immediately return with an error or a default response
- If the circuit is open and some time has passed, half-open the circuit
- If the circuit is half-open and the next call fails, open it again
- If the circuit is half-open and the next call succeeds, close it
This can be summed up in the following diagram:
The Istio Circuit Breaker
Istio is a service mesh, a configurable infrastructure layer for a Microservices application. It makes communication between service instances flexible, reliable, and fast, and provides service discovery, load balancing, encryption, authentication and authorization, support for the circuit breaker pattern, and other capabilities.
Istio’s control plane provides an abstraction layer over the underlying cluster management platform, such as Kubernetes, Mesos, etc., and requires your application to be managed in such way.
As its core, Istio consists of Envoy proxy instances that sit in front of the application instances, using the sidecar container pattern, and Pilot, a tool to manage them. This proxying strategy has many advantages:
- Automatic load balancing for HTTP, gRPC, WebSocket, and TCP traffic.
- Fine-grained control of traffic behavior with rich routing rules, retries, failovers, and fault injection.
- A pluggable policy layer and configuration API supporting access controls, rate limits and quotas.
- Automatic metrics, logs, and traces for all traffic within a cluster, including cluster ingress and egress.
- Secure service-to-service communication in a cluster with strong identity-based authentication and authorization.
Because outbound calls to the back-end go through the Envoy proxy, it’s easy to detect when they timeout. The proxy can then intercept further calls and return immediately, effectively failing fast. In particular, this enables the circuit breaker pattern to operate in a black-box way.
Configuring The Istio Circuit Breaker
As we said Istio builds up on the cluster management platform of your choice, and requires your application to be deployed trough it.
Kubernetes implements the circuit breaker pattern via a DestinationRule
, or more specifically the path TrafficPolicy
(formerly circuitBreaker
) -> OutlierDetection
, according to the following model:
Parameters are as follows:
Field | Description |
---|---|
consecutiveErrors |
Number of 5xx return code before the circuit breaker opens. |
interval |
Time interval between circuit breaker check analysis. |
baseEjectionTime |
Minimum opening duration. The circuit will remain for a period equal to the product of minimum ejection duration and the number of times the circuit has been opened. |
maxEjectionPercent |
Maximum % of hosts in the load balancing pool for the upstream service that can be ejected. |
Compared to the nominal circuit breaker described above, there are two main deviations:
- There’s no such thing as an half-open state. However, the duration the circuit breaker stays open depends on the number of times the called service failed before. A constantly failing service will cause longer and longer opening durations of the circuit breaker.
- In the basic pattern, there’s a single called application (the back-end).
In a more realistic production setup, there will likely be multiple instances of the same application deployed behind a load balancer.
Some instances might fail, while some might work, and since Istio also plays the role of the load-balancer, it’s able to track the failing ones and eject them from the load-balancing pool, up to a point:
the role of the
maxEjectionPercent
attribute is to keep a fraction of the instances in the pool.
Istio approach to the circuit breaker is a black-box one. It takes a high viewpoint stand, and can only open the circuit when things go wrong. On the flip side, it’s pretty simple to set up and doesn’t require any knowledge of the underlying code, and it can be configured as an afterthought.
The Hystrix Circuit Breaker
Hystrix is an Open Source Java library initially provided by Netflix. It’s a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
Hystrix has many features, including:
- Protecting against latency and failure from dependencies accessed (typically over the network) via third-party client libraries.
- Preventing cascading failures in a complex distributed system.
- Failing fast and recovering rapidly.
- Fallback and gracefully degrade when possible.
- Enable near real-time monitoring, alerting, and operational control.
Of course, the circuit breaker pattern figures among those features. Because Hystrix is a library, it implements it in a white-box way.
Resilience4J
Netflix has recently announced it has stopped development of the Hystrix library in favor of the less well-known Resilience4J project.
Even if the client code might be a bit different, the approach between Hystrix and Resilience4J is similar.
A Hystrix Circuit Breaker Example
Consider the case of an e-commerce web application. The architecture of the app is made of different micro-services, each built upon a business feature:
- Authentication
- Catalog browsing
- Cart management
- Pricing and quoting
- Etc.
When a catalog item is displayed, the pricing/quoting Microservice is queried for its price. If it’s down, circuit breaker or not, no price will be sent back, and it won’t be possible to order anything.
From a business point of view, any downtime will not only have an impact on the perception of the brand, it will also decrease sales. Most sales strategies tend to sell anyway, despite the price being not entirely correct. A solution to achieve this sell strategy could be to cache prices returned by the pricing/quoting service when it is available, and to return the cached price when the service is down.
Hystrix allows that approach by providing a circuit breaker implementation that allows a fallback when the circuit is open.
This is a pretty simplified class diagram of Hystrix’s model:
The magic happens in the HystrixCommand
methods run()
and getFallback()
:
run()
is the actual code e.g. fetches the price from the quoting servicegetFallabck()
gets the fallback result when the circuit breaker is open e.g. returns the cached price
This could translate into the following code, using Spring’s RestTemplate
:
public class FetchQuoteCommand extends HystrixCommand<Double> {
private final UUID productId; // 1
private final RestTemplate template; // 2
private final Cache<UUID, Double> cache; // 3
public FetchQuoteCommand(UUID productId,
RestTemplate template,
Cache<UUID, Double> cache) {
super(HystrixCommandGroupKey.Factory.asKey("GetQuote")); // 4
this.template = template;
this.cache = cache;
this.productId = productId;
}
@Override
protected Double run() {
Double quote = template.getForObject("https://acme.com/api/quote/{id}", // 5
Double.class,
productId);
cache.put(productId, quote); // 6
return quote;
}
@Override
protected Double getFallback() {
return cache.get(productId); // 7
}
}
This warrants some explanation:
- The command wraps a product’s id, modeled as a
UUID
. - The Spring’s
RestTemplate
is used to make REST calls. Any other alternative will do. - A shared JCache instance, to store quotes when the service is available.
- Hystrix commands require a group key, so that they can be grouped together if the need be. This is another feature of Hystrix, and goes beyond the scope of this post. Interested readers can read about command groups in the Hystrix wiki.
- Execute the call to the quoting service. If it fails, the Hystrix circuit breaker flow starts.
- If the calls succeeds, cache the returned quote in the JCache shared instance.
- The
getFallback()
is called when the circuit breaker is open. In that case, get the quote from the cache.
The Hystrix wiki features more advanced examples e.g. where the fallback is itself a command that needs to be executed.
Integrate Hystrix with Spring Cloud
While the above code works, a Hystrix command object needs to be created every time a quote is made.
Spring Cloud, a library built on top of Spring Boot (itself built upon the Spring framework), offers a great integration with Spring. It lets one just annotate the desired fallback method, while handling the instantiation of the Hystrix command object:
public class FetchQuoteService {
private final RestTemplate template;
private final Cache<UUID, Double> cache;
public SpringCloudFetchQuoteCommand(RestTemplate template,
Cache<UUID, Double> cache) {
this.template = template;
this.cache = cache;
}
@HystrixCommand(fallbackMethod = "getQuoteFromCache") // 1
public Double getQuoteFor(UUID productId) { // 2
Double quote = template.getForObject("https://acme.com/api/quote/{id}", // 3
Double.class,
productId);
cache.put(productId, quote); // 4
return quote;
}
public Double getQuoteFromCache(UUID productId) { // 5
return cache.get(productId);
}
}
- The method should be annotated with
@HystrixCommand
. ThefallbackMethod
element references the fallback method. Obviously, this will be handled through reflection and is not typesafe - this is a string after all. - Spring Cloud Hystrix allows to pass the product’s id parameter at method invocation. Compared to the simple Hystrix command above, this allows to have a generic service object. The creation of the Hystrix command is handled by Spring Cloud at runtime.
- The core logic doesn’t change.
- Likewise, the caching process stays the same.
- The fallback method is a regular method.
It will be called with the exact same parameter values as the main method, hence it must have the same argument types (in the same order).
Because the
getQuoteFor()
method accepts anUUID
, so does this method.
Hystrix, whether standalone or wrapped by Spring Boot Cloud, requires to handle the circuit breaker at the code level. Thus, it needs to be planned in advance, and changes require a deployment of the updated binary. However, that allows to have a very fine custom-tailored behavior when things go wrong.
Istio vs Hystrix: battle of circuit breakers
If there is the possibility for things to fail, given time, things will fail, and Microservices that heavily rely on the network need to be designed for failure. The circuit breaker pattern is one of the ways to handle the lack of availability of a service: instead of queuing requests and choking the caller, it fails fast and returns immediately.
There are two ways to implement the circuit breaker, the black-box way and the white-box way. Istio, as a proxy management tool, uses the black-box way. It’s a no-brainer to implement, it doesn’t depend on the underlying technology stack, and it can be configured as an afterthought.
On the other side, the Hystrix library uses the white-box way. It allows to have all different kind of fallbacks:
- A single default value
- A cache
- Calls to a different service.
It also offers cascading fallbacks. These additional features come at a cost: it requires decisions for fallbacks to be made whilst still in the development phase.
The best fit between the two approaches will likely depend on one’s own context: in some cases such as the quoting service, a white-box strategy with a fallback may be a better fit, while for other cases it may be perfectly acceptable to fail fast, such as a centralized remote logging service.
And of course nothing prevents you to use them side-by-side.