Break circuits, save Kong đŚ
- Published on
2020 was a milestone year for us at Dream11. We became the worldâs largest fantasy sports platform with over 100 million users and were the first-ever sports brand to become the Title Sponsor for the Indian Premier League (IPL) 2020. The increase in the popularity of fantasy sports has led to a huge uplift in user engagement on the Dream11 app. During the IPL, we receive up to 80 million requests per minute on our API Gateway which distributes it to more than 75+ microservices performing their micro-functions. Such a high influx of user requests comes with its own set of challenges.
In a microservice architecture, service, database or even network failure can lead to a cascading effect on other parts of your infrastructure. This can lead to:
- Socket hangup and resource contention due to unresponsive/failing requests on downstream services like the API Gateway, which can potentially bring down the entire app.
- Users having to retry the requests from the client, keeping the database/microservice overwhelmed, not allowing it to recover.
Circuit Breaker pattern was introduced to tackle this very problem.
The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all.âââMartin Fowler
We use the community version of Kong as our API gateway in production.Â
Most of our microservices are written in Java, where we use resilience4j to wrap any server-to-server (S2S) calls with a circuit-breaker pattern and opossum for few services written in Nodejs. Taking inspiration from them, we listed down the requirements we need in our API gateway (Kong):
Requirements:
- Request level circuit breaker: Track the failure % of each route served by Kong based on a time window. We want to break the circuit on route level rather than service level because we donât want to stop requests which are not causing failures.
- Timeout: A 5XX response from the upstream service is counted as a failure. Also, if the request does not complete in X milliseconds, the timeout should occur and it is also counted as a failure. As we donât want to keep kongâs socket hanging.
- Fail fast: If the
failure % (total_failed/total_requests)
for a route is above the defined threshold, then the plugin should open the circuit, block any further requests to this route and immediately return a 5XX error. - Half Open Circuit: After a defined time, Kong should allow a few requests on this route (which was blocked for a while) and check if requests are failing or succeeding by calculating failure % again. If the failure % decreased below the defined threshold, then allow requests on this route as usual. Else, keep blocking all requests on this route.
Circuit breaker as a library: We have written multiple custom plugins which make internal HTTP requests to upstream services for different use cases like making an HTTP request to Auth service to validate cookie. Ideally, we want to wrap these HTTP requests also with a circuit breaker so it does not hamper the performance of Kong in case letâs say the Authentication service fails.
Kong comes with an out-of-the-box circuit breaker to detect the health of an upstream service and block any further requests if the upstream becomes unhealthy. However, this circuit breaker did not meet the primary requirements mentioned above as it only works at the service level. We wanted to track on a per route level because:
- We do not want to stop all requests to an upstream service if in case just 1 route of that service seems to be failing.
- A scenario where a read-request is working fine because of caching but a write-request to the same microservice is failing because the database is overwhelmed.
Solution:
We explored if we could leverage any pre-existing lua library to write our plugin but could not find one that was production-ready and stress tested. Finally, we decided to write a Lua library (lua-circuit-breaker) that abstracts out circuit-breaker logic and used the same library to create a kong plugin (kong-circuit-breaker) as well.
While creating these repositories, we drew inspiration from resilience4j when it came to request-timeouts, percentage thresholds, time windows, custom error messages, and others. A sample configuration of the plugin looks like this:
configuration = {
window\_time: 10,
min\_calls\_in\_window: 20,
failure\_percent\_threshold: 51,
wait\_duration\_in\_half\_open\_state: 120,
wait\_duration\_in\_open\_state: 15,
half\_open\_max\_calls\_in\_window: 10,
half\_open\_min\_calls\_in\_window: 5,
api\_call\_timeout\_ms: 300,
error\_status\_code = 599,
error\_msg\_override = "Circuit breaker did not allow the request"
}
As you can see from the images below, a route GET /user-contests started to fail because the database of contest-reader-vertx service experienced a spike in latency. The circuit breaker on this route opened and failed-fast requests on this route for wait_duration_in_open_state
seconds resulting in 4.87M failures with status code 599. Then it retried a few requests and closed the circuit breaker when this route had recovered and allowed requests to flow as usual.
We have been using the library and the plugin for the last 6 months now and it has been working very well. It has been battle-tested for the Dream11 IPL scale (1.2 M requests per second).
The plugin has been really helpful by quickly opening the circuits and saving the cascading effect in case service/DB/network failures. So after IPL 2021, we open-sourced the library and the plugin to contribute to the Kong community. We want the community to try them out, raise feature requests and share any feedback to help make it better â¤ď¸.
Are you looking to solve more problems like these with the Dream11 team? Join us by applying here!
dream11/lua-circuit-breaker
_lua-circuit-breaker provides circuit-breaker functionality like resilience4j i.e. for Java. Any IO function/method thatâŚ_github.com
dream11/kong-circuit-breaker
_kong-circuit-breaker is a Kong plugin that provides circuit-breaker functionality at the route level. It usesâŚ_github.com