Break circuits, save Kong 🦍

Published on

2020 was a milestone year for us at Dream11. We became the world’s largest fantasy sports platform with over 100 million users and were the first-ever sports brand to become the Title Sponsor for the Indian Premier League (IPL) 2020. The increase in the popularity of fantasy sports has led to a huge uplift in user engagement on the Dream11 app. During the IPL, we receive up to 80 million requests per minute on our API Gateway which distributes it to more than 75+ microservices performing their micro-functions. Such a high influx of user requests comes with its own set of challenges.

In a microservice architecture, service, database or even network failure can lead to a cascading effect on other parts of your infrastructure. This can lead to:

  • Socket hangup and resource contention due to unresponsive/failing requests on downstream services like the API Gateway, which can potentially bring down the entire app.
  • Users having to retry the requests from the client, keeping the database/microservice overwhelmed, not allowing it to recover.

Circuit Breaker pattern was introduced to tackle this very problem.

The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. — Martin Fowler

We use the community version of Kong as our API gateway in production. 
Most of our microservices are written in Java, where we use resilience4j to wrap any server-to-server (S2S) calls with a circuit-breaker pattern and opossum for few services written in Nodejs. Taking inspiration from them, we listed down the requirements we need in our API gateway (Kong):

Requirements:

  • Request level circuit breaker: Track the failure % of each route served by Kong based on a time window. We want to break the circuit on route level rather than service level because we don’t want to stop requests which are not causing failures.
  • Timeout: A 5XX response from the upstream service is counted as a failure. Also, if the request does not complete in X milliseconds, the timeout should occur and it is also counted as a failure. As we don’t want to keep kong’s socket hanging.
  • Fail fast: If the failure % (total_failed/total_requests) for a route is above the defined threshold, then the plugin should open the circuit, block any further requests to this route and immediately return a 5XX error.
  • Half Open Circuit: After a defined time, Kong should allow a few requests on this route (which was blocked for a while) and check if requests are failing or succeeding by calculating failure % again. If the failure % decreased below the defined threshold, then allow requests on this route as usual. Else, keep blocking all requests on this route.

Circuit breaker as a library: We have written multiple custom plugins which make internal HTTP requests to upstream services for different use cases like making an HTTP request to Auth service to validate cookie. Ideally, we want to wrap these HTTP requests also with a circuit breaker so it does not hamper the performance of Kong in case let’s say the Authentication service fails.

Kong comes with an out-of-the-box circuit breaker to detect the health of an upstream service and block any further requests if the upstream becomes unhealthy. However, this circuit breaker did not meet the primary requirements mentioned above as it only works at the service level. We wanted to track on a per route level because:

  1. We do not want to stop all requests to an upstream service if in case just 1 route of that service seems to be failing.
  2. A scenario where a read-request is working fine because of caching but a write-request to the same microservice is failing because the database is overwhelmed.

Solution:

We explored if we could leverage any pre-existing lua library to write our plugin but could not find one that was production-ready and stress tested. Finally, we decided to write a Lua library (lua-circuit-breaker) that abstracts out circuit-breaker logic and used the same library to create a kong plugin (kong-circuit-breaker) as well.

While creating these repositories, we drew inspiration from resilience4j when it came to request-timeouts, percentage thresholds, time windows, custom error messages, and others. A sample configuration of the plugin looks like this:

configuration = {
        window\_time: 10,
        min\_calls\_in\_window: 20,
        failure\_percent\_threshold: 51,
        wait\_duration\_in\_half\_open\_state: 120,
        wait\_duration\_in\_open\_state: 15,
        half\_open\_max\_calls\_in\_window: 10,
        half\_open\_min\_calls\_in\_window: 5,
        api\_call\_timeout\_ms: 300,
        error\_status\_code = 599,
        error\_msg\_override = "Circuit breaker did not allow the request"
}

As you can see from the images below, a route GET /user-contests started to fail because the database of contest-reader-vertx service experienced a spike in latency. The circuit breaker on this route opened and failed-fast requests on this route for wait_duration_in_open_state seconds resulting in 4.87M failures with status code 599. Then it retried a few requests and closed the circuit breaker when this route had recovered and allowed requests to flow as usual.

We have been using the library and the plugin for the last 6 months now and it has been working very well. It has been battle-tested for the Dream11 IPL scale (1.2 M requests per second).

The plugin has been really helpful by quickly opening the circuits and saving the cascading effect in case service/DB/network failures. So after IPL 2021, we open-sourced the library and the plugin to contribute to the Kong community. We want the community to try them out, raise feature requests and share any feedback to help make it better ❤️.

Are you looking to solve more problems like these with the Dream11 team? Join us by applying here!

dream11/lua-circuit-breaker
_lua-circuit-breaker provides circuit-breaker functionality like resilience4j i.e. for Java. Any IO function/method that…_github.com

dream11/kong-circuit-breaker
_kong-circuit-breaker is a Kong plugin that provides circuit-breaker functionality at the route level. It uses…_github.com

Related Blogs

Experimentation at Dream11: Chapter I - Intelligent Traffic Routing (ITR)
At Dream11, experimentation runs deep in our DNA. We believe in building a culture that gives Dreamsters opportunities to experiment, fail and learn. In this blog, we deep dive into our journey to optimise payment success rates. We’ll touch upon the challenges we faced, the solutions deployed and the mathematical underpinnings behind our approach. Join us as we delve into the world of non-stationary bandits and how they've helped us maximise the success rate of payment routing.
December 13, 2023
Player Pricing
With Dream11 hosting around 10,000+ matches every year on its platform, have you ever wondered what all goes behind hosting these matches? It starts from deciding the match to host, generating the credit of players (keeping the user’s perspective in mind) and taking the match live - the whole picture is much bigger. In this blog we will take you through the whole process of assigning credit to players for individual matches, what data goes behind it, what were the considerations for automating the process and how through data driven intelligence this automation was achieved. We will also discuss the benefits of doing this automation from operational and business POV.
June 21, 2023
Data Beam: Self Serve Data Pipelines at Dream Sports
With Data Beam, users are now able to setup data pipelines on their own in less than 10 min without DE intervention Data Beam not just saves work hours for data engineers but also reduces the setup time drastically. It empowers our analysts, backend developers and service owners to create their own data pipelines in just a few clicks. We’re working on adding more pipelines to Data Beam including MySQL to IceBerg, MySQL to MySQL, etc. to make this product the ideal data integration stack.
May 8, 2023