Deployment At Scale: Story Behind Dream11's In-House Blue-Green Deployment Platform ‘OneClick’.

Published on

Introduction

As a part of the agile-based development revolution, organizations have believed in rolling small and quick changes into the market. The theory of building fast-shipping fast plays a significant role in upgrading any product by rolling the features with very little time to market. However, this burdens the tech team’s efficiency in allocating dedicated human resources to ensure these changes seamlessly live onto the application on a frequent basis.

The answer to the referred problem was to automate the deployment process with the assurance of minimized failures and controlled rollbacks, let’s understand how we are achieving this using our in-house platform “OneClick”.

Seamless Blue-Green Deployment

Considering the amount of criticality due to the high scale on which Dream11 operates, we preferred to opt for the Seamless Blue-Green Deployment as our artifact deployment strategy to minimize the risk of downtime. This strategy of deployment is an extended version of default or instantaneous Blue-Green Deployment where a stack with new application changes faces the incoming traffic gradually unlike instantaneous Blue-Green Deployment.

Now let’s quickly understand our custom automated workflow for the Seamless Blue-Green Deployment.

Briefly, this architecture mainly depicts the following components,

  • Jenkins: This acts as an Interaction layer for the deployment process and will obtain the information from users like targeted service name, how many shards are needed in this specific deployment, and a description of the change you are deploying for the service.
  • OneClick: This acts as a brain behind our deployments by taking the decisions of selecting the eligible infrastructure for deployments and performing smart traffic movement to achieve Seamless Blue-Green Deployment. (Stay tuned! We will talk more about this very soon in the upcoming sections below)
  • Concurrency Model: Our data science team at Dream11 has developed a model for predicting user concurrency using XGboost mode after trying multiple models with 100’s of features, to predict the half-hourly concurrency on the Dream11 platform (more details are here), So typically OneClick will use this information of upcoming traffic, and accordingly it will estimate the needed infra resources for deployment and respective Green stack creation calls can be made to Terraform.

Before we talk more about how OneClick Engine is achieving the Seamless Blue-Green Deployment, let’s have a glance at the typical microservice architecture in Dream11.

The above architecture depicts the incoming requests that will be interacting with cloud-managed DNS i.e. Route53 via Cloudfront. At Route53, load balancers are enlisted as weighted record entries. Here these weights determine the proportion of incoming traffic that should flow to the respective load balancers.

As Blue-Green traffic switch can be done at web-server or load balancer or DNS level. We preferred the DNS level switch over Loadbalancer’s due to the limitation of Targets per load balancer quota by AWS i.e. 1000 EC2 / LB, wherein peak intervals our most of the microservices exceeds these per load balancer limits and we need to take help of sharding in that case, to deliver the needed infrastructure resources by microservices (sharding is a process of distributing the traffic to multiple blue/active load balancers ).

Now let’s take a deep dive into “OneClick”. Below mentioned workflow represents the responsibilities and features of the “OneClick“ Engine which led us to achieve automated Seamless Blue-Green Deployment.

🚀 Identify Eligible Shards.

In this step, OneClick identifies the right set of needed load balancers out of the available load balancer’s inventory for microservice as per the requested number of shards by the users. Load balancers are categorized into Active (Green), Passive, and Eligible (Green) Loadbalancers sections along with their attached resources (Target Groups, ASG, EC2, etc ) and compiled into JSON objects. These Compiled JSONs will be processed for stack creation during deployments and traffic routing process and eventually shipped to AWS S3 for references during rollback events for Blue stack creation again.

🚀 Pre-Deployment Cleanup.

In this step, the framework identifies if any stale resources (EC2, Launch templates, or AutoScalingGroups ) are attached to the eligible Green shards as a part of earlier deployments, and then clean up is performed for the same. Once the cleanup is done, eligible Green load balancers are blank entities with no other resources attached to them and ready to get utilized by new artifacts.

🚀 Prewarm Eligible (Green) Load Balancers.

Since We perform the traffic switch (release) on the Green load balancers, during our peak traffic as well and hence we need to prewarm these load balancers on a prior basis to make them ready for serving the upcoming huge flash traffic to avoid any kind of load balancer capacity based anomalies. Prewarming of the load balancers done in the form of LCU (Load Balancer Capacity) units and its consumption is directly proportional to the results of function_max(new connections, active connections, rule evaluation, processed bytes).

As Dream11 traffic is highly volatile, the million-dollar question here is, how many LCUs are enough LCUs so, OneClick decides that using simple mathematics as mentioned below,

Eligible LCU Prewam Units = max(Minimum benchmarked LCU for micro-service, Current Provisioned LCUs On Blue LB, Consumed LCUs On Blue LB ).

Scale Up and Scale Down requirements for LCUs during Deployments get catered via infrastructure locks between our scaling platform i.e Scaler and OneClick.

🚀 ML-Based Resource Estimation

Thanks to our ML-based concurrency prediction model, which always tells us about the incoming concurrency on a prior basis for the Dream11 Platform. On the basis of predicted concurrency, our Scaler’s backend is able to generate the required number of EC2s that are needed by the deployment targetted micro-service so, OneClick simply just asks for needed EC2s from our Scaler backend for predicted concurrency and orchestrate the provisioning process.

We are working to tune our LCU estimation logic as similar to EC2 as mentioned above for more accurate estimations.

🚀 Green ( Passive ) Stack Creation

Once the estimation is done for needed EC2 (and other computing resources ) for deployment as per expected traffic, OneClick calls to our Terraform modules to procure the estimated capacity and attach it to Green eligible load balancers identified earlier by OneClick.

🚀 Pre-Route Operations

In this phase, we ensure the readiness of the Green Stack via,

  • Health Check Validations.
  • Ensuring, Green Stack capacities are not lesser than current Blue Stack in any case.
  • Acquiring, Inter-Automation Infrastructure Locks.

Let’s understand a little more about the latter one,

🚀 Inter-Automation Infrastructure Locks

As our infrastructure is growing so our automation entities are as well. These automation entities are responsible for doing some or other CRUD operations within the Dream11 infrastructure; With a growing number of invokers for the infrastructure operations, we needed a well-communicated approach within themselves and we are achieving this via a locking mechanism on Infra mutation activities. So while the deployment process OneClick acquires the lock-on microservice’s infrastructure and releases it post a successful traffic routing process and notifies other automation components for their infra-modification operations.

🚀 Seamless Blue-Green Traffic Switch

Considering all the pre-requisites and validations are done and there are no red flags, OneClick opens the traffic gates for Green load balancers. OneClick ensures this traffic shifts by moving some small percentage of total active traffic to the Green stack and observes a few verification factors to analyze any failures. These verification factors are nothing but the performance and functional metric (Load balancer Hardware Errors, Application Failures, Latency, etc ). Once captured metrics are within the defined threshold for microservice, OneClick sends more percentage of traffic to Green Stack and reduces the same portion of traffic from Blue stack, and reiterates the verification process. This iteration will keep continuing till Green Stack has 100% of incoming traffic and the previous Blue stack has 0% of the same. 
During these seamless Blue-Green traffic switches if anything goes wrong at any iterations then traffic movement pauses and OneClick prompts the user for auto or manual rollback.

🚀 Post-Deployment Downscaling

Post successful traffic switch previous Blue stack is carrying 0% of active traffic and attached unused computing resources. These resources can be considered as wastage, hence OneClick downscales the capacity of attached AutoScalingGroups to Zero. By downscaling the Green stack completely and spawning quickly whenever it’s needed back, OneClick is helping us to fix one of the disadvantages of Blue-Green Deployment i.e bearing a high cost due to the approach of carrying the parallel stack in production.

Now the question is how we are managing deployment uncertainty of rollback decisions taken after a few hours or days from the deployment completion.

🚀 Rollback Pre-Validations

OneClick allows the users to take rollback decisions and moves traffic automatically back to the previous Blue (Active) Stack. In an earlier stage of deployments, OneClick has sent all the pre-deployment metadata of the Blue and Green stack to AWS S3, now OneClick fetches that metadata by unique deployment identifier and re-procures the infrastructure as per metadata. OneClick also validates the healthiness of the re-spawned stack and provides sign-off for infrastructure readiness before switching the traffic back to the previously Blue stack.

🚀 Rollback

After successful rollback pre-validations, OneClick seeks approval from the user, and on affirmative response, it goes ahead and quickly switches the traffic in a single iteration so, previous Blue (active) will be carrying the 100% incoming traffic again and the previous Green stack will be serving 0% of traffic.

The in-house OneClick engine has not only increased the confidence metric for our releases through inter-step verifications and quick rollback options but has also allowed us to leverage the Blue-Green approach without worrying about the cost of a parallel running stack. Similar to OneClick our Site Reliability Engineering team is continuously working to build more automation and reliability platforms that can cater to the need of Dream11’s high scale and operational needs.

That’s it from our end for now! Keep watching this space for more of our upcoming blogs, till then Happy Coding!

Related Blogs

Player Pricing
With Dream11 hosting around 10,000+ matches every year on its platform, have you ever wondered what all goes behind hosting these matches? It starts from deciding the match to host, generating the credit of players (keeping the user’s perspective in mind) and taking the match live - the whole picture is much bigger. In this blog we will take you through the whole process of assigning credit to players for individual matches, what data goes behind it, what were the considerations for automating the process and how through data driven intelligence this automation was achieved. We will also discuss the benefits of doing this automation from operational and business POV.
June 21, 2023
Building our very own Data Highway — Dream11’s In-House Analytics Vs Multiple Analytics Platforms
The Dream11 app hosts over 150M sports fans, with 171 RPM & 10.56 user concurrency. This results in over ~25TB of data per day. At this scale, it's all about crunching data for valuable insights and offering personalised experiences.
April 6, 2023
Optimising price performance at Dream11 with GP3
Amazon Web Services (AWS) announced their next generation general-purpose Solid State Drives (SSD) volumes in December 2020, aka GP3. In the previous version GP2, high throughput and Input/Output Operations Per Second (IOPS) were required for better performance, and a large storage or IOPS volume was needed to be provisioned. This usually led to the wastage of storage resources at much higher costs.
February 28, 2022