To Scale In Or Scale Out? Here’s How We Scale at Dream11
- Published on
For the 100 million Dream11 users, the thrill and excitement of playing fantasy sports on our platform is unparalleled. They enjoy creating their own teams and competing with fellow fans and friends! However, from a backend perspective, there are various challenges we face in terms of variation in traffic and engagement on Dream11 majorly before the match start time. To ensure the application runs smoothly at critical times when user traffic is high, as a team, we came up with a scalable and customisable solution. And so, we were able to run multiple contests simultaneously and efficiently process millions of user requests per second without compromising their experience in playing fantasy sports.
How did we manage such variation in traffic that drastically fluctuates in short intervals on Dream11 platform ? But before we answer that, let’s have a look at the application’s architecture for better understanding.
The architecture supports:
- A user base of more than a 100 Million
- User concurrency of over 5.5 Million
- Over a 100 Million requests per minute (RPM) at edge services
- More than 30K+ of compute resources to support peak IPL traffic
- More than a 100+ microservices running parallel
Functional Challenges
The behavioural trend of Dream11 users is a very spiky one, (as seen in the Traffic Surge diagram above). This means that the number of users is dynamic and surges frequently, especially when they rush to create/edit fantasy teams and join contests. This also depends on multiple factors such as the popularity of the sport or match and the toss timeline (in the case of cricket matches). Such tsunamis of user traffic on the application can be divided in three main phases, namely, ‘before the match’, ‘during the match’ and ‘after the match’ (including when the winners are declared).
Once users open the Dream11 application, their journey usually oscillates between the home page, tour, team and contest pages. Hence, the load on the application shifts and accordingly, the edge layers, its dependent services and microservices have to be scaled in or out.
Interestingly, the user concurrency may rise or drop in between events or after the matches, and predicting it based on the growth of users per year, can be challenging.
Let us also consider the uncontrolled variables which generate spikes on the platform before, during and after matches. These are,
- User interest depends on popularity of the matches and players in real life, which affects the volume of RPM.
- Match-specific events such as the toss, squad announcement, team creation by the user post squad announcement, and mid-hour events like fall of wickets, hitting of sixes, hat-tricks, breakout events and other unpredictable factors such as rain.
The tsunami traffic represented by the vertical spikes in the graph above, brings tremendous volatility and load on the infrastructure.
Challenges in Scaling
Auto Scaling (why it won’t work)
Autoscaling in terms of infrastructure, has several limitations. Its provisioning time is not fast enough to support the compute requirements for users during key events! A flood of users during tsunami traffic needs short provisioning time to keep up with the spike, and it may not be suitable to have a build time and make users wait.
- Spot availability of nodes is limited and highly competitive — especially at key hours!
- Step scaling may not work at this point either, as it is limited to a certain number of nodes, if Dream11’s scale is to be considered
- Rebalancing or rearranging the number of nodes across availability zones (AZ) based on the availability of resources, may further add provisioning cost with respect to time.
Limitations of classic and application load balancers (CLB/ALB)
- Creating load balancers shard based on throughput, as there is a limit on the number of requests generated on the load balancer. For higher throughput based on user concurrency, there is a need to create shards and manage them as per service routings.
- Pre-warming of ELBs must be conducted in order to handle the sudden surge of traffic.
Limitations of Cloud Control Plane (Cloud Provider)
- Additionally, there are limits to the features on our cloud provider too. Application Programming Interface (API ) call rates are limited as per businesses, and this needs to be considered while allocating resources
- Console operations are heavy due to the number of resources provisioned.
- Operational overheads to scale out to 100+ microservices.
Solution: Predictive - Proactive Scaling
Our Homegrown Concurrency Prediction Model
Our data science team at Dream11 has developed a model for predicting user concurrency using XGboost mode after trying multiple models with 100’s of features, to predict the hourly concurrency on the Dream11 platform.
We first run every match’s metadata through a linear model which gives the tier of the match. A match tier is an indicator variable for how popular the match will be.
Match tiers are then categorised (prioritised by high concurrency or those most in demand) based on their past concurrency of similar matches.
The model then iterates multiple features to predict the concurrency for the particular match. These can be features of each hour such as number of matches by tier in that hour (and in the hours around it), active users in previous hours/days, average transaction sizes etc. All this data goes through a normalisation which can take care of Dream11’s exponential growth.
To top it all, we need a suitable cost-sensitive loss function with no option for under-prediction. In all, we have over 200 variables and more data artistry than data science, making the XGBoost model work with very limited hyper parameter tuning.
As our data science team believes, “Error Analysis >>>> Hyperparameter Tuning”.
Performance Testing & Gamedays
Based on the prediction model which provides concurrency estimates, the performance team holds ‘Game Days’ to benchmark infrastructure capacity along with factoring trends based on past matches.
The performance testing framework used is Torque
The infrastructure provisioning framework : Playground (Watch this space for more on this)
Using Playground to provision Infrastructure and Torque to run performance tests, the performance team certifies the following improvements for business functionality based on user concurrency predictions.
Performance metrics & validations :
- Defining the application latency — the acceptable latency to serve the business
- Identifying individual service capacity
- Benchmarking compute and network throughput
- Identifying failure and saturation points of the applications
- Generating sudden spikes and surges to identify impact on backend infrastructure and identifying cascading effects in the architecture
- Test Infrastructure boundaries w.r.t Compute, Network, API Limits and Operations.
Deployment optimisations to reduce provisioning time
- Fully baked Amazon machine images ( AMI ) for deployments with application artefact for faster scaling
- Provisioning multiple compute instance types across Availability Zones (diversified), reducing capacity challenges
- Capacity Optimised allocation strategy for spot instances
- Cost Optimisation ensuring 100% resources are running on spot
- Notifications to failure in case of spot unavailability and enable on demand provisioning.
- Weighted domain name system (DNS) records to support ELB shards.
Scaling for the key hours using Scaler.
- The DevOps and SRE team at Dream11 have orchestrated a platform ‘Scaler’ which helps in ProActive scaling ,per the concurrency prediction and performance benchmarks.
- Based on the performance tests with respect to the predicted concurrency and trend, the system is fed with different slabs of user concurrency and respective infrastructure to provision across microservices before the match, during the match, and after the match.
Technology and Cloud Services Used
🎖Results Achieved!
🚀 Improved quality with lower risk: Scaler helps the Dream11 DevOps/SRE team to manage Scale-In and Scale-Out operations much better, while making our process much more efficient, considering the sheer volume of matches hosted on the platform.
🚀 Faster service: It streamlined daily operational processes and infrastructural tasks, what previously took hours to complete is now achieved in minutes.
🚀 Increased flexibility: Scaler helps us save a significant amount of time, as scaling operations are now based on schedules of matches. This increases operational efficiency and enables the DevOps/SRE team to focus on making engineering improvements.
🚀 Lower infrastructure cost: As the scaling operations are scheduled based on different slabs and tiers, overall capacity of infrastructure can be provisioned based on the events of matches. This reduces the monthly infrastructure cost by 50%.
🚀 Distinctive insights: Analytics based on the scaling events and user trends provide better feedback to the machine learning model. This makes it possible to predict organic and inorganic growth patterns of infrastructure and users, which in turn helps us predict the provisioning requirements for the future.
Future Scope of Work
As we mature the architecture, we are looking at predictive and scheduled scaling for containers and data stores. We are also looking to optimise our infrastructure cost and to scale out realtime basis the spike we see on the Dream11 platform. To achieve this we are looking for talented engineers excited in solving infrastructure problems at scale and delivering a great product to Dream11 users.
If you think solving above challenges interests you? We at Dream11 have open positions for those who are SME’s in their domain. Apply Now.