Power of Machine Learning ahead of big matches
- Published on
Dream11 is the world’s largest fantasy sports platform with 150 million sports fans concurrently participating in over 10,000 contests. This makes predicting traffic patterns an impossible task especially during peak seasons like IPL and World Cup, because of traffic peaks from thousands to millions concurrent users in just a few minutes!
Our vision is to Make Sports Better and offer our users an unmatched world class engagement experience. In order to do so, we have to ensure that we have 100% uptime for all matches for which we use prediction-based scaling and scale-out infra. Our mobile applications capture all user action events and our microservices generate Application Performance Monitoring (APM) traces, system metrics, network metrics and DNS metrics. In totality, we generate around 25+ TeraBytes of data and 24+ billion events on any popular match day, which is 2.5 times higher than on regular days. Our machine learning algorithms predict demand by processing multiple data points collected historically and recently. These include team players, the tournament, virality factors, and new and repeat user playing patterns etc.
For instance, the T20 ICC World Cup is one of the most anticipated sporting events enjoyed by millions of cricket fans. During this season, our platform has the ability to scale and manage 6.21 million concurrencies at the edge layer.
We firmly believe that our Dreamsters play a critical role in offering an optimum user experience. Every service owner has readiness lists and runbooks created to overcome any unforeseen incidents. We follow strict protocols to ensure that our customer service team is equipped to address incidents within a short turn-around time.
Using technology to solve for some of the biggest challenges
Observability from a single dashboard lends itself to efficiency
Monitoring is critical to ensure accelerated troubleshooting since our network is complex and distributed across microservices. Our monitoring tools help us track Ingress/Egress architecture with the application, infrastructure and DNS performance. Every minute before the round lock is very crucial. It can take time and effort if not managed efficiently, especially since many aspects like networking to application and business performance indicators need to be considered. Our status pages and dashboards help us focus on areas that require urgent attention during a fantasy sports contest. Examples are the top Application Programming Interface (API) with a response time of more than 200ms, a leading API with 5xx or a 4xx error of more than 0.1%, the top Relational Database Service (RDS) based on the connection established, or the Central Processing Unit (CPU). We have created a bird’s eye view of the entire Dream11 infrastructure on a unified dashboard. through which we can quickly navigate through issues and bring down the Mean Time To Detect (MTTD) and Mean Time To Resolution (MTTR). Our monitoring tool can build relationships between Cloudwatch metrics, APM metrics, network metrics, and logs.
Realtime Event Stream —--CloudWatch metrics lag and API rate limit issue
We used to send metrics to our monitoring tool using an Amazon Web Services (AWS) integration which in turn scraped CloudWatch data and stored it as an integration metric. During mega events, we generate over 700 million Cloudwatch metrics per day and this process runs at a configurable interval of two to five minutes. Due to the high volume of metrics, we ended up with an API rate limit issue and an approximate delay of 20 to 30 minutes. This led to an increase in the number of false and delayed alerts. We then deployed a CloudWatch Metric Streams solution. We are now able to send metrics to any primary monitoring solution provider with a relatively low latency period and also reduced delay to a negligible one to two seconds.
Performance Testing & Benchmarking
Performance testing is another crucial step in the life cycle for any software. We do regular chaos and load tests to ascertain faults and establish benchmarks for each technical component.
All our apps have intelligent user handling to ensure that the user experience is not hampered even when backend applications have degraded performance.
Using our network monitoring, we can sniff out network abnormalities quickly. A clear signal to validate this is to check the TCP retransmits by AZ. We have multiple slice-and-dice options available to match our network performance. Therefore, we can filter traffic across the availability zones, services, ENVs, domains, hosts, IP, VPCs, ports, regions, IP type, etc. (public/private/local).
For example, our APM tool provides end-to-end distributed tracing from frontend devices to databases. By seamlessly correlating distributed traces with frontend and backend data, our monitoring APM tool helps us automatically monitor service dependencies and latency to eliminate errors for our users to get the best possible experience. Distributed tracing is a technique that allows us to address the problem of bringing visibility into the lifetime of a request across several systems.
This is immensely useful for debugging and figuring out where the application spends the most time. We have a Service Map which inspects all the services to get their RED metrics and their dependencies and filter capabilities by Application Service, Databases, Caches, Lambda Functions and Custom Scripts. This service map reflects almost in real-time, and the monitoring agent sends data to our tool at an interval of 10 seconds. If no issues are detected, the map shows all services in green and red if any problems are picked up. This data is pulled from the monitor configured for each service.
Being future ready for managing scale
We conduct predictive scaling based on our data science model. The model predicts the concurrency expected for each match. Based on this prediction and our load test, we benchmark our numbers to scale over 100 microservices. Previously, we used ASG scales manually and validated them regularly, which was a complex and time-consuming task. The process has now been automated, and our monitoring tool helps verify data faster. We have configured several alerts to help us identify where to locate and resolve issues.
We send expected capacity data from our benchmark for an hour and also from a cloud watch to our monitoring system. Comparing the data helps us detect provisioning issues faster.
Whenever there are unpredictable spikes during a delayed match toss, unexpected rain, or when a key player is not announced for the lineups, the scaling mechanism adapts to prevent a bottleneck. Therefore, the network can scale up or down to suit traffic requirements without needing manual intervention. All of this usually happens a few minutes before the match starts.
As a team, we also love experimenting and regularly introduce new features. For the T20 World Cup, we created Team Share & Guru Teams wherein users could share teams with their friends and use teams created by experts. We also experimented with KnockOut and Gladiator formats for Gameplay, and received very good feedback from the users. Since they also love competing against their family and friends during tournaments, we simplified the group leaderboard calculation and updated it faster after each match. As an added surprise, we introduced the much-awaited emoji reactions in Chat Messages.
Whats next
The tech team is now occupied with incorporating the learnings from the recent tournament and getting the systems ready for the Indian Premier League (IPL) in 2023. The next year IPL will be bigger than 2022, which means Dream11 systems need to be more robust, resilient and reliable for ensuring the best of user experience.
- Authors
- Name
- Dream11
- @Dream11Engg