Navigating the Streamverse: A Technical Odyssey into Advanced Stream Processing at Dream11
- Published on
Discovering the Vastness of the Streaming Universe
In today’s rapidly evolving digital landscape, organizations are faced with the daunting task to effectively manage and process massive amounts of data in real-time. Real-time data processing has emerged as a game-changing solution, empowering businesses to leverage the true potential of data by instantly capturing, analyzing, and taking action on it.
Real-time data processing operates by collecting, analyzing, and acting upon data as it is generated, eliminating any delays. Unlike traditional batch processing methods, real-time processing provides instant insights, enabling businesses to take action on those insights swiftly such as personalizing the app experience, creating an offer, or scaling the infrastructure. It empowers businesses with the agility to detect patterns, anomalies, and trends in real-time, facilitating bold, responses and immediate actions. This capability is particularly critical in time-sensitive industries such as gaming, finance, e-commerce, supply chain management, customer support, serving use cases ranging from real-time data visualization, user engagement & personalization, and real-time nudges.
To illustrate the significance of real-time data processing, consider the following graph showcasing the diminishing business value over time following a user event -
Therefore, it is essential to incorporate real-time decision-making into your organization to gain a competitive advantage and stay ahead in the rapidly evolving tech landscape.
Celestial Metamorphosis: Exploring the Evolution of Real-Time Data Processing in Dream11's Universe
In our journey to find the most appropriate stream processing framework, we initially relied on a combination of vanilla Kafka consumers and producers implemented in Java. This approach helped us leverage the core functionality of Kafka to handle real-time data streams. However, as our requirements became more complex and demanding, we sought out alternative solutions.
One such solution was the Spring Kafka framework, which provided a higher-level abstraction for Kafka consumers and producers. By utilizing this framework, we were able to streamline our development process and take advantage of its built-in features, such as automatic deserialization and serialization of messages, simplified configuration, and robust error handling.
While Spring Kafka proved to be a useful tool, certain pipelines within our system required more advanced stream processing capabilities. This led us to explore the KSQL processing framework, allowing us to perform real-time stream processing and analysis with a familiar SQL-like syntax. KSQL enabled us to quickly filter, transform, and aggregate data streams, making it a valuable addition to our stream processing arsenal.
Finally, after careful evaluation and benchmarking, we ultimately settled on Apache Flink as our preferred stream processing framework. Apache Flink's impressive performance and scalability made it the ideal choice to handle the high throughput demands of our business. Additionally, Flink's support for SQL, stateful computations and vast ecosystem of connectors helped make our stream processing pipeline more versatile and adaptable.
In conclusion, our journey towards finding the optimal stream processing framework led us through various options. However, Apache Flink ultimately met our requirements, providing us with the capability to handle the necessary throughput, scale, support for stateful computations, and a comprehensive set of connector packages.
Navigating the Flink Frontier: Challenges and Obstacles in the Initial Implementation Phase
During our journey of developing streaming pipelines using Apache Flink, we met several challenges and pain points that hindered our productivity and efficiency. Here are some of the key challenges we encountered:
- Duplicated Code/Effort: For almost every use case, our developers across teams and the org, ended up writing similar code for the repetitive task, for eg., reading data from any available source and writing to various available sinks. This resulted in a significant amount of duplicated effort and maintenance.
- Testing and Refactoring: Once the core logic was implemented, our developers had to spend considerable time testing and refining the processing logic. This iterative process involved reviewing the code to make corrections and improvements, leading to extra time and effort being spent.
- Manual Deployment: When our code was ready for production, we built the jar file and uploaded it to one of several existing Standalone Flink clusters or created a new one if required. If the existing cluster lacked sufficient resources to run the job, we would scale up the cluster before deployments. This manual deployment process was time-consuming and prone to human errors.
- Multiple Flink Clusters: To logically separate different jobs based on criteria like business priority or high-level projects, we created multiple Flink clusters. This approach resulted in the management of separate clusters based on criticality which added sophistication but increased the operational overhead.
- Alerting and Monitoring Challenges: Setting up alerting and monitoring for each Flink cluster proved to be a significant challenge and time-consuming one. Furthermore, we manually maintained metadata between the clusters and the jobs running on them. Identifying failed jobs required cross-checking with the metadata list and identifying the impacted cluster. This manual process turned out to be error-prone and time-consuming as things grew.
- Flink Session Cluster Limitations: Multiple jobs shared the cluster resources as we were using Flink Session Cluster. This shared resource model posed a significant risk. In the event of a task manager failure, all the jobs sharing resources on that particular task manager would go into a restart loop, causing extensive downtime and impacting overall system reliability.
- Debugging and Deployment Iterations: When bugs were identified in production pipelines, resolving the issues involved multiple iterations of debugging the logic, rebuilding the code, deploying it and verifying the changes. This process took time and effort due to the above-mentioned obstacles.
Streamverse: Empowering Dream11's Interstellar Journey into Real-Time Data Processing
At Dream11, we are a data-driven team dedicated to creating personalized user experiences in the sports domain. To achieve this, we recognize the importance of deriving actionable insights from the massive amounts of data flowing into our systems, enabling us to adopt a user-first approach and deliver a truly personalized experience for our 190M+ users.
To unlock the full potential of our data-driven approach, we developed Streamverse, our in-house real-time stream processing platform. Streamverse is a universe of all streams and entities that let you perform operations on these streams called operators, which enables us to process data in real-time. With these powerful primitives, we can effortlessly create data pipelines capable of processing billions of data points while tailoring our business logic to achieve the desired outcomes.
Unleashing Celestial Potential: Exploring The Benefits and Advantages of Streamverse
The Streamverse platform offers a range of compelling advantages for developers, empowering them to streamline their development processes and enhance overall efficiency:
- Reduced Development Time: By leveraging the platform, developers experience a significant reduction in end-to-end development and deployment time for real-time use cases. With the platform taking care of underlying complexities, developers can solely focus on core business logic, accelerating the development cycle.
- Ease of Debugging and Instant Validation: The platform facilitates seamless debugging and validation through its data preview feature. Developers can quickly examine and verify transformation logic within operators, saving valuable time by enabling rapid identification and resolution of issues.
- Managed Infrastructure: The platform alleviates concerns related to deployment and resource allocation concerns. Developers no longer need to worry about determining the optimal deployment location or allocating appropriate resources. The intuitive user interface (UI) makes job scalability effortless, allowing for adjustments in job-related parameters like vCPU cores, memory and other core configurations as per requirements.
- Observability: The platform ensures comprehensive observability by exposing all Flink job metrics to our observability platform. This enables monitoring of crucial metrics such as lag in data consumption, consumption speed, resource usage and other key metrics. With the ability to set up monitors, developers gain valuable insights into the health and performance of their jobs, facilitating proactive maintenance and optimization.
By harnessing the benefits of the Streamverse platform, developers can optimize their development processes, streamline debugging, enjoy managed infrastructure, and achieve heightened observability for their Flink-based applications.
Streamverse in Action: Diverse Use-Cases Amplified at Dream11
- Data movement: Streamverse enables smooth and effortless data movement across diverse systems and platforms at Dream11, by leveraging the wide array of sinks and processing operators at its disposal.
- Real-time alerting: Streamverse enables systems to generate timely notifications and alerts based on processed and/or aggregated real-time data streams.
- Feature generation for data science use cases: Streamverse provides capabilities to generate features and prepare data for data science and machine learning applications.
- Processing web application firewall logs: Streamverse helps derive valuable insights from web application firewall logs, allowing for prompt actions to be taken.
- Business-specific use cases for instant decision-making: Streamverse supports custom use cases where data processing is necessary to make immediate business decisions.
- Real-time visualization: Streamverse allows for the generation of aggregated data for real-time visualizations, enabling quick insights and analysis.
Immersing into Streamverse
Demystifying the Platform's Core Primitives for Efficient Data Processing
Streamverse leverages two core primitives: Streams and Operators. Streams represent real-time data flows, while operators enable operations on these streams. Operators can take multiple input streams (N) and produce corresponding output streams (M). This powerful combination forms the backbone of Streamverse, facilitating complex data pipelines and diverse business logic implementation.
Streams are continuous data flows generated and consumed in real-time. They consist of ordered events or records from various sources like IoT devices, social media feeds, log files, or any other data-producing system. Streams are dynamic and infinitely scalable, serving as reliable sources for real-time processing and timely decision-making.
Operators are the building blocks within Streamverse that transform and manipulate data. They perform operations such as filtering, aggregating, joining, transforming, enriching, and analyzing data in real-time. These operations yield immediate results for informed decision-making and prompt actions.
By harnessing streams and operators, Streamverse empowers developers to process and analyze data in real-time, enabling informed decisions and immediate actions. These core primitives establish a foundation for scalable, robust, and efficient data processing systems capable of handling the high volume, velocity, and variety of data in today's digital landscape.
Unleashing the Potential of Real-Time Data Processing with Streamverse
Our primary motivation behind the development of Streamverse, was to simplify the intricate stream processing logic to the point where software developers can solely focus on designing, building and validating the business logic. Streamverse takes care of all the additional tasks such as establishing connections to sources and sinks, deployment, alerting, monitoring, and more.
Streamverse provides developers with a dynamic playground where they can seamlessly integrate streams as sources. Through a user-friendly drag-and-drop interface, operators can be effortlessly added to manipulate the data flowing within these streams. The transformed stream can then be directed to the desired location. The user interface offers a comprehensive visual representation of the data's journey, from the source to the destination, showcasing the applied transformations.
The Building Blocks: Unveiling various components of Streamverse for Efficient Data Processing
Streamverse simplifies the development of real-time pipelines through fundamental abstractions of Streams and Operators. By leveraging these abstractions, Streamverse equips developers with the necessary tools to handle complex data flow and manipulation within software engineering.
The Streamverse framework includes a variety of operators, such as source operators (e.g., Kafka Source Operator), transformation and enrichment operators, filtering and projection operators, join and groupBy operators, SQL and lookup operators (e.g., JDBC Lookup Operator, HTTP Lookup Operator), and sink operators (e.g., Kafka Sink Operator, JDBC Sink Operator). Additionally, the framework allows developers to implement custom operators to address specific use cases and requirements.
Streamverse's Architectural Landscape
The Anatomy of Streamverse: Demystifying the Components that power the platform
The service provides REST APIs to create jobs, utilize operators, and establish connections between them. It supports job planning and execution on a Flink cluster running on Kubernetes, as well as manages the lifecycle of jobs and monitors their status. It also enables efficient debugging by allowing the preview of job output at any node.
The core engine processes the job plan as a directed acyclic graph (DAG) and performs a topological sort to execute low-level code for each operator. It validates and executes the Flink job.
The Streamverse UI offers a user-friendly playground for creating nodes by dragging and dropping operators and connecting them. It provides the ability to preview node output by processing the pipeline with sample records. It simplifies job lifecycle management and offers visibility into all running jobs on the platform.
Cosmic Harmony: Uncovering the Job Orchestration Symphony of Streamverse
In Streamverse, the orchestration of Flink jobs is seamlessly managed through the utilization of the powerful Flink Kubernetes Operator. When a request is received by the operator, it assumes the pivotal role of creating job manager and task manager pods, diligently monitoring the status of the job throughout its lifecycle.
We have ingeniously integrated custom plugins into the system to improve the orchestration capabilities. These plugins serve the purpose of generating alerts on Slack, ensuring proactive notifications for critical events. Additionally, we have established an incident management setup for jobs within the Flink Kubernetes Operator to ensure prompt attention and resolution. This integration further fortifies the robustness and reliability of our orchestration mechanism.
Smashing Stats: Streamverse handling Dream Sports’ Scale like a Pro
Streamverse is a highly adept real-time data stream processing platform that efficiently navigates through massive volumes of data, harmonizing real-time insights. Engineered with a strong emphasis on agile scalability and advanced technology, it effortlessly orchestrates data, ensuring precision and efficiency.
In the demanding landscape of Dream11, boasting over 190+ million users, with a peak user concurrency of 10.56 million, the need for high-volume data processing is paramount. Streamverse rises to this challenge easily, seamlessly handling hundreds of running jobs that utilize thousands of streams and efficiently processing ~1.4 billion records per minute. Impressively, Streamverse has consistently demonstrated its robustness while effectively managing billions of events in a single day.
A Universe of Possibilities: Streamverse's Future Advancements and Open Source Plans
Streamverse will introduce several exciting features for developers. These include a Stream Catalog, which will serve as a centralized repository for real-time streams within an organization, offering comprehensive stream details, usage insights, lineage functionality, and advanced access control lists (ACL) for security and governance.
A Streamverse SDK, will aid developers by providing them with a versatile software development kit for effortless reading and writing of streams. It will also integrate with Dream Sport's backend services to enable efficient data exchange.
Additionally, Streamverse is developing a developer-friendly CLI (Command Line Interface) that will simplify stream and real-time job management. The CLI will allow users to create, edit, run, and preview Streamverse jobs and perform stream creation and management tasks.
Stay tuned as we continue to expand and open-source these cutting-edge features already built and those in development to enrich the Streamverse ecosystem further.