Data Feast — A Highly Scalable Feature Store at Dream11

Published on

At Dream11, we have seen tremendous growth in terms of our user base to become the world’s largest fantasy sports platform. Our users can demonstrate their knowledge and skills in sports and connect deeper with the sport they love. With an ever-growing user base that currently stands at over 120 million, we work meticulously on our data platform to handle the sheer scale of generated data every second. Our goal is to build self-serve data products, be a data-obsessed team, and promote the concept of ‘data as a service within the organisation.

We started building a feature store to cater to our Machine Learning (ML) and in-house customer data platform needs, staying true to our goal. Here’s our journey of building Data Feast — our feature store product built entirely in-house.

What is a feature store?

A feature store is a data platform product that makes it easy to build, search, deploy, and use the features for various use-cases.

Here is a quick refresher on what is a feature, a feature value, and feature group/entities

A feature store is a central framework that processes, stores, and serves data that powers various use-cases of ML models and our customer data platform. Feature store helps transform raw data into feature values, store them, and serve them for model training and online inferencing. They also help reduce feature duplication by providing a registry/catalog for searching feature details. By automating all these steps, the feature store helps minimise duplication of data engineering efforts, speed up the ML models’ time to production, and help improve collaboration across data science teams.

Feature Store components

Five significant components that together form a feature store are:

1. Storage: This is the heart of a feature store. It provides capabilities for storage and retrieval of feature data for training and inference purposes.

2. Serving: The serving layer is responsible for serving feature data to models in a consistent format across training and inferencing.

3. Registry: A centralized registry is one of the essential components of a feature store. It enables standardized feature definitions and metadata for discovery and serves as a single source of truth for information about a feature.

4. Monitoring and Alerting: Usually, when something goes wrong with an ML setup, it mainly points to a data problem. A feature store should detect and raise alerts if such issues arise. Monitoring and alerting calculates all metrics on data stored and served, describing the correctness and quality of the features in question.

5. Transformation: In all ML applications, we require capabilities to process data to generate features that the models can then consume for inferencing and prediction. Feature stores should thus have out-of-the-box capabilities to perform data transformations that produce these features.

The Feature Store at Dream11, aka Data Feast

As we at Dream11 began experimenting, adapting, and creating many intelligent ML models, we saw exponential growth in the features generated by our various teams for multiple use-cases. Every team developed hundreds and thousands of features across multiple domains for various individual projects. We felt a need to build a system that can manage and handle all requirements of storing and serving features in a consolidated, easy, effective, standard way across all teams, projects, and models. We also foresaw the need to build a central repository that helps discover features and their metadata, thus reducing duplicate efforts across teams.

With all this in mind, we set out to build a system that would help solve all the above problems, and this is how Data Feast was envisioned and brought to life at Dream11.

Where we currently stand with Data Feast

Here is how we built the five significant components of feature stores that we talked about earlier.

Storage

In terms of storage requirements, there are two primary storage tiers in a feature store — online and offline to support different needs of inferencing and training of models.

The online storage tier

The online storage tier should have high throughput potential (which usually provides multi-million read requests per second) and low latency (single-digit millisecond latency) read capabilities. These are typically implemented with key-value stores. An online store usually has the latest version of the feature value for a given entity in a feature.

For the first iteration of the online store, we decided to use HBase with Apache Phoenix due to its scalability in terms of storage and high ingestion throughput with low latency reads. The additional feature of running SQL queries using the Phoenix interface on top of the feature data was an advantage for specific use cases that were being developed. We have heavily used and built systems and frameworks with HBase and Phoenix to solve a lot of requirements for high throughput ingestion and quick retrieval on terabytes of data.

As the adoption and the scale of data and usage grew, we noticed some bottlenecks in performance w.r.t throughput, majorly due to the limitation of phoenix query servers on the EMR master nodes, which led to us rethinking our online store strategy.

After much brainstorming, we concluded that a single store would not solve all our requirements perfectly. A store like HBase with Phoenix gave us decent throughput with an added functionality of SQL aggregate querying for specific situations. Still, it did not meet our expectations on latency and throughput after reaching features and data at scale. We decided to move to a hybrid online store model, so we could dynamically introduce an additional online store based on the use case at hand. This way, we can’t be held back by a store and its limitations and use the best parts of multiple stores to solve all kinds of use cases.

During our search for a store that can help solve our very high throughput, single-digit millisecond latency requirement, we came across RonDB. It is the open-source distribution of the NDB Cluster. RonDB claimed to provide LATS: low Latency, high Availability, high Throughput, scalable Storage, and it had good benchmark numbers to back its claim (more details on benchmarks available here). We performed a proof of concept (POC) with RonDB and realized it would work for us based on benchmark numbers. RonDB (NDB — Network Database) is accessible via a MySQL server cluster setup which was again a plus point for us regarding easy scalability and a SQL interface.

The offline storage tier

We currently maintain all the feature data on S3 in parquet format for the offline store tier. Data is stored at a feature group level and is partitioned on the feature, version, and date for easier querying. We also maintain a de-duped copy of feature values in delta format for quicker offline data processing in bulk/batch fashion.

To ensure consistent data in both the storage tiers, we use Kafka clusters as a front door. All data that needs to be written to the feature store can simply be published to Kafka. This setup gives us the flexibility to implement our consumers or use Kafka connect clusters to sink data to multiple data sources (online and offline stores). Thus, ensuring data availability in both stores and no inconsistencies in data across the online vs offline storage tiers.

Serving

The serving layer is available in two parts — a low latency scalable REST API service and an SDK written in Python and Java.

In tandem with our registry service, the REST API is aware of which feature is stored in which datastore. Based on the features requested, the API service pulls the feature values from the online store and serves the result to the user. The serving layer we built is currently a spring-boot application backed by HikariCP for connection pooling that helps us achieve good throughput in terms of requests per second.

Our SDK for Data Feast is another touchpoint to retrieve data from the feature storage. The SDK is written in Python and Java to cater to all codebases. These also come with support for spark Java/Scala and Pyspark). There are easy-to-use functions and methods to interact with Data Feast for data storage and retrieval.

Registry

A registry is the most critical component for any feature store. Keeping this in mind, we build a REST API service written using spring boot which is deployed in a high availability mode. An RDS instance backs the Service for metadata storage. We maintain all metadata related to features like name, description, owner, domain, team, storage tier enabled (offline/online or both), versions, data types, and other operations-related data.

The metadata in the registry sets the tone for Data Feast to function. We have automated jobs that use the registry to schedule and configure data ingestion and storage of features. The serving APIs utilise the registry to get information on which feature values are available, where to find them and how to access these features.

The registry service exposes API for FeatureGroup/Feature addition, edition, deletion, metadata retrieval and maintains a source of truth for all feature-related metadata. It helps us make the entire feature registration and metadata retrieval process as self-serve as possible. We also use the confluent schema registry to maintain the schema of data flow to and from Kafka. Additionally, the Data Feast registry service is responsible for other operational tasks such as registering feature schemas with the confluent schema registry, Kafka topic creation, and sink connector (Kafka connect) addition.

Monitoring and Alerting

There are two major categories of metrics that we need to capture to know the state and health of a feature store. First is the operational metrics about the production system as a whole, and the second is data-related metrics like quality, freshness, drift, etc.

As a first iteration, we focused on the operational metrics for the system’s core functionality.

We can track metrics related to Storage(availability, capacity, utilization, etc.), Feature serving (throughput, latency, error rates, etc.), and ingestion of feature data into the respective store (job success rate, throughput, processing lag, etc.) and more. We have thresholds set on these metrics that raise an incident for action to be taken when breached.

We are currently building a component to start capturing metrics related to data like data quality, tracking for drift, training-serving skew, and data freshness and profiling.

Transformation

During our initial phase of building Data Feast, there were a lot of existing data pipelines that created feature values already in production. Since the beginning, it was crucial for us to slowly start migrating the older pipelines to Data Feast and ensure all new pipelines get incorporated with Data Feast. To achieve seamless integration of feature engineering and computation, we decided to use Data Titan as the transformation layer for Data Feast.

Data Titan is a compute abstraction layer product that we built entirely in-house. It allows us to run data computation across various data stores with an option to run federated queries. We can decide on multiple destinations for the computed data, Data Feast being one of them. Data Titan is available in two forms at Dream11 — Batch and Real-time.

As the name suggests, Data Titan Batch allows computation on batch data from various sources like the data lake or different data warehouses. Data Titan is intelligent enough to understand where the data resides and directly run the computation on the data source. It also uses Apache Spark to run federated queries based on data sources and destinations.

Data Titan real-time lets us run a computation on real-time data sources like Kafka and store the data into Data Feast. Apache Flink backs this system for all immediate computational needs.

We will soon be releasing another blog to talk more about DataTitan and how it works, so stay tuned!

The high-level architecture diagram shown below brings the system’s current state together.

The future of Data Feast

At Dream11, we have just begun our journey in the world of Feature Stores, and there is a long way to go. Here are some significant milestones that we have envisioned for the immediate future:

Integration of Delta or Apache Hudi format for time-travel capabilities on the feature data: We are currently evaluating both for feasibility and other advantages they provide.

Integration of Data Feast with Data Titan: The purpose is to provide an easy and seamless experience of generating features for the end-user.

Tackling issues proactively: We are working on building a feature data monitoring, profiling, and alerting component to tackle and catch any data related issues proactively

Further perfecting Data Feast: The feature store and data space are ever-evolving, and there is always scope for improvement at every step at any given moment. We continuously work to evolve our systems for scale and performance as Dream11 grows.

Are you interested in solving problems related to data at a multi petabytes scale, handling and processing billions of data points per day or working with passionate and innovative minds on the turf? We are currently hiring across all levels! Apply here to join us. More exciting stuff to come from the Data Team at Dream11! Stay tuned!

Reference Links:

https://hbase.apache.org/

https://phoenix.apache.org/

https://kafka.apache.org/

https://www.confluent.io/product/confluent-connectors/

https://www.rondb.com/

Related Blogs

Dream11’s SVP Analytics - Arun Pai shares his journey exploring the intersections of Mathematics, Engineering and AI
Meet Arun Pai - Dream11’s SVP of Analytics. In Arun, Dream11 discovers not just a leader, but a dynamic force uniting technology and sportsmanship, driving innovation with humility and dedication. He believes in embracing diverse domains which paved the way for his transition from a world of engineering to his passion for sports. In this #BeyondTheAlgorithm, Arun unveils the intricate data stack powering Dream11
July 2, 2024
How Dream11 leverages Causal Inference to make data-driven decisions
As a sports technology company, which is user first, we conduct hundreds of experiments to offer our users world-class experiences. However, conducting clean experiments isn’t always feasible. Given this challenge, and to stay ahead of the curve, we built our very own AI-based Causal Inference Platform, which leverages existing data to determine causal impact. We’ve leveraged Causal Inference to drive critical business decisions. Continue to read this blog to understand Dream11’s AI & data-centric journey.
March 11, 2024
90% data-based decision making, 10% gut sets Dream11 apart; says Data Engineer Salman Dhariwala
There was curiosity— a thirst to unravel the potential concealed within data's labyrinth for Salman Dhariwala, Director of Data Engineering at Dream11. His focus was clear - to build robust pipelines and transform raw information or data into priceless insights. More than just a tech pursuit, Salman has been keen to shape how organizations leverage data for strategic victories. Salman shares with us how his transition into the sports-tech industry allowed him to pursue his two greatest passions – technology and sports.
January 22, 2024