Data Feast — A Highly Scalable Feature Store at Dream11
- Published on
At Dream11, we have seen tremendous growth in terms of our user base to become the world’s largest fantasy sports platform. Our users can demonstrate their knowledge and skills in sports and connect deeper with the sport they love. With an ever-growing user base that currently stands at over 120 million, we work meticulously on our data platform to handle the sheer scale of generated data every second. Our goal is to build self-serve data products, be a data-obsessed team, and promote the concept of ‘data as a service within the organisation.
We started building a feature store to cater to our Machine Learning (ML) and in-house customer data platform needs, staying true to our goal. Here’s our journey of building Data Feast — our feature store product built entirely in-house.
What is a feature store?
A feature store is a data platform product that makes it easy to build, search, deploy, and use the features for various use-cases.
Here is a quick refresher on what is a feature, a feature value, and feature group/entities
A feature store is a central framework that processes, stores, and serves data that powers various use-cases of ML models and our customer data platform. Feature store helps transform raw data into feature values, store them, and serve them for model training and online inferencing. They also help reduce feature duplication by providing a registry/catalog for searching feature details. By automating all these steps, the feature store helps minimise duplication of data engineering efforts, speed up the ML models’ time to production, and help improve collaboration across data science teams.
Feature Store components
Five significant components that together form a feature store are:
1. Storage: This is the heart of a feature store. It provides capabilities for storage and retrieval of feature data for training and inference purposes.
2. Serving: The serving layer is responsible for serving feature data to models in a consistent format across training and inferencing.
3. Registry: A centralized registry is one of the essential components of a feature store. It enables standardized feature definitions and metadata for discovery and serves as a single source of truth for information about a feature.
4. Monitoring and Alerting: Usually, when something goes wrong with an ML setup, it mainly points to a data problem. A feature store should detect and raise alerts if such issues arise. Monitoring and alerting calculates all metrics on data stored and served, describing the correctness and quality of the features in question.
5. Transformation: In all ML applications, we require capabilities to process data to generate features that the models can then consume for inferencing and prediction. Feature stores should thus have out-of-the-box capabilities to perform data transformations that produce these features.
The Feature Store at Dream11, aka Data Feast
As we at Dream11 began experimenting, adapting, and creating many intelligent ML models, we saw exponential growth in the features generated by our various teams for multiple use-cases. Every team developed hundreds and thousands of features across multiple domains for various individual projects. We felt a need to build a system that can manage and handle all requirements of storing and serving features in a consolidated, easy, effective, standard way across all teams, projects, and models. We also foresaw the need to build a central repository that helps discover features and their metadata, thus reducing duplicate efforts across teams.
With all this in mind, we set out to build a system that would help solve all the above problems, and this is how Data Feast was envisioned and brought to life at Dream11.
Where we currently stand with Data Feast
Here is how we built the five significant components of feature stores that we talked about earlier.
Storage
In terms of storage requirements, there are two primary storage tiers in a feature store — online and offline to support different needs of inferencing and training of models.
The online storage tier
The online storage tier should have high throughput potential (which usually provides multi-million read requests per second) and low latency (single-digit millisecond latency) read capabilities. These are typically implemented with key-value stores. An online store usually has the latest version of the feature value for a given entity in a feature.
For the first iteration of the online store, we decided to use HBase with Apache Phoenix due to its scalability in terms of storage and high ingestion throughput with low latency reads. The additional feature of running SQL queries using the Phoenix interface on top of the feature data was an advantage for specific use cases that were being developed. We have heavily used and built systems and frameworks with HBase and Phoenix to solve a lot of requirements for high throughput ingestion and quick retrieval on terabytes of data.
As the adoption and the scale of data and usage grew, we noticed some bottlenecks in performance w.r.t throughput, majorly due to the limitation of phoenix query servers on the EMR master nodes, which led to us rethinking our online store strategy.
After much brainstorming, we concluded that a single store would not solve all our requirements perfectly. A store like HBase with Phoenix gave us decent throughput with an added functionality of SQL aggregate querying for specific situations. Still, it did not meet our expectations on latency and throughput after reaching features and data at scale. We decided to move to a hybrid online store model, so we could dynamically introduce an additional online store based on the use case at hand. This way, we can’t be held back by a store and its limitations and use the best parts of multiple stores to solve all kinds of use cases.
During our search for a store that can help solve our very high throughput, single-digit millisecond latency requirement, we came across RonDB. It is the open-source distribution of the NDB Cluster. RonDB claimed to provide LATS: low Latency, high Availability, high Throughput, scalable Storage, and it had good benchmark numbers to back its claim (more details on benchmarks available here). We performed a proof of concept (POC) with RonDB and realized it would work for us based on benchmark numbers. RonDB (NDB — Network Database) is accessible via a MySQL server cluster setup which was again a plus point for us regarding easy scalability and a SQL interface.
The offline storage tier
We currently maintain all the feature data on S3 in parquet format for the offline store tier. Data is stored at a feature group level and is partitioned on the feature, version, and date for easier querying. We also maintain a de-duped copy of feature values in delta format for quicker offline data processing in bulk/batch fashion.
To ensure consistent data in both the storage tiers, we use Kafka clusters as a front door. All data that needs to be written to the feature store can simply be published to Kafka. This setup gives us the flexibility to implement our consumers or use Kafka connect clusters to sink data to multiple data sources (online and offline stores). Thus, ensuring data availability in both stores and no inconsistencies in data across the online vs offline storage tiers.
Serving
The serving layer is available in two parts — a low latency scalable REST API service and an SDK written in Python and Java.
In tandem with our registry service, the REST API is aware of which feature is stored in which datastore. Based on the features requested, the API service pulls the feature values from the online store and serves the result to the user. The serving layer we built is currently a spring-boot application backed by HikariCP for connection pooling that helps us achieve good throughput in terms of requests per second.
Our SDK for Data Feast is another touchpoint to retrieve data from the feature storage. The SDK is written in Python and Java to cater to all codebases. These also come with support for spark Java/Scala and Pyspark). There are easy-to-use functions and methods to interact with Data Feast for data storage and retrieval.
Registry
A registry is the most critical component for any feature store. Keeping this in mind, we build a REST API service written using spring boot which is deployed in a high availability mode. An RDS instance backs the Service for metadata storage. We maintain all metadata related to features like name, description, owner, domain, team, storage tier enabled (offline/online or both), versions, data types, and other operations-related data.
The metadata in the registry sets the tone for Data Feast to function. We have automated jobs that use the registry to schedule and configure data ingestion and storage of features. The serving APIs utilise the registry to get information on which feature values are available, where to find them and how to access these features.
The registry service exposes API for FeatureGroup/Feature addition, edition, deletion, metadata retrieval and maintains a source of truth for all feature-related metadata. It helps us make the entire feature registration and metadata retrieval process as self-serve as possible. We also use the confluent schema registry to maintain the schema of data flow to and from Kafka. Additionally, the Data Feast registry service is responsible for other operational tasks such as registering feature schemas with the confluent schema registry, Kafka topic creation, and sink connector (Kafka connect) addition.
Monitoring and Alerting
There are two major categories of metrics that we need to capture to know the state and health of a feature store. First is the operational metrics about the production system as a whole, and the second is data-related metrics like quality, freshness, drift, etc.
As a first iteration, we focused on the operational metrics for the system’s core functionality.
We can track metrics related to Storage(availability, capacity, utilization, etc.), Feature serving (throughput, latency, error rates, etc.), and ingestion of feature data into the respective store (job success rate, throughput, processing lag, etc.) and more. We have thresholds set on these metrics that raise an incident for action to be taken when breached.
We are currently building a component to start capturing metrics related to data like data quality, tracking for drift, training-serving skew, and data freshness and profiling.
Transformation
During our initial phase of building Data Feast, there were a lot of existing data pipelines that created feature values already in production. Since the beginning, it was crucial for us to slowly start migrating the older pipelines to Data Feast and ensure all new pipelines get incorporated with Data Feast. To achieve seamless integration of feature engineering and computation, we decided to use Data Titan as the transformation layer for Data Feast.
Data Titan is a compute abstraction layer product that we built entirely in-house. It allows us to run data computation across various data stores with an option to run federated queries. We can decide on multiple destinations for the computed data, Data Feast being one of them. Data Titan is available in two forms at Dream11 — Batch and Real-time.
As the name suggests, Data Titan Batch allows computation on batch data from various sources like the data lake or different data warehouses. Data Titan is intelligent enough to understand where the data resides and directly run the computation on the data source. It also uses Apache Spark to run federated queries based on data sources and destinations.
Data Titan real-time lets us run a computation on real-time data sources like Kafka and store the data into Data Feast. Apache Flink backs this system for all immediate computational needs.
We will soon be releasing another blog to talk more about DataTitan and how it works, so stay tuned!
The high-level architecture diagram shown below brings the system’s current state together.
The future of Data Feast
At Dream11, we have just begun our journey in the world of Feature Stores, and there is a long way to go. Here are some significant milestones that we have envisioned for the immediate future:
Integration of Delta or Apache Hudi format for time-travel capabilities on the feature data: We are currently evaluating both for feasibility and other advantages they provide.
Integration of Data Feast with Data Titan: The purpose is to provide an easy and seamless experience of generating features for the end-user.
Tackling issues proactively: We are working on building a feature data monitoring, profiling, and alerting component to tackle and catch any data related issues proactively
Further perfecting Data Feast: The feature store and data space are ever-evolving, and there is always scope for improvement at every step at any given moment. We continuously work to evolve our systems for scale and performance as Dream11 grows.
Are you interested in solving problems related to data at a multi petabytes scale, handling and processing billions of data points per day or working with passionate and innovative minds on the turf? We are currently hiring across all levels! Apply here to join us. More exciting stuff to come from the Data Team at Dream11! Stay tuned!
Reference Links:
- Authors
- Name
- Dream11
- @Dream11Engg