Building DataAware: The In-House Funnel Analytics Tool At Dream11

Published on

Introduction

As the world’s largest fantasy sports platform with 110 million+ users, Dream11 hosts hundreds and thousands of fantasy sports contests every day. Here, our users can actively engage with real-life sporting events and showcase their knowledge of sports. However, enabling so many users to have the best experience possible on Dream11 every day can seem challenging. At such a large scale, one of the common behavioural analytics requirements that we have is Funnel Analytics, to understand user behaviour and preferences. With our in-house Data Platform where we collect, process and serve terabytes of data per day, we have solved this easily. And in this journey, our hero has been DataAware — a Funnel Analytics tool that we use to recognise and know user behaviour trends at such a large scale as Dream11. We developed DataAware with the following features:

  • Ability to provide a user interface to select the event sequence interactively
  • Ability to apply filters on event properties
  • Auto-suggestions on events filter properties
  • Define conversion window
  • Date Range — the ability for users to analyze across days, weeks, or months

Defining the Funnel and its importance for Dream11

Funnel analysis involves mapping and analyzing a series of events that lead towards a defined goal, journeys like an application opened to join contests, understanding drop-offs in navigating through app and taking appropriate actions to increase conversions in Dream11, or the flow that starts with user engagement in a mobile app and ends in a sale on an eCommerce platform. Funnel analysis is an effective way to calculate conversion rates on specific user behaviours. This can be in the form of a sale, registration, or other intended action from an audience.

Below is a common example of e-commerce funnel analysis:

Architecture of DataAware

Data Collection — Raw Layer

We have Data Highway in-house events collection service to collect events from mobile devices and websites. Data Highway captures a stream of events in the JSON format on Kafka topics. To make this data queryable, we have to park this data somewhere. We choose S3 as our data lake and Confluent S3 sink connector to sink events data on S3. At this moment, we have plain text JSON data on S3 as follows.

Key observations:

  1. Every event lands into a separate Kafka topic, subsequently to separate S3 directory and schema registered in centralized Glue catalog — so that it is queryable via any query engine
  2. Every event data is partitioned by date (Hive-style data partitioning) to reduce data scan during daily ETL jobs
  3. The raw layer is being used for near-real queries and has strict data retention policies. Raw layer data gets moved in a more efficient and optimized processed layer.

Data Storage and Processing

The processed layer takes care of data enrichments, lookups, denormalization and storage of data in a more efficient Parquet format.

Benefits of the Parquet format:

  1. Requires minimum data scan due to columnar format, resulting in cost-saving since Athena costs per terabyte data scan by underlying Athena query
  2. Is efficient for aggregation queries like funnels analytics
  3. It supports flexible compression options and efficient encoding schemes
  4. Apache Parquet works best with interactive and serverless technologies like AWS Athena

Query Layer

For the query layer, we had two choices — serverless Athena and in-house presto cluster that we have been using for our batch processing, for more predictive response time.

Athena

Athena is an interactive, serverless query engine with a pay-per-use model based on per TB data scan. It works best on top of parquet and is efficient in data partitioning. However, in certain cases and large data scans, the performance could be unpredictable based on time of the day and the shared pool of resources available behind the scenes, as its managed service.

Presto

Presto gives full-control over performance and predictable response times, for us having a centralised glue catalog which is available with any query engine — Spark SQL, Presto, and Athena. We use our smart query engine to be picked up dynamically based on data scan and query performance patterns for queries.

API Layer — Dynamic Query Generation

The Application Programming Interface (API) layer is backed by flask framework, with multiple micro services to power the different UI components such as:

  1. The initial listing of events for end-user based on event selection
  2. Fetching the list of event attributes to apply filters, and while applying filters
  3. Identifying the possible values for event properties and giving auto-suggestions to the end-user.

This is done behind the scenes to keep all the event metadata up to date. We have real-time jobs interacting with real-time layers to keep this meta-information live.

Based on user inputs, and with a standardized glue catalog for every event, we form the funnel queries on the fly and send them to the query engine for getting those aggregated conversion numbers. We then send the result back to the visualization layer. Also, we have a caching layer in between to cache the results. If a similar query hits the service, the comparison is made based on entire query semantics such as participating events in the funnel, filters, and date ranges. The result will be the same and it directly gets served from the cache.

Visualization Layer

Considering niche requirements that were not satisfied by leading BI tools, we decided to build our custom UI with UX tailored to our requirements. We already had an internal portal to manage internal systems, and we plugged in one more component — DataAware. We had written it from scratch using React, by leveraging various community charting libraries. The React PWA (Progressive Web App) gives various functionalities to the users, like auto-suggesting values to users. We also developed capabilities to share and save the funnels for agile accessibility. All of these UI components were powered by the above-mentioned APIs written in Flask.

All in all, building our own in-house funnel analytics tool was a resounding success and helped bridge all the gaps that were previously missing in our pre-existing analytics tool. Here’s to DataAware!

Keen to work with us and build unique solutions at Dream11? Join us by applying here!

Related Blogs

Dream11’s SVP Analytics - Arun Pai shares his journey exploring the intersections of Mathematics, Engineering and AI
Meet Arun Pai - Dream11’s SVP of Analytics. In Arun, Dream11 discovers not just a leader, but a dynamic force uniting technology and sportsmanship, driving innovation with humility and dedication. He believes in embracing diverse domains which paved the way for his transition from a world of engineering to his passion for sports. In this #BeyondTheAlgorithm, Arun unveils the intricate data stack powering Dream11
July 2, 2024
90% data-based decision making, 10% gut sets Dream11 apart; says Data Engineer Salman Dhariwala
There was curiosity— a thirst to unravel the potential concealed within data's labyrinth for Salman Dhariwala, Director of Data Engineering at Dream11. His focus was clear - to build robust pipelines and transform raw information or data into priceless insights. More than just a tech pursuit, Salman has been keen to shape how organizations leverage data for strategic victories. Salman shares with us how his transition into the sports-tech industry allowed him to pursue his two greatest passions – technology and sports.
January 22, 2024
#BehindTheDream - IPL 2023: Unveiling the Season's Most Remarkable Highlights
In a record-breaking feat, Dream11's app has taken fan engagement to new heights this year, managing a concurrency of 10.56 million users during the Indian Premier League (IPL). This time of the year, the Dream11 Stadium is nothing short of stepping onto the cricket field itself, and it's all thanks to the extraordinary efforts by our super talented teams of engineers, data scientists, product developers, designers, customer experience and ops experts. But the real magic happens when we dive into the minds of our #Dreamsters. Join us as we unveil their gameplan and insights leading upto India’s biggest sporting event in Indian cricket – the TATA IPL 2023. Get ready to go #BehindTheDream!
November 16, 2023