Finding Order in Chaos: How We Automated Performance Testing with Torque

IPL witnesses huge fan following, so it is a defining event for us at Dream11. During Dream11 IPL 2020, we served an enormous amount of traffic. With more than 5.5 million concurrent users enjoying the fantasy sports experience, some of the services get more than 80 million requests per minute. The ultimate challenge is to provide our users with a seamless experience even at such a huge scale.

The Challenge

Preparing the Dream11 platform to scale for this event needed a herculean effort. The major challenges we had to overcome were:

Benchmarking 100+ micro services, built using varied technologies such as Java, Scala, Node, MySQL, Cassandra, Aerospike, Kafka and the like.
Traffic patterns which vary vastly before and after the match starts, creating a completely different set of usage patterns (and subsequently stressing different sets of systems) at different points of time.
Generating load at 5 times the regular traffic, ensuring the system can handle unexpected surges. For some of the services, this meant more than 100 million requests per minute (RPM).
Managing huge amounts of test data, usually in GBs, required to simulate these traffic patterns.
Capability to execute tests in parallel on multiple environments.
Need of a black box solution which could be used by anyone, without bothering about the underlying intricacies

Till IPL 2019, performance testing had a scattered set up involving multiple tools like Jmeter, Rundeck, Shell scripts, among others. This triggered a need for a central framework where we could automate data preparation, scaling of infrastructure and similar tasks.

What We Did

Keeping all this in mind, we created a fully automated and scalable framework where all scripts, data preparation and configurations could be kept centrally. What’s more? Within this framework, separate modules can be easily added for any new service created. Scripts for existing services can be used for multiple kinds of tests. These tests can be executed easily using Jenkins Jobs — all in all, the dream scene!

We automated the entire process and named our framework Torque .

How We Did It

The major technologies involved in torque are:

Gatling — It’s a Scala based scripting tool, which enables us to write Performance Tests As Code . We can define user behaviour directly in code, without any need for clunky UIs or bloated XMLs; just plain code. It also generates more load while using less infrastructure.
Scala — Gatling DSL is built on Scala, which made it the natural choice of language for Torque. Using Scala, we automated some of the peripheral tasks required for performance tests, such as data preparation before the test, cleaning up infrastructure after the tests, so on and so forth.
Redis — We work with very high throughput for our tests. Even with Gatling’s capabilities, a single generator is not sufficient for such load models, and that is why we use multiple generators. While using multiple generators, some of the tasks require mutual exclusion (e.g. distributing unique data to each generator or sending slack alerts). For such tasks, we use Redis locks and Redisson is the library of choice for us.
AWS S3 — Since our test data files usually go into GBs, we wanted a storage which can be used to save, reuse and distribute data over different load generators.
AWS Lambda — In cases where unique data is required for all load generators, we use Lambda function to read existing S3 data files, split it as per the number of generators required and then store it back in S3. The redis lock we mentioned before, helps in making sure each generator gets a unique set of data.
Apache Spark — Lambda has restrictions on duration of processing and memory to be used. It doesn’t work for tests, where we need a larger data set (~ 100–200 million rows). Thus, spark comes into picture to complete the above process.
Jenkins — It provides a simple interface, using which we can control the various aspects of tests like users, duration, load model, environment and a lot more.
Ansible & Terraform — These are used to provision load generators, deploy the torque code and initialise the tests based on Jenkins parameters.

Torque in Action

Users or a CI-CD pipeline triggers the test from Jenkins .
Jenkins utilizes Terraform & Ansible to create infrastructure and deploy the code on EC2 load generators.
Torque application is started on all generators with the parameters passed via jenkins.
All generators try to acquire lock on Redis to initialize test data preparation.
Generator which acquires the lock, fetches data from datasource and uploads it to AWS S3 .
The same generator then triggers a lambda function/spark job.
The lambda function/spark job splits the data file into multiple files (equal to number of load generators) and stores these back in AWS S3. It also updates the redis with S3 filepath assigned to each generator.
Every generator polls redis to fetch S3 path of the datafile assigned to it.
Each generator then downloads the test data file assigned to it from S3.
At this point, all generators have the required code and test data. Each of these, then start generating load on Application Under Test (AUT) using the Gatling simulation . Slack notification is sent for each action completed, to notify the user about progress of the test.

What We Have

Everything mentioned before boils down to the features detailed below, which Torque provides us.

Tests As Code — Develop easily maintainable test suites in a central code base.
Automated Test Environment and Data setup — Ability to perform tasks before and after actual test execution like data setup, clean up of environment and more, using various utilities in the framework.
Reusability — Reuse the same definitions for different kinds of tests (smoke, performance & functional) and different load models (soak, spike, etc).
Multi-Datasource support — Run load tests on data models for cassandra, aerospike, redis, RDS, etc., over different protocols.
CI Support -Use Jenkins jobs for test execution, which can be consumed by anyone — SDETs, Devs or Devops.
Utilities — Helpers to interact with AWS services, push messages to Kafka, run repetitive tasks in parallel to actual test executions.
High Configurability — Be it environment, users, load model or throughput, everything is configurable. Even the number of generators can be configured based on load required.

Conclusion:

All said and done, as a part of Dream11 IPL preparation, the Torque with all its mentioned capabilities, helped us in writing tests for 150+ APIs, executing about 1500+ Load Iterations, benchmarking more than 50 critical services at 5 times the normal scale, covering different scenarios and traffic pattern, simulating end to end match day scenarios.

The simplicity of its interface and the abstraction it provided, allowed people from different teams to execute these tests with ease, helping us achieve high productivity as well as velocity, yet maintaining the quality.

Torque, thus, assisted us in creating highly stable and scalable systems, resulting in no scale-related issues on production and providing a great user experience, which is what we aspire for at Dream11.

Finding Order in Chaos: How We Automated Performance Testing with Torque

The Challenge

What We Did

How We Did It

Torque in Action

What We Have

Conclusion:

Previous Article

Next Article

Tags

Related Blogs