Industry Insights

A Quick Guide To Chaos Engineering

A salesman in doubt can not find the solution to the problem concept with curvy lined arrows and question marks drawn on urban wall

When it comes to testing performance, clients always ask the following question to measure the KPI:

How Has The Application Behaved During Peak Hours?

While performance and stress testing solve many software challenges, chaos testing is becoming the need of the hour. Also known as “Chaos Monkey”, or the Simian Army, this type of testing was first developed by Netflix for testing the resilience of their IT infrastructure. It tests the production environment so that the frequency of cyberattacks, outages, and software failures can be greatly reduced or completely nullified.

Chaos testing can test Infrastructure failures, network failures, and application failures by providing APIs for different levels. Typically, a person from the DevOps team makes the scenario, executes the test, determines and records the results. He or she is also responsible for minimizing the impact on the production system.

 

What is Chaos Engineering?

Chaos engineering is the practice of making your servers, infrastructure, and applications resilient to changes like primetime usage surge, demand for the same content from multiple users, and so on. This application makes use of APIs to be plugged into the production server and execute their framework in a live environment. This process requires constant monitoring of servers or applications, and this is mostly done by the DevOps team. Development and QA work with the DevOps team side by side, because there will be application failures due to code changes. Chaos testing helps to identify those defects which cannot be found in a non-production environment.

To sum it up, chaos testing ensures two things: the first is that it ensures all aspects of the system are integrated seamlessly for a good end-user experience. The second reason why this type of testing is necessary is that it helps enterprises quickly identify the main reason for a software failure and act on it accordingly.

 

Advantages of Chaos Testing

The following are some of the advantages of running chaos tests:

  • It helps to quickly identify the issues and resolve them in a timely manner, which might not have been possible while testing in a live simulation environment.
  • This type of testing helps to reduce the chances of having any unplanned outages or downtime with proactive steps.
  • It fosters stronger system integrity.
  • It gives confidence when building or developing large, complex application systems deployed on different cloud-based services.
  • Helps in making decisions when upscaling and downscaling.
  • It helps to increase the speed of software recovery after an outage.

Testing is crucial when designing systems, software, and infrastructure. That's why, companies have to rely on quality assurance teams and tools to analyze their products for any vulnerabilities before releasing them. To enjoy the benefits of chaos engineering, organizations must first calculate the cost of outages on a minute, hourly, or daily basis. The estimates can be calculated about how much downtime they can save with chaos engineering practices. Also, by practicing chaos engineering, organizations can get their QA teams comfortable with responding to vulnerabilities and finding solutions faster.

Even though there are so many benefits of chaos tests for certain applications, it may not be required where the application size is very small and the user group is limited to smaller groups. This is because such applications and user groups do not require 100% uptime and can be contacted directly when in need.

 

Real-life Scenarios

Let us look at some examples of complete software failure in recent times to understand how chaos testing is useful. In 2019, Brno University Hospital in the Czech Republic suffered massive production failures caused by computer shutdowns during the middle of the pandemic. Other instances include Zoom server shutdowns which resulted in meeting and class disruptions for many hours.

Software shutdowns often harm the reputation of organizations. Hence, it is important to monitor and improve chaos testing strategies implemented for the product.

Most common tools used for Chaos engineering

Listed below are the most important tools that can be used for chaos testing:

  1. Chaos Monkey

    Chaos Monkey is a tool developed by Netflix when they started using Amazon Web Services. It is used to test the resilience of the IT infrastructure. It works by purposefully disabling computers in Netflix's production, to test how the remaining systems respond to the outage. Chaos Monkey is also known as Simian Army and it is a much larger test suite than the previous one, which is designed to test responses to various system failures.

  2. LitmusChaos

    LitmusChaos is a tool that enables teams to identify outages or weaknesses in infrastructure by inducing chaos tests in a controlled way. It is a 100% open source chaos engineering tool that developers can simply use to execute tests based on modern chaos engineering practices. It is an open-source chaos engineering platform that helps SREs and developers practice chaos engineering in a cloud-native way.

  3. Byte-Monkey

    Byte-Monkey is mostly used for JVM’s, and it is a small Java library for testing JVM applications. It works by changing the code on the fly by deliberately introducing bugs and errors. Applications developed or being developed in Java are most suitable for this tool and can be implemented in the development phase with the help of Dev and QA teams.

  4. Facebook Storm

    Facebook Storm is built to prepare for the failures of a large data center which, nowadays, is expanding geographically at a very fast rate. Facebook regularly tests the resilience of its infrastructures in test and production environments. It is also known as the Storm Project, where the program basically simulates massive data center failures to find out problems. Such organizations have exact replicas of the production environment and can test the integrity using chaos testing methodologies without having downtime on production.

  5. Mangle

    Mangle is built to run chaos testing experiments against applications and infra components to assess resilience and fault tolerance. It is developed to initiate faults with little pre-configuration and can support any infrastructure that you might have, including Docker, vCenter or any Remote Machine with ssh enablement. It is built on a plugin model, where you can define a custom fault of your choice, based on a template and execute it without re-building your code.

 

How to Start Chaos Testing?

The first thing that you need to think about when starting a chaos testing exercise is how to purposefully crash the production application system. Although it might sound peculiar to many, here are a few reasons why QA engineers should do it this way:

  • First, system failures occur without any notice and are unpredictable. However, when testing, the date and time of the failure are determined beforehand and can be arranged to minimize the impact. The specific failure itself may not be known. Since the date and time are decided beforehand, the technical team will be ready to take action immediately and fix the problems.

  • Second, there will also be a complete focus on monitoring system data prior, during, and after the failure. This will help the recovery process, and also help teams to gather data for subsequent analysis.

  • Third, when the problem is resolved and the application system is back up, subsequent analysis will garner new insights about the production system. You would discover new bugs in the system and untested software which was not expected to be faulty. Emergency management can be improved and better logins can be implemented.

 

How QASource can help with Chaos Engineering

Since the idea behind chaos testing is to reduce the time to recovery. The systems can get up as fast as possible and reduce the impact time on business, revenue, and customers. It will be crucial to hire QA engineers who can handle such situations and have the required expertise to tackle and reduce the time to recovery to a great extent.

The automated nature of the DevOps workflows means that a vast majority of testing is done by using automated tools. From unit testing to smoke testing, DevOps is designed to deliver software without a QA engineer testing the build.

QA engineers need to have knowledge of chaos engineering. It is not a dedicated testing practice, and many QA engineers still believe their job is done when an application reaches production.

Chaos Monkey is the best tool and QA engineers can contribute further if an application failure occurs. Chaos engineering helps teams to test their applications in a production environment to check for their resilience during shutdowns.

For QA engineers, chaos testing is very interesting compared to traditional functional testing as it brings latent bugs from the production cycle to the core. Chaos engineering enables QA engineers to expand their skills and add value in determining the quality of an application, without disturbing the core business functionality.

More and more companies are now adopting chaos engineering practices, as it aims to prevent security issues and outages before they happen. It is a continuous activity and at the end of a set of one experiment, you have to take action. After that, a new experiment should be designed to test stability when you implement those changes.

 

QASource’s has a team of extremely skilled engineers who have extensive experience in chaos testing, QA, and DevOps practices. If you’re looking for a partner that can help test your current processes and identify gaps in your systems, we have the solution for you. We can provide you with a team of testing experts without the hassle and cost of hiring an entire department of them.
Contact us today to learn more.

Disclaimer

This publication is for informational purposes only and nothing contained in it should be considered legal advice. We expressly disclaim any warranty or responsibility for damages arising out of this information and encourage you to consult with legal counsel regarding your specific needs. We do not undertake any duty to update previously posted materials.