Posted on Leave a comment

Chaos Engineering in Azure: Automating Resilience Testing with Terraform & Pipelines

Chaos Engineering in Azure with Chaos Studio

Azure Chaos Studio is Microsoft’s managed Chaos Engineering service, allowing teams to create controlled failure scenarios in a safe and repeatable manner. With fault injection capabilities across compute, networking, and application layers, teams can simulate real-world incidents and enhance their system’s resilience.

Key Features of Azure Chaos Studio:

  • Agent-based and Service-based faults: Inject failures at the infrastructure or application level.
  • Targeted chaos experiments: Apply disruptions to specific resources like VMs, AKS, or networking components.
  • Integration with Azure Pipelines: Automate experiment execution within CI/CD workflows.

Automating Chaos Engineering with Terraform and Azure Pipelines

The repository https://github.com/geralexgr/ai-cloud-modern-workplace provides a ready-to-use automation pipeline that streamlines the deployment and execution of Chaos Engineering experiments.

Terraform for Experiment Setup

Terraform is used to define and deploy chaos experiments in Azure. The repository includes IaC (Infrastructure as Code) to:

  • Provision Chaos Studio experiments.
  • Define failure scenarios (e.g., CPU stress, network latency, VM shutdowns).
  • Assign experiments to specific Azure resources.

Using Terraform ensures that experiments are version-controlled, repeatable, and easily managed across different environments.

Azure DevOps Pipeline for Experiment Execution

A CI/CD pipeline is included in the repository to automate:

  1. Deployment of Chaos Experiments using Terraform.
  2. Execution of Chaos Tests within Azure Chaos Studio.
  3. Monitoring and reporting of experiment results.

This automation allows teams to integrate chaos testing into their release process, ensuring that new changes do not introduce unforeseen weaknesses.

Details

The pipeline consists of two stages. The first one creates the experiment through terraform and the second one will run the experiment that is created from the previous step.

The experiment is designed to target a specific web app, identified via a variable, with the intended action of stopping it. A prerequisite in order to run the experiments would be to work with a user assigned managed identity and provide the necessary IAM actions on the identity.

Finally you can find the result of the experiment on Azure inside Chaos Studio.

By combining Terraform, Azure Chaos Studio, and Azure Pipelines, you can automate and streamline Chaos Engineering in Azure. This approach helps identify system weaknesses early, improves system reliability, and ensures your cloud workloads can handle unexpected failures.

Links:

https://github.com/geralexgr/ai-cloud-modern-workplace

Posted on Leave a comment

Azure Chaos Studio terraform properties

Recently I was playing around with Azure Chaos studio and as a DevOps engineer I wanted to automate the experiment creation through terraform. Although it might seem straightforward it is not and through the process you may find out that some information are needed.

First things first you can go through the documentation and use the azurerm_chaos_studio_experiment terraform resource. This resource needs some properties to be defined as selectors which would be the resources that the experiment will affect.

  selectors {
name = "Selector1"
chaos_studio_target_ids = [azurerm_chaos_studio_target.example.id]
}

As shown from the code we will then need a azurerm_chaos_studio_target terraform resource which is identical to the below:

resource "azurerm_chaos_studio_target" "example" {
location = "West Europe"
target_resource_id = data.azurerm_windows_web_app.example.id
target_type = "Microsoft-AppService"
}

The first thing that I. was searching is the target_types. Those can be located in the below page and for my example I wanted to use a WebApp that’s why I selected Microsoft-AppService

https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-providers

Afterwards you will need a capability. The capability block can be located inside the steps of the experiment as shown in the example of terraform docs (link attached to the bottom of the article). The capability has a unique name depending on the action that you need to perform and resource. You can find them in the below link.

https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-library#stop-app-service

resource "azurerm_chaos_studio_capability" "example" {
chaos_studio_target_id = azurerm_chaos_studio_target.example.id
capability_type = "Stop-1.0"
}

In my case I wanted to stop a WebApp so I selected the appropriate. As an example if you need to shutdown a virtual machine you will have to use another capability which is Shutdown-1.0 instead of Stop-1.0.

Bonus:

When running terraform to create your azure chaos resources you may encounter the below error:

retrieving list of chaos target types: loading results: Get "": unsupported protocol scheme ""

For this issue there is an open bug under terraform repository and it is under investigation.

https://github.com/hashicorp/pandora/issues/4535

Links:

https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/chaos_studio_experiment