Posted on Leave a comment

Azure Chaos Studio terraform properties

Recently I was playing around with Azure Chaos studio and as a DevOps engineer I wanted to automate the experiment creation through terraform. Although it might seem straightforward it is not and through the process you may find out that some information are needed.

First things first you can go through the documentation and use the azurerm_chaos_studio_experiment terraform resource. This resource needs some properties to be defined as selectors which would be the resources that the experiment will affect.

  selectors {
name = "Selector1"
chaos_studio_target_ids = [azurerm_chaos_studio_target.example.id]
}

As shown from the code we will then need a azurerm_chaos_studio_target terraform resource which is identical to the below:

resource "azurerm_chaos_studio_target" "example" {
location = "West Europe"
target_resource_id = data.azurerm_windows_web_app.example.id
target_type = "Microsoft-AppService"
}

The first thing that I. was searching is the target_types. Those can be located in the below page and for my example I wanted to use a WebApp that’s why I selected Microsoft-AppService

https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-providers

Afterwards you will need a capability. The capability block can be located inside the steps of the experiment as shown in the example of terraform docs (link attached to the bottom of the article). The capability has a unique name depending on the action that you need to perform and resource. You can find them in the below link.

https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-library#stop-app-service

resource "azurerm_chaos_studio_capability" "example" {
chaos_studio_target_id = azurerm_chaos_studio_target.example.id
capability_type = "Stop-1.0"
}

In my case I wanted to stop a WebApp so I selected the appropriate. As an example if you need to shutdown a virtual machine you will have to use another capability which is Shutdown-1.0 instead of Stop-1.0.

Bonus:

When running terraform to create your azure chaos resources you may encounter the below error:

retrieving list of chaos target types: loading results: Get "": unsupported protocol scheme ""

For this issue there is an open bug under terraform repository and it is under investigation.

https://github.com/hashicorp/pandora/issues/4535

Links:

https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/chaos_studio_experiment

Posted on Leave a comment

Automating chaos experiment execution with Azure DevOps

In the previous article I demonstrated how one can create chaos experiments to test their infrastructure against failures through Azure portal.

In order to automate the experiment execution through Azure DevOps we will need to create a new pipeline and use the task for az cli.

trigger:
- none

variables:
- name: EXP_NAME
  value: chaos-az-down
- name: SUB_NAME
  value: YOUR_SUB_ID
- name: RG_NAME
  value: chaos

pool:
  vmImage: ubuntu-latest
stages:
- stage: chaos_stage
  displayName: Chaos Experiment stage
  jobs:
  - job: run_experiment
    displayName: Run chaos experiment job
    steps:
    - task: AzureCLI@2
      displayName: run experiment to stop app service
      inputs:
        azureSubscription: 'MVP'
        scriptType: 'pscore'
        scriptLocation: 'inlineScript'
        inlineScript: 'az rest --method post --uri https://management.azure.com/subscriptions/$(SUB_NAME)/resourceGroups/$(RG_NAME)/providers/Microsoft.Chaos/experiments/$(EXP_NAME)/start?api-version=2023-11-01'

When we run the pipeline we will see that the task succeeded.

Finally the experiment execution will start automatically.

Links:

https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-tutorial-agent-based-cli

Posted on 1 Comment

Chaos Engineering with Azure – simulate web app failure

Chaos engineering is crucial because it helps organizations proactively test the resilience and reliability of their systems in unpredictable environments. In today’s cloud-native architectures, services are often distributed and complex, making them vulnerable to unexpected failures. By deliberately introducing controlled failures, chaos engineering allows teams to identify weaknesses, ensure systems can recover quickly, and improve overall reliability. It helps organizations prepare for real-world disruptions, enhancing system stability and reducing downtime during critical failures.

Azure Chaos Engineering is Microsoft Azure’s platform for conducting chaos experiments to test the resilience of cloud applications running in Azure. Azure offers Chaos Studio, which allows users to simulate outages, latency issues, or resource exhaustion on various services such as Virtual Machines, Kubernetes clusters, and more. This helps developers and DevOps teams to identify vulnerabilities and fix them before they cause actual service disruptions, ensuring that applications running on Azure are robust and can handle unexpected failures.

In this series of articles we will describe how we can implement our chaos engineering framework using Azure services. In order to put chaos into test we will first create an app service (web app) that we will try to test through chaos experiments. During the creation of the web app we should select the Free tier in order to have Zone redundancy disabled. This means that our application will not have high availability across different zones.

When our app is ready we can access it through the auto generated URL and we can see the content as shown below.

We will now navigate in azure chaos studio and press the create new experiment from template

Then we will select availability zone down

and we will continue by giving a name to our experiment and also where to be placed.

Then we can select the checkbox that indicates the below

Enable custom role creation and assignment

And as a next step go to the experiment designer to configure the experiment. The most important thing we should configure is the fault action.

By pressing the button we can select one from the available options provided by Microsoft.

For our case we can select the stop app service action that will stop our web app inside the region. If we had high availability enabled we could see our application up and running.

And then we should select the target. Before adding our target inside the experiment we should go under targets in chaos studio and enable our target by pressing the button.

When we have our target enabled we can go under our experiment template and select our target.

Finally we press create and our experiment will be ready to use. We can execute it by pressing start and voila.

If you faced the below error you should provide the necessary permissions on the chaos identity in order to perform actions inside Azure like start/stop app service etc.

The target resource(s) could not be resolved. Please verify that your targets exist and your managed identity has sufficient permissions on all target resources. Error Code: AccessDenied. Target Resource(s):

You can do that by navigating inside the identity and pressing Azure role assignments.

When you finally have the permissions to execute the actions described in the experiment template then the experiment will start.

and it will stop your web app as requested.

You all know this error right? This happens because our setup is not high available an important factor that we should take into consideration in our architecture.