Posted on Leave a comment

Chaos Engineering in Azure: Automating Resilience Testing with Terraform & Pipelines

Chaos Engineering in Azure with Chaos Studio

Azure Chaos Studio is Microsoft’s managed Chaos Engineering service, allowing teams to create controlled failure scenarios in a safe and repeatable manner. With fault injection capabilities across compute, networking, and application layers, teams can simulate real-world incidents and enhance their system’s resilience.

Key Features of Azure Chaos Studio:

  • Agent-based and Service-based faults: Inject failures at the infrastructure or application level.
  • Targeted chaos experiments: Apply disruptions to specific resources like VMs, AKS, or networking components.
  • Integration with Azure Pipelines: Automate experiment execution within CI/CD workflows.

Automating Chaos Engineering with Terraform and Azure Pipelines

The repository https://github.com/geralexgr/ai-cloud-modern-workplace provides a ready-to-use automation pipeline that streamlines the deployment and execution of Chaos Engineering experiments.

Terraform for Experiment Setup

Terraform is used to define and deploy chaos experiments in Azure. The repository includes IaC (Infrastructure as Code) to:

  • Provision Chaos Studio experiments.
  • Define failure scenarios (e.g., CPU stress, network latency, VM shutdowns).
  • Assign experiments to specific Azure resources.

Using Terraform ensures that experiments are version-controlled, repeatable, and easily managed across different environments.

Azure DevOps Pipeline for Experiment Execution

A CI/CD pipeline is included in the repository to automate:

  1. Deployment of Chaos Experiments using Terraform.
  2. Execution of Chaos Tests within Azure Chaos Studio.
  3. Monitoring and reporting of experiment results.

This automation allows teams to integrate chaos testing into their release process, ensuring that new changes do not introduce unforeseen weaknesses.

Details

The pipeline consists of two stages. The first one creates the experiment through terraform and the second one will run the experiment that is created from the previous step.

The experiment is designed to target a specific web app, identified via a variable, with the intended action of stopping it. A prerequisite in order to run the experiments would be to work with a user assigned managed identity and provide the necessary IAM actions on the identity.

Finally you can find the result of the experiment on Azure inside Chaos Studio.

By combining Terraform, Azure Chaos Studio, and Azure Pipelines, you can automate and streamline Chaos Engineering in Azure. This approach helps identify system weaknesses early, improves system reliability, and ensures your cloud workloads can handle unexpected failures.

Links:

https://github.com/geralexgr/ai-cloud-modern-workplace

Posted on Leave a comment

Automating chaos experiment execution with Azure DevOps

In the previous article I demonstrated how one can create chaos experiments to test their infrastructure against failures through Azure portal.

In order to automate the experiment execution through Azure DevOps we will need to create a new pipeline and use the task for az cli.

trigger:
- none

variables:
- name: EXP_NAME
  value: chaos-az-down
- name: SUB_NAME
  value: YOUR_SUB_ID
- name: RG_NAME
  value: chaos

pool:
  vmImage: ubuntu-latest
stages:
- stage: chaos_stage
  displayName: Chaos Experiment stage
  jobs:
  - job: run_experiment
    displayName: Run chaos experiment job
    steps:
    - task: AzureCLI@2
      displayName: run experiment to stop app service
      inputs:
        azureSubscription: 'MVP'
        scriptType: 'pscore'
        scriptLocation: 'inlineScript'
        inlineScript: 'az rest --method post --uri https://management.azure.com/subscriptions/$(SUB_NAME)/resourceGroups/$(RG_NAME)/providers/Microsoft.Chaos/experiments/$(EXP_NAME)/start?api-version=2023-11-01'

When we run the pipeline we will see that the task succeeded.

Finally the experiment execution will start automatically.

Links:

https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-tutorial-agent-based-cli

Posted on Leave a comment

Automatic rollback procedure for Azure DevOps

Azure devops pipelines provide a variety of tools for automated procedures. One mechanism that administrators can build using the YAML structure is an automated rollback mechanism during a deployment.

This means that after a deployment you can revert the previous state using your YAML tasks without having to redeploy. Another case would be a broken deployment which can be identified by monitoring tools and then a validation could approve or not the final release. This is exactly depicted in the below image. After releasing a version we have a validation step that requires manual approval from an administrator. If the validation is approved the release will proceed else the rollback will be triggered.

This mechanism is described below with YAML. Release stage includes release, validation and rollback jobs. Release job performs the actual release. Validation will depend on release job and will continue only if is approved. The rollback job will run only if validation failed which means that an administrator canceled the approval.

trigger: none
pr: none

stages:

- stage: releaseStage
  jobs:

  - deployment: release
    displayName: Release
    environment:
      name: dev
      resourceType: VirtualMachine
    strategy:
      runOnce:
        deploy:
          steps:
            - task: PowerShell@2
              displayName: hostname
              inputs:
                targetType: 'inline'
                script: |
                    deployment script here...
  
  - job: validation
    dependsOn: release
    pool: server
    steps:
    - task: ManualValidation@0
      inputs:
        notifyUsers: 'admin@domain.com'
        instructions: 'continue?'
        onTimeout: reject

  - deployment: rollback
    displayName: rollback
    dependsOn: validation
    condition: failed()
    environment:
      name: dev
      resourceType: VirtualMachine
    strategy:
      runOnce:
        deploy:
          steps:
            - task: PowerShell@2
              displayName: rolling back
              inputs:
                targetType: 'inline'
                script: |
                    rollback script here..
                    Write-Host "rollback"

When the release can be verified from the administrator the rollback will be skipped. This is the case when the validation is approved from the user.

Validation task will ask the user for a review.

On the other hand if validation is rejected the rollback stage will run.

Posted on Leave a comment

dynamically set dependsOn using variables – Azure devops

DependsOn is a condition on Azure devops with which you can define dependencies between jobs and stages.

An example can be found in the below picture where the stage2 depends from the production stage and will execute only when the production stage finishes. If the production stage fails, then the stage2 will not continue its execution.

The typical way to define a dependency would be by naming the stages and note on which stage you need your dependencies. For example in the stage2 we use dependsOn with the value stage1

stages:
- stage: stage1
  displayName: running stage1
  jobs:
  - job: job1
    displayName: running job1
    steps:
    - script: echo job1.task1
      displayName: running job1.task1  

- stage: stage2
  dependsOn: stage1
  displayName: running stage2
  jobs:
  - job: job2
    displayName: running job2
    steps:
    - script: echo job2.task1
      displayName: running job1.task1  

However you can also define dependsOn using a variable. This means that you can dynamically set under which stage another stage will depend and not by setting that as a static variable.

An example of this can be found below:

parameters:
  - name: myparam
    type: string
    values:
      - production
      - dev
      - qa

variables:
  ${{ if eq( parameters['myparam'], 'production' ) }}:
    myenv: production
  ${{ elseif eq( parameters['myparam'], 'dev' ) }}:
    myenv: dev
  ${{ elseif eq( parameters['myparam'], 'qa' ) }}:
    myenv: qa

trigger:
- none

pool:
  vmImage: ubuntu-latest

stages:
- stage: ${{ variables.myenv }}
  displayName: running ${{ variables.myenv }}
  jobs:
  - job: job1
    displayName: running job1
    steps:
    - script: echo job1.task1
      displayName: running job1.task1  

- stage: stage2
  dependsOn: ${{ variables.myenv }}
  displayName: running stage2
  jobs:
  - job: job2
    displayName: running job2
    steps:
    - script: echo job2.task1
      displayName: running job1.task1  

When we run the pipeline we will be asked for the environment as a parameter.

This parameter will be then passed into a variable and then this variable will be used for dependsOn condition.

You could also use the parameter itself as shown below.

- stage: stage2
  dependsOn: ${{ variables.myenv }}
  displayName: running stage2
  jobs:
  - job: job2
    displayName: running job2
    steps:
    - script: echo job2.task1
      displayName: running job1.task1  

Keep in mind that when you use variables, you should use the template syntax which is processed at compile time.

Youtube video: