Wednesday 19 January 2022

Chaos Engineering by Azure

Azure Chaos Studio - Chaos engineering experimentation | Microsoft Azure 


Chaos is a part of everyday life. Giving a mathematical twist to the subject “Chaos describes a situation where typical solutions (or orbits) of a differential equation (or typical evolutions of some other model describing deterministic evolution) do not converge to a stationary or periodic function (of time) but continue to exhibit a seemingly unpredictable behavior” .  The key word here is unpredictable behavior.

The IT systems are no stranger to unpredictable behavior the probability of unpredictable behavior becomes higher when cloud is added to the equation. There a long laundry list of items which can fail in a cloud architecture. Most architect now also have begun to add another dimension to there discussion is resiliency or chaos engineering.

What is Chaos engineering?

Chaos engineering is a methodology that helps developers attain reliability by hardening services against failures in production. 

How is it done?

Deliberating injecting faults that causes the system to fail. For instance taking dependencies offline (stopping API apps, shutting down VMs, restricting access, introducing slowness in services and many more. This can help in identifying if the applications architecture is indeed resilient to failure.

Of course resiliency also can be measured which is a different topic of discussion.

Introducing Azure Chaos

Azure Chaos is a service which can help in improving the resiliency by method of experiments. At a simplest level an experiment is a plan to introduce controlled faults into the systems.

To start with Azure Chaos has a good list of faults given below.




What does AWS have in the same space?

AWS has Fault Injection Simulator, which is a fully managed service for running fault injection experiments on AWS that makes it easier to improve an application’s performance, observability, and resiliency


Does Azure Chaos support custom faults?

For faults which are not supported the option is go with agent based fault, which requires setup and installation of the chaos agent, unlike a service-direct fault, which runs directly against an Azure resource without any need for instrumentation.

Does it have any limitation?

Find the limitations here

Parting word…

Architect should start looking at all architecture from a chaos engineering stand point of view, especially when cost of downtime on business can be very high.  

No comments: