Understanding and baselining the current capabilities, constraints, dependencies, performance, business impact etc. – Does it make sense to conduct ‘Chaos’ tests?
Establish the organization maturity
tools, processes, techniques, automation, culture…
To determine the impact of the “blast”, and if it could be contained, managed, and resolved effectively.
Understand and set objectives
What does that organization want to achieve out of these ‘Chaos’ experiments? What needs to be discovered / improved upon? Will it be ‘Increased availability’, MTTR, MTTD, Few Bugs, Reduced Supervision…?
Select the test environment
Will it be pre-production or production – based on organizational maturity assessment? Will it be ‘Canary’ or ‘Generic’?...
Articulate the tests and “blast” scope
What kind of tests will be conducted - Latency injection, Fault injection, Load generation…?
Establish measures and metrics
Tracking the progress of the experiment, and its impact of the system, corrective actions and impact, collection for future experiments and baseline…
Incidence response plan
How to contain the incidence, which has happened during the experiment – containment, corrective actions, rollback…
Evaluate results
Generate insights from collected data to plug the weaknesses.
Think of a system in terms of well-coordinated but
demarcated verticals or functions,
Our Approach
Essentially, a component interaction map to design
and plan the experiments better.
These verticals or functions are further detailed during planning, e.g., infrastructure will have CPU, Memory, Storage, Load Balancers etc.