Our Approach

Understand the system and practices

  • Understanding and baselining the current capabilities, constraints, dependencies, performance, business impact etc. – Does it make sense to conduct ‘Chaos’ tests?

Establish the organization maturity

  • tools, processes, techniques, automation, culture… To determine the impact of the “blast”, and if it could be contained, managed, and resolved effectively.

Understand and set objectives

  • What does that organization want to achieve out of these ‘Chaos’ experiments? What needs to be discovered / improved upon? Will it be ‘Increased availability’, MTTR, MTTD, Few Bugs, Reduced Supervision…?

Select the test environment

  • Will it be pre-production or production – based on organizational maturity assessment? Will it be ‘Canary’ or ‘Generic’?...

Articulate the tests and “blast” scope

  • What kind of tests will be conducted - Latency injection, Fault injection, Load generation…?

Establish measures and metrics

  • Tracking the progress of the experiment, and its impact of the system, corrective actions and impact, collection for future experiments and baseline…

Incidence response plan

  • How to contain the incidence, which has happened during the experiment – containment, corrective actions, rollback…

Evaluate
results

  • Generate insights from collected data to plug the weaknesses.
Think of a system in terms of well-coordinated but demarcated
verticals or functions,

Our Approach

Essentially, a component interaction map to design and plan
the experiments better.
These verticals or functions are further detailed during planning, e.g.,
infrastructure will have CPU, Memory, Storage, Load Balancers etc.