Design for Robustness

Apr 13, 2025

Complex systems are inherently unpredictable. A key concept in studying them is emergence—the idea that the system as a whole can behave very differently than its individual parts. Combined with partial observability, this leads to nondeterministic behavior.

While we can attempt to model complex systems, their behavior is often stochastic rather than precisely predictable. Examples of complex systems include biological systems like the human body, social systems, financial markets, and the one we’ll focus on here: distributed software.

There are many proven practices that help improve software quality — including test-driven development, good test coverage, continuous integration, code reviews, frequent small deployments, static analysis tools, and pre-release checklists or risk assessments. But there is a saying in the software circles: "All bugs have one thing in common — they all passed the tests."

Distributed software systems are inherently complex, exhibiting emergent behaviors and a high degree of nondeterminism. While rigorous engineering practices I described earlier are essential, they are not sufficient to fully capture the vast space of possible interactions and failure modes. Is there a better way to think about quality within the complex software systems?

This reminds me of Taguchi’s Robust Engineering methods, which focus on making systems less sensitive to variations in noise factors — and therefore more reliable and consistent. Bob Moesta, who worked with Taguchi in the 90s, introduced me to these ideas some time ago. There are three things I’m exploring to see how they can fit into our software development process.

P-Diagrams are a useful way to model a system by breaking it down into signal, control, and noise factors. This helps clarify what we can directly control versus what we can’t — the noise factors. In some cases, we can design the system to take control of a noise factor. In others, we need to build the system to tolerate the variations.

Once you’ve modeled the system properties, you can start designing and testing for robustness. Techniques like property-based testing and chaos engineering are built for this. Most modern languages now have frameworks that support fuzzing and property-based testing. Python’s Hypothesis is a popular property-based testing framework. On the chaos engineering side, tools like Chaos Mesh and Chaos Monkey help simulate failures and unpredictable conditions in distributed systems.

Taguchi also introduced Design of Experiments, a method for efficiently exploring interaction between control and noise factors. I’m still wrapping my head around how to utilize this within a software development process, but I’d love to understand how to use these methods for understanding and tuning complex system design.

We’re lucky in software — we have a growing set of methods and tools to help us design, test, and build for robustness. I’d love to see more teams use them.

Over the next few months, I’m planning to explore these ideas further and look for ways to apply them in the clinical trials software space — where building more robust systems can directly support better drug development and improve patient outcomes.

Ilya Sterin's Newsletter

Discussion about this post