Platform lesson #3: Balance architecture and continuous integration/test
A key challenge for any software product is quality. Before we can do anything to experiment with customers and improve the product, we need to make sure that the product does what it’s supposed to. Although no software is free from defects, there being too many of them hurts the perception of the product and the value it is to deliver.
A key enabler of quality is test automation. Running tests every time new code is checked in as well as more advanced system-level tests periodically is one of the most effective ways to ensure that the functionality that used to work is still working. Of course, this needs to be complemented with test-driven development and some amount of manual exploratory testing.
The best way to set up a test infrastructure and what to test when are subjects that have received vast amounts of research, and every software engineer has an opinion about them. I’ll not go into details here, but in an earlier post, I discussed the CIVIT model as an effective approach to visualize all the test activities that are conducted end-to-end for a product. This helps understand the current state, define the desired state and prioritize improvements.
The challenge for complex software is that you can’t test your way to quality. Especially for platforms, the number of configurations and connections between different parts of the system is so high that it’s simply impossible, or at least prohibitively expensive, to test everything. This is made worse for platform-based products that allow for significant configuration of each product instance at the customer site. The result may be a very high number of issues being reported by customers with little commonality between them.
The best answer to deal with this challenge is not to test more, although that may be required, but rather to focus on refactoring the platform and product architecture. A clean architecture with strong interfaces and decoupled functionality is great in that it simplifies testing as most testing can be pushed to the component and subsystem level and system-level testing can be minimized. For platforms, this means that it should be possible to test them independently of the products. This, of course, means a defined API that the products need to use. This then also means that products can be tested, at least to some extent, without the platform.
In practice, however, the architecture is always suffering from architectural technical debt and no matter what decomposition, there will always be functionality and quality requirements that have cross-cutting consequences. Consequently, we need to balance architecture work and investment in continuous integration and test. The architecture helps engineers avoid mistakes in the first place and applying continuous integration and test catches introduced errors early so that the cost for fixing these is minimal.
With more and more products being connected, an additional factor is the use of the post-deployment stage to identify quality issues. Of course, the main functionality needs to be confirmed before deployment, but more minor and rare quality issues can be identified post-deployment by monitoring system behavior and detecting deviations from the baseline established by the previous software version. I know of several cases where companies used this to detect issues at customers, after which they rolled out a fix before the customer even noticed the issue.
As more and more products with high-reliability requirements are adopting DevOps, ensuring quality becomes increasingly important. The knee-jerk reaction in many companies is to simply test more. However, in complex systems, such as platform-based products, you can’t test yourself to quality; the number of configurations often is prohibitively large. A complementary architecture refactoring initiative is required. Decoupling components and minimizing interaction, for example through a message bus and microservices, is a powerful way to focus test effort on the components and reduce system-level testing. Remember, sometimes the obvious is the wrong thing to do. Fix root causes, not symptoms.