Back in the years 2013-2016, I visited numerous customers for a Continuous Delivery Maturity Scan or DevOps Capability Review. During those days, CI/CD was new to many enterprise customers. Since then, I held other jobs focussing on development and architecture of microservices and cloud adoption.
Recently, my friend asked about these CI/CD and DevOps reviews. More than 6 years have passed since my last review so I suggested that quite a few things have changed since our last review. Here is a write-up of what I would look at when assessing CI/CD and DevOps capabilities today.
Disclaimer: This blog post is my personal opinion, does not represent my employer.
Cloud is a CI/CD catalyst
For large enterprises, the years 2015 and prior were early days for cloud adoption. Some companies might have moved their web and mobile applications to cloud as a pilot. But largely, enterprise workloads were running on-premises using tools like Puppet, Chef, Ansible and Salt Stack. We were looking at the first versions of Kubernetes and there were no managed offerings.
Cloud technology is an enabler for CI/CD and DevOps. The ability to spin up and tear down full environments as part of your CI/CD pipeline adds to your flexibility and reduces the need for long-living environments. You spin up an environment only for the duration of the (automated) test where possible. This reduces cost and adds to your flexibility to parallelize for speed as needed.
So the first thing I'd look for today is the customer's state of cloud adoption. I recommend customers to go all-in with cloud technology. Make sure you choose the right cloud provider. Security and reliability are non-negotiables.
Organization: self-contained, decentralized, multi-disciplinary teams
We now live in a post-Agile era. That does not mean we're not agile anymore. It means Agile is table stakes. I am not worried about specific methodology. Ideally, an organization would have used a methodology like SCRUM as training wheels and should have grown beyond the method and be continuously optimizing.
What do we look for in an organization? Multi-disciplinary teams, centered around a business objective. Teams that can delivery independently at their own pace. Organizations like this have a decentralized nature.
Does that mean no centralized teams at all? Unlikely. Depending on your organization and requirements, there may still be value in some centralized capabilities. Just make sure these teams do not become bottlenecks. Any services offered should be available with the push of a button or the call of an API. Any engagements with specialists should be incidental on a need basis rather than structural dependencies for every release. For structural dependencies, make the specialist part of the multi-disciplinary team.
For a more sophisticated discussion of organizational design, have a look at Matthew Skelton's book Team Topologies.
Architecture: use reverse-Conway's law to achieve autonomous Microservices
Organization and architecture should go hand-in-hand. For teams to deliver independently, aim for a Microservice architecture where teams fully own one or more services. Apply reverse-Conway's law so that your organizational design leads to good architecture.
For deployment speed, follow best practices from the 12 Factor App manifesto. In addition, adopt managed services where possible. You don't want to waste cycles running a self-managed NoSQL database when you can adopt DynamoDB instead. And even in managed databases, aim for the solution that requires the least maintenance. For example, DynamoDB doesn't ask you to specify maintenance windows and doesn't require you to "upgrade" the database on a regular basis. That makes it a preferable default choice over RDS when you get the chance. The same goes for your container orchestration service. You don't want to be forced into a full blue-green deployment to a new cluster whenever a new version comes along. (See also McKinsey's recent article "Does Kubernetes really give you multicloud portability?")
Create flow with fully automated pipelines
Back in 2014, when I was interviewing teams for their CI/CD practices, I often found that these teams had reasonable levels of build automation, test automatio and deployment automation. And still, they were experiencing long cycle times with wait times in-between steps. What was missing? A pipeline that connects everything together in a fully automated fashion. Consider this meme video:
The value of automation is largely eliminated when you need to file a ticket to another team to trigger the automation several hours or even days later. Automation is most valuable when connected as part of a fully automated pipeline.
Another litmus test: when any step in the pipeline breaks, how long does it take for someone to notice and take action? Is the failure abundantly visible? Are there alarm bells ringing, flash lights lighting up in your office? Or do you at least have a monitor change colors? A set of Slack messages that keep reminding you until you acknowledge the issue? Use the technique that works for you practically but make sure the pipeline serves as an Andon cord where all development stops until the pipeline is back up and running.
Build: Continuous Integration and short-living branches
The goal of a good build process is to keep your software in a potentially shippable state. You can't move fast if every release requires hours of integration work (or more!) How do you achieve this?
Follow Continuous Integration best practices. Ensure your branching strategy encourages frequent/continuous integration of changes. If you have a tight-knit team with high level of ownership, trunk-based development is likely the best way to achieve this. If, however, your team is more distributed and contributions come from a wider group of people with less ownership, you may still prefer a branching strategy with feature branches. In the latter case, ensure feature branches are short-living only. Create mechanisms in which feature branches are merged back into the mainline the same day.
In order to maintain a healthy code base while integrating continuously, you need good build automation with unit tests to verify that the most recent merge did not break any code. To keep fast feedback loops, ensure that the duration of build and unit tests combined is less than 5 minutes.
QA: Maintain a healthy test pyramid
The biggest anti-pattern in testing is the absence of automated testing. However, even when automating tests, you can create a less-than-ideal setup. Most commonly, teams of testers set up test automations at the UI layer. While it is good to have a basic set of acceptenace tests through the UI, you don't want all your tests to run this way. Why? Because acceptance testing through the UI is slow and expensive - even when automated.
What is better? Maintain a healthy test pyramid where most of the testing is done at the level where tests are fast and low-cost. Do continue with integration tests in addition but avoid running all tests at one level only.
Security & Compliance by design
QA is more than functional testing. If you want to regularly release software, you need to be able to ensure it is running securely. Ensure a clear security classification of your application and data so that you employ fit-for-purpose security and compliance solutions. Make security checks part of the release pipeline. Employ both preventative as well as detective tools to maintain compliance across your environments.
Release early and often!
The main mantra of Continuous Delivery is: "If it hurts, do it more frequently". This specifically refers to your number of go-lives to production. When companies experience issues during their go-live, they tend to become more conservative and many conclude that they should release less often to avoid the risk of go-lives.
Counter-intuitively, the best thing to do is to release more frequently, yet responsibly. Why? The risk of a release to production is correlated to the amount and impact of changes being rolled out. Saving up 4 months worth of changes and releasing them all at one time concentrates the release risk to a single moment. If something breaks, you need to figure out which of all these changes was the culprit. If, on the other hand, you release a single change at a time, possibly multiple times a day, you know the risk/impact of that particular change. If - despite automated testing efforts - the change fails in production, you know exactly which change caused it.
Finally, in organizations with strong CI/CD capabilities, you find a set of tools and techniques to help frequent, low-risk go-lives. These organizations separate the technical deployment to production from the functional release of functionality to their customers through the use of feature flags. That way, no one would notice if you have to rollback a version right after go-live. Other tools in this tool box include canary releases, rolling updates and blue/green deployments. These techniques can be used hand-in-hand with more functional/commercial techniques like A/B testing. And an organization with strong analytics capabilities would find out soon if the latest release reduces their funnel conversion rates.
Production is everyone's responsibility!
In a healthy team following DevOps philosophies, the developer of functionality cares about how it runs in production. Because of their ownership, developers actively build monitoring, logging and tracing capabilities. Developers create self-healing software that is resilient against common failure scenarios (i.e. machine failure, increased latency, unavailability of dependencies, overload scenarios). A healthy team is continuously learning in production and improving their software. Non-functional quality in production is at least as important as developing new functionality.
Provisioning: Immutable Infrastructure
I have seen companies that use cloud technology requiring you to fill a form in Excel to request a new EC2 instance. That form would take 2 weeks of processing before you were assigned your new EC2 instance. With such processes, you may as well not use cloud at all! Instead, for fast release cycles, you want a combination of fast infrastructure provisioning with strong ownership and visibility of the costs that result from this. Ensure resources are tagged and separated across multiple accounts so you can split out the cost for each internal owner.
During my reviews back in 2015, companies were using tools like Chef, Puppet, Ansible and Salt Stack to maintain a desired state configuration. These tools were popular due to their declarative infrastructure as code and offered a powerful tool set to control larger numbers of servers - particularly on-premises.
In a cloud environment, you can do better than enforce a desired state like that: employ Immutable Servers. When it is physically impossible to log in to a server or make changes, the only way to make changes is by provisioning a new server. When you have a mechanism that tracks back the creation of new servers to Infrastructure as Code maintained in version control, you can track the full state of that server through version control. For VMs, it takes a bit of work and creativity to employ this philosophy. However, in the world of Containers, immutability is a native part of the way containers work.
In this blog post, I discussed a number of best practices and techniques that I would look for in a CI/CD pipeline and DevOps team today (late 2022). This post covers the most important elements that come to mind in response but may not be fully comprehensive. There are so many tools and techniques out there that it is hard to do justice to all of them in a mere blog post.
Looking at the best practices presented here, you do see a few common patterns: moving fast, high levels of automation and self-service, reducing hands-offs between teams, creating flow through pipelines (inspired by Lean practices) and focussing on business value rather than technical heavy-lifting. And don't forget to maintain security and compliance every step of the way. Automation will help you doing so without slowing down!
Copyrights: Cover image is licensed as Creative Commons from Wikipedia - see source and cut/resized to fit Hashnode