Infrastructure as Code’s Broken Promises

It was supposed to make our lives easier.

Infrastructure as Code’s Broken Promises

Think back to the early days of your career. For me it was working as a lone-wolf sysadmin at a small regional retail company. I was still very green and was hired because they needed someone who could learn how reset passwords in Active Directory. I had no concept of the wider world of technology, the products and tools that existed in it, and the depth of the ocean I was starting to wade into. And just like everyone else that doesn’t know any better, I thought the best way to manage my resources (servers, desktops, etc.) was by hand. Manually imaging and re-imaging desktops, provisioning new servers, slogging through endless GUI menus and dialogs — this was my reality for a long time. Sure, tools like batch and PowerShell could some some problems, but their scope was limited. And then I met Ansible.

Fast forward to today — Ansible is no longer the hot ticket tool, its tools like Terraform and the AWS CDK. But with each of those comes more added complexity. Terraform, for example, allows you to create reusable modules. Great! We’re bringing lessons from software development into the SRE realm, this is a good thing! Except, after using it for a while you realize something: infrastructure isn’t anything like software. Its messy, things change, environments drift, emergencies happen. Infrastructure, with all of its moving pieces and interactions, is more like a living thing than software. Code, at its hea rt, is a static thing that defines a static product. Input goes in one end, output comes out the other in a predictable manner. But infrastructure isn’t like that. It mutates, it changes, demands are met, resources are recycled. Those modules you spent hours working on are no better than if you had written them as resources directly; turns out that not many things follow an easy inheritance pattern.

But it wasn’t supposed to be like this! There were promises made. They said that treating your infrastructure like code would make things easier, not harder. They said that writing infrastructure code would mean it would be testable, not more vague and unknowable. They said that we could spend more time on making things better than making sure things stayed up. But broken promises, all of them.

Things have not become easier. We have abstracted our abstractions behind more abstractions, creating modules of resources that are themselves virtualized resources. Our modules are almost 1:1 definitions of the base resources they begin with, where is the savings? Where is the simplicity? Why does it take longer for a new engineer to get up to speed with the automation code than it did 5 years ago? I thought things were supposed to get better, not worse. All we’ve done is traded one type of complexity for another.

Things have not become testable. When was the last time you wrote tests for your Terraform scripts, your Ansible playbooks, your Chef recipes, or your Salt stacks? And even if you did write tests, how would you be able to trust them when a majority of the resources can only be validated on resource creation or modification? How do you test code that only runs in production against production resources? The simple truth is that while many claim to be testable, you can never fully trust those tests. Your syntax may be correct for whatever tool you’re using, but your syntax checks are never going to correct for errors raised by your cloud or infrastructure providers. And even then, your logic may not be correct. With more of these tools introducing more code-like features, some of them hover dangerously close to Turing-complete and are more like entire programming languages than simple configuration scripts. More types of complexity.

I have not saved any time. In fact, I have spent more time writing automation than time spent if I had just made the change in situ. Am I saying we go back to the halcyon days of “do everything by hand”? Absolutely not. But for tools that are supposed to save us time, money, and resources they certainly have done none of that. Every single change, every little update requires altering my automation. Those alterations need to be applied across my infrastructure so that my state is current. If there are any errors then I need to fix those, recommit my code, let my build system or central server deploy the new automation, and wait for additional errors. Rinse & repeat. There has to be a better way.

I’ll be the first to admit that I don’t know what the solution is. But I do know that whatever that solution is it has to be less complex than what we have now. Our environments can’t handle much more complexity.