I’m not entirely sure how to begin this one; I know what I want to say, but I’m not sure how to descend into that madness without being incredibly blunt. Usually, I’d start with some sort of short personal story or a parable, or some kind of recent news reference, but there’s none of that here. This piece is really just a culmination of some growing concerns I’ve been having for the past year or so about the state of the DevOps/SRE industry, so I guess I’m left with being blunt: the state of the industry is terrible.
There’s a number of reasons that have drawn me to my conclusion, but I want to start with what I think is the biggest and most egregious reason: we have allowed our industry to be taken over by Kubernetes. I’ve written about my dislike of Kubernetes in the past, but this goes far beyond personal feelings or misgivings about a platform. We’ve given K8S so much control of our environments that we’re starting to get to the point where we can’t really run anything else. We constantly see products do complete pivots to support Kubernetes as their core model, completely forsaking their foundations. Rancher is a good example of this.
When Rancher first arrived on my radar it was a new way to run containers that was platform agnostic. In fact, you had multiple options for what kind of container orchestration you wanted to use: Cattle (Rancher’s own scheduling engine), Docker Swarm, or Kubernetes. And this was a good thing. This allowed teams to pick up Rancher and not have to worry about making a full-scale migration to a new platform because odds were that it already supported your orchestrator, or if you didn’t have one already, you could just use their native one. But since then we’ve seen Rancher shift away from both Swam and Cattle and fully embrace a Kubernetes-only mindset, a shift that we’ve seen from other products like Cloud Foundry.
Slowly but surely Kubernetes is taking over as the orchestrator of choice and its causing companies to have to make the difficult choice of either completely re-writing their products to support Kubernetes or simply discontinuing them. Just this year we saw the discontinuation of DC/OS and Mesosphere in favor of D2iQ focusing only on their Kubernetes offerings. And let me not mince words about this, this is not a good thing. The continued homogenization of the container orchestration space presents serious challenges for now and into the future both with regards to security and operations.
We’ve seen this phenomena before with Microsoft Windows, where a single unpatched vulnerability has the potential to affect millions of machines. Should the industry continue to become a monolithic Kubernetes infrastructure we will start to see the scale and severity of security vulnerabilities in Kubernetes rise and become close to all-encompassing. A single vulnerability in the Kubernetes core repo, or one of the many other codebases it depends on (CoreDNS, etcd, cri-o, containerd, runc, nginx, etc.) would cause a massive and wide-ranging problem for the industry as a whole. Competition not only breeds improvements overall, but also limits the blast radius should any one product be compromised, and the lack of competition does just the opposite.
And speaking of competition, its not as if Hashicorp’s Nomad is in a position to challenge.
There’s No Such Thing as Enterprise Open Source
I love Hashicorp products. Let me put that out there before I’m accused of just hating on Hashi. But all Hashicorp products have a single considerable flaw that makes it hard to justify adopting them in a real-world setting: you have to pay for needed features. The Hashi-stack could, and should, give Kubernetes a run for its money, yet we find ourselves in a position where the adoption rate of Nomad, Consul, et al, is slow and limited, and I think this is a direct and serious repercussion of Hashicorp’s pricing model. For any serious uses, such as the backbone of an enterprise container platform, you have to pay their considerable Enterprise licensing fees not just for one product but for every product you want to use.
Let’s say for example you want to set up Nomad as your container platform. Not only would you want to have paid Nomad features like resource quotes, multi-cluster deployments, redundancy zones, automated upgrades and backups, and audit logging, but Nomad also depends on Consul. And when running consul in a real-world setting you’re going to need redundancy, namespaces, SSO, audit logging, and other paid features. It’s the same story with Vault, Hashicorp’s secrets manager that also integrates with Nomad. Namespacing, monitoring, multi-datacenter replication, read replicas, audit logging, and disaster recovery are all paid features for Vault. So now are you not only on the hook for Nomad Enterprise, but also Consul and Vault enterprise licenses too. These products are touted as open source, and they are to a point. The tipping point are the features that you need to run them in a real-world production environment. And this is a pattern that is replicated all throughout the open-source community. There’s no such thing as enterprise-scale open source anymore, its all open-core software with paid enterprise features. And this is where something like Kubernetes comes in. Its one of the few truly free enterprise-scale tools still available.
People Still Use VM’s and That’s Okay
You would be astonished at the number of companies that still use VM’s. Or maybe you wouldn’t, what else are they supposed to use? And that’s kind of my point. This industry loves to make a fuss about leaving VM’s behind for serverless or whatever the Next Great Abstracted Architecture is, but the reality is that VM’s still power the vast majority of people’s infrastructures and there’s no real good way to manage them.
The industry has seemingly moved us into a post-VM state where products and tools are starting to deprecate VM-specific features (Terraform’s
Provisioners dropping support for Puppet and Chef and highly discouraging the use of
remote-exec is a good example) when the reality on the ground is that companies depend on these tools more than ever. Tools like SaltStack have pivoted into a more security-focused field, RedHat has ruined Ansible, and Puppet and Chef are as unwieldy and difficult to use as they always have been, but are focusing less on VM based infrastructure and more on the cloud as a while. We’re basically left with cloud-specific tools like CloudFormation and writing
cloud-init configs, or hacking together Terraform and
remote-exec calls despite their warnings to the contrary. We really have no good options for managing a fleet of VM’s left, much less simply configuring them after creation in a way that is observable, repeatable, and idempotent.
(Don’t even suggest running Ansible in local mode; not only was it not designed to do this, but RedHat has practically ruined Ansible. Additionally, Ansible relying on a specific version of Python to execute, and the way it transpiles YAML to shell is fragile and icky at best)
A Vague Conclusion
So what am I trying to say? I’m not really sure. Maybe there’s some sort of profound conclusion to be gleaned from this, but I think it all kind of boils down to this: we need to be careful not to get ahead of ourselves. We need to make sure that the tools we need are the right tools, that they exist, and if they don’t we bring them into existence. Because that’s what SRE does, we see a need and create tools to meet that need.