With a single blog post, Netflix showed the world that they have probably the best engineering team on the planet. And I don't think it's even close right now. What they showed was an engineering team that is perfectly capable of operating at the usual layers of abstraction – cloud provider, instances, services, etc. – but still fully capable and competent enough to dive deep into the intersection of software and hardware, reason about how they interact, and find ways to improve performance at a very deep, very obtuse level. The ability to drill down into the minute details and still understand how they impact the whole is, in my estimation, one of the greatest skills that an engineer can have regardless of engineering specialty. And not only does Netflix' tale of macro-to-micro engineering make for a compelling read it validates something I have been saying for a long time now: fundamentals matter.
Today's computing landscape is made up of layers and layers of abstractions and obfuscation. Especially with the rise of public cloud providers and containerized workloads, less and less do we actually interact with the "primitive" systems that these abstractions are built on. This isn't a bad thing – managing a fleet of just 20 virtual machines on a single bare-metal server is job enough without the automation, live migration, storage networks, monitoring, remediation, and everything else that goes into turning that single machine into one of many that make up a virtual machine service. Abstractions are incredibly useful and necessary; as the engineering landscape continues to complexify the only way to properly manage and understand it is to deal in abstractions. But those very abstractions have a negative effect as well: by hiding the underlying systems and processes its easy to forget the "how" and "why" of the way things work, and understanding those two facets of the systems you operate can make all the difference when it matters most.
All of this isn't to say that abstractions are a bad thing – just the opposite is true, in fact. The systems and technologies that we're able to build today can only happen on the shoulders of the abstractions and layers previously laid down by those who pioneered the industry before us. Without programmable circuitry we would be hardwiring logic into our machines; without "machine languages" like Assembly we would be manually flipping bits and registers; without human-readable languages like C and C++ we would be wiring a lot of complicated Assembly; without dynamic and memory managed languages like Python, Java, or Go a lot of our modern software could be more unreliable and less user-friendly; and the list goes on. Without these advancements in reliability and quality of life the software development world – and the world at large – would look very different. But, those things that came before us are not obsolete however less relevant they are to your average programmer's day-to-day routine. Just the opposite is true, actually: those layers play a more vital role in the smooth operation of today's world than most realize, and understanding them is crucial to the long-term survivability of any product, service, company, or line of code.
Castles On Sand
One of my absolute favorite YouTube channels is one called Brick Immortar. Their content is typically long-form (30+ minutes) documentary-style videos detailing various tragedies and mishaps that have happened over the centuries, from bridge collapses to sinking merchant vessels, with a focus on incidents where faulty engineering, gross oversight, or systemic failures have directly led to the loss of life. While the subject matter is admittedly dark at times, and more than a bit fascinating, there are also valuable lessons to be gleaned from stories such as these.
A common thread throughout the stories told on the Brick Immortar channel is that the people involved rarely, if ever, see the whole picture. Whether its a Ride the Ducks tour operator not having a crucial weather report or an engineering firm contracted to modify a ship to increase cargo capacity, without the full picture we leave space for tragedy to unfold. Take as example the history of the SS El Faro, originally christened the SS Puerto Rico. It was designed and built as a "roll-on, roll-off" cargo vessel, designed to transport semi trailers on three decks – two internal and one external. But as the standardized shipping container became the standard mode of transport for goods she was modified to a "ro-con", or rolling container ship, one that would carry both rolling stock (vehicles and trailers) in her lower decks and containers on her top deck. These modifications and changes to cargo layout compromised the vessels seaworthiness and, on sailing too near a hurricane that would otherwise not have posed a major issue for a properly seaworthy vessel, she sank with all hands lost. There were many points along the way that any number of people could have raised the alarm. The shipyard in Alabama that was contracted to make not one – but two – different sets of modifications could and should have recognized the danger, as should the ship's operator, captain, second mates, and crew should have.
Given, the stakes for software are often much lower, but the principle remains true nonetheless. And, much like the SS El Faro, the FV Emmy Rose, the MV Sewol, and Stretch Duck 7, all who had modifications made to them that compromised handling, boyancy, seaworthiness, and other critical capabilities, we often find ourselves making modifications to or building things on top of systems or structures we don't fully understand with only the hope that things will work out fine in the end. In doing so we set the stage for catastrophe, both from a technical and business perspective.
Factors Leading Up To...
So how did we get here?
Well, its rather simple, honestly: we prioritized skillsets and knowledge related to the abstractions rather than the scaffolding underneath.
In my career I have been fortunate enough to be in a position to hire a number of DevOps/SRE engineers at various organizations. Those candidate searches always end up being frustrating ordeals because without fail the majority of the engineers you interview have a knowledge area that is 100 feet wide but only an inch deep. They may have approximate knowledge about a large array of topics and technologies but dig any deeper into those areas and you'll find the understanding is lacking in a very concerning way. Containers are the backbone of a lot of modern production environments with Kubernetes being one of the most popular container orchestrators out there. But too many engineers have one or multiple Kubernetes certifications yet have no clue what actually makes up a container and how it works, and fewer still understand containers well enough to actually troubleshoot core issues or look for ways to secure them. Imagine a world where scientists built rockets but didn't understand the forces gravity, or a world where your local mechanic blindly followed a computer readout to diagnose and fix your car without ever actually understanding why. This is exactly what we're asking our engineers to do with the very systems that we rely on every day and it is unsustainable.
The way we can fix it is rather simple, believe it or not: hire people who know the underlying infrastructure or systems. If you're hiring a DevOps engineer make sure they know how things like what containers are, how DNS works, and basic networking. If you're hiring software developers, make sure they know about caching, I/O costs, and basic networking. Be sure that the criteria you are judging your candidates by includes these deeper skills, but more often than not you're not going to find someone who fits the bill in every category. This is where upskilling becomes such an important piece of the puzzle.
Upskilling, or training your engineers in new tools, technologies, or other areas, is critical to the survivability of a technical organization. As teams change, personnel move on or move up, and technologies change, it is absolutely vital that your engineers continue to learn, grow, and evolve. Without that evolution your teams will stagnate, your products will stagnate, and eventually the company, project, team, organization, or whatever will wither and die. But a robust program of training or a culture of self-improvement and self-motivated learning will counteract this decay, make your organization stronger, and give your people those deep skills that are so crucial to the underpinnings of modern technology.
Anything I missed? Anything you want to see me cover? Let me know in the comments!