The Westworld Software Team is bad at XP and DevOps
While watching Season 1 of Westworld, I spent the entire time annoying my wife. I would bug her about the anti-patterns I saw the software teams at Westworld using. Of course, the anti-patterns were for dramatic effect. It wouldn’t be interesting if everything went right. But, since it’s rife with bad practices it can be a good way to point out when teams are doing things wrong.
What follows will be a load of spoilers. We will learn from the Westworld team’s mistakes and prevent them in our own teams.
Again there will be many spoilers, so if you haven’t finished Season 1 and don’t want it to spoiled: stop reading.
Without further ado.
Code and Release Management
Right off the rip, the Westworld team has a bad release. In episode 1, the Westworld team is pushing an update. We as the audience are unclear what is in the update. But, Bernard mentions that Ford has slipped in a feature called “reveries” in at the last minute. No one had reviewed the code. Code going live without a review is a red flag.
Action Item 1: Code reviews are part of the process, even if you are the guy in charge and the people below you are robots.
We learn that the reveries code is not new code at all. Actually, most of the code is the product of Arnold, a former engineer on the team. The code itself is all legacy. Bernard even mentions that his team doesn’t understand how much of it works. Has your team ever pondered: what would happen if your top engineer got hit by a bus? or shot in the back of the head by their own creation? Now you know. Knowledge will leave your organization for good. Avoid knowledge silos at all costs and wrangle your legacy code with tests.
Action Item 2: Reduce knowledge silos by pair programming. You can’t recreate your teammates, no matter how hard you try.
Action Item 3: Write tests to understand the legacy code base and practice TDD for any new change. Regression bugs will result in hosts becoming sentient and murdering guests.
For the release itself, we need to give the Westworld team some credit. They mention that they are doing rolling releases to a percentage of the hosts. They reduced the blast radius. It could have been a world of hurt if this release went to every host right away. But, when they first see issues with the release, they do not rollback. They don’t even fix forward! Their plan is to wait it out in fear that it could interrupt other parts of the system. You should have a rollback plan with every release. If your release has data migration, you should have rollback for that as well.
Action Item 4: Make rollback easy and painless. Do it at the first sign of trouble, otherwise you might have murderous hosts.
While investigating incidents, Elsie and her team leverage logs and metrics. You see her walk all over the park in hopes to discover what is actually going on. Most devs I know will complain if they have to go to two different directories to get logs. Heck, they complain if they have to walk to the kitchen to get a coffee. I commend her determination but, her job would be easier if those logs were in one place. It was cool watching her pull them off the host’s body but in the real world, having to go to two places to get logs sucks.
Action Item 5: Centralize logs and metrics so your best engineers don’t have to leave the office and risk getting murdered
While on the hunt for the stray, Lowe mentions to Elsie that only certain hosts are able to fire weapons. But who was getting paged when Dolores started putting holes in people? We see some alerts happen, especially in the last episode, but there are few.
Action Item 6: What could go wrong, will go wrong… especially in Westworld. Put people on call and alert on anomalies.
One of my biggest gripes with the Westworld team is a lack of immutable infrastructure. The reveries bug is the result of memories from previous builds showing up in the hosts. Hosts with issues or the creation of new storylines should have put those hosts to pasture. Instead, some of these hosts run for over 30 years! Even the Man in Black mentions that many of the hosts are around because Ford loves a pretty face. Despite their western themes and mythos, the Westworld team kept their hosts as pets not cattle. Treating hosts or servers as pets will encourage bugs but also impacts your ability to scale. What if I wanted 20 Teddys for some Teddy related event? It should be easy to spin up a new host and throw away an old one at the first sign of trouble.
Action Item 7: There will be no question if this is a dream or reality. You will get paged in the middle of the night, if you don’t use immutable infrastructure driven by code.
Westworld suffers from a throw it over the wall culture. QA doesn’t trust dev, dev doesn’t trust QA, narrative trusts no one, and no one trusts management. Before things get out of control with blame or murder, think about the structure of your teams. Quality is everyone’s responsibility. A great product is everyone’s goal. Westworld might be better off with feature teams. A group of individuals aligned with a feature that delivered all aspects of that feature. There would be fewer dependencies for each team. It might end the mistrust between different functions of the organization
Action Item 8: Structure your teams to deliver features (or narratives) instead of functions. Avoid murdering other teams at all costs.
The Westworld team could have caught these issues and avoided future mishaps. But, they didn’t because they don’t actually do postmortems. Instead, they look for someone to fire. They even staged a fake incident to cast blame, rather than talk about how the mistake got there in the first place. Postmortems that don’t look for root causes risk the incident happening again. In a blameless postmortem world, incidents are not an individual’s fault but from systematic issues. For us as viewers, no real postmortems from the Westworld team is a guarantee of future seasons of great drama.
Action Item 9: Your team should be having blameless postmortems that get to root causes. Humans (and hosts) make mistakes, we should learn from them.
The Westworld team has systematic issues. Their culture is a mess, their code management is a wreck and their infrastructure needs improvements. But, teams are not encouraged to change, instead, they risk getting fired, humiliated, or brought to Ford’s childhood home… It is too late for the software team at Westworld (especially given the finale) but, we can learn from them.
The singularity doesn’t need to occur for your team to become better. Every day is an opportunity to incorporate practices like the ones above. Implementing some of these items can be hard but they are not impossible. If any of these are foreign to your team read up on them and see how they can help impact your team.