I know it’s too late for a bunch of shops but for gods sake please don’t use una...

yjftsjthsd-h · on July 6, 2022

> don’t use unattended upgrades

> Build your images in CI job

I know container images should generally be immutable, but I would expect unattended upgrades to be mostly used on the host, not in a container, in which that management system doesn't really work (unless you're doing VMs where you can deploy immutable root images to the VMs as well, or some fun bare metal + PXE combination).

Spivak · on July 6, 2022

> or some fun bare metal + PXE combination

This is actually what I implemented for our hypervisor tier, it’s not as scary as it sounds. I could legit completely rebuild our entire stack down to the metal in about 3 hours.

Kick off a new hypervisor version, the inactive side PXE boots all the nodes, installs and configures a Proxmox cluster, slaves itself to our Ceph cluster, and then either does a hot migration of all the VMs or kicks off a full deploy which rebuilds all the infra (Consul, Rabbit, Redis, LDAP, Elastic, PowerDNS, etc) along with the app servers. The hardest part (which really isn’t) is maintaining the clusters across the blue/green sides.

With this setup our only mutable infrastructure was our Ceph cluster (because replacing OSDs takes unacceptably long) and our DB (for performance the writers lived on dedicated servers, the read replicas lived on the VMs.).

jacoblambda · on July 6, 2022

alternatively I suppose depending on the size of your operation, you should consider having a dummy prod using at least one of each of the servers in your environment and using that to validate host upgrades. after that you can push an unattended upgrade via a self-hosted package+upgrade server.

Let things be automatic to the maximum degree possible but give yourself a single hard human checkpoint and some minimum level of validation in a dummy environment first.

ec109685 · on July 6, 2022

Idea is that your deploy step should handle both deploying code as well as upgrading OS, so all changes go through same pipeline.

markstos · on July 6, 2022

Sorry, not my experience.

My experience has been that by the time I notice some serious vulnerability is in the news, my servers have already patched themselves. I have never "hated life" or had a "hard to find and undo bug" due to automatic security patching. I pretty quickly found what caused this and had a clear path to resolution.

This is the first security update that caused a boot failure in about a decade. It was bad, but it didn't change my mind about unattended-upgrades. My takeaway that if that maybe I should have upgraded my 20.04 servers to 22.04 server sooner.

Spivak · on July 6, 2022

You’re conflating unattended-upgrades (server mutability, hard to roll back) with automated patching in general. Do automated patching but also run the changes though your CI so you can catch breaking changes and roll them out in a way that’s easy to debug (you can diff images) and revert.

I bet when you update your software dependencies you run those changes through your tests but your OS is a giant pile of code that usually gets updated differently and independently because mostly historical reasons.

markstos · on July 6, 2022

> I bet when you update your software dependencies you run those changes through your tests but your OS is a giant pile of code that usually gets updated differently and independently because mostly historical reasons.

Close. We are moving towards defining our server states through Ansible, but the project is not close to completion. Perhaps once that's further along, we could use Ansible Molecule + CI to test a new server state when there's a new patch available, but that's not an option on the table today.

The system we had in place for /today/ worked: Lower priority or redundant servers were set to auto-reboot after applying security updates, while other critical servers require manual reboot at low-risk times. By then, the patch has already been tested on lower-risk servers.

As a result, this issue caused no user-visible downtime for us, and due to the staggered runs of unattended-upgrades affected a minimal number of servers.

And this was the first time in 10+ years that something like this happened and we have to choose to write to prioritize spending our process-improvement time based on likelihood and impact.

nix23 · on July 6, 2022

>I know it’s too late for a bunch of shops but for gods sake please don’t use unattended upgrades to do your patching unless you want to hate you life and chase down hard to find hard to undo bugs

Some years ago everyone said the same about windows-servers ;)