This was even worse than the headline made it sound.
If you had `unattended-upgrades` running and had the "automatic reboot" option enabled, then all your Ubuntu 20.04 servers running Docker would reboot themselves and not come back up.
First, the bug was in a security branch. Second, it wasn't just the containers that crashed. If you booted containers on boot via Docker, then the host OS kernel-panicked and crashed at boot, since the containers share the kernel with the host.
At that point, you can't SSH in and have to follow the procedure for restoring from backup or re-mounting the root volume on an alternate house to revert the kernel version being run.
And then of course if you revert the kernel upgrade, you were once again vulnerable to whatever problem the security update was fixing...
QuentinM 44 days ago [-]
Sounds about right. And not the first time it happens either. I recall getting a few of those instant unit 3 panic over the past few years with Ubuntu. Often with things not as common out there in production, like tc (which in our case we were using in production to work around conntrack race conditions), and sometimes we also got non-panicking but absolutely production/nerve wrecking issues like TCP window size calculation overflows after the window went to zero due to a temporary slow consumer - freezing the window size to a few bytes only instead of getting a prompt full window recovery.
Not to mention we’ve also had our fair share of production triple faults from bugs in the Intel firmware patches for Spectre, which took weeks to investigate & fix between ourselves struggling to keep our exchange up & running, Intel, and AWS.
And that is why there’s value in the CoreOS/ContainerLinux-like solutions we designed & implemented nearly a decade ago now. Being able to promptly rollback any kernel/system/package upgrades at once - either manually or either after it’s detected a few panics in quick successions is actually quite awesome. Not to mention the slow update rollout strategy baked into the Omaha controller.
But the reality is that the what-ifs are always the hardest to market, nearly always after-thoughts and with fast-spiking/fast-decaying traction after major events.
stingraycharles 43 days ago [-]
It really seems like there’s no good non-redhat (but still “production capable”) alternative to CoreOS nowadays, right? It’s pretty much Fedora / Redhat CoreOS or go directly to things such as k3os?
k3os is in a dieing limbo, now is the time to get some interest in using stuff like it
Spivak 44 days ago [-]
I know it’s too late for a bunch of shops but for gods sake please don’t use unattended upgrades to do your patching unless you want to hate you life and chase down hard to find hard to undo bugs.
Build your images in CI job and have your deploy version be (code version, image version) so patching runs through all the same tests your code does and you have a trivial roll-forward to undo any mess you find yourself in.
yjftsjthsd-h 44 days ago [-]
> don’t use unattended upgrades
> Build your images in CI job
I know container images should generally be immutable, but I would expect unattended upgrades to be mostly used on the host, not in a container, in which that management system doesn't really work (unless you're doing VMs where you can deploy immutable root images to the VMs as well, or some fun bare metal + PXE combination).
jacoblambda 43 days ago [-]
alternatively I suppose depending on the size of your operation, you should consider having a dummy prod using at least one of each of the servers in your environment and using that to validate host upgrades. after that you can push an unattended upgrade via a self-hosted package+upgrade server.
Let things be automatic to the maximum degree possible but give yourself a single hard human checkpoint and some minimum level of validation in a dummy environment first.
ec109685 44 days ago [-]
Idea is that your deploy step should handle both deploying code as well as upgrading OS, so all changes go through same pipeline.
Spivak 43 days ago [-]
> or some fun bare metal + PXE combination
This is actually what I implemented for our hypervisor tier, it’s not as scary as it sounds. I could legit completely rebuild our entire stack down to the metal in about 3 hours.
Kick off a new hypervisor version, the inactive side PXE boots all the nodes, installs and configures a Proxmox cluster, slaves itself to our Ceph cluster, and then either does a hot migration of all the VMs or kicks off a full deploy which rebuilds all the infra (Consul, Rabbit, Redis, LDAP, Elastic, PowerDNS, etc) along with the app servers. The hardest part (which really isn’t) is maintaining the clusters across the blue/green sides.
With this setup our only mutable infrastructure was our Ceph cluster (because replacing OSDs takes unacceptably long) and our DB (for performance the writers lived on dedicated servers, the read replicas lived on the VMs.).
markstos 43 days ago [-]
Sorry, not my experience.
My experience has been that by the time I notice some serious vulnerability is in the news, my servers have already patched themselves. I have never "hated life" or had a "hard to find and undo bug" due to automatic security patching. I pretty quickly found what caused this and had a clear path to resolution.
This is the first security update that caused a boot failure in about a decade. It was bad, but it didn't change my mind about unattended-upgrades. My takeaway that if that maybe I should have upgraded my 20.04 servers to 22.04 server sooner.
Spivak 43 days ago [-]
You’re conflating unattended-upgrades (server mutability, hard to roll back) with automated patching in general. Do automated patching but also run the changes though your CI so you can catch breaking changes and roll them out in a way that’s easy to debug (you can diff images) and revert.
I bet when you update your software dependencies you run those changes through your tests but your OS is a giant pile of code that usually gets updated differently and independently because mostly historical reasons.
markstos 43 days ago [-]
> I bet when you update your software dependencies you run those changes through your tests but your OS is a giant pile of code that usually gets updated differently and independently because mostly historical reasons.
Close. We are moving towards defining our server states through Ansible, but the project is not close to completion. Perhaps once that's further along, we could use Ansible Molecule + CI to test a new server state when there's a new patch available, but that's not an option on the table today.
The system we had in place for /today/ worked: Lower priority or redundant servers were set to auto-reboot after applying security updates, while other critical servers require manual reboot at low-risk times. By then, the patch has already been tested on lower-risk servers.
As a result, this issue caused no user-visible downtime for us, and due to the staggered runs of unattended-upgrades affected a minimal number of servers.
And this was the first time in 10+ years that something like this happened and we have to choose to write to prioritize spending our process-improvement time based on likelihood and impact.
nix23 43 days ago [-]
>I know it’s too late for a bunch of shops but for gods sake please don’t use unattended upgrades to do your patching unless you want to hate you life and chase down hard to find hard to undo bugs
Some years ago everyone said the same about windows-servers ;)
akx 43 days ago [-]
> have to follow the procedure for restoring from backup or re-mounting the root volume on an alternate house to revert the kernel version being run.
Or add `systemd.mask=docker.service` to your boot parameters to prevent Docker from starting.
capableweb 43 days ago [-]
Which, if your server is stuck in a infinitive "boot -> docker starting -> container starting -> crashing kernel -> reboot" loop, you won't ever get a chance of actually adding anything to your boot parameters.
dspillett 43 days ago [-]
If you have access to the console (local physical machine, VM on a system that can expose the console, physical box that you have console access to via IPMI or other means), can you not specify that directive to be passed through via grub's interactive menu?
Failing that you could try the “single” directive and poke other configurations once booted in that mode.
A faf to be sure, but hopefully viable options (assuming the interactive menu hasn't been disabled to save a few seconds off boot time!).
bravetraveler 43 days ago [-]
Absolutely can, I'm quite surprised at the 'what do' attitude around this. It's routine -- not in all organizations to be sure, but it's a solved problem.
There are options even without out of band management. You can choose to configure your systems with PXE -- if the installation ever fails, it can boot into a recovery environment over the network.
jacquesm 43 days ago [-]
That's not correct. If you stop the boot you can add 'single' to the boot statement which will drop you in a single user shell from where you can do quite a bit of maintenance.
markstos 43 days ago [-]
AWS at least provides serial console access so have the option to access it during the boot cycle.
Alternatively, you umount the drive, attach it to another machine, chroot into it, fix grub or whatever, reverse the process and boot again. It's a few steps, but can be done in a few minutes with practice.
bravetraveler 43 days ago [-]
Out of band management is common and highly recommended
mrintegrity 42 days ago [-]
Actually networking and ssh come up for a couple of seconds before containerd triggers the kernel panic so you can fix it by doing this:
while true; do ssh <servername> sudo mv /usr/bin/containerd /usr/bin/containerd.backup ; sleep 1; done
While rebooting the system
gtirloni 43 days ago [-]
Sorry, what? You don't lose complete control of a server because it's rebooting nonstop.
loopz 44 days ago [-]
Wouldn't rollback of kernel be a choice in grub menu?
It's pretty standard for all distros to have that choice.
withinboredom 43 days ago [-]
That usually requires physical access to the server to select it during boot.
darkwater 43 days ago [-]
If you have unattended-upgrade and automatic reboot in the cloud to benefit from security updates for long-lived instances, then you better make sure to have a tty console attached to it. You are treating it like a physical machine, you must have the same tooling around.
bravetraveler 43 days ago [-]
Not really, console access through IPMI found on most servers
Exceptions tend to be white boxes built with desktop components, at which point, yea. The proverbial You asked for this problem
akx 43 days ago [-]
Not necessarily. With good timing and some luck, you can connect the serial/"recovery" console before GRUB's timeout ends and either change the running kernel or add the `systemd.mask=docker.service` boot parameter to prevent Docker from starting.
withinboredom 43 days ago [-]
Sounds like a VM and not a physical server.
laumars 43 days ago [-]
Nope. Back before VMs were thing it was common to do "lights out" style remote management via a console server. That console server would then have a serial connection (the old 9 pin d-sub plug[1]) to your individual physical servers. You could then connect to your remote servers local TTY via the console server a little like jumping to remote servers via an SSH bastion. However it did sometimes require a little bit of prior configuration, depending on your distro[2].
This wasn't just limited to Linux either. It was a common UNIX trick :)
This is a bit of a lost art these days though. iLo, IPMI have replaced the need for serial. Then virtualisation and, to a lesser extent, containerisation have lowered the bar even further plus also moving the industry towards more ephemeral systems that can be destroyed and rebuilt automatically rather than the old habits of nursing failed hosts back to health.
And quite a few implementations actually emulate the serial console allowing for the exact same access. (Serial Over Lan or SOL for short.)
topranks 43 days ago [-]
Still common on network devices (Cisco, Juniper, Arista etc.). No IPMI or similar on those.
Console servers from the likes of OpenGear and Lantronix still heavily used for those.
akx 43 days ago [-]
Sure. For a physical server, you'd use its lights-out management to the same effect.
ape4 43 days ago [-]
If its in the cloud you'd have a virtual console.
zurn 43 days ago [-]
Unsurprisingly AWS is ghetto about this.
taspeotis 43 days ago [-]
Or a real server with Lights Out Management.
withinboredom 43 days ago [-]
That’s why “usually” is in the sentence. :)
Most smaller teams usually don’t prioritize physical access — they usually only need it for one-off events. While this would be a one-off event, it would be one that affects many servers.
corobo 43 days ago [-]
I'd be more inclined to say that physical servers usually have some sort of console access available.
I'm not sure I've ever worked with any (2008-present) that don't in any case.
phillu 43 days ago [-]
That is really not my experience at all. Every professional smaller team I worked with "usually" had this figured out and set up.
In times of home office, no one wants to be at the office for just pressing a single button on some server.
Oh well, I guess experiences differ.
withinboredom 43 days ago [-]
My experiences for ops is all pre-2012 and with teams numbering less than 3 for the whole org. So I’m sure things have changed or gotten cheaper? I can’t see a team of 3-4 having the budget to get something that allows them to be “lazy”, especially when that budget can go towards something useful. But I guess the pandemic probably changed things there?
laumars 43 days ago [-]
Serial connections will only cost you a Raspberry Pi (there's probably some really cheap console servers on eBay too).
I don't think the issue is so much cost but more this kind of systems administration is becoming a forgotten art because 99% of the time modern tooling removes the need for it. So younger sysadmins are never taught how to do these kinds things. However when I started out, I worked in a few small companies that had their physical hosts connected to a console server (which was a Cisco device like a network switch) via serial cables and you'd then connect to that console server remotely.
topranks 43 days ago [-]
Depends on the infra and how it’s set up.
If you can afford to have something down for an extended period then fine. But even with a small team some services are built such that certain device outages cannot be tolerated, at least for an extended period.
So out-of-band/console servers or whatever still make a lot of sense and a relatively high priority.
jacquesm 43 days ago [-]
You can do this kind of thing across the network if you have to.
hansel_der 43 days ago [-]
no.
it requires acces to the serial console or baseband management controller or whatever terms have emerged.
have never rented a physical server w/o this.
sofixa 43 days ago [-]
> If you had `unattended-upgrades` running and had the "automatic reboot" option enabled, then all your Ubuntu 20.04 servers running Docker would reboot themselves and not come back up.
Isn't the common wisdom that you should have them enabled, but staggered across hours/days?
gtirloni 43 days ago [-]
Not a huge Debian/Ubuntu user but I think the systemd timer that triggers the unattended updates has a random delay added to it. I don't know of it's hours or just seconds.
markstos 43 days ago [-]
I believe it's staggered across hours by default and it seems that Canonical might have been able to at least stop pushing out the bad update even before they had a fix
AtlasBarfed 43 days ago [-]
Probably better you have rolling A/B replacements that stop the replacement run if the replacement doesn't come up.
This is mostly an in-place upgrade issue?
ConstantVigil 44 days ago [-]
And while I haven't had this happen to me yet; the fear of something like this or even worse is why I try to stay one step behind the update paths on linux distros.
Security patches matter, but I'm no one important, so I should be fine to wait a week or month...
Anyone else who is important though... servers for example...
rawoke083600 43 days ago [-]
That sounds downright horrible !
mroche 44 days ago [-]
Copy-pasta of Jonathan Corbet:
It's nice to see LWN on HN ... but please remember: it is only LWN subscribers that make this kind of writing possible. If you are enjoying it, please consider becoming a subscriber yourself — or, even better, getting your employer to subscribe.
If you're interested in detailed commentary on and investigations of the FOSS space, I can't recommend a subscription to LWN enough!
stefantalpalaru 43 days ago [-]
mrintegrity 43 days ago [-]
This was exceptionally annoying for me, some ec2 instances are used only during the day and we stop/start them with an in house scheduling application outside office hours. Also automatic security upgrades are enabled. Came in to work one day last week and all of our UAT environment was down.
It is possible to ssh in for about 2 seconds before the kernel panic so I solved it by doing this:
while true; do ssh <servername> sudo mv /usr/bin/containerd /usr/bin/containerd.backup ; sleep 1; done
On the next reboot i was able to ssh in and change to the (then just released within the past hour) kernel that doesn't have this stupid bug. After another reboot you can move containderd back and it should be working again
affected: linux-image-5.13.0-1028
not affected: >linux-image-5.13.0-1029
kubanczyk 43 days ago [-]
I love the approach. Nowadays you can even
sleep 0.1
on most systems. (And on Busybox, you should have usleep.)
heurisko 43 days ago [-]
I like Ubuntu, but in the last few months I have been following how things are packaged more closely.
For example, looking at the package for postgresql-14, an update still hasn't been released for the unscheduled mid-June release version 14.4, which fixed possible index corruption.
“It makes me wonder exactly how much of a resource is behind creating Ubuntu distributions.”
I don’t really have any sources to back this up, but my impression is that Canonical is kinda trying to punch above their weight.
atoav 44 days ago [-]
The cost of complexity showing itself.
A sysadmin friend of mine is totally against docker and his reason is that he wants as little complexity as is needed on his systems. Complexity, he says, leads to emergent behavior.
jve 43 days ago [-]
Docker actually helps managing complexity, by putting bits and pieces, scattered on the floor and putting them into a single cardboard box.
- If you throw the box out, you know you did no harm to other boxes.
- If you change your floor, you know you didn't wipe out something useful.
- Aaand you can `git switch` to a well known state
Ofcourse it's not 100% like that, in reality you still have to have some kind of consistency on where you put your docker-compose file, Dockerfiles for all the boxes, where you mount your volumes (in some folder or scattered all over system), maybe dealing with host firewall, dealing with not-commiting secrets into git etc.
But overall, it's very positive - docker-compose is (almost) one-stop file you need to see all your references to volumes, Dockerfile, network configurations, environment files with secrets.
markstos 43 days ago [-]
Containers help tame complexity, but the container itself could be run without the docker runtime, using systemd-nspawn, or run as a regular systemd service using podman.
It seems less complex to manage a bunch of systemd services than one pile of systemd services that are managed and logged one way and bunch of docker services that are managed and logged another way.
solarkraft 43 days ago [-]
Deviation from standards is what I find to cause a lot (the most?) of problems. Complexity generally makes things more error-prone, but it tends to become less of a problem when in a well maintained system a lot of people use (someone, maybe yourself, has already had and solved a problem you're having).
If your standard is dealing with systemd units, it may make sense to make your containers conform to that (thereby deviating from the most common way of managing containers). Maybe it's what I'd do in a larger operation. For my personal use I find it the most pragmatic to just use Docker since it's reasonably well documented, has reasonably low friction in usage and is very easy to set up.
markstos 43 days ago [-]
Agreed. When I'm picking "the right tool for the job", something I factor in what are the tools and languages my team knows. If my team knows JavaScript, I'm going to weigh that when I'm choosing the choice of coding languages for the next project.
jsight 43 days ago [-]
Exactly, you get many of the benefits of separate boxes without some of the added complexity. Docker and containers in general can reduce complexity significantly in a lot of cases.
dncornholio 43 days ago [-]
What you are saying is actual complexity. A box can be a machine instead of container and all your points will still stand but with less complexity.
jve 43 days ago [-]
Perhaps, if we talk about deployent.
Howevre looking at the whole chain of process:
I can deploy service/app on my local machine, including dependencies, along with other independent services/apps.
So, I can reuse what I have for production. But ofcourse, local development stuff will have different env variables, some docker-compose etc.
But it is a joy, when you can:
1. git clone something
2. set some env variables
3. docker-compose up -d
And your app, along with database, elasticsearch, whatnot - is running
So for development purposes it really helps. For deployment purposes - if that box is dedicated to something, then yeah, many of the good use cases not necessary.
solarkraft 43 days ago [-]
Having all of your services in the same box opens you up to a ton of possible unwanted cross-interaction, which is complexity.
(just not from the system perspective, but from the "reasoning about it" perspective)
iasay 43 days ago [-]
I agree with this. I work on a very large platform and the cost of complexity like this is immense. It’s a not insignificant measurable loss versus not using docker and kubernetes. We’d have been better off using flat EC2 instances for everything and not incurring the packaging complexity, the repository management complexity, the pipeline complexity and the extreme staffing cost to keep multiple large kubernetes clusters running and understood.
Even container security and compliance around it is a measurable loss on its own which is trivially solved if you have bare EC2 instances and a patch cycle.
RedShift1 43 days ago [-]
Although Docker may be another layer, from sysadmin point of view containers are not very complex. What I absolutely love about containers is that it shifts the responsibility of making the software run back to the developers. No more ridiculous installation requirements and long winded instructions just to get something going. Just start container, maybe add some mount volumes and env variables for configuration, boom, done. Way less complicated to set up and manage containers than having to learn what each piece of software does special just to get it going.
throwaway787544 43 days ago [-]
Complexity is just another part of natural systems. It's not something to be avoided for it's own sake, in as much as we should avoid having eyeballs, because they are crazily complex. Yet we tend to like ours and find them worth whatever cost they incur.
The emergent behavior of containerization has had an overall positive effect, even if it has annoying costs.
klqr 43 days ago [-]
For the end user there is no positive effect. Many websites were better in 2005 and had greater uptime.
They were also better organized. Ebay and Amazon were leaner and more pleasant to use.
sofixa 43 days ago [-]
> Many websites were better in 2005 and had greater uptime.
Citation needed. There were a lot less sites, maintenance windows in the hours weren't uncommon, there was no security to speak of (SQL injections, no SSL/TLS, etc.), and sites could do maybe 1% of what today's sites can (not saying that all of it is good or necessary, bit i quite like a non-insignificant amount of those new features like native video, audio, graphics, dynamism, etc.).
trasz 43 days ago [-]
It's quite noticeable tbh. Nowadays companies don't care about actual reliability, they only care about "apparent reliability", which is a bullshit statistic, and it shows.
throwaway787544 43 days ago [-]
We can now push 15 different apps to prod on the same server with completely different base distributions and dependencies, and they'll run the same as they did on a dev's laptop. The apps & servers don't crash as much as they used to when the app or packages would break on an update, and somebody had to juggle actual dependencies or run separate VMs or physical servers to run all those apps, and the configuration management that used to hose the box when it was misconfigured now no longer exists. The site is much more reliable now, and more dynamic, as it can be updated more frequently with random tech.
When was the last time you saw a weekly "Our website is down for maintenance for the next 2 days" message, other than for some government website still running COBOL on a mainframe? When was the last time you saw 500 errors? Used to be a daily thing.
nwh5jg56df 43 days ago [-]
> Many websites were better in 2005 and had greater uptime.
Source? Sounds bullshit
lixtra 44 days ago [-]
This is not a docker bug, it is a kernel bug.
It could be triggered by other complex applications that use kernel container features.
atoav 43 days ago [-]
Or it could not be triggered if you don't use containers.
kuschku 43 days ago [-]
It could also be triggered by application sandboxing solutions that aren't containers.
vbezhenar 43 days ago [-]
That's another reason to avoid those appimages and snaps.
Fnoord 43 days ago [-]
Would it have occurred on Podman as well?
markstos 43 days ago [-]
Some of my servers were running Ubuntu with systemd+podman to manage services, and none of them had a problem.
remram 43 days ago [-]
You would have to be running this specific kernel version and one of your containers has to memory-map a file (not from a volume?), from what I understand.
remram 43 days ago [-]
Depending on your configuration. You would need to be using overlayfs with Podman.
jacquesm 43 days ago [-]
That pretty much makes the GPs point: emergent behavior arising from complexity.
dncornholio 43 days ago [-]
Docker hasn't solved any problem for me, so I don't see any use in using it. Meanwhile I have multiple junior devs asking me docker stuff so they can run stuff localy. When I ask them why even use docker, they always say that some tutorial told them to.
This is when I introduce them to something called VirtualBox and then their eyes go bright with wonder on how simple that works.
jve 43 days ago [-]
> I introduce them to something called VirtualBox and then their eyes go bright with wonder on how simple that works.
Well, If I'd have a workhorse with loads of RAM... I'd still choose docker, because how FAST it starts/restarts. And because it is easy to recreate everything with docker - a VM may get messy when installing stuff for APP #1, #2, #x, "works on my machine!" etc.
speedgoose 44 days ago [-]
I think it’s more complex to ask people to package software in a good way without software containers.
capableweb 43 days ago [-]
Is it really? Windows have .exe files, macOS has .app files, Linux has .AppImage (or even tarballs with binaries), and that's just on the host level. Java has .jar files, and so on.
Not to mention, if you want to "natively" pack something for Windows and macOS, containers won't even solve that problem, as they only run on Linux. Only reason you can use Docker on macOS is because of virtualization.
remram 43 days ago [-]
Those are a lot like containers though.
Mac .app files are not just binary (MACH-O), they can include libraries, "frameworks", etc that will override what's loaded from the system. AppImage is even worse [1]. .exe are usually setup with an installer that trigger the side-by-side assembly mechanism, pretending that the system is using the version of the libraries that you included (and growing your WinSXS folder forever). JAR files usually include all their required transitive dependencies rather than "dynamic linking" with other JARs.
A software container is a bit more than a binary or a software package. It also includes the dependencies and the required files. It’s not only a .jar but a .jar with a compatible JVM with the compatible dependencies.
I don’t think that it uses virtualisation on windows or Mac is very bad. I think that it’s an advantage for simplicity that everything is Linux (I pretend that windows containers do not exist).
43 days ago [-]
dspillett 43 days ago [-]
I don't do much by way of containers myself¹ but some teams in DayJob do and some other contact also. Some run containers in VMs² to separate out some of the complexity due to boot bugs like this - there is a little performance hit from the VM but failures in the container parts of the kernel can't cause the whole machine not to boot so it is easier to get in to revert things back to a last-known-good state.
----
[1] I have a couple of bits running via LCX but otherwise use VMs to split services out
[2] One large VM running many containers³, or sometimes a couple of VMs, perhaps separating them performance-wise across drives or with CPU core affinity where that was/seemed easier, or just so in case of disaster they could concentrate on getting the higher priority VM+containers restored and back up first.
[3] Obviously one VM per container would defeat the container benefits, though I've seen this done where docker was the only officially supported install option and they wanted to run a service in a VM.
tpetry 43 days ago [-]
Interestingly solutions like docker csn also make the system less complex. With CoreOS you get an operating system reduced to the bare bones just for running containers, there‘s a lot less complexity if your os is only designed to run containers and nothing else.
Gordonjcp 43 days ago [-]
Okay, so presumably they either sysadmin one single site, or an entire fleet of individual servers each for one specific task?
Either way, I dont think they understand what Docker is, what it's for, and why it makes things less complicated.
I've just added some info to those bugs on a possible upstream stable fix.
MBCook 44 days ago [-]
This seems like the kind of thing that automated testing should have been able to catch. It’s not like running Docker is a small use-case these days.
jillesvangurp 44 days ago [-]
My thoughts exactly, the details of what this bug is about technically are interesting and fascinating but the key take away is that something went terribly wrong with Ubuntu's testing processes. This should not have shipped without more scrutiny. Somebody presumably cut some corners there and it's worrying that that is possible at all.
I actually rolled out Ubuntu 22.04 to a few servers a few weeks ago. Pretty uneventful update, all my Ansible scripts for 20.04 worked without modification against these new servers. So, I guess I dodge this bug for now. One reason I've always preferred Ubuntu over Red Hat for servers is that with Red Hat/Centos essentially everything I care about is perpetually and hopelessly out of date and obsolete. So, it just creates a lot of hassle to work around that and get reasonably current versions of things I actually need my servers to run. With Ubuntu that was always a lot more straightforward.
I currently write this on a laptop with Manjaro and Linux 5.18. I'm glad I don't have to deal with about a year of long fixed issues with hardware, bluetooth, GPUs, performance, etc. IMHO there's very little value in sticking with older kernels on desktop machines. Especially when that involves a convoluted process of back-porting and integrating lots of complicated patches. I recently put Ubuntu on an old imac (secure boot prevents booting Manjaro) and I promptly ran into hardware issues that I recall having with Manjaro a few months ago that were fixed by simply upgrading the kernel. Bluetooth especially seems way more flaky. And that's not exactly flawless on 5.18 either. I get the if it ain't broke don't fix it thing; my point is that with modern Desktop Linux things being broken is a constant. The least broken version of Linux is usually the kernel that was just released that has all the cumulative fixes for all the issues addressed in previous kernel releases. Opting out of a few years of those fixes seems misguided.
Even on servers, I suspect simply updating the kernel more regularly would not be the end of the world for most users. With an incubation period to catch bugs/blocking issues of course, the more people use a kernel version, the more stable it gets. I doubt many users would experience any regressions. And it's a lot cheaper to support. If I had the option, I don't think I would opt to run 2-3 year old kernels on any of my servers if I had a different choice. I don't see the value of opting out of 2-3 years worth of known & fixed stability, performance, and other issues.
ungamedplayer 44 days ago [-]
> . One reason I've always preferred Ubuntu over Red Hat for servers is that with Red Hat/Centos essentially everything I care about is perpetually and hopelessly out of date and obsolete
This is exactly why you choose it. Lesser chance of insanity.
ladyanita22 43 days ago [-]
Ýou can always choose Fedora Server if you want a more up-to-date server OS.
lllkbcxdd 43 days ago [-]
Fedora Server.. is RedHat...
There's no "Fedora Server" product and never has been. Do you mean the rolling release CentOS?
That is specifically and explicitly intended for workstations, i.e. desktops and laptops... not for servers.
dncornholio 43 days ago [-]
> I don't see the value of opting out of 2-3 years worth of known & fixed stability, performance, and other issues.
My 3 year old server is running fine. What am I missing out on exactly? My 6 year old router is also running perfectly. Don't fix what isn't broken. Updates often break things without providing me any value.
I'm running a 5 year old Android. Upgrading to a newer version will slug my phone. I don't need a newer android (yet). My phone works perfectly for me.
Now, if you are going to tell me my security is at risk. Please be specific and provide an example :)
Perhaps add “June 2022” to the title to reduce panic? Updated packages that resolve the issue were released on 2022-06-10, so this article is a post-mortem not an alert of a new problem that could affect people now.
oynqr 44 days ago [-]
They are just trying to convert docker users to snap enthusiasts.
kramerger 43 days ago [-]
Well, lxd is a snap now so you are maybe into something
compsciphd 43 days ago [-]
ubuntu 22.04 also broke many IBM laptops. Took them 2 months to fix it, without any acknowledgement that the bug existed.
It makes me Q the value in my org looking into an ubuntu advantage subscription. When there are tickets that have lots of "me too" that result in unusable laptops, one should at least triage them / consolidate them into a single ticket and then be able to mark when fixed.
44 days ago [-]
up6w6 44 days ago [-]
I'm using Oracle's ARM servers and I thought it was some weird patch they did to the kernel, the bug only disappeared when I force upgraded it to 22.04. Ubuntu/Canonical itself would be the last place I would have thought to be the source of a problem like that.
nyc_pizzadev 43 days ago [-]
Interesting, around the same time both of my Ubuntu 20 laptops (Dell and Lenovo) started having major problems connecting to my home wifi. It only affects these 2 laptops, all my other devices have no problems. Before reading this I did think this was a result of a bad Ubuntu update. Given they switched kernel versions, my guess is that this is infact the culprit. Very annoying, it takes me anywhere from 5 to 45 minutes to establish a wifi session now.
jacquesm 43 days ago [-]
Auto update strikes again. Really, we need to re-think this.
akvadrako 43 days ago [-]
There is no need to rethink it; it's never been a good idea to leave it turned on.
jacquesm 43 days ago [-]
Well, you're between a rock and a hard place. No auto-update = security risk exposure, auto-update = stability risk exposure (and sometimes security risk exposure thrown in for free as well).
amelius 43 days ago [-]
If the only externally visible service you run is sshd then how important is it to auto-update for security reasons? (Also considering that security risks in sshd are almost guaranteed to end up on the front page of HN, so you won't miss it).
leaflets2 43 days ago [-]
> the front page of HN
What if you're in bed with the flu
But if you're a team, then maybe.
Still, could delay the response with a whole day (checking HN once a day)
symlinkk 44 days ago [-]
Wow, another buggy Ubuntu patch breaks something. Why don’t they just stick to what’s upstream?
baggy_trough 44 days ago [-]
Yeah, I don't really get why they don't use the stable kernel releases, of which there are many, rather than rolling their own.
kramerger 43 days ago [-]
Seems like every department at canonical needs to learn this on their own.
After all, they reinvented everything from DE to init system at least once in past.
(They also have their own containers, LXD. I actually really like that one, please keep working on that canonical)
leaflets2 43 days ago [-]
What are some other bad mistakes that's been made?
But I assume as Ubuntu follows an April release schedule, it doesn't always match with an appropriate LTS kernel.
lproven 43 days ago [-]
Ubuntu was originally designed as a desktop OS and its release cycle was synched with the GNOME 2.x release cycle.
treesknees 43 days ago [-]
Agreed. Ubuntu does use a stable kernel by default for LTS, at least for ISO installs. This problem occurred within the HWE (hardware enablement) release train, where they backport non-LTS kernels and features, which for some reason they use as a default in various places like their official cloud images.
fomine3 44 days ago [-]
According to the article, looks reasonable modification (though it's hard) but should be tested.
zerop 44 days ago [-]
Any good documentation that talks about how big open source software manage code changes, releases cadences, given contributors from across the world.
wronglyprepaid 43 days ago [-]
I'm fairly sure this differs for different project/organization, not sure there is a rule, and not sure there are really any considerations that are specific to open source, good practices are good practices regardless.
That being said, I rate Canonical's practices as rather poor.
zerop 43 days ago [-]
Taking an example, how are Linux Kernel releases are planned and managed
lwswl 41 days ago [-]
The more container bugs the better.
I hope they can't fix it.
latte2021 43 days ago [-]
Does this apply for desktop also?
lproven 43 days ago [-]
Probably not, unless you're working with containers. Most desktop users are not, I suspect... unless they are developers building containers for later deployment on servers.
If you had `unattended-upgrades` running and had the "automatic reboot" option enabled, then all your Ubuntu 20.04 servers running Docker would reboot themselves and not come back up.
First, the bug was in a security branch. Second, it wasn't just the containers that crashed. If you booted containers on boot via Docker, then the host OS kernel-panicked and crashed at boot, since the containers share the kernel with the host.
At that point, you can't SSH in and have to follow the procedure for restoring from backup or re-mounting the root volume on an alternate house to revert the kernel version being run.
And then of course if you revert the kernel upgrade, you were once again vulnerable to whatever problem the security update was fixing...
Not to mention we’ve also had our fair share of production triple faults from bugs in the Intel firmware patches for Spectre, which took weeks to investigate & fix between ourselves struggling to keep our exchange up & running, Intel, and AWS.
And that is why there’s value in the CoreOS/ContainerLinux-like solutions we designed & implemented nearly a decade ago now. Being able to promptly rollback any kernel/system/package upgrades at once - either manually or either after it’s detected a few panics in quick successions is actually quite awesome. Not to mention the slow update rollout strategy baked into the Omaha controller.
But the reality is that the what-ifs are always the hardest to market, nearly always after-thoughts and with fast-spiking/fast-decaying traction after major events.
Elemental is pretty close to coreos: https://github.com/rancher/elemental/
They even have a way to build arbitrary os images: https://github.com/rancher/elemental-toolkit
It's pretty great
Build your images in CI job and have your deploy version be (code version, image version) so patching runs through all the same tests your code does and you have a trivial roll-forward to undo any mess you find yourself in.
> Build your images in CI job
I know container images should generally be immutable, but I would expect unattended upgrades to be mostly used on the host, not in a container, in which that management system doesn't really work (unless you're doing VMs where you can deploy immutable root images to the VMs as well, or some fun bare metal + PXE combination).
Let things be automatic to the maximum degree possible but give yourself a single hard human checkpoint and some minimum level of validation in a dummy environment first.
This is actually what I implemented for our hypervisor tier, it’s not as scary as it sounds. I could legit completely rebuild our entire stack down to the metal in about 3 hours.
Kick off a new hypervisor version, the inactive side PXE boots all the nodes, installs and configures a Proxmox cluster, slaves itself to our Ceph cluster, and then either does a hot migration of all the VMs or kicks off a full deploy which rebuilds all the infra (Consul, Rabbit, Redis, LDAP, Elastic, PowerDNS, etc) along with the app servers. The hardest part (which really isn’t) is maintaining the clusters across the blue/green sides.
With this setup our only mutable infrastructure was our Ceph cluster (because replacing OSDs takes unacceptably long) and our DB (for performance the writers lived on dedicated servers, the read replicas lived on the VMs.).
My experience has been that by the time I notice some serious vulnerability is in the news, my servers have already patched themselves. I have never "hated life" or had a "hard to find and undo bug" due to automatic security patching. I pretty quickly found what caused this and had a clear path to resolution.
This is the first security update that caused a boot failure in about a decade. It was bad, but it didn't change my mind about unattended-upgrades. My takeaway that if that maybe I should have upgraded my 20.04 servers to 22.04 server sooner.
I bet when you update your software dependencies you run those changes through your tests but your OS is a giant pile of code that usually gets updated differently and independently because mostly historical reasons.
Close. We are moving towards defining our server states through Ansible, but the project is not close to completion. Perhaps once that's further along, we could use Ansible Molecule + CI to test a new server state when there's a new patch available, but that's not an option on the table today.
The system we had in place for /today/ worked: Lower priority or redundant servers were set to auto-reboot after applying security updates, while other critical servers require manual reboot at low-risk times. By then, the patch has already been tested on lower-risk servers.
As a result, this issue caused no user-visible downtime for us, and due to the staggered runs of unattended-upgrades affected a minimal number of servers.
And this was the first time in 10+ years that something like this happened and we have to choose to write to prioritize spending our process-improvement time based on likelihood and impact.
Some years ago everyone said the same about windows-servers ;)
Or add `systemd.mask=docker.service` to your boot parameters to prevent Docker from starting.
Failing that you could try the “single” directive and poke other configurations once booted in that mode.
A faf to be sure, but hopefully viable options (assuming the interactive menu hasn't been disabled to save a few seconds off boot time!).
There are options even without out of band management. You can choose to configure your systems with PXE -- if the installation ever fails, it can boot into a recovery environment over the network.
Alternatively, you umount the drive, attach it to another machine, chroot into it, fix grub or whatever, reverse the process and boot again. It's a few steps, but can be done in a few minutes with practice.
while true; do ssh <servername> sudo mv /usr/bin/containerd /usr/bin/containerd.backup ; sleep 1; done
While rebooting the system
It's pretty standard for all distros to have that choice.
Exceptions tend to be white boxes built with desktop components, at which point, yea. The proverbial You asked for this problem
This wasn't just limited to Linux either. It was a common UNIX trick :)
This is a bit of a lost art these days though. iLo, IPMI have replaced the need for serial. Then virtualisation and, to a lesser extent, containerisation have lowered the bar even further plus also moving the industry towards more ephemeral systems that can be destroyed and rebuilt automatically rather than the old habits of nursing failed hosts back to health.
[1] https://duckduckgo.com/?q=9+pin+d-sub+plug&t=newext&atb=v316...
[2] https://www.kernel.org/doc/html/v5.3/admin-guide/serial-cons... (a lot of distros at the time did ship a kernel with this support compiled in. I don't know how common it is now).
Console servers from the likes of OpenGear and Lantronix still heavily used for those.
Most smaller teams usually don’t prioritize physical access — they usually only need it for one-off events. While this would be a one-off event, it would be one that affects many servers.
I'm not sure I've ever worked with any (2008-present) that don't in any case.
Oh well, I guess experiences differ.
I don't think the issue is so much cost but more this kind of systems administration is becoming a forgotten art because 99% of the time modern tooling removes the need for it. So younger sysadmins are never taught how to do these kinds things. However when I started out, I worked in a few small companies that had their physical hosts connected to a console server (which was a Cisco device like a network switch) via serial cables and you'd then connect to that console server remotely.
If you can afford to have something down for an extended period then fine. But even with a small team some services are built such that certain device outages cannot be tolerated, at least for an extended period.
So out-of-band/console servers or whatever still make a lot of sense and a relatively high priority.
it requires acces to the serial console or baseband management controller or whatever terms have emerged.
have never rented a physical server w/o this.
Isn't the common wisdom that you should have them enabled, but staggered across hours/days?
This is mostly an in-place upgrade issue?
Security patches matter, but I'm no one important, so I should be fine to wait a week or month...
Anyone else who is important though... servers for example...
It's nice to see LWN on HN ... but please remember: it is only LWN subscribers that make this kind of writing possible. If you are enjoying it, please consider becoming a subscriber yourself — or, even better, getting your employer to subscribe.
https://news.ycombinator.com/item?id=31852477
If you're interested in detailed commentary on and investigations of the FOSS space, I can't recommend a subscription to LWN enough!
It is possible to ssh in for about 2 seconds before the kernel panic so I solved it by doing this:
while true; do ssh <servername> sudo mv /usr/bin/containerd /usr/bin/containerd.backup ; sleep 1; done
On the next reboot i was able to ssh in and change to the (then just released within the past hour) kernel that doesn't have this stupid bug. After another reboot you can move containderd back and it should be working again
affected: linux-image-5.13.0-1028 not affected: >linux-image-5.13.0-1029
For example, looking at the package for postgresql-14, an update still hasn't been released for the unscheduled mid-June release version 14.4, which fixed possible index corruption.
http://changelogs.ubuntu.com/changelogs/pool/main/p/postgres...
I would have thought this would have been packaged earlier, as I would expect the Ubuntu + postgresql would be a common combination.
It makes me wonder exactly how much of a resource is behind creating Ubuntu distributions.
https://packages.ubuntu.com/jammy-updates/postgresql-14
http://changelogs.ubuntu.com/changelogs/pool/main/p/postgres...
I don’t really have any sources to back this up, but my impression is that Canonical is kinda trying to punch above their weight.
A sysadmin friend of mine is totally against docker and his reason is that he wants as little complexity as is needed on his systems. Complexity, he says, leads to emergent behavior.
- If you throw the box out, you know you did no harm to other boxes.
- If you change your floor, you know you didn't wipe out something useful.
- Aaand you can `git switch` to a well known state
Ofcourse it's not 100% like that, in reality you still have to have some kind of consistency on where you put your docker-compose file, Dockerfiles for all the boxes, where you mount your volumes (in some folder or scattered all over system), maybe dealing with host firewall, dealing with not-commiting secrets into git etc.
But overall, it's very positive - docker-compose is (almost) one-stop file you need to see all your references to volumes, Dockerfile, network configurations, environment files with secrets.
It seems less complex to manage a bunch of systemd services than one pile of systemd services that are managed and logged one way and bunch of docker services that are managed and logged another way.
If your standard is dealing with systemd units, it may make sense to make your containers conform to that (thereby deviating from the most common way of managing containers). Maybe it's what I'd do in a larger operation. For my personal use I find it the most pragmatic to just use Docker since it's reasonably well documented, has reasonably low friction in usage and is very easy to set up.
Howevre looking at the whole chain of process:
I can deploy service/app on my local machine, including dependencies, along with other independent services/apps.
So, I can reuse what I have for production. But ofcourse, local development stuff will have different env variables, some docker-compose etc.
But it is a joy, when you can:
1. git clone something
2. set some env variables
3. docker-compose up -d
And your app, along with database, elasticsearch, whatnot - is running
So for development purposes it really helps. For deployment purposes - if that box is dedicated to something, then yeah, many of the good use cases not necessary.
(just not from the system perspective, but from the "reasoning about it" perspective)
Even container security and compliance around it is a measurable loss on its own which is trivially solved if you have bare EC2 instances and a patch cycle.
The emergent behavior of containerization has had an overall positive effect, even if it has annoying costs.
They were also better organized. Ebay and Amazon were leaner and more pleasant to use.
Citation needed. There were a lot less sites, maintenance windows in the hours weren't uncommon, there was no security to speak of (SQL injections, no SSL/TLS, etc.), and sites could do maybe 1% of what today's sites can (not saying that all of it is good or necessary, bit i quite like a non-insignificant amount of those new features like native video, audio, graphics, dynamism, etc.).
When was the last time you saw a weekly "Our website is down for maintenance for the next 2 days" message, other than for some government website still running COBOL on a mainframe? When was the last time you saw 500 errors? Used to be a daily thing.
Source? Sounds bullshit
It could be triggered by other complex applications that use kernel container features.
This is when I introduce them to something called VirtualBox and then their eyes go bright with wonder on how simple that works.
Well, If I'd have a workhorse with loads of RAM... I'd still choose docker, because how FAST it starts/restarts. And because it is easy to recreate everything with docker - a VM may get messy when installing stuff for APP #1, #2, #x, "works on my machine!" etc.
Not to mention, if you want to "natively" pack something for Windows and macOS, containers won't even solve that problem, as they only run on Linux. Only reason you can use Docker on macOS is because of virtualization.
Mac .app files are not just binary (MACH-O), they can include libraries, "frameworks", etc that will override what's loaded from the system. AppImage is even worse [1]. .exe are usually setup with an installer that trigger the side-by-side assembly mechanism, pretending that the system is using the version of the libraries that you included (and growing your WinSXS folder forever). JAR files usually include all their required transitive dependencies rather than "dynamic linking" with other JARs.
[1]: AppImage official documentation: "Do not depend on system-provided resources" https://docs.appimage.org/introduction/concepts.html#do-not-...
I don’t think that it uses virtualisation on windows or Mac is very bad. I think that it’s an advantage for simplicity that everything is Linux (I pretend that windows containers do not exist).
----
[1] I have a couple of bits running via LCX but otherwise use VMs to split services out
[2] One large VM running many containers³, or sometimes a couple of VMs, perhaps separating them performance-wise across drives or with CPU core affinity where that was/seemed easier, or just so in case of disaster they could concentrate on getting the higher priority VM+containers restored and back up first.
[3] Obviously one VM per container would defeat the container benefits, though I've seen this done where docker was the only officially supported install option and they wanted to run a service in a VM.
Either way, I dont think they understand what Docker is, what it's for, and why it makes things less complicated.
I've recently migrated to Ubuntu 22.04 and got this: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1971505 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1970453
on HP ProLiant servers.
I actually rolled out Ubuntu 22.04 to a few servers a few weeks ago. Pretty uneventful update, all my Ansible scripts for 20.04 worked without modification against these new servers. So, I guess I dodge this bug for now. One reason I've always preferred Ubuntu over Red Hat for servers is that with Red Hat/Centos essentially everything I care about is perpetually and hopelessly out of date and obsolete. So, it just creates a lot of hassle to work around that and get reasonably current versions of things I actually need my servers to run. With Ubuntu that was always a lot more straightforward.
I currently write this on a laptop with Manjaro and Linux 5.18. I'm glad I don't have to deal with about a year of long fixed issues with hardware, bluetooth, GPUs, performance, etc. IMHO there's very little value in sticking with older kernels on desktop machines. Especially when that involves a convoluted process of back-porting and integrating lots of complicated patches. I recently put Ubuntu on an old imac (secure boot prevents booting Manjaro) and I promptly ran into hardware issues that I recall having with Manjaro a few months ago that were fixed by simply upgrading the kernel. Bluetooth especially seems way more flaky. And that's not exactly flawless on 5.18 either. I get the if it ain't broke don't fix it thing; my point is that with modern Desktop Linux things being broken is a constant. The least broken version of Linux is usually the kernel that was just released that has all the cumulative fixes for all the issues addressed in previous kernel releases. Opting out of a few years of those fixes seems misguided.
Even on servers, I suspect simply updating the kernel more regularly would not be the end of the world for most users. With an incubation period to catch bugs/blocking issues of course, the more people use a kernel version, the more stable it gets. I doubt many users would experience any regressions. And it's a lot cheaper to support. If I had the option, I don't think I would opt to run 2-3 year old kernels on any of my servers if I had a different choice. I don't see the value of opting out of 2-3 years worth of known & fixed stability, performance, and other issues.
This is exactly why you choose it. Lesser chance of insanity.
There's no "Fedora Server" product and never has been. Do you mean the rolling release CentOS?
https://xanmod.org
My 3 year old server is running fine. What am I missing out on exactly? My 6 year old router is also running perfectly. Don't fix what isn't broken. Updates often break things without providing me any value.
I'm running a 5 year old Android. Upgrading to a newer version will slug my phone. I don't need a newer android (yet). My phone works perfectly for me.
Now, if you are going to tell me my security is at risk. Please be specific and provide an example :)
number of tickets in launchpad such as https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1970957
It makes me Q the value in my org looking into an ubuntu advantage subscription. When there are tickets that have lots of "me too" that result in unusable laptops, one should at least triage them / consolidate them into a single ticket and then be able to mark when fixed.
What if you're in bed with the flu
But if you're a team, then maybe.
Still, could delay the response with a whole day (checking HN once a day)
After all, they reinvented everything from DE to init system at least once in past.
(They also have their own containers, LXD. I actually really like that one, please keep working on that canonical)
But I assume as Ubuntu follows an April release schedule, it doesn't always match with an appropriate LTS kernel.
That being said, I rate Canonical's practices as rather poor.