I hit a funky interaction bug with the last bookworm stable point
release upgrade. In isolation, each component behaves reasonably, but
their combination may result in unexpected service failure. I seek
feedback on improving the situation.
1. Services such as cron and ssh may leave processes behind after a
restart. A long cron job and an existing ssh connection are example
situations where this happens.
2. By default, unattended-upgrades sets Unattended-Upgrade::MinimalSteps
to true and therefore upgrades one source package at a time. As a
consequence it invokes needrestart a number of times.
3. Every time needrestart is invoked, it considers all running services
and considers those left-over cron jobs or ssh connections as a reason
to restart the service even if the main daemon process is no longer
using an outdated copy.
4. systemd poses a limit on restarting services too frequently. If you
restart a service 10 times within a minute, it temporarily ignores
start requests and leaves the service in a failed state.
The end result is that a stable point release may upgrade glibc rather
early, then each of the minimal steps will restart your service until it fails. A stable point release has sufficiently many updates to trigger systemd's limit if you operate on fast storage.
Terminating ssh in an unattended-upgrade is a significant problem
justifying important severity. Hope you agree.
Now the question arises what could be done to improve the situation.
The default of Unattended-Upgrade::MinimalSteps is set to true arguing
that this is safer. Arguably, setting it to false, also provides a kind
of safety against unattended-upgrades terminating your ssh server.
Another way to look at this would be that needrestart maybe should
recognize that restarting cron or ssh is not going to help in this
situation and skip doing that.
Yet another way of looking at it, is considering that
unattended-upgrades maybe should interact with needrestart more closely
and batch up needrestart even in the fase of Unattended-Upgrade::MinimalSteps. Maybe it could temporarily disable needrestart somehow and then run it once after doing its thing? That
would also speed things up.
We're not yet at the end of options. Skipping restarts of young
processes also is a possible avenue and suggested by Paul Wise via
#889552.
Last but not least, having unattended-upgrades perform a sleep between
the upgrade operations would make it slow enough to not trigger
systemd's limit.
As we can see, there a are lots of options to twist the current behavior
into something that avoids this particular failure mode. On the flip
side, each of them has other subtle consequences, so it is not clear to
me what the best option is. I appreciate some feedback from the relevant package maintainers.
Control: tags -1 + patch
On Mon, Mar 17, 2025 at 10:46:23AM +0100, Helmut Grohne wrote:
I hit a funky interaction bug with the last bookworm stable point
release upgrade. In isolation, each component behaves reasonably, but
their combination may result in unexpected service failure. I seek
feedback on improving the situation.
1. Services such as cron and ssh may leave processes behind after a
restart. A long cron job and an existing ssh connection are example
situations where this happens.
2. By default, unattended-upgrades sets Unattended-Upgrade::MinimalSteps
to true and therefore upgrades one source package at a time. As a
consequence it invokes needrestart a number of times.
3. Every time needrestart is invoked, it considers all running services
and considers those left-over cron jobs or ssh connections as a reason
to restart the service even if the main daemon process is no longer
using an outdated copy.
4. systemd poses a limit on restarting services too frequently. If you
restart a service 10 times within a minute, it temporarily ignores
start requests and leaves the service in a failed state.
The end result is that a stable point release may upgrade glibc rather early, then each of the minimal steps will restart your service until it fails. A stable point release has sufficiently many updates to trigger systemd's limit if you operate on fast storage.
Terminating ssh in an unattended-upgrade is a significant problem justifying important severity. Hope you agree.
Now the question arises what could be done to improve the situation.
The default of Unattended-Upgrade::MinimalSteps is set to true arguing
that this is safer. Arguably, setting it to false, also provides a kind
of safety against unattended-upgrades terminating your ssh server.
Another way to look at this would be that needrestart maybe should recognize that restarting cron or ssh is not going to help in this situation and skip doing that.
Yet another way of looking at it, is considering that
unattended-upgrades maybe should interact with needrestart more closely
and batch up needrestart even in the fase of Unattended-Upgrade::MinimalSteps. Maybe it could temporarily disable needrestart somehow and then run it once after doing its thing? That
would also speed things up.
We're not yet at the end of options. Skipping restarts of young
processes also is a possible avenue and suggested by Paul Wise via
#889552.
Last but not least, having unattended-upgrades perform a sleep between
the upgrade operations would make it slow enough to not trigger
systemd's limit.
As we can see, there a are lots of options to twist the current behavior into something that avoids this particular failure mode. On the flip
side, each of them has other subtle consequences, so it is not clear to
me what the best option is. I appreciate some feedback from the relevant package maintainers.
I attempted to solve the problem at the needrestart level. There, the
options were dim. needrestart supports a NEEDRESTART_SUSPEND variable to
skip its operation, but it cannot reasonably know when to set that. I
also suggested moving the actual restarts into a systemd unit that would order after apt-daily-upgrade.service to batch service activations, but
it there are other distributions with other services invoking
needrestart. We ended up concluding that needrestart was not a good
place to fix this. Still Thomas Liske provided some feedback and agreed
to contribute to the discussion.
My second attempt is at the unattended-upgrades level. In effect, it is
an unattended-upgrades process that ends up calling needrestart via needrestart's apt-pinvoke via an apt invocation. Thus it is able to
control needrestart via NEEDRESTART_SUSPEND. If the unattended-upgrades process were to set that variable and finally call apt-pinvoke, we'd effectively get the batching suggested earlier and fix the root cause.
Now unattended-upgrades has a plugin mechanism. The primary hooking
mechanism is postrun, which is a good place to call apt-pinvoke. In
addition, __init__ is called early and allows us to modify the process environment. It probably was not intended that way, but it works. And
that approach yields a fairly reliable mechanism for batching
needrestart when called from unattended-upgrades. I'm attaching the
resulting unattanded-upgrades plugin that can be dropped into one of /etc/unattended-upgrades/plugins or
/usr/share/unattended-upgrades/plugins.
Now the question becomes whether either unattended-upgrades or
needrestart would be willing to install this plugin below /usr/share to
turn it active by default? From my point of view, either package would
be a good fit for doing so.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 481 |
Nodes: | 16 (3 / 13) |
Uptime: | 32:31:57 |
Calls: | 9,547 |
Calls today: | 7 |
Files: | 13,656 |
Messages: | 6,141,056 |