• Bug#1100338: cloud.debian.org: cloud-init is not consistently enabled o

    From Noah Meyerhans@21:1/5 to All on Wed Mar 12 19:30:01 2025
    Package: cloud.debian.org
    Severity: important

    This has so far only been observed on Azure. It's not clear whether it's impacted other cloud environments or not.

    Cloud-init is not consistently being enabled during VM provisioning on Microsoft Azure. The external symptom is that the launch times out. We can see from the logs that the various systemd services associated with cloud-init are never invoked at all:

    Failed launch:

    root@boot-failure:~# journalctl -b -1 | grep cloud-init | wc -l

    0

    Versus a successful launch:

    root@boot-failure:~# journalctl -b 0 | grep cloud-init | wc -l
    102
    root@boot-failure:~# journalctl -b 0 | grep cloud-init | head -n1
    Mar 12 18:12:01 localhost systemd[1]: Starting cloud-init-main.service - Cloud-init: Single Process...
    root@boot-failure:~# journalctl -b 0 | grep cloud-init.target
    Mar 12 18:12:12 boot-failure systemd[1]: Reached target cloud-init.target - Cloud-init target.

    It may have something to do with cloud-init's systemd generator, which is responsible for conditionally enabling cloud-init.target early in the boot process. On a failed launch it generates the following in /run/cloud-init/cloud-init-generator.log:

    /usr/lib/systemd/system-generators/cloud-init-generator normal=/run/systemd/generator early=/run/systemd/generator.early late=/run/systemd/generator.late
    checking for datasource

    While a successful launch specifically indicates that it's enabling cloud-init.target:

    /usr/lib/systemd/system-generators/cloud-init-generator normal=/run/systemd/generator early=/run/systemd/generator.early late=/run/systemd/generator.late
    checking for datasource
    ds-identify rc=0
    enabled via /run/systemd/generator.early/multi-user.target.wants/cloud-init.target -> /lib/systemd/system/cloud-init.target

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Noah Meyerhans@21:1/5 to All on Thu Mar 13 03:40:01 2025
    I can confirm that systemd is sending a TERM to (at least) the cloud-init-generator process. However, I'm not yet sure why:

    * The cloud-init-generator process typically runs in approximately 30ms,
    and is sometimes killed.

    * I've been able to insert an artificial 2s pause in the middle of the
    script without appearing to impact the frequency of it being
    killed.

    * Timing of this specific generator doesn't explain why several other
    generators are also apparently being killed.

    The systemd docs (https://manpages.debian.org/unstable/systemd/systemd.generator.7.en.html) indicate that generators are executed synchronously during the boot
    process and should thus take care to not "delay the entire boot if they
    are slow". However, there's no mention of a specific runtime
    requirement. Is the deadline dynamic in some way?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bastian Blank@21:1/5 to Noah Meyerhans on Thu Mar 13 18:30:02 2025
    On Wed, Mar 12, 2025 at 10:36:19PM -0400, Noah Meyerhans wrote:
    The systemd docs (https://manpages.debian.org/unstable/systemd/systemd.generator.7.en.html) indicate that generators are executed synchronously during the boot
    process and should thus take care to not "delay the entire boot if they
    are slow". However, there's no mention of a specific runtime
    requirement. Is the deadline dynamic in some way?

    The default timeout is 90s, so it's far off.

    I read a bit through systemd. I don't see anything apart "alarm()" to
    enforce a timeout.

    "systemd.log_level=debug systemd.log_target=console" on the kernel
    command line gives some more insight. But this is a lot of output, so
    requires serial console output.

    Bastian

    --
    No one wants war.
    -- Kirk, "Errand of Mercy", stardate 3201.7

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Noah Meyerhans@21:1/5 to Bastian Blank on Thu Mar 13 18:40:02 2025
    On Thu, Mar 13, 2025 at 06:21:13PM +0100, Bastian Blank wrote:
    "systemd.log_level=debug systemd.log_target=console" on the kernel
    command line gives some more insight. But this is a lot of output, so requires serial console output.

    Yep. One of my earlier updates contains the full debug output from
    systemd. Interestingly, it doesn't seem like systemd logs anything at
    all when it sends a SIGTERM (at least to generators).

    You mentioned on IRC that you've observed this on the generic images,
    and I've now observed the same thing. Note that it doesn't occur only
    on first boot; it appears that it can happen on any reboot (or
    presumably on daemon-reload, as generators are executed then as well).

    One potentially interesting entry from the debug logs is:
    [ 3.398752] boot-failure1 (sd-gens)[271]: (sd-exec-strv) failed with exit status 1.

    It does occur on a successful launch as well, but because the generators
    all run in parallel, it's possible that whatever causes this does not
    always lead to failure depending on how the events are ordered.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bastian Blank@21:1/5 to Noah Meyerhans on Thu Mar 13 19:30:02 2025
    On Thu, Mar 13, 2025 at 01:36:03PM -0400, Noah Meyerhans wrote:
    Yep. One of my earlier updates contains the full debug output from
    systemd. Interestingly, it doesn't seem like systemd logs anything at
    all when it sends a SIGTERM (at least to generators).

    That mail did not make it to the list. I also don't see any code path
    that would do explicit SIGTERM.



    You mentioned on IRC that you've observed this on the generic images,
    and I've now observed the same thing. Note that it doesn't occur only
    on first boot; it appears that it can happen on any reboot (or
    presumably on daemon-reload, as generators are executed then as well).

    One potentially interesting entry from the debug logs is:
    [ 3.398752] boot-failure1 (sd-gens)[271]: (sd-exec-strv) failed with exit status 1.

    Yes. This code bails out on all and every error with exit(1).

    Okay, found it:

    wait_for_terminate_and_check:
    | log_full(prio, "%s terminated by signal %s.", strna(name), signal_to_string(status.si_status));
    | return -EPROTO;

    do_execute:
    | r = wait_for_terminate_and_check(…);
    | if (r < 0) return r;

    execute_strv:
    | r = do_execute(…)
    | _exit(r < 0 ? EXIT_FAILURE : r);

    So the SIGABRT of netplan short circuits the error handling and kills
    off all other generators.

    Bastian

    --
    Those who hate and fight must stop themselves -- otherwise it is not stopped.
    -- Spock, "Day of the Dove", stardate unknown

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Noah Meyerhans@21:1/5 to Bastian Blank on Thu Mar 13 20:20:02 2025
    Control: block -1 by 1100418
    Control: affects 1100418 + cloud.debian.org

    On Thu, Mar 13, 2025 at 07:23:24PM +0100, Bastian Blank wrote:
    Yep. One of my earlier updates contains the full debug output from
    systemd. Interestingly, it doesn't seem like systemd logs anything at
    all when it sends a SIGTERM (at least to generators).

    That mail did not make it to the list. I also don't see any code path
    that would do explicit SIGTERM.

    I think the TERM comes from https://github.com/systemd/systemd/blob/main/src/shared/exec-util.c#L271

    The mail is here and the log is at https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=1100338;filename=journal.fail;msg=10,
    but I don't think that matters anymore.

    You mentioned on IRC that you've observed this on the generic images,
    and I've now observed the same thing. Note that it doesn't occur only
    on first boot; it appears that it can happen on any reboot (or
    presumably on daemon-reload, as generators are executed then as well).

    One potentially interesting entry from the debug logs is:
    [ 3.398752] boot-failure1 (sd-gens)[271]: (sd-exec-strv) failed with exit status 1.

    Yes. This code bails out on all and every error with exit(1).

    Okay, found it:

    wait_for_terminate_and_check:
    | log_full(prio, "%s terminated by signal %s.", strna(name), signal_to_string(status.si_status));
    | return -EPROTO;

    do_execute:
    | r = wait_for_terminate_and_check(…);
    | if (r < 0) return r;

    execute_strv:
    | r = do_execute(…)
    | _exit(r < 0 ? EXIT_FAILURE : r);

    So the SIGABRT of netplan short circuits the error handling and kills
    off all other generators.

    Yep, that's basically the path I was on, too. That explains the
    intermittent nature of the failures, too, since it's possible for the
    ABRT to happen after some generators have already completed
    successfully.

    I wonder what's going on with netplan-generator...

    noah

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bastian Blank@21:1/5 to Noah Meyerhans on Fri Mar 14 20:40:01 2025
    Control: clone -1 -2
    Control: reassing -2 systemd
    Control: severity -2 important
    Control: retitle -2 systemd - generators gets killed

    systemd kills remaining generators in cleanup if one generator died with
    a signal. This might break booting.

    On Thu, Mar 13, 2025 at 03:10:51PM -0400, Noah Meyerhans wrote:
    On Thu, Mar 13, 2025 at 07:23:24PM +0100, Bastian Blank wrote:
    You mentioned on IRC that you've observed this on the generic images,
    and I've now observed the same thing. Note that it doesn't occur only
    on first boot; it appears that it can happen on any reboot (or
    presumably on daemon-reload, as generators are executed then as well).

    One potentially interesting entry from the debug logs is:
    [ 3.398752] boot-failure1 (sd-gens)[271]: (sd-exec-strv) failed with exit status 1.

    Yes. This code bails out on all and every error with exit(1).

    Okay, found it:

    wait_for_terminate_and_check:
    | log_full(prio, "%s terminated by signal %s.", strna(name), signal_to_string(status.si_status));
    | return -EPROTO;

    do_execute:
    | r = wait_for_terminate_and_check(…);
    | if (r < 0) return r;

    execute_strv:
    | r = do_execute(…)
    | _exit(r < 0 ? EXIT_FAILURE : r);

    So the SIGABRT of netplan short circuits the error handling and kills
    off all other generators.

    Yep, that's basically the path I was on, too. That explains the
    intermittent nature of the failures, too, since it's possible for the
    ABRT to happen after some generators have already completed
    successfully.

    I wonder what's going on with netplan-generator...

    Bastian

    --
    Prepare for tomorrow -- get ready.
    -- Edith Keeler, "The City On the Edge of Forever",
    stardate unknown

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)