Mechanism to detect kernel hang at bootup
This is an unresolved problem in Easy Excalibur; random hang at bootup. It gets to displaying "Loading kernel modules...", which is inside /etc/rc.d/rc.sysinit, and that's it, stuck there.
The kernel has a mechanism to detect hung processes, described here:
https://blog.cloudflare.com/es-la/searching-for-the-cause-of-hung-tasks-in-the-linux-kernel/
So, I have compiled the 6.12.44 kernel with this configuration, in the "Kernel hacking -> Debug Oops, Lockups and Hangs" section:
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=60
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
In /etc/rc.d/rc.sysinit, inserted this at line 636:
#20250831 if rc.sysinit completes, renamed to .sysinit.log see further down...
syslogd -O /mnt/wkg/.syslogd.log.${$}
klogd
Then much later in the script:
#20250831
killall klogd
killall syslogd
mv -f /mnt/wkg/.syslogd.log.${$} /mnt/wkg/.syslogd.log
#...see /etc/init.d/00sys_logger
If the kernel hangs in between those two, then wait several seconds, then after another reboot, that /mnt/wkg/.syslogd.log.${$} will still exist. This is the kernel log, and can be studied for hanging or timeout reports.
I also edited /etc/init.d/00sys_logger, which will execute only if execution gets past that above second code block:
#!/bin/sh
case $1 in
start)
#20250831 .syslogd.log created in /etc/rc.d/rc.sysinit
if [ -f /mnt/wkg/.syslogd.log ];then
cat /mnt/wkg/.syslogd.log > /var/log/messages
fi
syslogd #writes to /var/log/messages
klogd
;;
stop)
killall klogd
killall syslogd
;;
esac
...the second startup of syslogd and klogd will append to /var/log/messages.
My Lenovo PC has not hung at bootup for at least a week, and I thought, hey, when will it happen again. Well, serendipity, rebooted after setting up the above, and it hung, right at "Loading kernel modules...".
I rebooted, and booted up Easy Scarthgap, just to be cautious not to modify that /mnt/wkg/.syslogd.${$} (where that $$ is 335 in my case), looked at the file, and very interesting, just keeps repeating this, over and over:
Aug 30 22:03:45 (none) daemon.warn kernel: [ 1271.605276] udevd[420]: slow: 'ata_id --export /dev/sr0' [475]
Aug 30 22:03:46 (none) daemon.err kernel: [ 1272.606540] udevd[420]: timeout: killing 'ata_id --export /dev/sr0' [475]
That 'ata_id' is a binary executable called by a udev rule. I don't think that the kernel hung detection has anything to do with that, as it has a timeout of 60 seconds (see above). Instead, what is happening is the 20 second timeout in udevd, see line 700 in rc.sysinit:
udevd --daemon --resolve-names=early --children-max=32 --event-timeout=20 >/tmp/udevd-debug.log 2>&1
What seems to be happening is that udevd tries to kill 'ata_id'
but fails, and it just keeps retrying. At least, that is what
seems to be happening. I need to study ata_id, what it does.
Anyway, we have progress.
EDIT:
I have removed
/usr/lib/udev/rules.d/60-persistent-storage.rules; this is what
calls /usr/lib/udev/ata_id
Actually, I had removed it sometime ago, as was suspicious of it; but an EasyOS user asked why /dev/disk folder was missing, so I put it back. EasyOS does not need that folder. Google's AI says this:
udev_ata_id is a callout program for the
udev device manager that reads product and serial numbers from
ATA drives to provide udev with unique, stable identifiers.
Udev then uses this information to create symbolic links in
/dev/disk/by-id/ and /dev/disk/by-label/.
Nor do users. You can use the 'blkid'
utility to find out that information.
Tags: easy