Monday, July 4, 2016

Fedora 23 boot time optimization

Before I start, if you are only interested in the solutions to my boot problems, skip directly to the "Let's get started" section further below.  For those of you interested in the background story, read on:

This past weekend I decided to fix a problem that has been bothering me for a while.  I have a perfectly good Intel based PC that I use at home for various tasks.

Its an older machine dating from 2006 runing on an Intel Core 2 DUO e6600 with 4 Gigabytes of RAM.  Of course there are no SSDs, its running on 5600rpm SATA drives.  Despite its low hardware specs, it should still be fairly useful if we aren't trying to run the latest games on it.  Besides I mostly use it to code in C++ and work on small projects.

Recently I've installed Fedora 23 from a live DVD, in order to have a server that I can access remotely.  Note that this was an "out-of-the-box" installation with no special requirements.

Immediately I noticed that the boot-time was relatively slow.  I didn't measure it at that time, but it was probably around 1.5 minutes or more.  Since I wasn't planning on using the system much, I didn't care to look into it either.  My plan was to boot it up in the morning, use it remotely and turn it off at night... boot time was not an issue.

Recently however, I found myself accessing it more often and the boot time started to become cumbersome.  Following some research online I found a few tools which anyone serious about fixing slow boot-times should get familiar with:

systemd-analyze
systemd-analyze blame
systemd-analyze critical-chain
systemd-analyze plot > somefile.svg

I used all four of these to gather the data that I needed to determine what was causing my booting woes and believe it or not, I went from over 3 minutes to less than 25 seconds.

Here's summary of various systemd-analyze boot times as I experimented with tweaks below:

[removed@removed ~]$ cat ./boot-time
Startup finished in 904ms (kernel) + 4.155s (initrd) + 2min 59.009s (userspace) = 3min 4.069s
Startup finished in 906ms (kernel) + 4.096s (initrd) + 1min 439ms (userspace) = 1min 5.441s
Startup finished in 905ms (kernel) + 4.089s (initrd) + 1min 1.978s (userspace) = 1min 6.973s
Startup finished in 904ms (kernel) + 4.162s (initrd) + 41.080s (userspace) = 46.147s
Startup finished in 904ms (kernel) + 4.148s (initrd) + 40.696s (userspace) = 45.749s
Startup finished in 905ms (kernel) + 4.209s (initrd) + 35.845s (userspace) = 40.959s
Startup finished in 905ms (kernel) + 4.183s (initrd) + 1min 158ms (userspace) = 1min 5.246s
Startup finished in 905ms (kernel) + 4.186s (initrd) + 34.855s (userspace) = 39.947s
Startup finished in 905ms (kernel) + 4.278s (initrd) + 30.879s (userspace) = 36.063s
Startup finished in 906ms (kernel) + 4.047s (initrd) + 31.017s (userspace) = 35.971s
Startup finished in 905ms (kernel) + 4.161s (initrd) + 30.502s (userspace) = 35.569s
Startup finished in 905ms (kernel) + 4.070s (initrd) + 29.762s (userspace) = 34.739s
Startup finished in 906ms (kernel) + 4.198s (initrd) + 19.584s (userspace) = 24.688s

To be fair, I added configuration to my system which only served to slow down boot time: SMB and NMB.  I wanted to share some files with my windows computers, but I now decided to disable it.

I'm not going to go through all of the changes that I made, the research I did and the reasons I chose to do certain things... a lot of it was hit-and-miss and getting data from all over the place.  I found that the Arch-Linux documentation is, as always, extremely helpful.  Here's an example:

https://wiki.archlinux.org/index.php/systemd#Journal_size_limit

I created a file containing my changes and the perceived effect on boot time and I will attempt to describe what I did from beginning to the end.

Let's get started:

First boot time trace using systemd-analyze >> ./boot-time.txt

Startup finished in 904ms (kernel) + 4.155s (initrd) + 2min 59.009s (userspace) = 3min 4.069s

My first attempt at reducing boot-time was done by checking the systemd-analyze blame, which showed that firewalld seemed to be a bit of a bottleneck.  The result was that I cut down boot time by close to a minute.  Why does firewalld take so long to start, and why does it block the boot process?  I don't know, I didn't research it yet.

Switch from Firewalld to IPTables (considerable difference)

# Removed firewalld - replaced with iptable
# sudo systemctl stop firewalld
# sudo systemctl disable firewalld
# dnf install iptable-services
# sudo systemctl enable iptables.service
# sudo systemctl start iptables.service

Startup finished in 906ms (kernel) + 4.096s (initrd) + 1min 439ms (userspace) = 1min 5.441s

Disable plymouth-quit-wait.service (considerable difference)

Another bottleneck was the plymouth-quit-wait.service.  This was an obvious one and many people recommend disabling it, but I'm not sure it should be, I still have to research this one further.  I have a feeling I would prefer to disable plymouth entirely.  Yet the difference on boot is considerable with a gain of over 20 seconds.  Note that you have to both disable it and mask the service for the gain to take effect.

# sudo systemctl disable plymouth-quit-wait.service
# sudo systemctl mask plymouth-quit-wait.service - BIG DIFFERENCE
Startup finished in 904ms (kernel) + 4.162s (initrd) + 41.080s (userspace) = 46.147s

Readahead on boot? (negligible or negative)

I tried installing preload to have some readahead capabilities, but this only helps applications after boot, and has a small negative effect on boot time, so I ended up removing it.

# sudo dnf install preload ->
# systemctl enable preload.service:
Startup finished in 904ms (kernel) + 4.148s (initrd) + 40.696s (userspace) = 45.749s


Note that the RedHat team decided to completely remove systemd-readahead whose sole purpose was to improve boot speed.  Further on this later.

Things are now still relatively slow and inconsistent:

Startup finished in 905ms (kernel) + 4.183s (initrd) + 1min 158ms (userspace) = 1min 5.246s

Disable Samba and NetBIOS (considerable difference)

The next step was to disable SMB and NMB which also had a considerable impact:

# DISABLED SAMBA and NETBIOS - BIG DIFFERENCE
# sudo systemctl stop smb.service
# sudo systemctl stop nmb.service
# sudo systemctl disable smb.service
# sudo systemctl disable nmb.service

Startup finished in 905ms (kernel) + 4.186s (initrd) + 34.855s (userspace) = 39.947s

DISABLE Libvirtd (minor)

I noticed a few services which were starting up which I didn't think I needed, so I researched them and disabled them.  I use virtualbox and have the virtualbox kernel driver installed.  I have no need for libvirtd or libvirt.  I saved 4 seconds but again, these are not always consistent differences.

# DISABLED LIBVIRTD- NOT SIGNIFICANT
# sudo systemctl disable libvirtd.service
startup finished in 905ms (kernel) + 4.278s (initrd) + 30.879s (userspace) = 36.063s

DISABLE ModemManager (minor)

The modemmanager seemed to be one that took a long time to startup but it was not a blocking service, so disabling it did not improve boot time significantly.   Still, if I don't need it, its still wasting precious time.

# DISABLED MODEMMANAGER - NOT SIGNIFICANT
Startup finished in 905ms (kernel) + 4.161s (initrd) + 30.502s (userspace) = 35.569s

DISABLE rngd (minor)

The random number generator caused errors in the logs since I have no hardware based seed.  I have no idea why this has to be loaded by default. 

# DISABLED rngd.service - NOT SIGNIFICANT
Startup finished in 905ms (kernel) + 4.070s (initrd) + 29.762s (userspace) = 34.739s

DISABLE systemd-journal (considerable)

Now another big one was the systemd-journal.service.  After several boot-ups, this one took approximately 13 seconds to load on average and it was blocking other services.

Apparently, the journal system is slow because it has a way of reading entries from one set of files and rewriting them to another set of files on boot.  It does not copy the files directly, but goes in each file and reads entries as objects and dumps these in another file.  The bigger your logs, the slower this will be.  The information I recite here is from an old bug report but things may have changed, I'm not sure.

My home PC had a /var/log/journal directory sized at 750Mbs and it took 13 seconds almost consistently.

On the other hand, my work PC runs Fedora 23 (same basic install) on a 2012 Dell with an Intel I7, 16 Gbs of RAM and 7200rpm drives.  My journal log was 1.58Gbs and the systemd-journal.service started in 4 seconds... That's a big difference and it all has to do with hardware.

There is a way to solve this problem for slower computers and the difference is a considerable 10 second improvement:

# SET MAX JOURNAL SIZE: https://wiki.archlinux.org/index.php/systemd#Journal_size_limit    -- BIG DIFFERENCE
Startup finished in 906ms (kernel) + 4.198s (initrd) + 19.584s (userspace) = 24.688s

... 24.688 seconds ...

Conclusion:

Apart from SMB and NMB, which I added and later removed myself, the bulk of these are relatively minor tweaks which should come out-of-the-box.  While I agree that its nice to have OS's able to use powerful hardware, I find its a waste to load services if they are not needed.  That being said, while I find that most RedHat folks are great and helpful, there are some points of views that I will disagree with whole-heartedly:  https://lists.freedesktop.org/archives/systemd-devel/2014-August/022002.html

This is just an example, but I find it hard to believe that support would be dropped for boot-time readahead just because "Nobody in the systemd team still works on a laptop with rotating media, hence nobody tries to optimize it in any way."  - that is a very poor excuse for not optimizing systems.

I agree that read-ahead does become cumbersome on SSD drives, but then at least provide it as an option during the installation of the OS.  Why should the multitudes of rotating-media systems suffer simply because the RedHat team all use SSDs?

At the end of the day I start to ask myself, why am I running Fedora again?

Thursday, January 14, 2016

CentOS 7 - No network device detected

While installing CentOS 7 on older hardware, I found a solution to the problem where the network device is not automatically detected during or after the system install.

Following some searching on google, the following forum post gave me the solution to this problem.  http://unix.stackexchange.com/a/200029

"This device uses the forcedeth driver which is disabled in the CentOS 7 kernel.
You can use the kmod-forcedeth driver from elrepo.org:
http://elrepo.org/linux/elrepo/el7/x86_64/RPMS/kmod-forcedeth-0.64-1.el7.elrepo.x86_64.rpm"

As the answer states, it is simply a matter of installing the RPM and rebooting the system.  To do this, I downloaded the file to a USB stick and ran the yum localinstall command as such:

# yum localinstall /run/media/.../kmod-forcedeth-0.64-1.el7.elrepo.x86_64.rpm