Software and Related Stuff: 05/01/2025

Saturday, May 31, 2025

Homelab adventures, part 4: Worker nodes of the world, unite! Or maybe not.

Antisocial worker nodes, and maybe 2gb of memory is not enough

I'm almost done- once I spin up my two worker nodes, I'll have my own private Kubernetes cluster. I breeze through the first few steps in chapter 9, until I get to the part about disabling swap.

This should be easy: identify the swap partition or file, comment it out of /etc/fstab, and reboot. But swap is no longer defined there:

root@node-0:~# swapon --show
NAME      TYPE      SIZE USED PRIO
/dev/sda5 partition 975M 524K   -2

root@node-0:~# grep sda5 /etc/fstab
# swap was on /dev/sda5 during installation

I'm all in favor of systemd: over the years I've been bitten too many times by the shortcomings of the classic SysV init/rc system. Even so, there are times when it's all-encompassing nature still irks me.

Turning off swap the systemd way:

root@node-0:~# systemctl --type swap
  UNIT                                                                      LOAD   ACTIVE SUB
  dev-disk-by\x2duuid-8ad8b7b3\x2d87ae\x2d423a\x2daa1c\x2d67b82c1a8812.swap loaded active active

Are those backslashes escapes or literal backslashes? One way to find out:

root@node-0:~# systemctl mask dev-disk-by\x2duuid-8ad8b7b3\x2d87ae\x2d423a\x2daa1c\x2d67b82c1a8812.swap
Unit dev-disk-byx2duuid-8ad8b7b3x2d87aex2d423ax2daa1cx2d67b82c1a8812.swap does not exist, proceeding anyway.
Created symlink 
  /etc/systemd/system/dev-disk-byx2duuid-8ad8b7b3x2d87aex2d423ax2daa1cx2d67b82c1a8812.swap → /dev/null.

root@node-0:~# systemctl mask 'dev-disk-by\x2duuid-8ad8b7b3\x2d87ae\x2d423a\x2daa1c\x2d67b82c1a8812.swap'
Created symlink 
  /etc/systemd/system/dev-disk-by\x2duuid-8ad8b7b3\x2d87ae\x2d423a\x2daa1c\x2d67b82c1a8812.swap → /dev/null.

Then turn off any swap still in use:

root@node-0:~# swapoff -a
root@node-0:~# systemctl --type swap
  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

Reboot, and check our work:

root@node-0:~# systemctl --type swap --all
  UNIT                                                                      LOAD   ACTIVE   SUB
  dev-disk-by\x2duuid-56340219\x2d2c91\x2d4837\x2db4d2\x2d82e9823cbea9.swap loaded inactive dead

I finish up the rest of the node-0's configuration, start containerd and the Kubernetes services, and verify everything's running:

root@node-0:~# systemctl|grep kube
  kube-proxy.service                 loaded active running   Kubernetes Kube Proxy
  kubelet.service                    loaded active running   Kubernetes Kubelet
  kubepods-besteffort.slice          loaded active active    libcontainer container kubepods-besteffort.slice
  kubepods-burstable.slice           loaded active active    libcontainer container kubepods-burstable.slice
  kubepods.slice                     loaded active active    libcontainer container kubepods.slice

I repeat these steps on node-1, then do the final verification from jumpbox:

root@jumpbox:~/kubernetes-the-hard-way# ssh root@server \
  "kubectl get nodes \
  --kubeconfig admin.kubeconfig"

Or try to. It takes a very long time just to get a shell prompt, and then the ssh command hangs. Ok, I'll do this from server.

I'm finally able to get the worker node status, though it takes a minute or so for the command to complete:

root@server:~# kubectl get nodes --kubeconfig admin.kubeconfig
NAME     STATUS     ROLES    AGE     VERSION
node-0   NotReady      7m31s   v1.32.3
node-1   NotReady      9m27s   v1.32.3

Not good, and the painful sluggishness of my VMs is making this hard to debug. I pause the jumpbox and node-1 VMs and start poking around node-0.

journalctl shows me (lightly edited):

E0522 kubelet.go:3002] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady 
    message:Network plugin returns error: cni plugin not initialized"
E0522 20:04:54.979171    3938 kubelet.go:3002] "Container runtime network not ready" networkReady="NetworkReady=false 
    reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
W0522 20:04:57.927081    3938 transport.go:356] Unable to cancel request for *otelhttp.Transport
E0522 20:04:57.927232    3938 controller.go:195] "Failed to update lease" 
    err="Put \"https://server.kubernetes.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node-0?timeout=10s\": 
    net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

Is there a bridge network adapter?

root@node-0:~# ip address
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: ens18:  mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether bc:24:11:af:19:5a brd ff:ff:ff:ff:ff:ff
    altname enp0s18
    inet 10.1.1.226/24 brd 10.1.1.255 scope global dynamic ens18
       valid_lft 151706sec preferred_lft 151706sec
    inet6 fe80::be24:11ff:feaf:195a/64 scope link
       valid_lft forever preferred_lft forever

Doesn't look like it, but I haven't worked much with bridges under Linux- maybe ip doesn't show them?

Did the bridge driver not get properly configured? It appears to have been loaded:

root@node-0:~# lsmod|grep bridge
bridge                311296  1 br_netfilter
stp                    16384  1 bridge
llc                    16384  2 bridge,stp

I search on the error messages from journald. I discover a lot of cool stuff about CNI plugins and pod networking, but very little relevant to my problem.

It's almost Memorial Day weekend in the US, so I'm ready to take a break. My plan for the weekend is:

See if I can find any compatible memory priced cheaply. It's painful to work on the machine right now, and there's a very slight chance that my problems are due to memory exhaustion and timeouts.
Verify that the Dell's Xeon processor supports the "virtualization in a VM" VT-x extensions required by containerd. This is the worst-case scenario- that my machine is just too old to run this.
Learn more about CNI, and how to debug it.

If all else fails I'll set aside these VMs and try minikube or k3s instead. This will (hopefully) give me a working Kubernetes installation that I can compare against the state of these machines.

I'm going to kick back for the weekend, first.

Thursday, May 29, 2025

Did you know curl has a theme song?

It's taken me longer than I expected to write up the next entry in Homelab Adventures, due to the amount of material I want to cover (spinning up my Kubernetes worker nodes did not go well, to put it mildly).

In the interim, please enjoy the curl theme song:

Friday, May 23, 2025

Homelab Adventures, part 3: Provisioning the Kubernetes control plane

"RTFM" also means "Follow the Fine Manual"

Now that I have my VMs, it's time to provision Kubernetes. I'll work through chapters 2-8 of Kubernetes The Hard Way, setting up etcd and the Kubernetes control plane.

I enable root ssh login and distribute ssh keys to the other VMs. While I'm at it I enable sudo for the non-root user I created at install time. Having to always su to root gives me that uncomfortable, driving-without-a-seatbelt feeling.

I know some distros are moving away from sudo, so I check the protocol for Debian and find I need to add this user to the group sudo, rather than the traditional wheel.

Next it's time to set up the hosts file on the VMs- but do I really need to do this?

I had set the VMs' hostnames at install time, and dnsmasq, the DNS/DHCP on my home network, lets clients auto-register their hostname and IP address, so I can resolve these names even when they're not in /etc/hosts.

Looks like I can skip this step.

Narrator's voice: he could not, in fact, skip this step.

I breeze through the setup instructions in chapters 4-6, then stumble briefly in chapter 7, when I'm told to:

Set the etcd name to match the hostname of the current compute instance

I decide to ignore this, and start etcd, which comes up without any errors.

root@server:~# etcdctl member list
6702b0a34e2cfd39, started, controller, http://127.0.0.1:2380, http://127.0.0.1:2379, false

Now on to the control plane. Chapter 8 is lengthy, but quick- it's almost all mv commands. I start the Kubernetes services, and check their status:

root@server:~# systemctl|grep kube
  kube-apiserver.service             loaded active     running      Kubernetes API Server
  kube-controller-manager.service    loaded active     running      Kubernetes Controller Manager
  kube-scheduler.service             loaded activating auto-restart Kubernetes Scheduler

That kube-scheduler status looks odd. journalctl --since "5 minutes ago" tells me:

err="stat /var/lib/kubernetes/kube-scheduler.kubeconfig: no such file or dir

Looks like I missed a step. I move the missing .kubeconfig file into /var/lib/kubernetes and bounce the Kubernetes services. Things look much better now:

root@server:~# systemctl|grep kube
  kube-apiserver.service            loaded active running   Kubernetes API Server
  kube-controller-manager.service   loaded active running   Kubernetes Controller Manager
  kube-scheduler.service            loaded active running   Kubernetes Scheduler

I make a final check with journalctl:

...Get "https://server.larock.nu:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0":
tls: failed to verify certificate: x509: certificate is valid for kubernetes, kubernetes.default,
kubernetes.default.svc, kubernetes.default.svc.cluster,kubernetes.svc.cluster.local,
server.kubernetes.local, api-server.kubernetes.local, not server.larock.nu

That's not good.

The self-signed cert doesn't include my domain, and there might also be an issue with my domain not ending in .local (there are nuances about how the signing certificate chain is resolved that I'm not expert on).

I shut down the Kubernetes services and etcd, and clear any data associated with them:

rm -r /etc/kubernetes/config /var/lib/kubernetes
rm -r /etc/etcd /var/lib/etcd/*

Then back to chapter 3, and I do the hosts file work I had skipped before, and repeat the configuration steps.

I start everything up on server, don't see any errors in journalctl, and probe Kubernetes:

root@server:~# kubectl cluster-info --kubeconfig admin.kubeconfig
Kubernetes control plane is running at https://127.0.0.1:6443

Then a test from the jumpbox:

root@jumpbox:~/kubernetes-the-hard-way# curl --cacert ca.crt \
  https://server.kubernetes.local:6443/version
{
  "major": "1",
  "minor": "32",
  "gitVersion": "v1.32.3",
  "gitCommit": "32cc146f75aad04beaaa245a7157eb35063a9f99",
  "gitTreeState": "clean",
  "buildDate": "2025-03-11T19:52:21Z",
  "goVersion": "go1.23.6",
  "compiler": "gc",
  "platform": "linux/amd64"
}

I have a working control plane.

Next: provisioning the workers, and final steps.

Monday, May 19, 2025

Homelab Adventures, part 2: Time to make the donuts VMs

x86-64 is the concept of a plan of an architecture

Now that I have my KVM box running, it's time to make some VMs. I'll spin up four Debian instances, based on the configurations in Kubernetes The Hard Way.

I grab the Debian 12 install iso, upload it through the Proxmox console, then configure the first VM: jumpbox.

I've used Debian before, but never installed it from scratch; usually I work with Redhat family distros. It's a straightforward process, and the install seems simpler than for Rocky or Fedora install (though since my first Linux distro was Slackware ftp'ed into 3" floppies I may have a low bar for "simpler"). I stick with the default options, except for adding an OpenSSH server to the list of software to be installed. The installer completes, prompts me to allow it to reboot, and the new VM starts up- then dies with an error about unsupported machine instructions.

I had selected kvm64 as the CPU type in the VM configuration dialog; I knew the Dell's Xeon processor long preceded x86-64-v2 and the like, and I had found some comments online about incompatibilities between Debian 12 and the most recent x86-64 processors. But was this a mistake in my choice of CPU type, or did I mangle the install?

Settingthe CPU to qemu64 gets the VM to work, but it's obviously going to be too slow for anything useful. This confirms that my install is good, at least. Now I need to figure out which of the Intel processor codenames in the configuration menu best matches the Xeon processor in the Dell box.

CPU selection menu listing processor codenames

The Proxmox CPU selection menu in the VM configuration dialog

I get shell on the host from the Proxmox console (I don't know whether to be awed or horrified at the idea of Xterm implemented in JavaScript), and dmesg provides me a few additional scraps of data about the processor. I'm able to use this to find a range of dates for when the processor was probably manufactured, but not the actual name.

Using the information in Wikipedia's List of Intel codenames, I'm able to come up with a set of candidate names. Luckily, the unsupported instruction error happens early in the boot sequence, so it doesn't take me long to iterate over them, and discover the correct selection for my machine is "Conroe".

Soon the standard GDM login screen displays in the VM's console window. Success! I click on the user name, and wait for the password dialog.

And wait. And wait. Maybe a GUI environment isn't a viable choice for a machine this strapped for resources.

I ssh into the VM and uninstall Gnome, Evolution, and anything else that's obviously a GUI application (I leave the X client libraries), then shutdown the VM.

From the Options panel I remove the VM's mount of the .iso, and enable QEMU Guest Agent. I'm not sure if this is installed, but if it is, this will give the Proxmox console some additional status information from the running VM.

I restart the VM, and in a minute I have a login prompt in the console window, and the Summary panel now shows the VM's IP address.

I then configure VMs for the server and node-0 machines. I bump node-0's CPU count to 2; from the performance stats in the console the Dell is memory-constrained, but still has a lot of compute available.

I choose the non-graphical install, and untick "Gnome" from the list of software to be installed (I still add the SSH server). When prompted to reboot, I halt the VM instead, and repeat the changes in the Options panel I made to jumpbox. I start each VM, and check that I can log in through the Proxmox console.

I could probably create node-1 by cloning node-0's VM, and I definitely want to explore Proxmox' VM management features in the future, but for now I just do another install, with the same parameters as node-0.

I now have four running VMs. Time to provision Kubernetes.

Sunday, May 11, 2025

Homelab Adventures, part 1: Installing Proxmox

The saga of an ancient Dell server, and goodbye OpenSolaris

Having some free time on my hands as a result of being DOGE'd- my employer was heavily reliant on now-frozen NIH and NSF contracts- I decided to take the opportunity to learn more about Kubernetes. I had dabbled enough in it to see its potential for the kind of work I did, but was never able to explore it in depth.

Also as a result of being DOGE'd, I needed to spend as little money as possible on this. Luckily, I had an unused Dell server lying around (I'm a computer person, so of course I have a collection of surplus computers, boards, and accessories in the basement. And someday I'll need that Centronics printer cable). It was over a decade old, and only had 2 gb of memory, but its processor supported VT-x, so it should work as a KVM box, albeit a slow one.

My plan was to install a KVM, then spin up a Kubernetes cluster as described in Kubernetes the Hard Way. After some research I decided to use Proxmox for the KVM. It was free for home use, well regarded, and looked relatively simple to use. In a couple of days, I had this:

Proxmox management console for my homelab

But not without a few mistakes along the way.

Resurrecting the dead (server)

I gave the Dell box a good cleaning before powering it up, blasting away the dust bunnies in its interior, including clearing a felt-like coat from the largest CPU heatsink I've ever seen. While people have strong opinions about vendors (to put it mildly), I've always been impressed by Dell's servers and business-grade laptops.

Now, time to give OpenSolaris an honorable farewell.

I worked with SunOS and Solaris systems from the Sun-3 days through the dot-com era of SPARC servers ruling the internet. I ran OpenSolaris on this machine (and am now running OmniOS on its replacement) because the CDDL-licensed version of ZFS is both bulletproof and drop-dead simple to manage. And because I hate Larry Ellison with the fire of a thousand suns.

Time to power things up and see if everything still works:

♫ Won't you give me one last boot... ♫ [apologies to Hikrau Utada]

After a quick check that there was nothing on the machine I still needed, I rebooted into the BIOS, enabled VT-x, and set USB as the first bootable device.

Then it was time to install Proxmox. I downloaded the iso, burned a bootable USB drive with Rufus, and booted the Dell back up. After that I simply followed the installation instructions in the Proxmox administration document. I kept the old hostname and static IP address, so I wouldn't have to tinker with my home DNS configuration, completed the install, and rebooted.

To a blank screen.

The disk the install had found and configured as the boot disk,wasn't the first disk in boot order. Three tries later (on a four-disk machine) I had the order right, and a Proxmox console showing on my browser.

The install had also only used one of the four disks. I used the Wipe Disk and Initialize Disk with GPT controls on the console to clear the other three disks, then allocated the disks as follows:

I formatted one disk as a single partition, and added this to the LVM logical volume, which seemed severely undersized. While I haven't used Linux's physical/logical volume tools for a long time, doing this was pretty straightforward, though tedious enough to remind me why I prefer ZFS' "throw the raw disk into a pool" simplicity.

I created a ZFS pool from the second disk. I'll use this for virtual disks for my VMs.

I carved an 8 gb swap partition out of the lastdisk, and reconfigured the system to use this for swap instead of the swapfile in the LVM volume. I don't expect a dedicated swap drive to turn a system with only 2 gb of memory into a speed demon, but it should give a slight performance boost, and even if I wasn't trying to be frugal I'd be reluctant to put in the effort required to track down compatible vintage ECC DRAMs for this machine. If I can scrounge up a cheap SATA SSD I'll use that for swap, instead.

Now it's time to install my VMs. And discover just how ancient that Dell's processor is.