Tuesday, June 24, 2025

Sometimes TV gets it right

Of course it's anime

I learned to suspend my critical thinking skills watching movie or TV scenes featuring a computer back in the IBM PC era. At the time, any shot of someone working on a computer was invariably paired with CLACKY TYPEWRITER NOISES, since how else would the audience know they were doing computer things?

Crack into that super-secret defense system by randomly guessing passwords because there's no lockout or MFA in TV Land? I can ignore that. IPv4 address octets > 255. No problem (though I do want a scene where a developer from the 1980's, thawed out of suspended animation, screams in horror at the sight of an IPv6 address).

At the same time when a TV show gets it right deserves praise, given the huge influence media has on public perception of technology. 

This is from Kowloon Generic Romance, episode 12:

Laptop displaying Python code

Not only is this valid Python code, whoever wrote this ensured it was harmless- because they knew someone would run it as-is.


Monday, June 23, 2025

Homelab Adventures, part 6: Playing with Proxmox VM management

Herding the cats VMs

Before doing any more Kubernetes work I need to get my VMs under control. I need a way to group related VMs together, and ideally manage them as a group. I also want to be able to quickly spin up new VM instances.

The first one is easy. I use the pulldown menu in the top left of the Proxmox console to switch to Tag View, and then assign my Kubernetes The Hard Way VMs a tag. I'll label them kthw:

Assigning a tag to a VM in the Proxmox console

To tag a VM select it, click on the pencil icon, and pick the tag you want, or type a new one. As a bonus Proxmox automatically color-codes tags, and groups tagged VMs together.

I can work on these VMs as a group in bulk operations by specifying the tag. For instance, I can start all the Kubernetes The Hard Way VMs in one operation by choosing kthw in the "Include Tags" box of the Bulk Start dialog:

Proxmox bulk start dialog, with all VMs tagged kthw selected

This gives me what I need to manage my VMs.

Send in the clones

Proxmox will let me convert any VM into a template, which I can then use as a base for additional VMs.

First I create a new VM, using the same Debian 12 install iso as before. I make two changes in the install process: I manually partition the virtual disk to eliminate the swap partition, so I don't have to manually disable swap post-install, and I skip installing the "Debian Desktop" software.

Once the install completes I log in to the VM, do the usual steps of enabling root ssh logins adding my local user to the sudo group, and installing some essential software:

# Truncate the too-verbose motd
echo 'Your motd here' > /etc/motd

# Essential software
apt install curl git zip unzip

# Cleanup
apt update
apt upgrade
apt autoremove

Let's see how big the VM is:

root@debian12:~# df -k /
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sda1       32845584 1807868  29343716   6% /

Plenty of room for additional software.

Now stop the VM, right-click on the VM name, select "Convert to template", and behold:

Proxmox VM window, showing a template VM

I've given my template a name reflecting its creation date, since the VM can no longer be updated after being converted to a clone.

I've got my VM tools ready. Time to attack Kubernetes again.

Wednesday, June 04, 2025

Homelab Adventures, part 5: Rocket Surgery

 More memory, same problems

Over the Memorial Day weekend, I find an eBay store selling compatible ECC DRAMs for a surprisingly low price. I grab two 8gb sticks- the Dell requires memory to be installed in pairs- for the price of a couple of pizzas.

I also track down a pdf of the Dell machine's manual and copy it to a directory on my home file server. I've been doing this with manuals for a few years with manuals, especially those odd little devices whose documentation consists of one big foldout page that I usually wind up losing. I'm running nginx to serve these up, so I can view them from a browser any time I'm on my home network.

Once the memory arrives, I power down the Dell and pop open the case. 

The connectors on the slots for each memory lane are color-coded:

Interior of Dell server, showing the motherboard and a large airflow guide

The grey thing on the lower left is an airflow guide forcing cool air over that ginormous heatsink. It's got a clever quick release mechanism (not shown), and once I pop it off I have easy access to the memory slots:

Dell server mainboard, showing memory slots

Firmly press in the new memory sticks, close everything up, and behold!

BIOS boot screen showing 18gb of memory

It takes several minutes for Proxmox to spin up all four VMs. I wait a bit longer for the Kubernetes services to chat with each other, and login to node-0. I get a login prompt in a couple of seconds, and, overall, the system is pretty responsive- much better than before

I still have the same error, though:

networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

I decide to do some housekeeping before digging into this further.

First, I apply the latest firmware update, from 2018. From the release notes this primarily addresses some of the speculative execution bugs that have been discovered in Intel processors over the years. While this won't fix my Kubernetes problem, it might give me a slight performance boost, since sometime the Linux kernel enables software mitigations for unpatched firmware bugs. 

And it's just good computing hygiene.

Dell provides a convenient Linux executable to apply this update from the command line:

root@vega:~# ./T110_BIOS_C4W9T_LN_1.12.0.BIN 
Collecting inventory...
.
Running validation...

Server BIOS 11G

The version of this Update Package is newer than the currently installed version.
Software application name: BIOS
Package version: 1.12.0
Installed version: 1.3.4

Continue? Y/N:y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER PRODUCTS WHILE UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
.......................................................................................

I then do some tinkering with the CPU settings of the VMs. While I was researching the virtualization capabilities of the Dell's Xeon CPU I discovered that Xeon "Lynnnfield" processors are part of the "Nehalem" processor family, which is one of the CPU options in Proxmox' VM configuration.

I stop node-0, change it's hardware processor setting from "Conroe" to "Nehalem", and restart it. 

It works! lscpu identifies the processor as Intel Core i7 rather than Celeron, and the CPU flags show SSE 4 as being supported.

Next, I'm going to explore Proxmox further. I want to create a VM template so I can quickly spin up clean VMs, and see if I have better success with minikube or k3s than I have with the full Kubernetes distro. Hopefully, having a working Kubernetes install to compare against my current one will give me enough clues to identify the problem. 

At worst, I'll have a working Kubernetes- just not the one I expected.

Saturday, May 31, 2025

Homelab adventures, part 4: Worker nodes of the world, unite! Or maybe not.

Antisocial worker nodes, and maybe 2gb of memory is not enough

I'm almost done- once I spin up my two worker nodes, I'll have my own private Kubernetes cluster. I breeze through the first few steps in chapter 9, until I get to the part about disabling swap. 

This should be easy: identify the swap partition or file, comment it out of /etc/fstab, and reboot. But swap is no longer defined there:

root@node-0:~# swapon --show
NAME      TYPE      SIZE USED PRIO
/dev/sda5 partition 975M 524K   -2

root@node-0:~# grep sda5 /etc/fstab
# swap was on /dev/sda5 during installation

I'm all in favor of systemd: over the years I've been bitten too many times by the shortcomings of the classic SysV init/rc system. Even so, there are times when it's all-encompassing nature still irks me.

Turning off swap the systemd way:

root@node-0:~# systemctl --type swap
  UNIT                                                                      LOAD   ACTIVE SUB
  dev-disk-by\x2duuid-8ad8b7b3\x2d87ae\x2d423a\x2daa1c\x2d67b82c1a8812.swap loaded active active

Are those backslashes escapes or literal backslashes? One way to find out:

root@node-0:~# systemctl mask dev-disk-by\x2duuid-8ad8b7b3\x2d87ae\x2d423a\x2daa1c\x2d67b82c1a8812.swap
Unit dev-disk-byx2duuid-8ad8b7b3x2d87aex2d423ax2daa1cx2d67b82c1a8812.swap does not exist, proceeding anyway.
Created symlink 
  /etc/systemd/system/dev-disk-byx2duuid-8ad8b7b3x2d87aex2d423ax2daa1cx2d67b82c1a8812.swap → /dev/null.

root@node-0:~# systemctl mask 'dev-disk-by\x2duuid-8ad8b7b3\x2d87ae\x2d423a\x2daa1c\x2d67b82c1a8812.swap'
Created symlink 
  /etc/systemd/system/dev-disk-by\x2duuid-8ad8b7b3\x2d87ae\x2d423a\x2daa1c\x2d67b82c1a8812.swap → /dev/null.

Then turn off any swap still in use:

root@node-0:~# swapoff -a
root@node-0:~# systemctl --type swap
  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

Reboot, and check our work:

root@node-0:~# systemctl --type swap --all
  UNIT                                                                      LOAD   ACTIVE   SUB
  dev-disk-by\x2duuid-56340219\x2d2c91\x2d4837\x2db4d2\x2d82e9823cbea9.swap loaded inactive dead

I finish up the rest of the node-0's configuration, start containerd and the Kubernetes services, and verify everything's running:

root@node-0:~# systemctl|grep kube
  kube-proxy.service                 loaded active running   Kubernetes Kube Proxy
  kubelet.service                    loaded active running   Kubernetes Kubelet
  kubepods-besteffort.slice          loaded active active    libcontainer container kubepods-besteffort.slice
  kubepods-burstable.slice           loaded active active    libcontainer container kubepods-burstable.slice
  kubepods.slice                     loaded active active    libcontainer container kubepods.slice

I repeat these steps on node-1, then do the final verification from jumpbox:

root@jumpbox:~/kubernetes-the-hard-way# ssh root@server \
  "kubectl get nodes \
  --kubeconfig admin.kubeconfig"

Or try to. It takes a very long time just to get a shell prompt, and then the ssh command hangs.  Ok, I'll do this from server.

I'm finally able to get the worker node status, though it takes a minute or so for the command to complete:

root@server:~# kubectl get nodes --kubeconfig admin.kubeconfig
NAME     STATUS     ROLES    AGE     VERSION
node-0   NotReady      7m31s   v1.32.3
node-1   NotReady      9m27s   v1.32.3

Not good, and the painful sluggishness of my VMs is making this hard to debug. I pause the jumpbox and node-1 VMs and start poking around node-0

journalctl shows me (lightly edited):

E0522 kubelet.go:3002] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady 
    message:Network plugin returns error: cni plugin not initialized"
E0522 20:04:54.979171    3938 kubelet.go:3002] "Container runtime network not ready" networkReady="NetworkReady=false 
    reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
W0522 20:04:57.927081    3938 transport.go:356] Unable to cancel request for *otelhttp.Transport
E0522 20:04:57.927232    3938 controller.go:195] "Failed to update lease" 
    err="Put \"https://server.kubernetes.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node-0?timeout=10s\": 
    net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

Is there a bridge network adapter?

root@node-0:~# ip address
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: ens18:  mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether bc:24:11:af:19:5a brd ff:ff:ff:ff:ff:ff
    altname enp0s18
    inet 10.1.1.226/24 brd 10.1.1.255 scope global dynamic ens18
       valid_lft 151706sec preferred_lft 151706sec
    inet6 fe80::be24:11ff:feaf:195a/64 scope link
       valid_lft forever preferred_lft forever

Doesn't look like it, but I haven't worked much with bridges under Linux- maybe ip doesn't show them?

Did the bridge driver not get properly configured? It appears to have been loaded:

root@node-0:~# lsmod|grep bridge
bridge                311296  1 br_netfilter
stp                    16384  1 bridge
llc                    16384  2 bridge,stp

I search on the error messages from journald. I discover a lot of cool stuff about CNI plugins and pod networking, but very little relevant to my problem.

It's almost Memorial Day weekend in the US, so I'm ready to take a break. My plan for the weekend is:

  • See if I can find any compatible memory priced cheaply. It's painful to work on the machine right now, and there's a very slight chance that my problems are due to memory exhaustion and timeouts.
  • Verify that the Dell's Xeon processor supports the "virtualization in a VM" VT-x extensions required by containerd. This is the worst-case scenario- that my machine is just too old to run this.
  • Learn more about CNI, and how to debug it.
If all else fails I'll set aside these VMs and try minikube or k3s instead. This will (hopefully) give me a working Kubernetes installation that I can compare against the state of these machines.

I'm going to kick back for the weekend, first.

Thursday, May 29, 2025

Did you know curl has a theme song?

It's taken me longer than I expected to write up the next entry in Homelab Adventures, due to the amount of material I want to cover (spinning up my Kubernetes worker nodes did not go well, to put it mildly). 

In the interim, please enjoy the curl theme song:



Friday, May 23, 2025

Homelab Adventures, part 3: Provisioning the Kubernetes control plane

 "RTFM" also means "Follow the Fine Manual"

Now that I have my VMs, it's time to provision Kubernetes. I'll work through chapters 2-8 of Kubernetes The Hard Way, setting up etcd and the Kubernetes control plane.

I enable root ssh login and distribute ssh keys to the other VMs. While I'm at it I enable sudo for the non-root user I created at install time. Having to always su to root gives me that uncomfortable, driving-without-a-seatbelt feeling.

I know some distros are moving away from sudo, so I check the protocol for Debian and find I need to add this user to the group sudo, rather than the traditional wheel

Next it's time to set up the hosts file on the VMs- but do I really need to do this?

I had set the VMs' hostnames at install time, and dnsmasq, the DNS/DHCP on my home network, lets clients auto-register their hostname and IP address, so I can resolve these names even when they're not in /etc/hosts. 

Looks like I can skip this step.

Narrator's voice: he could not, in fact, skip this step.

I breeze through the setup instructions in chapters 4-6, then stumble briefly in chapter 7, when I'm told to:

Set the etcd name to match the hostname of the current compute instance

I decide to ignore this, and start etcd, which comes up without any errors.

root@server:~# etcdctl member list
6702b0a34e2cfd39, started, controller, http://127.0.0.1:2380, http://127.0.0.1:2379, false

Now on to the control plane. Chapter 8 is lengthy, but quick- it's almost all mv commands. I start the Kubernetes services, and check their status:

root@server:~# systemctl|grep kube
  kube-apiserver.service             loaded active     running      Kubernetes API Server
  kube-controller-manager.service    loaded active     running      Kubernetes Controller Manager
  kube-scheduler.service             loaded activating auto-restart Kubernetes Scheduler

That kube-scheduler status looks odd. journalctl --since "5 minutes ago" tells me:

err="stat /var/lib/kubernetes/kube-scheduler.kubeconfig: no such file or dir

Looks like I missed a step. I move the missing .kubeconfig file into /var/lib/kubernetes and bounce the Kubernetes services. Things look much better now:

root@server:~# systemctl|grep kube
  kube-apiserver.service            loaded active running   Kubernetes API Server
  kube-controller-manager.service   loaded active running   Kubernetes Controller Manager
  kube-scheduler.service            loaded active running   Kubernetes Scheduler

I make a final check with journalctl:

...Get "https://server.larock.nu:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0":
tls: failed to verify certificate: x509: certificate is valid for kubernetes, kubernetes.default,
kubernetes.default.svc, kubernetes.default.svc.cluster,kubernetes.svc.cluster.local,
server.kubernetes.local, api-server.kubernetes.local, not server.larock.nu

That's not good.

The self-signed cert doesn't include my domain, and there might also be an issue with my domain not ending in .local (there are nuances about how the signing certificate chain is resolved that I'm not expert on).

I shut down the Kubernetes services and etcd, and clear any data associated with them:

rm -r /etc/kubernetes/config /var/lib/kubernetes
rm -r /etc/etcd /var/lib/etcd/*

Then back to chapter 3, and I do the hosts file work I had skipped before, and repeat the configuration steps.

I start everything up on server, don't see any errors in journalctl, and probe Kubernetes:

root@server:~# kubectl cluster-info --kubeconfig admin.kubeconfig
Kubernetes control plane is running at https://127.0.0.1:6443

Then a test from the jumpbox:

root@jumpbox:~/kubernetes-the-hard-way# curl --cacert ca.crt \
  https://server.kubernetes.local:6443/version
{
  "major": "1",
  "minor": "32",
  "gitVersion": "v1.32.3",
  "gitCommit": "32cc146f75aad04beaaa245a7157eb35063a9f99",
  "gitTreeState": "clean",
  "buildDate": "2025-03-11T19:52:21Z",
  "goVersion": "go1.23.6",
  "compiler": "gc",
  "platform": "linux/amd64"
}

I have a working control plane.

Next: provisioning the workers, and final steps.

Monday, May 19, 2025

Homelab Adventures, part 2: Time to make the donuts VMs

x86-64 is the concept of a plan of an architecture

Now that I have my KVM box running, it's time to make some VMs. I'll spin up four Debian instances, based on the configurations in Kubernetes The Hard Way

I grab the Debian 12 install iso, upload it through the Proxmox console, then configure the first VM: jumpbox

I've used Debian before, but never installed it from scratch; usually I work with Redhat family distros. It's a straightforward process, and the install seems simpler than for Rocky or Fedora install (though since my first Linux distro was Slackware ftp'ed into 3" floppies I may have a low bar for "simpler"). I stick with the default options, except for adding an OpenSSH server to the list of software to be installed. The installer completes, prompts me to allow it to reboot, and the new VM starts up- then dies with an error about unsupported machine instructions.

I had selected kvm64 as the CPU type in the VM configuration dialog; I knew the Dell's Xeon processor long preceded x86-64-v2 and the like, and I had found some comments online about incompatibilities between Debian 12 and the most recent x86-64 processors. But was this a mistake in my choice of CPU type, or did I mangle the install?

Settingthe CPU to qemu64 gets the VM to work, but it's obviously going to be too slow for anything useful. This confirms that my install is good, at least. Now I need to figure out which of the Intel processor codenames in the configuration menu best matches the Xeon processor in the Dell box.

CPU selection menu listing processor codenames
The Proxmox CPU selection menu in the VM configuration dialog

I get shell on the host from the Proxmox console (I don't know whether to be awed or horrified at the idea of Xterm implemented in JavaScript), and dmesg provides me a few additional scraps of data about the processor. I'm able to use this to find a range of dates for when the processor was probably manufactured, but not the actual name.

Using the information in Wikipedia's List of Intel codenames, I'm able to come up with a set of candidate names. Luckily, the unsupported instruction error happens early in the boot sequence, so it doesn't take me long to iterate over them, and discover the correct selection for my machine is "Conroe". 

Soon the standard GDM login screen displays in the VM's console window. Success! I click on the user name, and wait for the password dialog. 

And wait. And wait. Maybe a GUI environment isn't a viable choice for a machine this strapped for resources. 

I ssh into the VM and uninstall Gnome, Evolution, and anything else that's obviously a GUI application (I leave the X client libraries), then shutdown the VM.

From the Options panel I remove the VM's mount of the .iso, and enable QEMU Guest Agent. I'm not sure if this is installed, but if it is, this will give the Proxmox console some additional status information from the running VM.

I restart the VM, and in a minute I have a login prompt in the console window, and the Summary panel now shows the VM's IP address.

I then configure VMs for the server and node-0 machines. I bump node-0's CPU count to 2; from the performance stats in the console the Dell is memory-constrained, but still has a lot of compute available.

I choose the non-graphical install, and untick "Gnome" from the list of software to be installed (I still add the SSH server). When prompted to reboot, I halt the VM instead, and repeat the changes in the Options panel I made to jumpbox. I start each VM, and check that I can log in through the Proxmox console.

I could probably create node-1 by cloning node-0's VM, and I definitely want to explore Proxmox' VM management features in the future, but for now I just do another install, with the same parameters as node-0.

I now have four running VMs. Time to provision Kubernetes. 


Sunday, May 11, 2025

Homelab Adventures, part 1: Installing Proxmox

The saga of an ancient Dell server, and goodbye OpenSolaris

Having some free time on my hands as a result of being DOGE'd- my employer was heavily reliant on now-frozen NIH and NSF contracts- I decided to take the opportunity to learn more about Kubernetes. I had dabbled enough in it to see its potential for the kind of work I did, but was never able to explore it in depth.

Also as a result of being DOGE'd, I needed to spend as little money as possible on this. Luckily, I had an unused Dell server lying around (I'm a computer person, so of course I have a collection of surplus computers, boards, and accessories in the basement. And someday I'll need that Centronics printer cable). It was over a decade old, and only had 2 gb of memory, but its processor supported VT-x, so it should work as a KVM box, albeit a slow one. 

My plan was to install a KVM, then spin up a Kubernetes cluster as described in Kubernetes the Hard Way.  After some research I decided to use Proxmox for the KVM. It was free for home use, well regarded, and looked relatively simple to use. In a couple of days, I had this:

Proxmox management console for my homelab


But not without a few mistakes along the way.

Resurrecting the dead (server)

I gave the Dell box a good cleaning before powering it up, blasting away the dust bunnies in its interior, including clearing a felt-like coat from the largest CPU heatsink I've ever seen. While people have strong opinions about vendors (to put it mildly), I've always been impressed by Dell's servers and business-grade laptops.

Now, time to give OpenSolaris an honorable farewell.  

I worked with SunOS and Solaris systems from the Sun-3 days through the dot-com era of SPARC servers ruling the internet. I ran OpenSolaris on this machine (and am now running OmniOS on its replacement) because the CDDL-licensed version of ZFS is both bulletproof and drop-dead simple to manage. And because I hate Larry Ellison with the fire of a thousand suns.

Time to power things up and see if everything still works:

♫ Won't you give me one last boot... ♫ [apologies to Hikrau Utada]

After a quick check that there was nothing on the machine I still needed, I rebooted into the BIOS, enabled VT-x, and set USB as the first bootable device.

Then it was time to install Proxmox. I downloaded the iso, burned a bootable USB drive with Rufus, and booted the Dell back up. After that I simply followed the installation instructions in the Proxmox administration document. I kept the old hostname and static IP address, so I wouldn't have to tinker with my home DNS configuration, completed the install, and rebooted.

To a blank screen.

The disk the install had found and configured as the boot disk,wasn't the first disk in boot order. Three tries later (on a four-disk machine) I had the order right, and a Proxmox console showing on my browser.

The install had also only used one of the four disks. I used the Wipe Disk and Initialize Disk with GPT controls on the console to clear the other three disks, then allocated the disks as follows:

I formatted one disk as a single partition, and added this to the LVM logical volume, which seemed severely undersized. While I haven't used Linux's physical/logical volume tools for a long time, doing this was pretty straightforward, though tedious enough to remind me why I prefer ZFS' "throw the raw disk into a pool" simplicity.

I created a ZFS pool from the second disk. I'll use this for virtual disks for my VMs.

I carved an 8 gb swap partition out of the lastdisk, and reconfigured the system to use this for swap instead of the swapfile in the LVM volume. I don't expect a dedicated swap drive to turn a system with only 2 gb of memory into a speed demon, but it should give a slight performance boost, and even if I wasn't trying to be frugal I'd be reluctant to put in the effort required to track down compatible vintage ECC DRAMs for this machine. If I can scrounge up a cheap SATA SSD I'll use that for swap, instead.

Now it's time to install my VMs. And discover just how ancient that Dell's processor is.