LXC multiple personality disorder

I have a couple of server systems running as Linux Containers (LCX) as a test since Debian 6.0 (squeeze). The host system has been upgraded to Debian 7.0 (wheezy) lxc version: 0.8.0-rc1 and things generally work fine but I rebooted the other day and the containers fell to pieces.

After manually stopping the containers (or so I thought) and starting them up again one of the containers was fine. The other, not so much. Connecting to it remotely with the PuTTY ssh client would fail either immediately with "Network error: Connection reset by peer" or it would work for some seemingly random amount of time, a second to minutes, before another error would appear "Network error: Software caused connection abort"

Scouring the web I found lots of suggestions saying it was missing config files or keys in the instance's /etc/ssh/ directory, but I knew this was not the case. The files were there and the connection worked, sometimes. Plus I did some tests running netcat (nc) as a client and a server and those connections also failed either after a while or sometimes right away. Sometimes when connecting to the server instance I had just started I would be told the connection was refused.

I started to believe that I had another server running in my network that claimed the same IP address and server name on login. This belief moved to some kind of server multiple personality disorder when I saw that my tmux session sometimes existed and sometimes didn't on login even though the file I had created in the "tmux exists" connection was there in the "no tmux" connection.

I popped onto the lxc irc channel on freenode for some advice. A fellow user, wam, ran me through some tests. I wasn't running out of memory. My configuration was very similar to the working container. No firewalls were blocking stuff on this internal private network. He suggested that I use tshark to track down the possible RSET, so I went (t)shark fishing:

  4.999284 3com_c0:25:71 -> 46:4d:07:7e:87:9c ARP 60 Who has  Tell
  4.999327 46:4d:07:7e:87:9c -> 3com_c0:25:71 ARP 42 is at 46:4d:07:7e:87:9c
  5.007975 fa:11:43:eb:f6:eb -> 3com_c0:25:71 ARP 42 Who has  Tell (duplicate use of detected!)
  5.008086 3com_c0:25:71 -> fa:11:43:eb:f6:eb ARP 60 is at 00:50:da:c0:25:71 (duplicate use of detected!)

Duplicate use of detected with different mac addresses? I thought I had just ruled out multiple servers.

DeHackEd on the LXC IRC channel, #lxcontaines, suggested checking brctl showmacs which I filtered further using other information he shared:

brctl showmacs br0 | grep -v '  1'
port no mac addr                is local?       ageing timer
  4     46:4d:07:7e:87:9c       no                70.08
  2     fa:11:43:eb:f6:eb       no                22.85
  2     fe:68:85:a8:dc:6e       yes                0.00
  3     fe:b0:fc:d7:2e:9f       yes                0.00

  4     fe:dd:ac:a6:ac:f4       yes                0.00

Both of the systems claiming are running on the LXC host. Odd. Using lxc-ls and lxc-list shows only the working container and the broken one. Not three. Another person, devonblzx suggested that I just specify the hwaddr in the lxc config file. I, in fact, had done this once upon a time and I had long since commented it out. I don't remember why. Maybe it wasn't the unique lxc.network.hwaddr but the non-unique lxc.network.name that was tripping me up. After what DeHackEd and devonblzx pointed out in /sys/class/net/$bridgename/brif/ and /sys/devices/virtual/net/$bridgename/ I bet it was the .name value. I should try it again. The question that nagged me was, how had I launched duplicate instances and would setting hwaddr protect me? DeHackEd said I'm not suppose to be able to launch multiple instances, at least not by the same user, due to control channels that use the names that would conflict. I thought it was maybe a bug in lxc 0.8.0 so I shared my ps output that showed there were indeed three instances with two pointing to the same config file:

root 9287 0.0 0.0 20920 664 ? Ss Feb13 0:00 lxc-start -n lxc -f /etc/lxc/auto/brokencontainer.conf -d
root 18816 0.0 0.0 20920 668 ? Ss Feb13 0:00 lxc-start -d -n brokencontainer

DeHackEd promptly said "no name conflict..." and it took me a minute to spot it. One had been started with the name lxc. I asked why lxc-ls or lxc-list don't show it but no one volunteered that answer so I dove into the start-up process to figure out why it started with -n lxc.

My configuration was from Debian 6.0 worked like this. The /etc/default/lxc file specified both that I wanted it to run from init and listed the CONTAINERS I wanted to start. The init script didn't care anymore about the CONTAINERS variable, instead it looked at /etc/lxc/auto/* and tried deriving the names from them. My /etc/lxc directory looks like this:

/etc/lxc/auto/workingcontainer.conf -> /etc/lxc/workingcontainer.conf
/etc/lxc/auto/brokencontainer.conf -> /etc/lxc/brokencontainer.conf

This is not quite how the README.Debian file suggests things to be:

LXC container can be automatically started on boot. In order to enable this, the LXC init script has to be enabled in /etc/default/lxc and and container that should be automatically started needs its configuration file symlinked (or copied) into the /etc/lxc/auto directory.
Note that the name in /etc/lxc/auto needs to be the container name, e.g.:
  /etc/lxc/auto/www.example.org -> /var/lib/lxc/www.example.org/config
I joined the #debian channel on  the OFTC IRC network to get some advice and figure out if my init.d/lxc script or something was messed up and peter1138 helped straighten me out. He said he had a similar setup when he first upgraded from squeeze but when he created a new container in wheezy he saw the config files were in /var/lib/lxc/containername/config and that the /etc/lxc/auto/containername pointed there.

This made it so that the init script works by extracting containername from the folder holding config. There is nothing in that process that cares what the file in /etc/lxc/auto/* is named, it just better be a symbolic link to a file in a directory who's name is the container name you want. I complained about config files in /var, broken upgrades, and a seemingly misleading emphasis on the auto/ name and how autostart works and was given the bug! challenge.

I think it would be even better if it just read the lxc.utsname from the file as peter1138 suggested, then it could be a symlink or copy to any file without needing some specific directory layout. I said it didn't seem that the name in auto needed to be the container name at all and peter1138 agreed that for autostart to auto start this was true, but if the name was the container name then lxc-list would tag the container in the listing as autostart.

I hope this is helpful to someone else facing similar sounding issues even if that someone else turns out to be a future me.
Post a Comment