Byzantine Reality

Searching for Byzantine failures in the world around us

Xen on the Xeon

Over the last three weeks I’ve had nothing but trouble trying to get a little cluster set up in our lab and even more trouble trying to get Xen to work. The biggest frustration is that these technologies are not bleeding-edge, untested, actually beta pieces-of-garbage: they’re technologies that are heavily invested in! Thankfully, we finally resolved some of these problems, so for those of you with Xeon CPUs withHyperThreading, here’s how we did it:

Step 1: Go into the BIOS and turn off HyperThreading.

Step 2: There is no Step 2.

That’s it. That’s all it took to get rid of the evil kernel panic plaguing me after I installed Xen on our Xeon box. Specifically, after installing Xen, trying to create a virtual machine causes this to happen:

[ 300.375060]
[ 300.375176] Pid: 14620, comm: gzip Not tainted (2.6.24-19-xen #1)
[ 300.375298] EIP: 0061:[] EFLAGS: 00010a13 CPU: 1
[ 300.375420] EIP is at 0xc1bb5429
[ 300.375539] EAX: c1bb9a60 EBX: c1bb3460 ECX: 00000000 EDX: 00000000
[ 300.375660] ESI: 00000001 EDI: 40040000 EBP: 00000000 ESP: e9eadd00
[ 300.375783] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
[ 300.375903] Process gzip (pid: 14620, ti=e9eac000 task=ec529830 task.ti=e9eac000)
[ 300.376026] Stack: c01623a5 00000000 00000000 e9eadd44 00000001 00000001 e9eadd3c c0162456
[ 300.377094]    c1bb9a60 c1522174 00000001 00000001 c0165997 00000001 c0176681 00000001
[ 300.378146]    00000000 c1bb9a60 15852ff8 f578e000 00000000 c0000000 c0169d61 c1bb9a60
[ 300.379199] Call Trace:
[ 300.379427] [] free_hot_cold_page+0x195/0x220
[ 300.379661] [] __pagevec_free+0x26/0x30
[ 300.379890] [] release_pages+0x137/0x160
[ 300.380116] [] move_page_tables+0x611/0x800
[ 300.380346] [] dec_zone_page_state+0x21/0x70
[ 300.380574] [] free_pgd_range+0x26c/0x370
[ 300.380804] [] free_pages_and_swap_cache+0x74/0xa0
[ 300.381043] [] setup_arg_pages+0x284/0x290
[ 300.381275] [] load_elf_binary+0x3d9/0x1c90
[ 300.381508] [] file_read_actor+0x0/0x100
[ 300.381738] [] current_fs_time+0x13/0x20
[ 300.381971] [] follow_page+0x20d/0x410
[ 300.382205] [] get_user_pages+0x163/0x540
[ 300.382437] [] get_arg_page+0x4b/0xb0
[ 300.382667] [] load_elf_binary+0x0/0x1c90
[ 300.382893] [] search_binary_handler+0x9a/0x1e0
[ 300.383123] [] do_execve+0x1a6/0x1d0
[ 300.383350] [] sys_execve+0x2f/0x80
[ 300.383577] [] syscall_call+0x7/0xb
[ 300.383805] [] vcc_getsockopt+0x110/0x170
[ 300.384036] =======================
[ 300.384154] Code: 20 00 00 00 00 40 01 00 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 01 10 00 00 02 20 00 00 20 00 40 01 00 00 00 ff ff ff 14 70 bb c1 00 00 00 00 20 cb bb c1 00 01 10 00 00 02
[ 300.390976] EIP: [] 0xc1bb5429 SS:ESP 0069:e9eadd00
[ 300.391340] ---[ end trace 750438800d7fe836 ]---

Similar messages come up on my other Xeon boxes when I try to copy files, but since they don’t have HyperThreading, the same fix isn’t applicable here. However, since they work fine when not using the Xen-modified kernel, we’ll likely just re-task them for other work.

But that’s besides the point: what the hell about this error message would have told you that HyperThreading was the culprit? I’m certainly no newcomer to Linux (although I don’t do any kernel hacking) but there is no way a reasonable user could have figured out that HyperThreading was the culprit. Xen was announced five years ago, like I said, this is not a new technology. The Hardware Compatibility List shows Xeon servers, so we know they’ve tested Xen with Xeon boxes. It’s obviously impossible to test every combination of hardware and software, but there’s no reason to see this on sixdifferent Xeon boxes with different hardware (only one had HyperThreading).

That being said, I’ve learned an important lesson today:

If you’ve tried everything logical to fix a problem and failed, try the most illogical thing you can think of.