Biostar z87x 3D on Linux

Posted by Jack on 2013-12-29 at 15:27
Tagged: software , hardware , linux

UPDATE: This has been fixed with a BIOS microcode update..

I've been having trouble with a system that I built a month or two ago. Up until this winter break I'd been using it mostly like a Wintendo, only booting into Windows to play Steam games. Seeing as I'm spending a little more time at home, I decided to finally get comfortable in the Arch install which - to my surprise - began to hard lock every time I turned around.

The Short Version

If you've got a Biostar z87x 3D that crashes in Linux all the time, add the following to your kernel commandline:

nolapic

Please note the l, it's nolapic, not another option, noapic (although they're related).

You can usually add these flags in your grub configuration, either through /etc/default/grub with GRUB\_CMDLINE\_LINUX or, failing that, in the /boot/grub/grub.cfg directly.

For reference, and in case I'm wrong, I also have clocksource=hpet in there because TSC worries me a bit, but I don't think that's relevant to the bug.

The Long Version

The symptoms of this crash were varied. It mostly just hard locked the system, no input, no output, not responsive to SSH/ping. It actually only crashed occasionally, and usually after I finished running a Minecraft server or playing Dwarf Fortress.

First things first, I built a kernel from git and started to use it. My hardware is relatively new (this year's Haswell + z87 chipset) so it's not out of the question that there would be some kernel patches in flight between Arch's 3.12 and 3.13-rc5 in git. No dice though, the problem was just as prevalent on git so I moved on to narrowing down the malfunctioning devices.

I eliminated the USB wireless device that I only suspected because it was throwing warnings all over my dmesg output. It's a WiPi device that I had laying around with the rt2800usb driver that was complaining about transmission timeouts. Surprisingly, with the nolapic fix, this driver has shut up so it was likely a symptom of the same problem.

Then I eliminated the video card by running in a nouveau console without ever starting X. I would have reverted to the board's internal Intel device but running from the hardware console and provoking the crash actually let me get a glimpse of the kernel debug output and running MOC seemed to agitate it enough that I could reproduce in about half an hour. With the messages the kernel was dumping to the hardware console I was able to find the pattern in the crashes.

They were all in an interrupt context, which is serious bad news for the kernel. They were all in different interrupt contexts as well, meaning that - unless there's multiple, board wide failures - then something with interrupts is wrong. So, clearly, I began to tweak knobs with the APIC (the Advanced Programmable Interrupt Controller). noapic, noapictimer, and nolapic\_timer were all insufficient to fix it, but nolapic did the trick.

Why Did This Work?

This is definitely a question going forward and one that I'll give more attention in the future when I'm not on winter break. However, my first theory is that there's a problem with the Intel idle driver expecting the LAPIC timer to be reliable when it actually isn't so when the core is idle, and the processor has been put into a low(er) power state, the lapic wakeup either comes late or doesn't come at all. This would explain why the scheduler craps itself (I don't think it responds well to starvation or bad timing) in an interrupt context, as well as why I could use the system for literally hours with no problem and then it would fail shortly after I was done.

I also think this is true because Googling some LAPIC quirks I discovered a handful of Intel Atom chips that have to be similarly gimped in the intel\_idle driver because of unreliable LAPICs. The associated bugs were about random hangs.

What this theory doesn't explain is why running MOC would agitate it, although the sound drivers likely make use of precision timing as well and it seems likely that MOC wouldn't have to keep the process too busy just to keep the audio buffer full.

Anyway, there might be a quirk patch in here if I can get around to pinning it down.