There’s an interesting discussion going on in the Linux Kernel email list. Normally, the intimate details of Linux kernel development don’t make it to the security trade media (let alone the mainstream media), but this one did. For example, The Register is covering it.
First, the backstory. Kees Cook (security engineer on ChomeOS), Debian developer, and Ubuntu security maintainer) and Linus Torvalds (Linux kernel author/maintainer) had a back and forth this weekend over a set of Linux hardening features (specifically, hardening features for usercopy). The feature is designed to do a bunch of stuff, but notably: catch heap overflows, prevent kernel modification, and enforce memory boundary checking. Seems like a useful feature. The current debate though is what should the behavior be if it fails. It’s written (currently) such that it’ll warn, with the idea that in the future it’ll kernel panic.
The full exchange is worth reading through in detail, but the reason it’s in the news is mostly Linus’ responses about security people not “doing sane things” and the fact that his response was a bit on the vitriolic side. Specifically, per the Linus reply:
So honestly, this is the kind of completely unacceptable “security person” behavior that we had with the original user access hardening too, and made that much more painful than it ever should have been… Some security people have scoffed at me when I say that security problems are primarily “just bugs”. Those security people are f*cking morons. Because honestly, the kind of security person who doesn’t accept that security problems are primarily just bugs, I don’t want to work with. If you don’t see your job as “debugging first”, I’m simply not interested.
He goes on to say:
The primary focus should be “let’s make sure the kernel released in a year is better than the one released today”… the hardening efforts should instead _start_ from the standpoint of “let’s warn about what looks dangerous, and maybe in a _year_ when we’ve warned for a long time, and we are confident that we’ve actually caught all the normal cases, _then_ we can start taking more drastic measures”.
There are a few reasons why I think this is interesting. First, there’s the argument itself. To discuss it analytically and objectively though, you have to be willing to at least consider the other side of the argument. Which, frankly, Linus’ response makes it a little hard to do because of the metacognitive dynamics of how he phrased his position. Those dynamics themselves are interesting too — but for a totally different reason. Lastly, it’s interesting because of synecdoche (i.e., “part represents the whole”). This whole exchange is a crystallized moment that is, writ small, the essence of the back and forth discussion that many products have about security vs. functionality trade-offs. It’s worth talking about those because customers are often unhappy with the “security” of what they buy or use — and they might not understand that a discussion just like this one is the reason why that product is the way it is.
Anyway, let’s start with Linus’ point. Is he right in saying that security issues are just bugs? I think fundamentally he is, but I will note that it’s nuanced. Let’s start with what’s easy. We all know there are few different potential types of security “issues” that could happen: coding issues (e.g. injection issues and the like), logic issues (for example, behavior that isn’t constructed in such a way as to mitigate potential attack paths), features that don’t behave as advertised or that allow an attacker to misuse the architecture or the implementation. Those are clearly bugs. It gets a little murkier though when we include security functionality as well in the discussion. For example, philosophically — are missing features (i.e. those that would be required to meet a reasonable security baseline based on how the product will be used) “bugs”? For example, say I were to author a web server that didn’t support TLS. Is it a “bug” that someone could snoop on the traffic? Meaning, would it be a request for a new feature (ergo, an improvement) to add it? Or is adding TLS addressing a bug? I think the answer is based on usage and the claims made about how it is supposed to function.
For example, it’s probably hard to argue it isn’t a bug if the web server example above is an e-commerce application advertised as “a secure, hardened e-commerce platform fully in compliance with the PCI DSS.” In that case, failing to implement a clearly-necessary feature to meet the goals of the product and expectations of the user community is a bug. The expectations the user has dictates that. We’ve set up the expectation that it behave a certain way — so, it not doing so is a bug. But it’s almost entirely dependent on what we’ve said about it and how the user expects it to behave.
Now, in the case of a general purpose OS, the usage is undefined. For normative usage, I think the right behavior to enforce — and what I expect as a user — is that you don’t kernel panic if there’s an issue with usercopy. Warn me about it first. Seriously. Warn me a lot if you want, but start with the warning (which, note is what the WARN behavior does). The situation might be significantly different though if it’s say, a voting system or biomedical system running something like BusyBox. In that case, I don’t expect a lot of modification to the user base. If usercopy is happening in that situation – particularly in a way that fails the check – kernel panic is probably an acceptable option. In fact, I’d prefer that. I might even prefer it if the kernel panics if someone employs usercopy in the first place.
tl;dr? I agree with Linus philosophically, with the caveat that, depending on usage, it might be non-“f*cking moronic” to advocate the “panic if the check fails” position (note that nobody in this thread is arguing that…). I wish Linus had picked a different way to phrase this though. He’s set the argument up such that either you agree or you’re a moron. This makes it really hard for anyone to argue the alternative position. For example, did I argue the way I did above based on the logic or because I don’t want to Linus to think I’m a moron? I don’t know. I like to think it’s based on the logic. I also get it that I have a metacognitive bias based on the Millgram effect (when it comes to Linux, who’s an authority if not Linus Torvalds?), so frankly I will never truly know for sure. That’s interesting if you’re interested in that kind of thing (which I am).
Finally, I think it’s interesting also because these are exactly the security/usability tradeoff discussions that happen with every product that there is. On the one hand, there are folks who want it to just work – functionality is paramount and trumps enhanced security. On the other, there are those that feel the system is flawed if certain security-related design criteria are not met. Almost always, the decision is to go with the former. Is that always the right thing? For the Linux kernel, I think it is. Ideally, I’d prefer to have the option to enable the “kernel panic on whitelist failure”, “warn on whitelist failure” or “ignore whitelist failure” based on my particular use case. For BusyBox, I’d almost always prefer it be to panic (and default that way); for my desktop, I’d almost always prefer it warn (and default to that), for a teardown malware-testing virtual instance I think “off” is the right setting. But what’s interesting to me about it is that it represents a security vs. usability design decision – and there’s a grey area. I think it’s important that we recognize it has the potential to be a grey area, because that way discussion and learning happens.