WannaCry lessons, patching redux


So I promised yesterday that I would continue the discussion of the rampant foolishness that is “WannaCry” – and more importantly the lessons that we can learn from it.  I talked yesterday about what I perceive to be issues with the way that we as an industry responded from a communication standpoint (i.e. the “festival of half-cocked media evangelism”), but there is another lesson to learn I think as well.  Specifically, about patching.

Here’s the deal: over the weekend, the universe sorted organizations into two groups – those that patched for MS17-010 and those that did not.  Now, I’m not going to sit here and “armchair quarterback” the folks that didn’t patch: it’s super easy to criticize in hindsight and doing so almost always leaves out important considerations on the part of those that made the decision.  That said, the question of whether they would have been better off or not having patched is an interesting one to suss out.  Specifically, because I think it’s possible patching (even given a possible production impact) could have been the right decision in almost every case.  Not as in “now we know WannaCry came through so in it’s the right decision in hindsight” but instead as in “we don’t know yet whether MS17-010 will be exploited, but it’s serious enough that maybe the right decision it to patch — even if there could be some production impact in doing so.”  If you’re in IT – particularly in a “critical infrastructure” organization, you probably realize how bold a statement that is.

Look, patching is a hard thing to get right.  For example, in a hospital or health system, you have to balance the potential impact of the patch (which can sometimes have health and safety consequences) with the potential likelihood and impact of a potential vulnerability.  You need to weigh how much to test and the time it takes to do so against the potential impact of exploitation and the role/impact of the systems that could be exploited. To address potentially critical production issues, you might choose to test a given patch thoroughly – to research it exhaustively, test its operation on critical systems, etc. That could take months. Alternatively, you might choose to streamline it given the severity of the issue in question.  This might take weeks.  Or you might choose – in a few rare situations – to push it through with minimal testing because the potential impact is so dire that it needs to be addressed ASAP.  This you can do in days.  I think MS17-010 should be in that last category for most shops in many situations.  Now, I’m excepting here areas where, as a practical matter, you can’t effect patching (like a biomed system, communications routing equipment, etc.) that might be maintained by an outside vendor; but really, you have SMB turned off or otherwise filtered for those already, right?

To get back to the point, if you look at it a certain way, “not patching” is itself a certain kind of control or countermeasure.  Specifically, it’s a control to safeguard against potential downtime impact – or potential adverse consequences caused by the patch itself.  This use of “not patching” as a control is (or should be) weighed against the potential impact caused should the vulnerability the patch addresses be exploited.  This weighing of those two controls (patching vs. not patching) is therefore a risk-based decision with those two sets of undesirable consequences weighed against each other and the least impactful plan moving forward. And here’s where I’m going with this: expedited patching – even with minimal (or even no) testing, might have resulted in less impact in aggregate than waiting to thoroughly test before doing so.

In the case of the issues fixed by MS17-010 (i.e.  CVE-2017-0143, CVE-2017-0144, et cetera), I think we can all agree that we knew ahead of time that the impact (when exploited) had the potential to be significant, right?  If you didn’t know that, you have bigger fish to fry than systematically evaluating the relative performance of your patching – or “not patching” – controls.  Like, stop reading this and go spin up a process to start tracking and evaluating this stuff — we’ll wait.  Anyway, assuming you did have a process that flagged this issue ahead of time along with it’s relative seriousness, I would argue that a 9.3 CVSS, remotely-exploitable issue, for which there is exploit code readily available, in one of the most commonly-used protocols on the planet is the time to pull out the expedited process in almost every case.  On the whole, I can almost guarantee you that the shops that had problems in aggregate with WannaCry, had they applied an expedited patching process from the get-go, would likely have encountered an overall reduction in production impact relative to what they experienced by not doing so.

So what does that mean from a practical point of view?  It means, those shops that had issues with WannaCry may need to re-examine their patching process.  This is not going to be the last issue of this type that we’ll encounter. The onus is on us – between now and the next time that it happens – to make sure that our processes are as streamlined as they can be and that we’re appropriately prioritizing in light of the full range of potential impacts of exploitation.