In a typical enterprise, a division of responsibilities is codified: an IT team runs IT systems and a security team operates security systems. There might not be any risk of security systems affecting IT systems until the security tools are running on end-user devices, servers and as active elements in the network (firewall admins will agree with me, they get lots of unwarranted grief from IT teams that “the firewall is slowing things down”).

Out of the security tools that have potential impacts on IT managed systems are anti-malware kernel-hooked drivers. As cyber threat actors improve their attacks, so too do the capabilities of anti-malware tools. To perform their function efficiently these are allowed privileged access into the deeper levels of the operating systems and applications. That is where the technical, responsibility and incident management issues arise. To resolve these, IT and security teams must work together, not against each other.

Take a security tool that requires a piece of software (agent/service/kernel driver) to run on IT managed systems, be they end-user computers or servers. The security team cannot and should not demand that the IT team install the said software on their systems, blindingly trusting the security team that “this software is safe”.

Instead, the IT team should insist on correct justification and performance impact testing. An assessment should be done of how these tools, managed by a security team, affect the IT team’s Recovery Time Objectives (RTO) and Recovery Point Objectives (RPOs) contract between the IT team and the rest of the business.

Unfortunately, based on my experience, and the analysis of the biggest IT incident caused by a security company to date, many enterprises even in the regulated industries failed to do just that.

You might recall those businesses that, even days after CrowdStrike distributed a faulty channel update and released a fix a few hours later, were unable to resume normal operations. Take Delta Airlines as an example. While all other US airlines restored their operations within two days of the fix being made available, Delta was unable to operate for five days. As per CrowdStrike’s blog post, the blame for not restoring in due time is shared between the CrowdStrike’s IT and security teams.

While I am not advocating for the reduction of CrowdStrike’s portion of the blame, I argue that the failure to resume operations once the fix was available, represents a failure of IT and security teams in the affected organisations.

The IT team’s primary objective is to deliver business value by making sure necessary IT systems are available and performing within agreed parameters, while the security team’s primary objective is to reduce the probability of material impact due to a cyber event.  CrowdStrike was not a cyber event; it was an IT event that was caused by a security vendor. Similar events happen due to Microsoft blunders every year.

Inevitably, the lack of preparedness to restore normal operations within agreed RTOs and RPOs tarnishes both the IT and security team’s reputation in the other business functions executives’ books.

The lost trust and reputation are difficult to regain. As an industry we need to learn from this and work smarter.

The following are three lessons learned from this era-defining incident:

  • Focus on testing recovery based on agreed RTO & RPOs. Security teams should insist that IT team perform recovery testing covering scenarios where a security tool makes the operating system non-bootable.
  • CIOs and CISO should talk jointly to the rest of business executives and explain the need for specialised security tooling but also assurances that the tested recovery is within the agreed parameters (e.g. RTOs and RPOs).
  • Engage with the company’s legal counsel and procurement to review the security vendor’s contracts and identify unfair advantages that vendors have embedded in the contracts regarding the compensations due to their faults in service delivery.



Source link