What lessons can we learn from the CrowdStrike outage? – Security

The recent CrowdStrike outage, triggered by a faulty software update, resulted in the disruption of over 8.5 million Windows PCs worldwide.

This incident had far-reaching consequences, including thousands of cancelled flights, halted surgeries in hospitals, and temporary shutdowns of 911 systems, impacting critical sectors such as healthcare, retail, and finance. The estimated financial toll on Fortune 500 companies already exceeds US$5.4 billion (S$7.3 billion).

In response, Singapore established a task force to assess the situation and enhance digital resilience. While the APAC region was less affected due to fewer CrowdStrike customers, the incident underscores vulnerabilities in software management and highlights the need for improved cybersecurity and operational resilience across industries increasingly reliant on digital technologies.

To gain deeper insights on the lessons learned from the CrowdStrike incident and potential mitigation strategies, iTNews Asia speaks with industry practitioners to explore how the tech providers and businesses can address such disruptions if they happen here. Our respondents include:

Vaibhav Dutta, Associate Vice President and Global Head-Cybersecurity Products & Services at Tata Communications;
Dan Elliott, Head of Cyber Resilience, Zurich Resilience Solutions, Australia, Zurich Insurance
Aaron Bugal, Field Chief Technology Officer, Asia Pacific and Japan, Sophos;
Damien Wong, Senior Vice President, Asia Pacific and Japan, Tricentis
Anthony Lim, Director, Centre for Strategic Cyber and International Risks (CSCIS) and
Mark A. Johnston, Vice President, Global Healthcare Innovation, Infovision

iTNews Asia: What insights can we gain from the CrowdStrike outage?

Zurich’s Elliott: After an incident like this, it’s essential to hold discussions internally, with vendors, and with the company insurer, focusing on incident response and business continuity planning.

Some organisations reacted quickly, having established mitigating strategies that included well-planned communications to employees and customers, from which we can learn valuable lessons.

Organisations should practise for such incidents like schools conduct fire drills. This also highlights the importance of third-party risk management, an area of cyber risk I’ve been focused on for some time.

The traditional “castle-and-moat” network security model is outdated due to remote work and cloud infrastructure, and we must consider our dependence and interconnectedness with customers and vendors.

Tata’s Dutta: As critical infrastructure becomes more digital, cyberattacks can severely disrupt daily life. Resilience is about readiness and recovery, not just prevention.

Organisations should develop comprehensive contingency plans that go beyond IT to include essential business operations. Key measures include testing updates in staging environments, implementing phased rollouts, and conducting regular risk assessments.

A well-prepared incident response plan with clearly defined roles is vital for effective crisis management. Additionally, investing in continuous monitoring and AI-driven anomaly detection can improve threat response, while promoting a culture of security awareness helps reduce human error.

Sophos’ Bugal: Recent cybersecurity events and ongoing software trends clearly indicate that changes are necessary.

We must consider the advantages of Windows offering an extended set of native security interfaces for the entire endpoint security ecosystem while also weighing the monoculture risks associated with replacing the diverse proprietary innovations and controls currently available.

– Aaron Bugal, Field Chief Technology Officer, Asia Pacific and Japan, Sophos

Ultimately, transparency and open communication are the best ways to enhance outcomes for defenders and customers as quickly as possible.

CSCIS’ Lim: The CrowdStrike incident is a stark reminder of the delicate balance between maintaining security and stability in the cybersecurity realm within our increasingly interconnected and complex digital landscape.

While proactive threat detection and response are vital, they must be balanced with meticulous testing and contingency planning. As the industry navigates this incident, it will undoubtedly lead to enhanced protocols and safeguards to prevent a similar future occurrences.

Infovision’s Johnston: The CrowdStrike outage highlights the need for redundancy and regular third-party security audits to identify vulnerabilities in both primary and backup systems.

Proactively auditing security measures can uncover potential weaknesses, preventing future disruptions. Threat hunting, which involves identifying and neutralising threats before they manifest, should also be a priority to reduce reliance on reactive security measures. This incident shows the importance of having distributed security controls across layers to ensure resilience.

Tricentis’ Wong: Many organisations in APAC still rely on traditional methods like manual testing and legacy script-based automation, which are inadequate for today’s complex IT environments.

Integrating automation and AI into quality assurance and DevOps processes is essential for modernising these outdated approaches and ensuring more efficient testing, so that minor code changes don’t have the same ripple effect we saw with the CrowdStrike incident.

AI-Augmented DevOps is already enhancing team efficiency, reducing skills gaps, cutting costs, and improving software quality.

Our latest global survey found that 60 percent of DevOps practitioners believe testing is where AI adds the most value, emphasising its importance for software reliability and performance. For APAC organisations to stay resilient, adopting these advanced technologies and evolving their quality assurance strategies is a business imperative.

iTNews Asia: Should we reconsider our approach to integrating third-party software and managing software updates to avoid exposing ourselves to a single point of failure?

Tricentis’ Wong: Enterprise IT environments and application ecosystems are increasingly complex and interconnected. The pressure to accelerate software development and delivery often leads to a trade-off between speed and quality.

To keep pace with the frequency of updates and integrations while minimising risk and ensuring system stability, robust software testing strategies are essential.

To deliver high-quality products and services, teams must focus on identifying and addressing risks and defects early, shortening release cycles, and incorporating automation for continuous, actionable feedback. This shift in mindset can help reduce vulnerabilities and enhance overall software quality.

Sophos’ Bugal: Any security product, whether it utilises its own kernel drivers or Windows platform features, requires periodic updates that alter system behaviour. Such changes should be rolled out gradually to ensure stability and functionality.

Vendors should focus on making their products as safe and reliable as possible while providing customers with as much visibility and control as feasible.

CSCIS’ Lim: This will be challenging or near impossible, unless there is global outrage, lobbying, and legislation pushing software vendors and big tech to allow such options. I wouldn’t be surprised if the EU implements such a requirement.

In this incident, nearly all Microsoft enterprise users were unaware that CrowdStrike was installed on their computers until the story broke. CrowdStrike had been functioning quietly until a pre-production software update was rolled out, crashing millions of machines worldwide.

While we can reconsider our approach to software updates, the way software is deployed, and the concentrated risks leave us with limited options. We can’t label it a single point of failure since the issue lies with the machine OS itself. It’s impractical to maintain a parallel OS or a backup laptop that doesn’t run Windows.

The real single point of failure isn’t the CrowdStrike endpoint; as an extreme case to illustrate this state of dependency (on a single point of failure) and understand why it is difficult to overcome – if our home power supply goes out, we won’t have a parallel power supply source.

One other revelation from this incident is why CrowdStrike is necessary as an endpoint security solution. Is Microsoft’s own Defender not sufficient? And what about the millions of Windows home users relying on Defender for protection?

Zurich’s Elliott: There is room for improvement in these processes. Before onboarding a vendor, it’s crucial to assess the risks they may pose to the organisation. This can be achieved through a questionnaire, discussions about their practices, or even a penetration test of their network.

When building a cybersecurity program, it’s vital to avoid the risks of putting “all your eggs in one basket” by relying solely on a single security vendor. Additionally, best practices for software updates include initially updating just one computer or a few devices to see how those updates interact with existing software.

Tata’s Dutta:

Before adopting any third-party software, conduct thorough due diligence by assessing the vendor’s security track record, certifications, and incident response plans. Prioritise partners with a proven commitment to security, as they serve as your first line of defence.

– Vaibhav Dutta, Associate Vice President and Global Head-Cybersecurity Products & Services at Tata Communications;

Cloud services operate on a shared responsibility model: providers secure the core infrastructure, while businesses must protect their applications, endpoints, and data. To prevent a single point of failure, businesses should diversify their software vendors to ensure redundancy and reduce over-reliance on any one provider.

By implementing robust security testing and using advanced tools like vulnerability scanning, businesses can stay ahead of threats. Ultimately, investing in comprehensive strategies not only strengthens defences but also builds trust and confidence in your digital infrastructure.

iTNews Asia: From CSCIS’ perspective, how prepared are governments, critical infrastructure service providers, and larger organisations in Asia should a similar incident take place here? Should they relook their business resiliency posture?

CSCIS’ Lim: Organisations should assess their exposure or dependency on third and fourth parties in light of this and future incidents. Even if their own endpoints, servers, and cloud workloads are unaffected, external parties they rely on may still be impacted, making it crucial to understand these relationships.

While third and fourth parties may recover and resume operations, it’s vital to determine whether their supply chains are operating without essential security controls due to this issue – for example, some businesses might disable CrowdStrike, thereby losing critical protections.

– Anthony Lim, Director, Centre for Strategic Cyber and International Risks

Many government and industry regulators are already focusing on third-party risks, and this scrutiny is likely to increase following the CrowdStrike incident.

These efforts will likely include:

Developing an aggregated view of technology dependencies across critical infrastructure sectors and industries – including reliance on technology service providers and software products – in order to identify systemic cyber risks, supply chain risks, sector-wide dependencies, and/or vulnerabilities.
Evaluating whether market presence or critical infrastructure technology dependencies should create new reliability and security obligations.
Vendors offering customers a robust rollback mechanism to quickly revert to previous states in case of problematic or unsuitable updates automatically being dished out.
Promoting effective communication channels in organisations and support mechanisms crucial for guiding users through troubleshooting processes, especially during widespread incidents.
Requiring vendors’ software updates to be rolled out incrementally to customers and the latter building redundancy into systems to cater for such situations.

iTNews Asia: What practical changes can be considered in cybersecurity and management strategies to mitigate the impact of such incidents? How can we revamp our business contingency plans?

Tata’s Dutta: Follow a strategy-driven focus on technology. As security infrastructure becomes more complex, it’s vital to simplify operations and ensure strong defences. Instead of adding more tech solutions, businesses should streamline their cybersecurity strategies to better protect data and customer interactions.

Building cyber resilience requires both policy and technological changes. This involves implementing a centralised system for managing software updates to ensure consistency and control.

Establishing a rigorous testing environment to validate updates before deploying them to production is essential, as is considering a phased rollout approach for effective monitoring and troubleshooting.

Finally, educating employees about the importance of timely software updates and their role in the overall security strategy is crucial for preventing vulnerabilities and reducing human error. Engaging IT teams and collaborating with risk management experts can help organisations navigate these challenges.

Zurich’s Elliott: There are a number of approaches that vendors could institute to potentially mitigate the impact of an outage like this. Staggering update rollouts allows for early detection and monitoring of potential issues.

Similarly, vendors can test their updates in a “sandbox” environment to identify any problems before deployment. There is also a compelling case for AI driven anomaly detection to enhance the reliability and stability of coding and updates.

Tricentis’ Wong:

This incident was not due to cybersecurity oversight, but rather a lack of testing before software release. In today’s complex IT environments, even a minor change in one application can trigger a ripple effect that impacts interconnected business processes.

– Damien Wong, Senior Vice President, Asia Pacific and Japan, Tricentis

A mature quality assurance and testing strategy is essential for addressing user needs across different environments. Software quality intelligence, which uses real-time data analytics to monitor and analyse software risk, can enhance this approach by identifying which code changes need testing with each update, enabling enterprises to predict impacts and mitigate risks effectively.

Sophos’ Bugal: Strategies like gradual rollouts, feature flags, and customer control over software versions could be important considerations for ensuring business continuity in the face of potential disruptions.

CSCIS’ Lim: The disruption recalls how a mistake in switch configuration knocked out the global BlackBerry email platform for three days after the error spread to every switch on the network. Human error is inevitable, but there are always lessons to learn to minimise future risks – switching everyone to MacBooks as a backup solution is not practical and may not even solve the problem.

While the likelihood of such events recurring is low, their impact can be significant. Agent-based detection systems like CrowdStrike Falcon EDR often require enhanced or administrator-level privileges to monitor computer activity, as they are integrated into critical OS components. There’s no quick fix; companies must manually reboot each affected device in ‘safe mode.’

Addressing the issue involves hands-on work for hundreds of thousands of machines, which poses a greater challenge for organisations with a large, remote workforce.

A bigger risk arises when users are asked to self-remediate, as this entails giving them administrative passwords, which could have further security implications.

Infovision’s Johnston: In addition to implementing cloud-based failover systems and decentralised security monitoring, organisations should integrate Security Orchestration, Automation, and Response (SOAR) platforms to streamline incident response across various tools.

Continuous security awareness training for employees is essential, as human error remains a major attack vector during security incidents. Additionally, threat intelligence must be updated regularly to adapt to new risks and improve response times.

– Mark A. Johnston, Vice President, Global Healthcare Innovation, Infovision

iTNews Asia: Does the CrowdStrike outage challenge traditional notions of cybersecurity resilience that you had in Zurich Insurance, and what new paradigms might emerge from this event?

Zurich’s Elliott: During the outage, some organisations reportedly considered temporarily disabling their CrowdStrike network monitoring tools, leading to a difficult choice: face the risks of a network outage or leave the system vulnerable to attacks.

This underscores a significant issue where security tools meant to protect us can sometimes cause the business interruptions they are designed to prevent. It creates a challenging situation where systems either function perfectly or fail completely.

However, we must remember that network monitoring remains a fundamental aspect of network defense. While much focus has been on cyberattacks and breaches, we should not overlook other incidents, like system failures, that can also significantly impact organisations.

iTNews Asia: How can issues like this be prevented going into the future? How can companies balance between protection and exposure and be better prepared for the next disruption?

CSCIS’ Lim: Two steps an organisation can take to strengthen its business continuity in the case of a similar situation:

Build awareness, capability and resources into the Incident Response, Disaster Recovery & Business Continuity teams.
Work with the company’s usual system integrator to ensure availability of expertise and resources to help mitigate the situation when it arises, and the SI gets called.

Organisations should consider what could have been done differently to have reduced the impact and revise its incident response, business interruption and disaster recovery plans accordingly.

Tata’s Dutta: Preventing future issues relies on building cyber resilience, much like vaccination for modern enterprises. The focus should shift from traditional tools to tackling evasive threats like ransomware and advanced persistent threats that often evade detection. Companies must be proactive, anticipating and mitigating risks before they lead to disruptions.

However, resilience isn’t about eliminating all risks but managing them effectively. To balance protection and exposure, companies should streamline security, anticipate risks, implement phased updates, and invest in advanced monitoring, patch management, and configuration management, all while fostering a culture of security awareness.

On a broader scale, governments can support this effort by developing robust guidelines and standards. For instance, Singapore’s proposed amendment to the Cybersecurity Act, which includes cloud data centre operators, represents a significant step toward enhancing data security. Such regulatory measures highlight the need for stringent cybersecurity practices and proactive threat management.

Zurich’s Elliott: Cyber risk advisors and engineers will continue to evolve global assessment methods for underwriters. Traditionally, risks from potential insured clients are evaluated based on their controls, provided these controls are properly configured and the remaining risks.

Moving forward, two additional factors will be increasingly important: incident response plans for non-security incidents (like vendor outages) and avoiding vendor lock-in or single points of failure, which occurs when an organisation becomes overly reliant on a single vendor or technology for its cybersecurity program.

– Dan Elliott, Head of Cyber Resilience, Zurich Resilience Solutions, Australia, Zurich Insurance

Tricentis’ Wong: Strong change and release management processes are crucial for preventing faulty software updates from causing significant IT outages. These processes help identify affected areas and assess potential risks before implementation.

By analysing these factors, IT teams can ensure that faulty software isn’t released into production environments. While mistakes can happen, organisations should use these opportunities to re-evaluate their quality assurance strategies, particularly their change validation processes, and adopt more advanced testing when needed.

iTNews Asia: From your user perspective, how can companies and vendors work together more effectively to handle such outages in the future?

Zurich’s Elliott: At a contractual level, it’s vital to understand your service level agreements (SLAs) and the shared responsibility model for each vendor. However, communication is key; the worst time to discuss issues with a vendor is during an incident. Formalising communication as part of the annual vendor management cycle is essential if ad hoc discussions aren’t feasible.

The most successful organisations during incidents are not necessarily those using different vendors, as outages can occur with any security provider.

Rather, success lies in having IT and cybersecurity teams that can react quickly, gather information from the vendor, and communicate guidance to the rest of the organisation in an orderly manner.

Source link