Skip to content

Chapter 3: Incident Response Fundamentals

Module Overview

Imagine it is 4:45 PM on a Friday. You are packing up your bag, looking forward to the weekend. Suddenly, your phone buzzes. Then it buzzes again. Then the Slack channel for the Security Operations Center (SOC) lights up red. A server in the payroll department just started encrypting its own hard drive.

At this exact moment, the theory stops, and reality begins.

Welcome to Incident Response (IR). If cybersecurity is the shield, Incident Response is the sword we use when the shield cracks. In this profession, we operate under a simple, sobering truth: It is not a matter of if you will be breached, but when.

When that bad day comes, the difference between a minor annoyance and a company-ending headline is not the tools you bought—it is how well you prepared and how disciplined you are in your response. This week, we are moving away from general risk concepts and stepping into the role of the "First Responder." We will learn how to distinguish noise from actual threats, how to prioritize chaos, and most importantly, how to build a plan before the building catches fire.


Learning Objectives

By the end of this module, you will be able to:

  • Differentiate between a benign security event and an actionable security incident based on impact and intent.
  • Analyze raw security data to apply correlation techniques that identify complex attack patterns.
  • Categorize security incidents by Severity Levels (P1–P4) to determine resource allocation and response timelines.
  • Construct the framework of an Incident Response Plan (IRP) with a specific focus on the Preparation Phase of the NIST SP 800-61 lifecycle.
  • Apply the RACI matrix model to assign specific roles and responsibilities within a Computer Security Incident Response Team (CSIRT).

1. The Signal vs. The Noise: Events and Incidents

In a modern enterprise environment, our systems generate millions of logs every single day. Every login, every file access, every website visit creates a digital footprint. The first skill a SOC Analyst must master is the ability to filter this massive stream of data.

Defining the Terms

Many people use the words "event" and "incident" interchangeably. In this class (and in your job interview), you must not make that mistake.

Feature Security Event Security Incident
Definition An observable occurrence in a system or network. An event that violates security policies or threatens assets.
Frequency Happens thousands of times per second. Happens occasionally (hopefully rarely).
Nature Usually neutral. Inherently negative/hostile.
Example A user types a password incorrectly once. A script tries 500 passwords in 1 minute (Brute Force).
Action Req. Usually none (Log and ignore). Immediate Action Required.

Real-World Scenarios: Event vs. Incident

To help you spot the difference in the logs, here is a direct comparison of similar activities across different security domains.

Domain Security Event (Log & Ignore) Security Incident (Alert & Respond)
Network A firewall blocks a random port scan from the internet. (This is normal background noise). A firewall logs an internal server communicating outbound to a known Command & Control (C2) IP address.
Email A user receives a generic spam email selling vitamins. A user receives a targeted spear-phishing email pretending to be the CEO asking for W-2 forms.
Endpoint Antivirus software successfully updates its signature database. Antivirus detects and fails to quarantine a trojan on the HR Director's laptop.
Access Control A user mistypes their password once on a Monday morning. A user account is locked out after 50 failed login attempts in 10 seconds (Brute Force Attack).
Physical Security A badge reader grants access to an active employee at 8:00 AM. A badge reader grants access to a terminated employee at 3:00 AM on a Saturday.

Mentorship Note: Think of a Security Event like a heartbeat monitor in a hospital. It beeps constantly to say "I'm alive." That is normal. A Security Incident is when that heartbeat stops or goes erratic. That is when you grab the crash cart.

What Constitutes an Incident?

An incident is declared when the CIA Triad (Confidentiality, Integrity, Availability) is threatened. To upgrade an Event to an Incident, we usually look for three criteria:

  1. Policy Violation: Did someone break the rules? (e.g., An admin installing unauthorized software).
  2. Security Impact: Is data at risk? Is a system down?
  3. Documentation Necessity: Do we need to record this for legal or regulatory reasons?

Event Correlation: Connecting the Dots

A single event is rarely a smoking gun. Hackers are smart; they move "low and slow" to avoid tripping alarms. Correlation is the technique of connecting multiple seemingly harmless events to reveal a malicious pattern.

The Tool of the Trade: SIEM We use a Security Information and Event Management (SIEM) system (like Splunk, Microsoft Sentinel, or LogRhythm) to do this heavy lifting.

Example of Correlation Logic:

  • Event A (10:00 AM): HR Manager swipes badge to enter the building in Ohio.
  • Event B (10:05 AM): HR Manager's account logs into the VPN from an IP address in Russia.
  • Analysis: Individually, both events might be technically possible. Correlated, they represent an "Impossible Travel" scenario. This is an instant Incident.

"Left of Bang" vs "Right of Bang"

In the industry, you will often hear SOC Managers talk about shifting "Left of Bang." This concept comes from the military (specifically the US Marine Corps Combat Hunter program), but it is standard terminology in cybersecurity operations.

"The Bang" is the moment the attack successfully executes (e.g., the moment the ransomware encrypts the drive, or the data leaves the network).

Left of Bang (Proactive): Everything that happens before the attack succeeds. This is where we want to live.

  • Activities: Threat Hunting, patching, user awareness training, setting up firewall rules, analyzing pre-attack indicators (like scanning).
  • Goal: Disruption and Prevention.

Right of Bang (Reactive): Everything that happens after the attack succeeds. This is traditional Incident Response.

  • Activities: Forensics, re-imaging servers, restoring from backups, writing legal reports, paying fines.
  • Goal: Survival and Recovery.

Note

If your security team spends 100% of its time "Right of Bang," you are just digital firefighters. You will burn out. A mature security program invests heavily "Left of Bang" to stop the spark before it becomes an inferno.


2. Triage: Severity and Priority

When a breach occurs, panic is the enemy. You cannot fix everything at once. If a printer is infected with malware and the CEO's laptop is being ransomed, you must know which one to tackle first. We use Severity Levels (often called Priority Levels) to make these decisions objectively.

The Standard Priority Matrix

Priority Classification Description Response SLA (Time to React)
P1 Critical Mission-critical systems down, active data exfiltration, or massive financial/reputational damage. Immediate (15 mins or less)
P2 High Significant degradation of services, or a localized breach that could spread. Urgent (< 1-2 Hours)
P3 Medium Policy violation or malware on a non-critical system. Business continues, but risk exists. Standard (< 24 Hours)
P4 Low Annoyance, spam phishing attempt that failed, or minor policy infraction. Scheduled (Next Business Day)

Scenario: I once worked a ticket where a user reported "Internet is slow." We marked it P4. Two hours later, we realized the "slowness" was actually a DDoS attack taking down our main web server. We had to escalate that to a P1 instantly. Categorization is dynamic; it can change as you learn more.

Common Types of Security Incidents

You will encounter these standard archetypes in the field:

  • Ransomware: Malicious software encrypts data and demands payment. (High Availability/Integrity impact).
  • Unauthorized Access: Use of stolen credentials to access systems (Confidentiality impact).
  • Data Exfiltration: The unauthorized transfer of data out of the network.
  • Insider Threat: A disgruntled employee or contractor abusing their legitimate access.
  • Supply Chain Attack: A vendor you trust is compromised, and the attacker rides their connection into your network (e.g., the SolarWinds breach).


3. The Incident Response Lifecycle (NIST SP 800-61)

We do not make up Incident Response processes as we go. We follow the NIST SP 800-61 Revision 2 framework. This is the gold standard for US government and most private industries.

The lifecycle is a continuous loop consisting of four distinct phases. Understanding the specific activities in each phase is crucial for passing your certs and surviving your first week in a SOC.

Phase 1: Preparation

This is the only phase that happens before the bad guys get in. It is about establishing the capability to respond.

  • Goal: Ensure the organization is ready to handle an incident.
  • Key Activities:
    • Baselining: Knowing what "normal" network traffic looks like so you can spot "abnormal."
    • Tooling: Installing EDR agents, setting up the SIEM, and ensuring logging is turned on.
    • Contact Lists: Ensuring you have the 24/7 phone numbers for Legal, PR, and your ISP.
    • User Training: Teaching "Human Firewalls" not to click phishing links.

Phase 2: Detection & Analysis

This is where the alert fires and the investigation begins. This phase often consumes the most time because false positives are common.

  • Goal: Determine if an incident has occurred, how severe it is, and what the scope is.
  • Key Activities:
    • Triage: Discarding false alarms (e.g., a user forgetting their password vs. a brute force attack).
    • Scoping: Answering the question: "Is it just one laptop, or is it the whole network?"
    • Forensics: Capturing volatile memory (RAM) or disk images to find the "Patient Zero" malware.
    • Correlation: Using the SIEM to link a door swipe to a login event.

Phase 3: Containment, Eradication, & Recovery

Once we know what the threat is, we have to stop it. This phase is broken into three sub-steps:

A. Containment (Stopping the Bleeding)

  • Short-Term Containment: Disconnecting the infected server from the network (pulling the cable) or isolating the VLAN.
  • Long-Term Containment: Applying temporary patches or changing passwords on compromised accounts to prevent reentry.
  • Critical Decision: Do you shut it down immediately (stopping the damage but destroying evidence in RAM) or watch it to learn the attacker's moves?

B. Eradication (Removing the Cancer)

  • Goal: Eliminate all components of the incident.
  • Examples: Deleting the malware executable, disabling breached accounts, identifying and patching the specific vulnerability (e.g., the unpatched web server) that let them in.

C. Recovery (Getting Back to Business)

  • Goal: Restore systems to normal operation.
  • Examples: Restoring data from a clean backup (confirmed not to be infected), rebuilding the server from a gold image, and—crucially—monitoring the system for 24-48 hours to ensure the attacker doesn't return.

Phase 4: Post-Incident Activity

Often skipped, but arguably the most valuable phase. This happens after the dust settles.

  • Goal: Learn from the attack to prevent it from happening again.
  • Key Activities:
    • Lessons Learned Meeting (The "Hot Wash"): A meeting where the team asks: "What went wrong? How did they get in? Why did it take us 4 hours to notice?"
    • Metric Analysis: Calculating "Time to Detect" and "Time to Remediate."
    • Plan Updates: Rewriting the Incident Response Plan (IRP) based on what failed during the actual event.

Note

While we just covered the overview of the entire incident response lifecycle, this module will focus primarily on the first phase: Preparation. We cover all phases in-depth throughout future weekly modules.


4. Deep Dive: The Preparation Phase

Abraham Lincoln reputedly said, "Give me six hours to chop down a tree and I will spend the first four sharpening the axe."

In Cybersecurity, Preparation is sharpening the axe. If you try to figure out who to call or where the backups are during a ransomware attack, you have already failed.

A. Developing Incident Response Policies

The Policy is the document that gives you the authority to act. It must be signed by senior leadership. Without this, you are just an IT person unplugging cables; with it, you are an authorized responder protecting the business.

The RACI Matrix Confusion kills response times. We use a RACI Matrix to define roles clearly:

  • R - Responsible: The person who does the work (e.g., The SOC Analyst).
  • A - Accountable: The person who owns the outcome (e.g., The CISO).
  • C - Consulted: Subject matter experts who give advice (e.g., Legal Counsel).
  • I - Informed: People who need to be updated (e.g., The CEO or PR team).

B. The Incident Response Plan (IRP)

The Incident Response Plan (IRP) is the central governing document that directs the organization's response to security incidents. While the Policy establishes the mandate, the Plan provides the roadmap for the Computer Security Incident Response Team (CSIRT) to navigate the chaos of a breach.

A robust IRP must contain the following critical components:

  1. Mission & Objectives:
    • It answers the question: "When things go wrong, what matters most?" (e.g., Integrity vs. Availability).
  2. Scope:
    • What are we authorized to protect? (Physical, Technical, Assets).
    • Does the team have authority over BYOD (Bring Your Own Device)?
  3. Authority & Management Approval:

    • This is a formal statement signed by Executive Leadership.
    • It grants the CSIRT the power to make drastic decisions—such as severing internet connectivity—without needing to schedule a meeting during a crisis.
    • TIP: An IRP without this specific authorization is dangerous. If you unplug a server to stop a virus and cost the college money, you need a signed document proving you were authorized to take that action.
  4. Performance Metrics (KPIs):

    • Time to Detect (TTD).
    • Time to Acknowledge (TTA).
    • Time to Remediate (TTR).
  5. Forms & Checklists:
    • Incident Intake Form.
    • Chain of Custody Form (for forensics).
    • Call Trees.
  6. Plan Maintenance & Testing:
    • How often is the plan reviewed and tested?

C. Communication Protocols

Silence is bad, but the wrong communication is worse.

  • Out-of-Band Communication: If hackers are inside your email server, you cannot use email to coordinate the response. You need a backup (e.g., Signal, a phone tree, or personal Gmails).
  • Need-to-Know: Never broadcast details of a breach to the whole company until Legal approves it. Leaks can destroy stock prices.

D. Testing the Plan: Tabletop Exercises (TTX)

A plan that sits in a binder is useless. A Tabletop Exercise (TTX) is a discussion-based simulation where the team gathers to talk through a fake emergency.

The Goal: Validating Effectiveness We do not run TTXs to "win." We run them to fail safely.

  • Does the plan actually work?
  • Does everyone know their role (RACI)?
  • Are the contact numbers for legal and insurance up to date?

Resources for Running a TTX

  • Backdoors & Breaches: An Incident Response card game by Black Hills Information Security.
  • TryHackMe Tabletop Exercise Simulator: A digital platform with guided scenarios and "Injects" (new information added to escalate pressure).

E. The CSIRT

The Computer Security Incident Response Team (CSIRT) is the group responsible for handling the incident.

  • Technical: Security Analysts, Network Engineers, SysAdmins.
  • Non-Technical: Legal, Human Resources (for insider threats), Public Relations.

5. Deep Dive Scenario: The "Friday Night" Supply Chain

Let's walk through how Event Correlation, Severity, and Preparation come together in a real workflow.

The Setup: You are monitoring the SIEM for a manufacturing college. You have established a Preparation baseline: you know what "normal" network traffic looks like.

The Events (The Inputs):

  1. Event A (11:00 PM): A vendor account (HVAC_Service) logs into the network. Status: Normal event.
  2. Event B (11:03 PM): The HVAC_Service account attempts to access the Student Financial Aid file server. Status: Suspicious Event (Why does the AC guy need financial records?).
  3. Event C (11:05 PM): The Financial Aid server initiates a 50GB outbound transfer to an unknown IP address in North Korea. Status: Critical Event.

The Process:

  • Correlation: The SIEM connects these three distinct dots. It realizes a Service Account is behaving like a Data Thief.
  • Triage: You receive the alert. You classify this as a P1 (Critical) because it involves Data Exfiltration of sensitive student info (PII).
  • Response (The Plan): You open the "Data Breach Playbook."
    1. Sever the connection (Containment).
    2. Call the CISO (Communication Protocol).
    3. Engage Legal (RACI - Consulted).

The Lesson: If you hadn't done the Preparation (defining what the HVAC account should access), you would never have spotted the anomaly.


6. Case Study: The SolarWinds Supply Chain Attack

To understand why we treat "Supply Chain Attacks" as P1 Critical incidents, we must look at the most significant breach of the last decade: The SolarWinds "Sunburst" attack (detected Dec 2020).

The Attack Mechanism

SolarWinds makes software called Orion, which is used by IT departments to monitor network traffic. It has "God Mode" access to everything because it needs to see everything to monitor it.

Russian state-sponsored hackers compromised the build pipeline of SolarWinds. They didn't hack the customers directly; they hacked the update server. When 18,000 companies (including Microsoft and the US Government) downloaded the legitimate software update, they unknowingly installed a backdoor (SolarWinds.Orion.Core.BusinessLayer.dll).

Note: This is the nightmare scenario. The call is coming from inside the house. You trusted the vendor, you installed the patch to stay secure, and the patch itself was the malware.

How Do You Respond? (The IR Perspective)

If you were a SOC Lead during this event, your standard playbook for "Malware" would have failed.

Phase 1: Detection (The Hard Part) Standard antivirus did not catch this because the file was digitally signed by SolarWinds.

  • The Hero Moment: FireEye (a security company) discovered the breach not because of a fancy AI tool, but because a security analyst noticed a user had two registered Multi-Factor Authentication (MFA) devices instead of one. Correlation of subtle anomalies is what caught the biggest hack in history.

Phase 2: Containment (The Pivot) You cannot just "clean" the server. The software itself is compromised.

  • Immediate Action: Isolate the SolarWinds Orion server. Pull the network cable.
  • Assume Breach: Because Orion had admin credentials, you must assume the attackers stole those credentials.

Phase 3: Eradication (Scorched Earth)

  • Rebuild: The only safe response is to decommission the server entirely and rebuild from a known good source.
  • Credential Reset: You must force a password reset for every service account that the SolarWinds server had access to.

Phase 4: Lessons Learned The SolarWinds attack changed how we write IR Plans. Now, our "Preparation" phase includes questions like: Does our security software have internet access it doesn't need?


Module Summary

This week, we learned that Incident Response is a structured discipline, not a frantic reaction. We distinguished between Events (happen all the time) and Incidents (require action). We discussed the importance of Correlation to find the needle in the haystack and how to use Severity Levels (P1-P4) to manage resources.

Finally, we explored the NIST SP 800-61 Preparation Phase. We learned that the most important work happens before the hacker arrives—by writing Policies, defining the RACI matrix, and testing our teams with Tabletop Exercises.

Discussion Questions

  1. Why is "Accountability" (The 'A' in RACI) different from "Responsibility" (The 'R'), and why is it dangerous for one person to hold both roles in a major incident?
  2. Imagine a scenario where a P3 incident (a minor policy violation) could escalate to a P1 incident within minutes. What would that look like?
  3. Many companies fail to perform Tabletop Exercises because they "don't have time." How would you argue the Return on Investment (ROI) of a Tabletop exercise to a CFO?