CH5: IR Fundamentals and the CSIRT

Introduction

In the previous chapters, we focused on the architecture of resilience: analyzing risk, conducting business impact analyses, and building disaster recovery plans. We have spent weeks preparing for "what if." Now, we shift our focus to "what is happening."

This chapter marks the transition from Planning to Handling.

When defenses fail and an adversary successfully infiltrates a network, the organization relies on Incident Response (IR). IR is not a passive activity; it is active combat against a threat actor within the digital environment. To win this engagement, organizations cannot rely on ad-hoc efforts or heroic individuals. They require a structured framework, a specialized team (the CSIRT), and a rigorous system for tracking the battle (Case Management).

Learning Objectives

By the end of this chapter, you will be able to:

Differentiate between an "Event" and an "Incident" and explain the importance of proper classification.
Compare and contrast the NIST SP 800-61 and ISO/IEC 27035 incident response frameworks.
Analyze the 7-step Incident Response lifecycle and map it to the NIST 4-step federal standard.
Define the core roles within a CSIRT, distinguishing between the technical analysis of the Investigator and the strategic authority of the Incident Commander.
Apply the RACI model to an incident scenario to determine ownership and accountability across different departments.
Explain the critical components of Case Management, including Chain of Custody, evidence tracking, and decision logging.

5.1 Incident Response Frameworks

An Incident Response framework provides the lifecycle model that guides a security team from the moment an alarm triggers to the final lessons learned meeting. Without a framework, response efforts become chaotic, often resulting in "whack-a-mole" eradication attempts that alert the adversary without removing their access. This is where we continue to reference industry standards from NIST, specifically the NIST SP 800-61.

The 7-Step Incident Response Lifecycle

Before we align with specific government standards, it is critical to understand the granular steps of handling an attack. While many frameworks group these into broad phases, the operational reality of an incident often follows seven distinct stages.

We will explore each of these phases in deep detail in Chapters 6 through 11, but here is the high-level roadmap of the battle:

1. Preparation

The work done before the incident occurs to ensure the team is ready to fight. This is the most critical phase.
Examples: Writing policies, deploying EDR agents, configuring log retention, creating "break-glass" administrative accounts, and conducting tabletop exercises.

2. Detection

The process of recognizing that a security event has occurred and determining if it qualifies as an incident.
Examples: A SIEM alert triggers for "Impossible Travel," a user reports a phishing email, or an IDS signature fires on a SQL injection attempt.

3. Analysis

The forensic process of determining what happened, how they got in, and where they went. We identify the "scope" of the incident. This often happens in parallel with containment.
Examples: Analyzing RAM dumps for fileless malware, reviewing firewall logs to trace lateral movement, and identifying the "Patient Zero" vulnerability.

4. Containment

Now that we know the full scope, we can take immediate actions to stop the spread of the attack and prevent further damage. This is "applying the tourniquet."
Examples: Disconnecting a server from the network, disabling a compromised user account, or blocking an attacker's IP address at the firewall.

*5. Eradication

The complete removal of the threat from the environment.
Examples: Deleting malicious scheduled tasks, removing backdoors or maliciously created user accounts, and patching the vulnerability that allowed entry.

6. Recovery

Restoring systems to normal operation and validating they are safe to return to production.
Examples: Restoring data from clean backups, resetting passwords for all users, and monitoring the system closely for 48 hours to ensure the attacker doesn't return.

7. Lessons Learned (Post-Incident)

The review phase to improve the process for next time.
Examples: Conducting a "Hot Wash" meeting, calculating the cost of the breach, and updating the IR Plan to fix gaps discovered during the response.

Note

CompTIA vs. NIST Mappings

If you are preparing for the CompTIA Security+ or CySA+ exams, you will often see the lifecycle broken down into these 7 distinct phases

However, NIST SP 800-61—the federal standard we use for governance—groups these seven steps into four broader phases:

Preparation
Detection & Analysis
Containment, Eradication & Recovery
Post-Incident Activity

Be aware of this mapping when translating between "Classroom Theory" (7 steps) and "Federal Policy" (4 steps).

NIST SP 800-61 and the CSF 2.0 Alignment

The gold standard for Incident Response in the United States is NIST Special Publication 800-61 (Computer Security Incident Handling Guide). Modern operations align these steps with the NIST Cybersecurity Framework (CSF) 2.0 to ensure IR is not a siloed function but part of enterprise risk management.

Govern & Identify: These functions align with the Preparation phase. They involve establishing the policies, asset inventories, and risk assessments that inform how the team fights.
Protect: This involves the proactive controls (firewalls, MFA, EDR) that reduce the number of incidents the team must handle.
Detect: The moment an event is observed. This is the trigger for the IR lifecycle.
Respond: The core operational phase where the team analyzes, contains, and eradicates the threat.
Recover: Restoring systems to normal operation and ensuring business continuity.

ISO/IEC 27035: The International Alternative

While NIST is the dominant standard in the US, organizations with a global footprint—especially those in Europe or Asia—often adopt ISO/IEC 27035. This is the international standard for incident management and is part of the broader ISO 27000 information security series.

Key Differences from NIST:

Five-Phase Model: ISO 27035 explicitly separates "Plan and Prepare" (Phase 1) and "Lessons Learned" (Phase 5) as distinct, formalized audit requirements.
Process-Heavy: It places a stronger emphasis on the administrative process of managing the incident (documentation, reporting structures) rather than just the technical steps of fixing it.
Terminology: You may see terms like "Information Security Incident Management (ISIM)" used more frequently in ISO shops.

For the purposes of this course, we will follow the NIST model, but you must be aware that ISO 27035 exists, particularly if you work for a multinational corporation.

Test your IR Lifecycle skills with this interactive activity

Incident Taxonomy & Classification

One of the most frequent failures in early-stage IR programs is the inability to distinguish between an Event and an Incident.

Event: Any observable occurrence in a system or network. A user connecting to a file share is an event.
Incident: An event that violates computer security policies, acceptable use policies, or standard security practices. A user connecting to a known malware command-and-control (C2) server is an incident.

To manage resources effectively, organizations must adopt a clear Taxonomy—a controlled vocabulary for naming and categorizing incidents. Using consistent categories (e.g., "Malware," "Denial of Service," "Unauthorized Access," "Harassment") allows the organization to track trends over time.

Severity Matrix & Escalation Triggers

Not all incidents are created equal. A single laptop infected with adware does not require the same response as a ransomware deployment on the primary database. To determine the appropriate level of force, organizations utilize a Severity Matrix.

The matrix typically uses a combination of Impact (How badly does this hurt?) and Scope (How widespread is it?) to assign a severity level, often denoted as SEV1 (Critical) through SEV4 (Low).

SEV4 (Low): Routine commodity malware on a single endpoint. Handled by Tier 1 analysts.
SEV3 (Medium): Targeted phishing or confirmed unauthorized access to a non-critical system. Escalated to Tier 2/3.
SEV2 (High): Data exfiltration, compromise of a critical server, or widespread outage. Activates the Incident Commander and Crisis Management.
SEV1 (Critical): Existential threat (e.g., enterprise-wide ransomware, loss of all customer data). Involves the Board of Directors and external authorities.

Escalation Triggers are pre-defined conditions that force an elevation in severity. For example, a policy might state: "Any incident involving the PII of more than 500 customers automatically escalates to SEV2." These triggers remove ambiguity and hesitation during the "fog of war."

Declaration & Decision Rights

Who has the authority to shut down the company's e-commerce portal? Who decides to sever the connection to a critical vendor?

These are questions of Decision Rights. In a crisis, seeking consensus is fatal. The IR Plan must explicitly state who has the Declaration Authority—the power to declare a major incident—and who holds the decision rights for containment actions that impact business revenue. Typically, technical containment (blocking an IP) is delegated to the security team, while business-impacting containment (shutting down a plant) requires executive approval, often vetted by the Incident Commander.

5.2 The CSIRT Structure

The Computer Security Incident Response Team (CSIRT) is the operational body responsible for executing the IR plan. The structure of this team varies based on the size and culture of the organization.

CSIRT Models

Centralized (SOC): A single team, usually located at headquarters, handles all incident response for the global enterprise. This offers high consistency but may lack local context or language skills for remote offices.
Distributed: Multiple independent IR teams exist in different business units or geographic regions. This provides excellent local response speed but often results in poor coordination and fractured intelligence.
Hybrid: The most common model for large enterprises. A central "Core" team sets policy, manages tooling, and handles major investigations, while local "Liaisons" or smaller teams handle routine incidents and provide "boots on the ground" support.

Key CSIRT Roles

A functional CSIRT is not just a room full of hackers; it requires specific roles to manage the flow of information and operations.

The Incident Commander (IC)

The IC is the most critical role during a major incident. The IC does not touch the keyboard. Their job is to manage the incident, not fix the server.

Responsibilities: Coordinate the team, maintain the master timeline, manage resource allocation, and serve as the single point of contact for executive leadership.
Authority: The IC typically holds the delegated authority to make time-sensitive containment decisions.

The Lead Investigator / Analyst

This is the technical lead responsible for the "deep dive."

Responsibilities: Forensics, log analysis, malware reverse engineering, and root cause analysis. They report findings to the IC.

The Scribe / Recorder

In high-stress environments, memory is fallible.

Responsibilities: Document every decision, action, timestamp, and finding in the Master Chronology. This documentation is vital for legal defense and the Post-Incident Review.

Extended Team Liaisons

Legal Liaison: Advise on regulatory reporting timelines (e.g., GDPR's 72-hour window) and privilege.
PR / Communications: Manage the narrative to the press and customers.
HR Liaison: Handle incidents involving insider threats or employee misconduct.

Professional Insight: The "Fog of War" In the heat of a breach, technical teams often tunnel vision on the problem (e.g., "I need to decrypt this file"). It is the Incident Commander's job to pull them back and look at the strategic picture (e.g., "Stop trying to decrypt; we need to isolate the backup server before the attackers find it"). A team without a Commander is just a group of individuals working in parallel, often at cross-purposes.

5.3 Case Management Operations

Incident Response is an evidence-driven discipline. If an action or finding is not documented, it effectively did not happen. Relying on email threads, Slack chats, or verbal shouts across the room is insufficient for professional incident management. Teams require a dedicated Case Management System.

The "System of Record"

The Case Management System (CMS) acts as the single source of truth for the incident. It is a specialized ticketing system designed for security workflows. Unlike a standard IT Help Desk ticket (which focuses on "fixing" and "closing"), an IR case focuses on "investigating," "containing," and "preserving."

Ticketing & Workflow

Effective case management relies on structured workflows. When an alert enters the queue, the CMS should guide the analyst through the triage process:

Triage: Is this a false positive?
Assignment: Who owns this?
Severity: What is the initial classification?
SLA Tracking: The system tracks "Time to Acknowledge" and "Time to Contain" against internal Service Level Agreements.

Minimum Case Fields

To ensure the data is usable for future metrics and legal defense, every case must capture specific data points:

Timeline: When did the event happen? When was it detected? When was it closed?
Asset Data: Which systems (Hostnames, IPs, Users) are involved?
Observables/IoCs: What IP addresses, file hashes, or domains were observed?
Hypothesis: What does the analyst think is happening?
Actions Taken: What containment steps were executed (e.g., "Reset password for User X at 14:00")?
Evidence Location: Where are the log exports or disk images stored?

RACI & Handoffs

Incidents rarely stay within one team. The SOC might detect the alert, but the Server Admin team must patch the vulnerability, the Network Team must block the firewall port, and Legal must approve the external notification. Without clear roles, tasks fall through the cracks, or multiple teams collide trying to fix the same issue.

To solve this, IR teams utilize a RACI Matrix—a linear responsibility chart that clarifies ownership for every major task.

Defining the RACI Model

R - Responsible (The Doer): The person or role who actually performs the work. They are the "boots on the ground" touching the keyboard. There can be multiple "R"s for a task.
A - Accountable (The Owner): The person who is ultimately answerable for the correct and thorough completion of the task. They delegate the work to the Responsible party. There must be only one "A" per task to avoid confusion.
C - Consulted (The Adviser): Subject matter experts whose opinions are sought before a decision or action. This is two-way communication. (e.g., "Legal, can we legally shut this server down?")
I - Informed (The Recipient): Those who are kept up-to-date on progress, often after a decision or action. This is one-way communication. (e.g., Telling the CEO, "The server is down.")

Evidence Tracking (Chain of Custody)

In legal proceedings, the Chain of Custody proves that digital evidence has not been altered from the moment it was collected to the moment it is presented in court.

The CMS must include—or link to—a mechanism for tracking evidence. This includes:

Who collected the evidence.
When it was collected.
Where it is stored (e.g., a secure, access-controlled evidence locker or digital vault).
Hash Values: Cryptographic hashes (SHA-256) taken at the time of collection to prove integrity later.

Tooling

While many organizations attempt to use standard IT ticketing systems (like ServiceNow or Jira) for IR, specialized tools often provide better support for observables and threat intelligence integration.

TheHive: A popular open-source Security Incident Response Platform (SIRP) that integrates tightly with threat intelligence feeds.
Cortex XSOAR / Splunk SOAR: Commercial platforms that automate playbook execution and evidence gathering.
Jira (Customized): Often used due to ubiquity, but requires significant customization to handle the sensitivity and specific field requirements of IR (e.g., hiding sensitive breach details from general IT staff).

Visit TheHive website to learn more: https://strangebee.com/thehive/

5.4 Breaking Down the Lifecycle: A Real-World Scenario

To visualize how these frameworks, teams, and tools interact, let's walk through a single incident—a Ransomware attack—using the 7-step lifecycle we defined in Section 5.1.

Scenario: It is Friday at 4:45 PM. A user in the Finance department, eager to finish their week, clicks a link in an email labeled "Q3 Invoice Discrepancy."

Step 1: Preparation (Before the Click)

Long before Friday afternoon, the CSIRT was busy. They established the IR Policy giving them the authority to disconnect Finance servers. They configured Immutable Backups so the data could be recovered even if encrypted. They also deployed EDR (Endpoint Detection & Response) agents to all laptops, ensuring they had visibility into process execution.

If this step was skipped: The team would have no authority, no backups, and no logs. The battle would be lost before it began.

Step 2: Detection (The Alarm)

At 4:48 PM, the user notices their files are renamed to .locked and a text file appears on their desktop demanding Bitcoin. Simultaneously, the SIEM triggers a high-severity alert: "Multiple File Modifications & High-Entropy Writes Detected on Host FIN-LT-04."

The SOC Analyst validates the alert: This is not a glitch; it is an Incident.
They categorize it as "Malware / Ransomware" and assign it SEV2 (High Severity).

Step 3: Analysis (Scoping the Breach)

Before the team can "stop" the attack, they must know where it is. If they only focus on the user's laptop, they might miss the attacker moving laterally to the main file server.

The Investigation: The Forensic Analyst traces the attacker's activity. They identify Patient Zero (the Finance laptop) but also see a network connection from that laptop to the company's "Payroll-Share."
The Finding: The scope is not just one laptop; it is the laptop and the file server.

Tip

Reality Check: Quick Isolation vs. Formal Containment

In the real world, you will often perform an immediate "quick isolation" of the obvious infected host (the laptop) right when you detect it. However, the formal Containment Phase cannot truly begin until the Analysis phase tells you the full scope. If you isolate the laptop but fail to analyze the network logs, you leave the infected file server active, and the ransomware continues to spread.

Step 4: Containment (Applying the Tourniquet)

Now that the Analysis phase has confirmed the full scope (Laptop + File Server), the Incident Commander orders the containment plan.

Technical Containment: The SOC isolates both the laptop and the file server from the network.
Identity Containment: The compromised user's Active Directory account is disabled to prevent the attacker from logging in elsewhere.
Business Containment: The Finance department is notified that Payroll processing is suspended.

Step 5: Eradication (Clean Up)

Antivirus might say "Virus Removed," but we do not trust it. The attacker may have left "backdoors" (scheduled tasks or new admin accounts) to get back in later.

Action: The team decides not to simply clean the malware. Instead, they wipe the drives of both the laptop and server and Re-image them from a known-good "Golden Image."
Root Cause Fix: They identify the malicious email in the exchange server and delete it from all other user mailboxes to ensure no one else clicks it.

Step 6: Recovery (Back to Business)

With the environment clean, the business needs to run.

Restoration: The Finance team restores the encrypted payroll files from the Immutable Backups taken the night before.
Validation: The CSIRT monitors the restored devices closely for 24 hours to ensure the attacker doesn't return.
Acceptance: The CFO confirms the data is accurate, and the "All Clear" is given.

Step 7: Post-Incident Activity (Lessons Learned)

Two weeks later, the team holds a "Hot Wash" meeting.

Finding: The user clicked the link because the email banner didn't say "EXTERNAL SENDER."
Action Item: IT Operations is assigned a task to configure email banners for all external mail.
Metric: The team notes it took 15 minutes to detect and 3 minutes to contain. They aim to reduce detection time to 10 minutes next quarter.

Summary

Building a resilient organization requires more than just disaster recovery servers; it requires a capability to detect and respond to adversaries in real-time. This capability is built upon three pillars:

Frameworks: Adopting a lifecycle model (whether the 7-step operational view, the 4-step NIST standard, or ISO 27035) to ensure a repeatable process.
People (CSIRT): Structuring a team with clear roles, specifically the Incident Commander, to manage the chaos.
Process (Case Management): Implementing a rigorous system of record to track the investigation, manage evidence, and ensure nothing falls through the cracks.

In the next chapter, we will examine the Preparation phase in detail, looking at the specific policies, tools, and intelligence sources required to arm the CSIRT before the battle begins.