CH7: IR Detection Phase

Introduction

In the lifecycle of incident response, we have moved from "Left of Boom" (Preparation) to the moment of impact. The "Boom" represents the initial compromise or the realization of an attack. However, a compromise does not automatically trigger a response. A breach can only be managed if it is detected.

Detection is the art and science of identifying malicious activity amidst the massive volume of benign background noise generated by modern IT environments. It is the phase where the organization’s senses—human observations, network sensors, and log aggregators—must function in unison to signal that a security event is occurring.

Without effective detection, an adversary can dwell within a network for months, escalating privileges and exfiltrating data. This chapter explores the three primary pillars of detection: the Human Sensor (employees and help desk staff), the Network Sensor (IDS/IPS and traffic analysis), and the Log Aggregator (SIEM and correlation).

Learning Objectives

By the end of this chapter, you will be able to:

Explain the role of the "Human Sensor" in identifying phishing attempts and technical anomalies that automated tools might miss.
Differentiate between Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS), including their placement and response capabilities.
Compare and contrast detection methodologies, specifically signature-based versus anomaly-based logic.
Analyze network visibility data, distinguishing the use cases for Full Packet Capture (PCAP) versus NetFlow.
Identify common NSM and SIEM tools, including industry standards like Wireshark, Splunk, and Zeek.
Explain the concept of Log Correlation and demonstrate how it links disparate events to reveal an attack.
Definitively categorize security data into Events, Alerts, and Incidents using industry-standard criteria.
Apply the MITRE ATT&CK framework to map detection coverage and identify gaps using heat maps.

7.1 The Human Sensor: Employee Reporting

While cybersecurity often focuses on high-tech automated tools, the most ubiquitous sensor in any organization is the human workforce. Employees, particularly those in non-technical roles, are frequently the first targets of an attack and, consequently, the first to witness indicators of compromise (IoCs).

Phishing Reporting

Phishing remains the primary vector for initial access in most major breaches. Technical controls like Secure Email Gateways (SEGs) filter out the majority of spam and malicious attachments, but sophisticated social engineering attacks often bypass these filters.

The "Human Firewall" becomes the critical last line of defense. Organizations must deploy mechanisms, such as a "Report Phishing" button within email clients, to allow users to flag suspicious communications instantly.

The Triage Workflow: When a user reports a phishing email, it should not merely disappear into a black hole. It triggers a specific workflow:
1. Ingestion: The email is moved to a secure sandbox or analysis queue.
2. Analysis: Automated tools check the header for spoofing (SPF/DKIM failures) and scan links/attachments.
3. Feedback: If the email was malicious, the reporter receives positive reinforcement. If it was a simulation, they are congratulated.
4. Cluster Defense: If one user reports a malicious email, the incident response team can search the mail server logs to find and purge that same email from all other user inboxes.

Help Desk Escalation

The IT Service Desk (Help Desk) is the "Canary in the Coal Mine." They receive the initial complaints that often masquerade as technical glitches but are actually symptoms of an attack.

An Incident Response Plan must include specific training for Help Desk staff to recognize security anomalies.

Scenario: The "Slow Computer"

Standard IT View: A user calls complaining their laptop is running slowly and the fan is loud. The standard fix is to reboot or check for Windows updates.

Security View: High CPU usage and fan noise can be indicative of Cryptojacking (malware mining cryptocurrency in the background).

Standard IT View: A user cannot open their Excel files; the file names look "weird."

Security View: This is the hallmark of a Ransomware encryption process beginning.

If the Help Desk treats these tickets as routine performance issues, the attack is allowed to spread. Effective detection requires an escalation path where specific keywords (e.g., "encryption," "ransom," "strange popup," "mouse moving by itself") automatically flag a ticket for security review.

7.2 Intrusion Detection & Prevention (IDS/IPS)

Moving to technical sensors, the Intrusion Detection System (IDS) and Intrusion Prevention System (IPS) serve as the automated sentries of the network. These systems analyze traffic at the packet level to identify and stop malicious activity.

Functional Definitions

Intrusion Detection System (IDS): A passive monitoring solution that inspects a copy of network traffic. Its primary function is to alert analysts when potential threats are identified. It provides visibility without affecting network performance or availability. Think of an IDS as a Burglar Alarm; it makes noise when a window breaks, but it cannot stop the thief.
Intrusion Prevention System (IPS): An active control mechanism placed inline with traffic flow. Its primary function is to block malicious packets before they reach their destination. It can actively drop packets, reset connections, or deny specific traffic streams. Think of an IPS as a Bouncer; it physically stops unauthorized people from entering the club.

Architecture & Placement

Network-Based (NIDS): These sensors are placed at strategic choke points, such as the network perimeter (behind the firewall) or at the core switch between network segments (e.g., separating the User LAN from the Data Center). They analyze packet headers and payloads as they traverse the wire.
Host-Based (HIDS): These agents operate directly on the endpoint (server or workstation). They have visibility into the operating system itself, monitoring file integrity, memory, and local log files. Modern Endpoint Detection and Response (EDR) platforms have largely superseded traditional HIDS.

What They Actually Look For (Detection Logic)

IDS and IPS engines do not "know" good from bad inherently; they must be taught what to look for using specific detection logic.

1. Signature-Based Detection (Known Bad)

This method compares packet contents against a database of known threat signatures.

Exploit Payloads: The engine looks for specific byte sequences associated with known vulnerabilities.
- Example: Detecting the "EternalBlue" exploit by identifying specific malformed SMB (Server Message Block) packets used to crash the Windows kernel.
- Example: Detecting SQL Injection attempts by scanning HTTP GET requests for strings like ' OR 1=1--.
Command & Control (C2) Traffic: The engine looks for known malicious domains or communication patterns.
- Example: A signature that triggers when an internal host attempts to connect to a known botnet IP address.
- Example: Identifying specific User-Agent strings used by malware (e.g., "Morbius" or "CobaltStrike").

2. Anomaly-Based & Protocol Analysis (Deviation from Standard)

This method identifies traffic that technically adheres to a protocol but violates standard usage or statistical baselines.

Protocol Violations: The engine understands how protocols (like TCP, HTTP, or DNS) are supposed to work based on RFC standards.
- Example: HTTP over Non-Standard Ports: Detecting HTTP traffic running over port 53 (DNS) or 445 (SMB), which is a common technique used to hide data exfiltration.
- Example: Malformed Headers: A TCP packet with conflicting flags set (e.g., a packet that has both SYN and FIN flags set), often used in reconnaissance scanning to crash firewalls.
Statistical Anomalies:
- Example: A printer that normally sends 5MB of data a day suddenly sending 5GB of data to an external IP.

7.3 Network Security Monitoring (NSM)

While IDS/IPS is about alerting, Network Security Monitoring (NSM) is about visibility. NSM provides the data required to investigate an alert. It answers the question: "What actually happened?".

Full Packet Capture (PCAP)

PCAP is the gold standard of evidence. It records the entire payload of the network conversation—every bit and byte.

The Artifact: PCAP captures the actual data transferred. If an attacker downloaded a file, the PCAP contains that file. If they viewed a webpage, the PCAP contains the HTML.
Challenge: Storage. Recording every packet on a high-speed network generates terabytes of data daily. Most organizations can only afford to retain PCAP for a few days or weeks.

NetFlow

NetFlow provides metadata about network traffic, rather than the content itself. It records the "Who, What, When, Where, and How Much".

The Artifact: NetFlow records the source IP, destination IP, port, protocol, timestamp, and volume of bytes transferred. It does not show the content of the message.
Use Case: NetFlow is incredibly efficient. Because it is just text-based metadata, organizations can store months or years of NetFlow data. This is crucial for hunting long-term threats like Command and Control (C2) beaconing.

Common Tools for NSM

To capture, store, and analyze this data, security teams rely on specialized toolsets.

Wireshark: The world's most widely used network protocol analyzer. It is the "microscope" of network forensics.
- Primary Use: Deep-dive analysis of specific PCAP files. When an analyst needs to reconstruct a single TCP stream or view the hex code of a payload, they open Wireshark.
- Limitation: It is not designed for continuous, automated monitoring of an entire network (it would crash). It is an interactive tool for human analysts.
Arkime (formerly Moloch): An open-source, large-scale, full packet capturing, indexing, and database system. It provides a simple web interface to view PCAP data and is widely used for hunting and incident handling across massive datasets.
Zeek (formerly Bro): While also an IDS, Zeek is primarily a "Network Flight Recorder." It sits on a TAP and converts raw traffic into structured, tab-separated log files (e.g., http.log, dns.log, ssl.log). This makes network traffic searchable without the massive storage cost of full PCAP.
Security Onion: A free and open-source Linux distribution for threat hunting and enterprise security monitoring. It bundles several tools together (including Zeek, Suricata, and Arkime) into a pre-configured platform, making it a favorite for education and small-to-mid-sized businesses.
RITA (Real Intelligence Threat Analytics): An open-source framework developed by Black Hills Information Security specifically for analyzing Zeek logs to detect beaconing behavior (long-duration C2 channels). More information can be found at https://www.activecountermeasures.com/free-tools/rita/.

7.4 SIEM & Log Analysis

In a modern enterprise, security data is distributed across firewalls, servers, cloud tenants, and antivirus consoles. The Security Information and Event Management (SIEM) system is the central brain that aggregates these disparate data sources.

The SIEM Architecture

Collection: Agents or syslog forwarders send logs from devices to the SIEM.
Normalization: The SIEM converts logs from different vendors into a standard format. (e.g., converting "User_Login_Failed" from a Cisco router and "Event ID 4625" from Windows into a single standard event: Authentication Failure).
Correlation: The engine applies logic rules to link events together.
Alerting: If a rule triggers, an incident ticket is generated.

High-Value Log Sources

Not all logs are created equal. For effective detection, the following sources are critical:

Authentication Logs: The most vital source. Analysts look for brute-force attempts (thousands of failures in seconds), "Impossible Travel" (a user logging in from New York and London within 5 minutes), and logins at unusual times.
Firewall/Perimeter Logs: These show denied connections and, more importantly, successful outbound traffic to bad-reputation IP addresses.
Endpoint Logs (EDR/Antivirus): These provide details on process execution (e.g., powershell.exe launching a script), file modifications, and malware blocks.
DNS Logs: Often undervalued, DNS logs are critical for detecting C2. Malware must resolve a domain name (e.g., attacker.com) to phone home. Spotting requests to known malicious domains or randomly generated domain names (DGA) often reveals an infection.

Correlation Logic and Real-World Scenario

The true power of a SIEM is not just collecting logs, but Correlation.

Definition: Correlation is the automated process of connecting relationships between two or more seemingly unrelated events to reveal a larger pattern. A single log entry is often "noise," but a sequence of specific entries across different systems is a "signal."

Scenario: The "Needle in the Haystack"

Let's look at how a SIEM correlates four separate log sources to identify a compromised laptop communicating with a Command & Control (C2) server.

Event 1 (Firewall Log): A user's laptop at Headquarters initiates an outbound connection on Port 443 (HTTPS) to an IP address in a foreign country.
- Analysis: By itself, this is benign. Users browse the web all day; HTTPS traffic is normal.
Event 2 (Authentication Log): The user JSmith logged into this laptop at 2:00 AM local time.
- Analysis: By itself, this is suspicious but not definitive. Maybe JSmith is just working late.
Event 3 (Endpoint/EDR Log): A PowerShell script update_win.ps1 was executed by JSmith at 2:01 AM.
- Analysis: By itself, this is suspicious. Why is a user running scripts? But maybe it's a scheduled task.
Event 4 (DNS Log): The laptop requested to resolve the domain update-microsoft-support.xyz.
- Analysis: By itself, this is highly suspicious. Legitimate Microsoft updates do not use .xyz domains.

The Correlation Rule: The SIEM engine sees all four events occur within a 5-minute window involving the same IP and Username. It correlates them into a single high-severity incident:

"Suspected C2 Beaconing: Late night login + Unsigned PowerShell Script + High-Risk DNS Request."

Without correlation, the analyst would have to manually find these four needles in four different haystacks. With correlation, the root cause (the malicious PowerShell script calling out to a bad domain) is immediately visible.

Industry Standard SIEM Tools

The SIEM market is vast, ranging from expensive enterprise suites to community-driven open-source projects.

Commercial Leaders

Splunk: The longstanding market leader. Known for its powerful "Search Processing Language" (SPL) and massive ecosystem of integrations. It is highly customizable but can be expensive as costs often scale by data volume (GB/day).
Microsoft Sentinel: A cloud-native SIEM built directly into Azure. It has gained massive popularity because it integrates seamlessly with Office 365 and uses the Kusto Query Language (KQL). It removes the need to manage backend servers.
CrowdStrike Falcon LogScale (formerly Humio): Known for extreme speed and index-free logging, allowing for real-time searching of massive datasets with lower latency than traditional SIEMs.

Open Source & Free Options

Elastic Security (ELK Stack): Comprised of Elasticsearch (database), Logstash (ingestion), and Kibana (visualization). While "free" to download, it requires significant engineering effort to build and maintain at scale. It is the backbone of many modern SOCs.
Wazuh: An open-source security platform that combines XDR (Extended Detection and Response) and SIEM capabilities. It excels at endpoint monitoring and compliance auditing.
Graylog: A centralized log management solution that is robust and user-friendly. While often used for IT operations, its security plugins make it a viable SIEM alternative for smaller organizations.

Activity

Can you identify the best log sources for the activity we want to detect?

7.5 The Critical Distinction: Event vs. Incident

Before discussing the triage workflow, we must establish the specific vocabulary of detection. In a security operations center (SOC), words have precise meanings. Confusing an "Event" with an "Incident" can lead to panic, resource exhaustion, or legal liability.

This classification process is essential for accurate reporting and trending.

1. Security Event

A security event is any observable occurrence in a system or network. It is a neutral fact.

Characteristics: Common, high volume, usually benign.
Examples: A user connecting to a file share; a firewall denying a connection; a server rebooting; a scheduled task running.

2. Security Alert

An alert is a specific type of event that triggers a notification because it matches a predefined rule or threshold. It indicates that something might be wrong.

Characteristics: Needs human or automated review.
Examples: A user failed to login 10 times in 1 minute (Brute Force Rule); a computer scanned port 445 on 50 hosts (Scanner Rule).

3. Security Incident

An incident is a violation (or imminent threat of violation) of computer security policies, acceptable use policies, or standard security practices. It implies harm or unauthorized access.

Characteristics: Requires formal response, containment, and remediation.
Examples: A ransomware payload executing; an unauthorized user viewing payroll data; a Denial of Service (DoS) attack taking down the website.

Comparative Scenarios

The following table illustrates how a simple event evolves into an incident:

Scenario	The Event (Observable Fact)	The Alert (Trigger)	Is it an Incident?
Scenario A	A firewall drops a packet from an external IP on port 22.	"SSH Inbound Denied" (Count: 1)	No. The control worked. The threat was stopped. This is just "background radiation."
Scenario B	Antivirus detects `mimikatz.exe` on a laptop and deletes it.	"Malware Detected & Quarantined"	Maybe. If the malware was deleted, the immediate threat is gone. However, how did it get there? This is likely an incident requiring investigation.
Scenario C	A user logs in from an IP address in North Korea.	"Geo-Location Anomaly"	Yes (High Probability). Unless you have employees in North Korea, this is an active policy violation and likely account compromise.
Scenario D	A database server sends 5GB of data to a Dropbox URL.	"Data Exfiltration Anomaly"	Yes. This threatens the confidentiality of data (Data Loss).

The Golden Rule of Triage:

We investigate Alerts to determine if they are Incidents.

7.6 Alert Triage & Detection Engineering

Deploying tools is only half the battle. The output of these tools is a queue of alerts that must be managed by human analysts.

The Alert Triage Workflow

When an alert lands in the queue, the analyst must rapidly answer three questions to determine if it is a True Positive (Real Attack) or False Positive (Benign Noise):

Validate Signal: Is this activity actually malicious? (e.g., Is the "malware" actually a legitimate administrative tool?)
Determine Scope: Is this affecting one laptop or the entire server farm?
Prioritize Impact: Does this threaten critical business functions (CBFs) or sensitive data?

False Positives vs. False Negatives

False Positive: The alarm rings, but there is no fire. Too many of these lead to "Alert Fatigue," where analysts ignore the dashboard because "it's always broken".
False Negative: There is a fire, but the alarm stays silent. This is the most dangerous scenario, as the attacker operates undetected.

Alert Tuning: Reducing the Noise

Once we identify False Positives, we must perform Alert Tuning. This involves modifying the logic of the detection rule so it continues to catch the bad guys but stops annoying the analysts.

Let's look at a classic example: The "Noisy" Brute Force Rule.

The Original Rule: Alert if User Login Failures > 3 in 60 minutes.
- The Problem: This rule is far too sensitive. If a user changes their password on Monday morning but forgets to update it on their iPhone, the phone will retry the old password repeatedly in the background. This generates hundreds of alerts for innocent employees.
- The Result: Alert Fatigue. Analysts stop looking at Brute Force alerts because "it's just iPhone sync issues."
The "Tuned" Rule: Alert if User Login Failures > 10 in 1 minute AND Source IP is NOT Internal.
- The Logic Change:
  1. We increased the threshold (10 failures) and shortened the time window (1 minute). This targets automated scripts rather than forgetful humans.
  2. We added a logic condition (Source IP is NOT Internal) to ignore the office Wi-Fi, focusing only on external attacks.
- The Result: Fewer alerts, but higher fidelity. When this alert fires now, it is almost certainly an attack.

Note

Regarding our logic above, you might be thinking, "What if it is an insider threat doing the brute forcing?" Well, you are correct to think that! Having specifically designed alert rules for detecting insider threats is another important consideration. All SOCs are going to have different different tunings depending on the context of their environment. For a high risk environment there may even be a dedicated team to "insider threat prevention".

Detection Engineering & The MITRE ATT&CK Framework

In the early days of cybersecurity, detection was often based on "signatures"—looking for a specific file hash or a known bad IP address. If the attacker changed one byte of the file or rented a new server, the detection failed. Today, we move beyond simple alert writing into Detection Engineering: a systematic process of designing, building, testing, and maintaining detection logic that focuses on adversary behaviors rather than static indicators.

To do this effectively, we need a common language to describe those behaviors. This is where the MITRE ATT&CK framework becomes our most valuable tool.

Website: https://attack.mitre.org

ATT&CK stands for Adversarial Tactics, Techniques, and Common Knowledge. It is a globally accessible knowledge base of adversary behaviors based on real-world observations. It moves beyond simple "Indicators of Compromise" (like IP addresses, which change easily) and focuses on "Behavior" (which is hard for attackers to change).

The framework is structured around the concept of TTPs (Tactics, Techniques, and Procedures). Understanding the distinction between these three layers is critical for a detection engineer, as it dictates how durable your detections will be.

TTPs: The Hierarchy of Behavior

When analyzing an attack or building a detection, we break the behavior down into three levels of granularity:

Tactics (The "Why"): Tactics represent the adversary's technical goal or the reason for performing an action. This is the highest level of the framework.
- Example: Initial Access (The attacker wants to get in), Credential Access (The attacker wants to steal passwords), or Exfiltration (The attacker wants to steal data).
- Detection Value: You cannot write a detection for a "Tactic" directly because it is too broad. You cannot simply "detect Initial Access"; you must detect the specific methods used to achieve it.
Techniques (The "How"): Techniques represent the way an adversary achieves a tactical goal. There are often many techniques to achieve a single tactic.
- Example: To achieve the tactic of Credential Access, an attacker might use the technique OS Credential Dumping (extracting passwords from memory) or Brute Force (guessing passwords).
- Detection Value: This is the "sweet spot" for detection engineering. If you can detect the technique of dumping memory, you catch the attacker regardless of which specific tool they use.
Procedures (The "What"): Procedures are the specific implementation details—the exact tools, commands, or sequence of actions the adversary uses to execute a technique.
- Example: To execute the OS Credential Dumping technique, the adversary might use the specific procedure of running the tool Mimikatz with the command sekurlsa::logonpasswords, or they might use Task Manager to create a dump file of the lsass.exe process.
- Detection Value: Procedures are highly specific. Detecting mimikatz.exe is useful, but brittle. If the attacker renames the file to notepad.exe or uses a different tool to dump credentials, a procedure-based detection might fail.

Real-World Example: The TTP Chain

To visualize how these fit together, imagine an attacker named "APT29" trying to steal data.

Tactic: Credential Access (The Goal).
Technique: T1003.001 - OS Credential Dumping: LSASS Memory (The Method).
Procedure: procdump.exe -ma lsass.exe lsass.dmp (The Specific Command).

The Pyramid of Pain

The relationship between TTPs and detection quality is often visualized using the Pyramid of Pain, which we covered in Chapter 6.

Hash Values and IP Addresses are at the bottom. They are easy to detect but trivial for an attacker to change.
TTPs are at the apex. Changing a TTP (learning a completely new way to dump credentials without touching LSASS) is difficult and expensive for the attacker.

As detection engineers, our goal is to move up the pyramid. While we alert on known bad IPs (Procedures/Indicators), our focus must be on writing logic that identifies the underlying behavior (Techniques). For example, rather than looking for "Mimikatz," we look for "any process attempting to read the memory of lsass.exe." This detects Mimikatz, but it also detects five other tools that do the same thing, making the detection robust and difficult to evade.

Mapping Coverage with Heat Maps

Detection Engineers use the ATT&CK Matrix to visualize the effectiveness of their security program. This is often done by creating a Heat Map.

The Concept: You take the full matrix of hundreds of techniques and overlay your current detection capabilities.
- Green Cell: "We have a reliable SIEM alert for this technique."
- Red/Blank Cell: "We have no visibility here. If an attacker uses this technique, we will miss it."

This visualization helps leadership prioritize budget and effort. If the "Exfiltration" column is entirely Red, the organization knows it must urgently invest in Data Loss Prevention (DLP) or egress monitoring rules, rather than buying another firewall.

Summary

Detection is the pivotal phase that converts a silent compromise into a managed incident. It requires a "Defense in Depth" approach to visibility. We rely on the Human Sensor to report social engineering, Network Sensors (IDS/NSM) to catch traffic anomalies, and Log Aggregators (SIEM) to correlate scattered clues into a coherent narrative.

Crucially, we must discern between mere Events and true Incidents. Not every alert requires a "Battle Stations" response; learning to distinguish noise from signal—and tuning that signal to reduce false positives—is the hallmark of a mature security team. By mapping these tuned detections to the MITRE ATT&CK framework, we ensure our eyes are open to the specific behaviors adversaries use to target us.

In the next chapter, we will discuss what happens once an alert is verified as a true incident: the Analysis phase.

Key Takeaways

Event vs. Incident: An event is an occurrence; an incident is a violation or threat. Triage is the process of sorting one from the other.
Phishing Reporting: Empower users to be sensors; the Help Desk must recognize security indicators in support tickets.
IDS vs. IPS: IDS watches (passive); IPS blocks (active). Both rely on signatures (known bad) or anomalies (weird behavior).
PCAP vs. NetFlow: PCAP is the full recording (storage heavy); NetFlow is the call log (storage light).
SIEM Correlation: The true value of a SIEM lies in connecting disparate logs (DNS + Auth + EDR) to confirm complex attacks like C2 beaconing.
MITRE ATT&CK: A framework of adversary behaviors (Tactics = Goals, Techniques = Methods) used to create heat maps that visualize detection gaps.