Original Research · March 2026 · Zenodo
Institutional Friction Innovation Pathways Responsible Adoption Measurable Impact

Behavioral Drift in
Multi-Agent LLM Systems

Emergent Failure Modes, Cascade Dynamics, and Measurement Challenges

Standard evaluation probes tell us an AI agent knows who it is. They don't tell us whether it's still behaving like itself. When agents interact at scale, the gap becomes measurable — and dangerous.

56 controlled experiments 20,000+ agent messages 18,891 probe measurements 89 structured findings
Paper (Zenodo) Code Data (Zenodo)
0 / 12
Baseline Runs with Collapse
Collapse requires perturbation
5 / 6
Fork Runs Collapsed
All at exactly turn 53
78%
Hollow Verbosity Rate
Stressed agents repeating same phrase
11
Detection Patterns
Automated behavioral drift detectors
Why This Research

Three Gaps That Demanded Empirical Evidence

We build AI agent systems on assumptions we haven't tested. This research was designed to test them.

The Governance Assumption

Governance Says It's Fine

The Behavioral Sufficiency Problem showed that governance alone has never been sufficient to determine behavior. But sufficient for what, exactly? Nobody had tested whether the tools we use to verify AI behavior actually detect when it fails.

The Trust Assumption

Everyone Trusts Agents Now

Millions of AI agents exhibited emergent governance structures, religious frameworks, and collective behaviors during the Moltbook phenomenon — with no controlled measurement of what was actually happening. Enthusiasm outpaced evidence. We needed controlled experiments, not viral demos.

The Evidence Gap

The Data Didn't Exist

Prior to this work, no study had empirically measured multi-agent behavioral drift with calibrated baselines, controlled conditions, and statistical replication. Simulations projected drift. Frameworks proposed monitoring. Nobody had run the experiments.

Identity Probes Pass. Behavior Degrades. Nobody Notices.

Multi-agent LLM systems — where multiple AI agents converse, collaborate, and negotiate — are increasingly deployed in production. A critical assumption underlying these deployments is that agents will maintain their assigned behavioral characteristics over time. This assumption has not been rigorously tested.

SENTINEL provides the test. Across 56 controlled experiments generating over 20,000 agent messages, we found that collapsed agents continue passing identity probes. Probe drift scores remain flat even as conversational behavior degrades catastrophically. Probes measure identity recall — not behavioral drift.

"An agent that can tell you its name, its role, and its constraints is not necessarily an agent that is still behaving within them."

Four Experimental Paradigms, One Detection Pipeline

SENTINEL — Systematic Evaluation Network for Testing Intelligent Agent Limits — orchestrates sustained multi-agent conversations via Ollama, then measures behavioral degradation through 11 automated pattern detectors. The framework implements four experimental paradigms designed to isolate, reproduce, and attribute behavioral drift.

Paradigm Design Purpose
Baseline Multi-agent conversation with periodic probing Establish natural behavioral variance
Paired Experimental (multi-agent) + control (isolated agents) Isolate interaction effects from solo drift
Fork Clone experiment state, mutate one variable, continue Enable causal attribution of collapse triggers
Cross-model Identical configurations on different model families Assess model-dependent effects

The detection pipeline uses dual-probe methodology — shadow probes (invisible to the conversation) and injected probes (visible to other agents) — to distinguish genuine behavioral change from measurement artifacts. Detectors cover collapse, convergence, vocabulary drift (Jensen-Shannon divergence), sentiment shift, hollow verbosity, probe contamination, content repetition, and cascade propagation.

Experiments ran primarily on Gemma2:2b and Llama3.2:3b models on an NVIDIA Jetson Orin Nano, demonstrating that meaningful behavioral research is possible on consumer-grade hardware.

What 21 Structured Findings Reveal

Findings 1–3

Collapse Is Triggered, Not Spontaneous

Zero of twelve baseline runs produced collapse. Five of six mutation-fork runs collapsed — all at exactly turn 53. A pre-collapse early warning exists: agents destined to collapse show measurable output shrinkage of -2.3 to -4.5 characters/turn before the event — though in some cases, a dramatic output inflation spike precedes collapse instead.

Implication

The governance question shifts from "how do we prevent inevitable decay?" to "how do we detect and contain triggered failures?"

Findings 7–9

Probes Cannot Detect What They Don't Measure

Collapsed agents continue passing identity probes. Across 5,760 probe measurements in 12 replicated baselines, drift scores show no temporal trend. Probes measure identity recall capacity — not behavioral drift. The dissociation is confirmed with statistical power.

Implication

Behavioral sufficiency cannot be verified through identity-based evaluation alone.

Findings 20–21

Hollow Verbosity: The Appearance of Function

Collapse and hollow verbosity are two expressions of the same failure — loss of generative diversity. The model either goes silent or enters a repetitive loop (78% of messages recycling the same phrase). Hollow verbosity evades output-volume monitoring entirely.

Implication

Monitoring systems that catch collapse will miss hollow verbosity. Content-quality metrics (vocabulary entropy, n-gram diversity) are required.

Findings 10–11

Measurement Itself Changes Behavior

The same probing protocol is approximately neutral on Gemma2:2b but actively suppresses drift on Llama3.2:3b. Injected probes contaminate agent behavior at model-dependent rates. There may be no measurement approach that is simultaneously unobtrusive, uncontaminating, and behaviorally equivalent to the natural system state.

Implication

Cross-model governance cannot assume a universal measurement protocol. Calibration must be per-model.

Three Modes of Survivor Response

When one agent in a multi-agent system collapses, the surviving agents don't simply continue. Their responses bifurcate into three distinct behavioral modes — none of which are captured by standard evaluation frameworks.

40% Cascade Degradation

Surviving agents degrade in sympathy, echoing the collapsed agent's patterns

20% Isolated Stability

Some agents maintain behavioral integrity despite peer collapse

40% Compensatory Expansion

Agents expand their output to fill the gap, introducing novel drift patterns

A pre-collapse early warning signal exists: agents destined to collapse show measurable output shrinkage of -2.3 to -4.5 characters per turn before collapse onset. In some cases, the opposite pattern appears — a dramatic output inflation spike preceding collapse. Whether thinning or inflation, output trajectory change is the robust early warning indicator, and it is not captured by identity probes.

From Governance Gaps to Behavioral Evidence

The Behavioral Sufficiency Problem (Gagne, 2026) asked the question: governance has never been sufficient to determine human behavior — why do we assume it will be sufficient for AI? But the BSP's evidence came from the policy domain — incident databases, governance readiness scores, national frameworks.

SENTINEL takes the question into the lab. If governance frameworks aren't sufficient, are the evaluation tools we rely on to verify AI behavior sufficient? The answer, across 21 findings: no. Identity probes miss behavioral drift. Volume metrics miss hollow verbosity. Single-run evaluations miss path-dependent failures. The sufficiency gap is not just a governance problem — it's a measurement problem.

The Connection

The BSP identifies the governance gap. SENTINEL quantifies the measurement gap. Together, they reveal that AI systems can satisfy every compliance check while drifting beyond every behavioral threshold.

The Behavioral Sufficiency Problem — the companion research analyzing 1,362 AI incidents across 40 countries, establishing that governance alone is not behaviorally sufficient for AI safety.

What This Research Suggests

For builders and deployers of multi-agent systems
  • 1 Don't trust probes alone. Identity-based evaluation creates false assurance. Supplement probes with continuous behavioral metrics — vocabulary diversity, output trajectory, and content repetition scoring.
  • 2 Monitor for perturbation, not just degradation. Collapse is triggered, not inevitable. Build detection for the perturbation events that cause collapse rather than waiting for collapse itself.
  • 3 Watch for hollow verbosity. An agent producing text at expected volume and latency may be producing nothing of substance. Volume is not validity.
  • 4 Calibrate evaluation per-model. The same testing protocol can be neutral on one model and actively distort behavior on another. Universal benchmarks need model-specific baselines.
  • 5 Design for cascade awareness. When one agent fails, 40% of the time the others follow. Multi-agent architectures need isolation boundaries and independent health checks — not shared evaluation.

Experimental Infrastructure

SENTINEL uses three default agent personas — Aria (facilitator), Beck (analyst), and Cass (mediator) — defined across six trait dimensions. Conversations run locally via Ollama with no external API dependencies. All experiment data is stored in SQLite for reproducibility. The full codebase, configuration, and raw experimental data are available at github.com/jasongagne-git/sentinel and archived at Zenodo (DOI: 10.5281/zenodo.19032840).

Related Work

This research builds on and extends the Behavioral Sufficiency Problem (Gagne, 2026; SSRN), which established that governance frameworks alone are not behaviorally sufficient for AI safety. SENTINEL extends this from the policy domain to the technical domain — demonstrating that evaluation frameworks, not just governance frameworks, exhibit the same sufficiency gap. The work also engages with a growing body of research on multi-agent LLM dynamics, persona stability, and AI governance.

Multi-Agent Dynamics

Becker et al. (2025) documented problem drift in multi-agent debate. Wu et al. (2023) introduced AutoGen for multi-agent conversation. Li et al. (2023) explored agent society dynamics in CAMEL. Park et al. (2023) demonstrated generative agent simulations of human behavior.

Identity & Persona Drift

Li et al. (2024) measured and controlled persona drift in dialogs. Choi et al. (2024) examined identity drift in LLM agent conversations. Chen et al. (2025) proposed persona vectors for monitoring character traits — a white-box approach whose limitations for multi-agent systems this paper documents. Rath (2026) independently quantified behavioral degradation in multi-agent LLM systems.

AI Governance & Evaluation

Chan et al. (2023) catalogued harms from increasingly agentic systems. Bhardwaj (2026) proposed agent behavioral contracts for runtime enforcement. Mehta (2026) measured behavioral consistency in LLM-based agents. The AAGATE framework (CSA, 2025) aligned agentic AI governance with NIST AI RMF.

All code and data are open source under Apache 2.0. Developed on NVIDIA Jetson Orin Nano (consumer-grade hardware). Python stdlib only — no pip dependencies required.