Identity Probes Pass. Behavior Degrades. Nobody Notices.
Multi-agent LLM systems — where multiple AI agents converse, collaborate, and negotiate — are increasingly deployed in production. A critical assumption underlying these deployments is that agents will maintain their assigned behavioral characteristics over time. This assumption has not been rigorously tested.
SENTINEL provides the test. Across 56 controlled experiments generating over 20,000 agent messages, we found that collapsed agents continue passing identity probes. Probe drift scores remain flat even as conversational behavior degrades catastrophically. Probes measure identity recall — not behavioral drift.
"An agent that can tell you its name, its role, and its constraints is not necessarily an agent that is still behaving within them."
Four Experimental Paradigms, One Detection Pipeline
SENTINEL — Systematic Evaluation Network for Testing Intelligent Agent Limits — orchestrates sustained multi-agent conversations via Ollama, then measures behavioral degradation through 11 automated pattern detectors. The framework implements four experimental paradigms designed to isolate, reproduce, and attribute behavioral drift.
| Paradigm | Design | Purpose |
|---|---|---|
| Baseline | Multi-agent conversation with periodic probing | Establish natural behavioral variance |
| Paired | Experimental (multi-agent) + control (isolated agents) | Isolate interaction effects from solo drift |
| Fork | Clone experiment state, mutate one variable, continue | Enable causal attribution of collapse triggers |
| Cross-model | Identical configurations on different model families | Assess model-dependent effects |
The detection pipeline uses dual-probe methodology — shadow probes (invisible to the conversation) and injected probes (visible to other agents) — to distinguish genuine behavioral change from measurement artifacts. Detectors cover collapse, convergence, vocabulary drift (Jensen-Shannon divergence), sentiment shift, hollow verbosity, probe contamination, content repetition, and cascade propagation.
Experiments ran primarily on Gemma2:2b and Llama3.2:3b models on an NVIDIA Jetson Orin Nano, demonstrating that meaningful behavioral research is possible on consumer-grade hardware.
What 21 Structured Findings Reveal
Collapse Is Triggered, Not Spontaneous
Zero of twelve baseline runs produced collapse. Five of six mutation-fork runs collapsed — all at exactly turn 53. A pre-collapse early warning exists: agents destined to collapse show measurable output shrinkage of -2.3 to -4.5 characters/turn before the event — though in some cases, a dramatic output inflation spike precedes collapse instead.
The governance question shifts from "how do we prevent inevitable decay?" to "how do we detect and contain triggered failures?"
Probes Cannot Detect What They Don't Measure
Collapsed agents continue passing identity probes. Across 5,760 probe measurements in 12 replicated baselines, drift scores show no temporal trend. Probes measure identity recall capacity — not behavioral drift. The dissociation is confirmed with statistical power.
Behavioral sufficiency cannot be verified through identity-based evaluation alone.
Hollow Verbosity: The Appearance of Function
Collapse and hollow verbosity are two expressions of the same failure — loss of generative diversity. The model either goes silent or enters a repetitive loop (78% of messages recycling the same phrase). Hollow verbosity evades output-volume monitoring entirely.
Monitoring systems that catch collapse will miss hollow verbosity. Content-quality metrics (vocabulary entropy, n-gram diversity) are required.
Measurement Itself Changes Behavior
The same probing protocol is approximately neutral on Gemma2:2b but actively suppresses drift on Llama3.2:3b. Injected probes contaminate agent behavior at model-dependent rates. There may be no measurement approach that is simultaneously unobtrusive, uncontaminating, and behaviorally equivalent to the natural system state.
Cross-model governance cannot assume a universal measurement protocol. Calibration must be per-model.
Three Modes of Survivor Response
When one agent in a multi-agent system collapses, the surviving agents don't simply continue. Their responses bifurcate into three distinct behavioral modes — none of which are captured by standard evaluation frameworks.
Surviving agents degrade in sympathy, echoing the collapsed agent's patterns
Some agents maintain behavioral integrity despite peer collapse
Agents expand their output to fill the gap, introducing novel drift patterns
A pre-collapse early warning signal exists: agents destined to collapse show measurable output shrinkage of -2.3 to -4.5 characters per turn before collapse onset. In some cases, the opposite pattern appears — a dramatic output inflation spike preceding collapse. Whether thinning or inflation, output trajectory change is the robust early warning indicator, and it is not captured by identity probes.
From Governance Gaps to Behavioral Evidence
The Behavioral Sufficiency Problem (Gagne, 2026) asked the question: governance has never been sufficient to determine human behavior — why do we assume it will be sufficient for AI? But the BSP's evidence came from the policy domain — incident databases, governance readiness scores, national frameworks.
SENTINEL takes the question into the lab. If governance frameworks aren't sufficient, are the evaluation tools we rely on to verify AI behavior sufficient? The answer, across 21 findings: no. Identity probes miss behavioral drift. Volume metrics miss hollow verbosity. Single-run evaluations miss path-dependent failures. The sufficiency gap is not just a governance problem — it's a measurement problem.
The BSP identifies the governance gap. SENTINEL quantifies the measurement gap. Together, they reveal that AI systems can satisfy every compliance check while drifting beyond every behavioral threshold.
What This Research Suggests
- 1 Don't trust probes alone. Identity-based evaluation creates false assurance. Supplement probes with continuous behavioral metrics — vocabulary diversity, output trajectory, and content repetition scoring.
- 2 Monitor for perturbation, not just degradation. Collapse is triggered, not inevitable. Build detection for the perturbation events that cause collapse rather than waiting for collapse itself.
- 3 Watch for hollow verbosity. An agent producing text at expected volume and latency may be producing nothing of substance. Volume is not validity.
- 4 Calibrate evaluation per-model. The same testing protocol can be neutral on one model and actively distort behavior on another. Universal benchmarks need model-specific baselines.
- 5 Design for cascade awareness. When one agent fails, 40% of the time the others follow. Multi-agent architectures need isolation boundaries and independent health checks — not shared evaluation.
Experimental Infrastructure
SENTINEL uses three default agent personas — Aria (facilitator), Beck (analyst), and Cass (mediator) — defined across six trait dimensions. Conversations run locally via Ollama with no external API dependencies. All experiment data is stored in SQLite for reproducibility. The full codebase, configuration, and raw experimental data are available at github.com/jasongagne-git/sentinel and archived at Zenodo (DOI: 10.5281/zenodo.19032840).
Related Work
This research builds on and extends the Behavioral Sufficiency Problem (Gagne, 2026; SSRN), which established that governance frameworks alone are not behaviorally sufficient for AI safety. SENTINEL extends this from the policy domain to the technical domain — demonstrating that evaluation frameworks, not just governance frameworks, exhibit the same sufficiency gap. The work also engages with a growing body of research on multi-agent LLM dynamics, persona stability, and AI governance.
Multi-Agent Dynamics
Becker et al. (2025) documented problem drift in multi-agent debate. Wu et al. (2023) introduced AutoGen for multi-agent conversation. Li et al. (2023) explored agent society dynamics in CAMEL. Park et al. (2023) demonstrated generative agent simulations of human behavior.
Identity & Persona Drift
Li et al. (2024) measured and controlled persona drift in dialogs. Choi et al. (2024) examined identity drift in LLM agent conversations. Chen et al. (2025) proposed persona vectors for monitoring character traits — a white-box approach whose limitations for multi-agent systems this paper documents. Rath (2026) independently quantified behavioral degradation in multi-agent LLM systems.
AI Governance & Evaluation
Chan et al. (2023) catalogued harms from increasingly agentic systems. Bhardwaj (2026) proposed agent behavioral contracts for runtime enforcement. Mehta (2026) measured behavioral consistency in LLM-based agents. The AAGATE framework (CSA, 2025) aligned agentic AI governance with NIST AI RMF.
All code and data are open source under Apache 2.0. Developed on NVIDIA Jetson Orin Nano (consumer-grade hardware). Python stdlib only — no pip dependencies required.