Automated Failure Attribution in LLM Multi-Agent Systems: A New Benchmark and Methods

Introduction

Large language model (LLM) multi-agent systems are increasingly used to tackle complex problems through collaborative effort. However, despite their promise, these systems frequently encounter task failures. When a failure occurs, developers face a daunting question: which agent, at which point in the process, caused the problem? Manually sifting through extensive interaction logs to find the root cause is like searching for a needle in a haystack—time-consuming and labor-intensive. This challenge is especially acute in autonomous multi-agent environments, where long information chains and agent autonomy make debugging nearly impossible without automated assistance.

Automated Failure Attribution in LLM Multi-Agent Systems: A New Benchmark and Methods — Source: syncedreview.com

To address this, researchers from Penn State University and Duke University, in collaboration with teams from Google DeepMind, the University of Washington, Meta, and others, have introduced the novel problem of automated failure attribution. They have built the first benchmark dataset for this task—called Who&When—and developed several attribution methods. Their work has been accepted as a Spotlight presentation at ICML 2025, and the code and dataset are fully open-source.

Background and Challenges

LLM multi-agent systems show immense potential across domains like software development, scientific research, and customer support. Yet they remain fragile. A single agent’s error, a misunderstanding between agents, or a mistake in information transmission can derail the entire task. According to the research, current debugging relies on manual methods:

Manual log archaeology: Developers must review lengthy interaction logs to pinpoint the failure source.
Reliance on expertise: The debugging process depends heavily on the developer’s deep understanding of the system.

These approaches are inefficient and fail to scale as systems grow more complex. The researchers highlight that automated failure attribution is essential for rapid system iteration and optimization.

The Who&When Dataset

The Who&When dataset is the first benchmark specifically designed for automated failure attribution in LLM multi-agent systems. It contains multiple multi-agent interaction logs, each annotated with the responsible agent and the time step where the failure originated. The dataset covers various task types and failure modes, providing a standardized evaluation platform.

Key features of the dataset include:

Diverse scenarios: Tasks range from simple question answering to multi-step reasoning.
Ground-truth labels: Each failure trace comes with precise annotations of the failing agent and step.
Realistic complexity: Logs reflect the autonomous, unpredictable nature of multi-agent collaboration.

The dataset is available on Hugging Face and the code on GitHub.

Automated Attribution Methods

The researchers proposed and evaluated several automated approaches for failure attribution. These methods fall into two main categories:

Heuristic Methods

Baseline approaches that use simple rules, such as identifying the agent that produced the first anomalous output or the agent with the most errors in the log. While fast, these methods often lack accuracy.

Learning-Based Methods

More sophisticated techniques that train models to recognize failure patterns. The team explored:

Supervised learning: Using the Who&When dataset to train classifiers that predict the failing agent and step from log features.
Prompting-based attribution: Feeding the interaction log to a separate LLM and asking it to identify the failure source.

Preliminary results show that learning-based methods significantly outperform heuristics, but the task remains challenging—especially for long interaction chains and subtle errors.

Results and Key Findings

Experiments on the Who&When benchmark revealed several insights:

Agent identification is harder than step identification: Models were generally better at pinpointing when a failure occurred than which agent caused it.
Context matters: Attribution accuracy improved when the method had access to the full interaction log rather than local snippets.
Prompting struggles: LLM-based prompting for attribution performed worse than supervised learning, likely due to the nuanced nature of multi-agent failures.

The study establishes a baseline for future work and highlights the need for more sophisticated reasoning in automated failure attribution.

Conclusion and Future Directions

The introduction of automated failure attribution for LLM multi-agent systems is a crucial step toward building more reliable and debuggable AI systems. The Who&When dataset provides a foundation for research, and the proposed methods open new avenues for improvement. Future work could explore:

Integrating attribution into real-time monitoring systems.
Extending the approach to dynamic agent topologies.
Combining attribution with corrective actions to enable self-healing systems.

As multi-agent systems become more prevalent, tools like these will be essential for developers to maintain and scale their applications. The researchers hope their work encourages the community to tackle this important problem.

Paper: arXiv | Code: GitHub | Dataset: Hugging Face

Tags: