How to Automatically Attribute Failures in LLM Multi-Agent Systems Using the Who&When Dataset
Introduction
When your LLM-powered multi-agent system fails on a task, you're not just left with a broken output — you're left with a headache. Which agent made the mistake? At what step did things go wrong? Manual log crawling feels like hunting for a single typo in a novel. Fortunately, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, the University of Washington, Meta, Nanyang Technological University, and Oregon State University, have introduced a structured solution: automated failure attribution. Their work, accepted as a Spotlight presentation at ICML 2025, provides the first benchmark dataset (Who&When) and several evaluation methods to pinpoint the root cause of failures. This guide walks you through applying these tools to your own multi-agent systems, saving you hours of frustration.

What You Need
- Python 3.8+ environment
- Git to clone the official code repository
- Basic understanding of LLM multi-agent system architectures
- Access to a Hugging Face account to download the Who&When dataset
- Compute resources (a machine with at least 16GB RAM, GPU optional but recommended for large models)
Step-by-Step Guide
Step 1: Understand the Task of Failure Attribution
Before diving into code, grasp the core concept. In LLM multi-agent systems, multiple agents collaborate (e.g., via conversation or tool use) to solve a problem. A failure occurs when the final output is incorrect or incomplete. Failure attribution answers two questions: which agent caused the failure and at which point in the interaction (i.e., which timestamp or turn). The Who&When dataset simulates such failures with ground-truth labels, so you can evaluate the accuracy of your attribution method.
Step 2: Clone the Repository and Set Up the Environment
- Open a terminal and run:
git clone https://github.com/mingyin1/Agents_Failure_Attribution.git - Navigate to the directory:
cd Agents_Failure_Attribution - Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate - Install dependencies:
pip install -r requirements.txt
Step 3: Download the Who&When Dataset
The dataset is hosted on Hugging Face. Run the provided download script or use the Hugging Face datasets library:
from datasets import load_dataset
dataset = load_dataset("Kevin355/Who_and_When")Alternatively, visit the dataset page and download the files manually. Place them in a data/ folder within the repository.
Step 4: Understand the Dataset Structure
The dataset contains multi-agent interaction logs, each labeled with:
- Failure type (e.g., reasoning error, miscommunication, missing information)
- Responsible agent (by ID or role)
- Failure timestamp (step index where the error first manifested)
Familiarize yourself with the format by examining a sample: dataset['train'][0] in Python.
Step 5: Choose an Attribution Method
The paper introduces several automated methods. Start with the trace-based method which uses a pre-trained LLM to analyze the entire interaction trace and predict the responsible agent and time. More advanced options include:
- Contrastive attribution: compares failed traces with successful ones to isolate divergences.
- Causal intervention: simulates “what if” scenarios by modifying agent outputs and checking if the failure is avoided.
The repository includes scripts for each. For your first run, use the default trace-based approach.
Step 6: Run Attribution on a Sample Failure
Execute the provided evaluation script:
python run_attribution.py --dataset_path ./data/Who_and_When --method trace_based --split testThis will analyze a batch of test cases and output predictions vs. ground truth. The script logs the results including accuracy for who and when separately.
Step 7: Interpret the Results
Check the output summary. A high who accuracy (e.g., >80%) indicates the method reliably identifies the failing agent. A low when accuracy suggests the method struggles with pinpointing the exact moment. Examine false positives — does the model blame an agent too early or too late? The paper reports baseline metrics (e.g., random guessing gives ~25% accuracy for who in a 4-agent system), so compare accordingly.
Step 8: Apply to Your Own Multi-Agent System
To use this on your custom system, you must log interactions in the same format as the dataset: a JSON or dict with keys for agent names, message content, timestamps, and final success/failure. Modify the attribution scripts to accept your data. The trace_based method can be adapted by feeding your logs to the LLM with a similar prompt template.
Tips for Success
- Start simple: Begin with the provided dataset to validate that your environment runs correctly before applying to your own data.
- Use a strong LLM: The attribution quality improves with more capable models (e.g., GPT-4, Claude, or open-source models like LLaMA-3). The default script supports OpenAI and Hugging Face models.
- Log everything: For your own system, ensure you capture every input, output, and intermediate state of each agent. Missing logs lead to ambiguous attribution.
- Combine methods: The paper shows that ensemble methods (e.g., averaging predictions from trace-based and contrastive) can boost accuracy by 5–10%.
- Beware of cascading failures: Sometimes an earlier harmless error triggers a later failure. The “when” label might be earlier than the obvious first mistake — the dataset accounts for this.
- Iterate: Use failure attribution as a feedback loop. Once you find a common failure pattern (e.g., Agent 3 consistently misreads numeric data), modify that agent’s prompt or tool use.
Related Articles
- The Art of Storytelling in User Research: A Three-Act Approach
- How to Effectively Advocate Against Climate-Exacerbating Policies: A Step-by-Step Guide
- Beyond the Jolt: How Coffee Transforms Your Gut and Brain
- 10 Revelations About Fat Metabolism That Are Changing Obesity Science
- Charting a Post-Fossil Future: A Guide to the Colombia Climate Summit and Its Roadmap
- Orbital AI: A Step-by-Step Guide to Cowboy Space's Rocket-Powered Data Center Strategy
- New Breakthroughs Reveal Dinosaurs Were Far More Social and Intelligent Than Previously Believed
- The Growing Teacher Exodus: Understanding Why Educators Are Leaving and What Schools Can Do